## Documentation

`AnnoyIndex(f, metric)`returns a new index that's read-write and stores vector of f dimensions. Metric can be "angular", "euclidean", "manhattan", "hamming", or "dot".

`a.add_item(i, v)` adds item i (any nonnegative integer) with vector v. Note that it will allocate memory for max(i)+1 items.

`a.build(n_trees, n_jobs=-1)` builds a forest of n_trees trees. More trees gives higher precision when querying. After calling build, no more items can be added. n_jobs specifies the number of threads used to build the trees. n_jobs=-1 uses all available CPU cores.

`a.save(fn, prefault=False)` saves the index to disk and loads it (see next function). After saving, no more items can be added.

`a.load(fn, prefault=False)` loads (mmaps) an index from disk. If prefault is set to True, it will pre-read the entire file into memory (using mmap with MAP_POPULATE). Default is False.

`a.unload()` unloads.

`a.get_nns_by_item(i, n, search_k=-1, include_distances=False)` returns the n closest items. During the query it will inspect up to search_k nodes which defaults to n_trees * n if not provided. search_k gives you a run-time tradeoff between better accuracy and speed. If you set include_distances to True, it will return a 2 element tuple with two lists in it: the second one containing all corresponding distances.

`a.get_nns_by_vector(v, n, search_k=-1, include_distances=False)` same but query by vector v.

`a.get_item_vector(i)` returns the vector for item i that was previously added.

`a.get_distance(i, j)` returns the distance between items i and j. NOTE: this used to return the squared distance, but has been changed as of Aug 2016.

`a.get_n_items()` returns the number of items in the index.

`a.get_n_trees()` returns the number of trees in the index.

`a.on_disk_build(fn)` prepares annoy to build the index in the specified file instead of RAM (execute before adding items, no need to save after build)

`a.set_seed(seed)` will initialize the random number generator with the given seed. Only used for building up the tree, i. e. only necessary to pass this before adding the items. Will have no effect after calling a.build(n_trees) or a.load(fn).


## Import library

In [1]:
import sys
sys.path

['/Users/filipilievski/mcs/cskg/examples',
 '/Users/filipilievski/opt/anaconda3/envs/mowgli/lib/python37.zip',
 '/Users/filipilievski/opt/anaconda3/envs/mowgli/lib/python3.7',
 '/Users/filipilievski/opt/anaconda3/envs/mowgli/lib/python3.7/lib-dynload',
 '',
 '/Users/filipilievski/opt/anaconda3/envs/mowgli/lib/python3.7/site-packages',
 '/Users/filipilievski/opt/anaconda3/envs/mowgli/lib/python3.7/site-packages/IPython/extensions',
 '/Users/filipilievski/.ipython']

In [2]:
from annoy import AnnoyIndex

## Prepare Data and Index

In [3]:
# specify a certain entity embedding tsv file
#input_file = '/nas/home/binzhang/backup_data/complex/comp_log_dot_0.01/entities_output.tsv'
input_file='../output/embeddings/entity_embedding_100.tsv'

In [4]:
# build a entity name-index bi dictionary
entity_dict = {}  # {name1:0, 0:name1}

# entity dimension
dimension = 100  

# build an index that stores vector
annoy_index = AnnoyIndex(dimension, 'angular')  # angular => cos

In [6]:
%%time
with open(input_file, 'r') as f:
    for index,line in enumerate(f):
        line = line.split()
        entity_name = line[0]
        entity_vec =  [ float(i) for i in line[1:]]
        entity_dict[entity_name] = index
        entity_dict[index] = entity_name
        annoy_index.add_item(index, entity_vec)

CPU times: user 44.3 s, sys: 797 ms, total: 45.1 s
Wall time: 45.2 s


In [7]:
# builds a forest of n_trees trees. More trees gives higher precision when querying. 
annoy_index.build(100) # build(n_trees, n_jobs=-1)

True

## Get topk entites

In [9]:
entity_dict

{'/c/en/saltyback/a': 0,
 0: '/c/en/saltyback/a',
 'at:to_see_what_they_are_not_showing_them': 1,
 1: 'at:to_see_what_they_are_not_showing_them',
 '/c/en/pot_cheese/n': 2,
 2: '/c/en/pot_cheese/n',
 '/c/en/set/n/wn/cognition': 3,
 3: '/c/en/set/n/wn/cognition',
 '/c/en/automatic_drive/n/wn/artifact': 4,
 4: '/c/en/automatic_drive/n/wn/artifact',
 'at:get_money_out_of_the_bank': 5,
 5: 'at:get_money_out_of_the_bank',
 '/c/en/emblazers/n': 6,
 6: '/c/en/emblazers/n',
 '/c/en/rubber_disk': 7,
 7: '/c/en/rubber_disk',
 '/c/en/nonshadowed': 8,
 8: '/c/en/nonshadowed',
 '/c/en/splayfoot/n': 9,
 9: '/c/en/splayfoot/n',
 '/c/en/fingercots/n': 10,
 10: '/c/en/fingercots/n',
 '/c/en/advertising_agency/n': 11,
 11: '/c/en/advertising_agency/n',
 '/c/en/arterioportography': 12,
 12: '/c/en/arterioportography',
 '/c/en/hebesphenomegacorona': 13,
 13: '/c/en/hebesphenomegacorona',
 '/c/en/lady_liberty/n': 14,
 14: '/c/en/lady_liberty/n',
 '/c/en/jarrah': 15,
 15: '/c/en/jarrah',
 '/c/en/appendicolit

In [13]:
# pick an example: 
target_entity_name = '/c/en/caffeine' # entity_name = '/c/en/goateed'
target_entity_index = entity_dict[target_entity_name]
target_entity_name,target_entity_index

('/c/en/caffeine', 1979706)

In [15]:
# top5 cloest neighbors
similar_ents = annoy_index.get_nns_by_item(1979706, 5, include_distances=True) # get most closet 5 items
ent_dis = list(zip(similar_ents[0],similar_ents[1]))

for ent in ent_dis:
    ent_index = ent[0]
    distance = ent[1]
    ent_nam = entity_dict[ent_index]
    print(f'{ent_nam:<30} {ent_index:<10} {distance}')
    print()

/c/en/caffeine                 1979706    0.0

/c/en/keep_awake               219997     0.6564935445785522

/c/en/alkaloid_drug_with_stimulant_action 1343731    0.7606598138809204

/c/en/overcaffeinated/a        182020     0.8577694296836853

/c/en/coke/n                   2135040    1.089383840560913

