## Documentation

`AnnoyIndex(f, metric)`returns a new index that's read-write and stores vector of f dimensions. Metric can be "angular", "euclidean", "manhattan", "hamming", or "dot".

`a.add_item(i, v)` adds item i (any nonnegative integer) with vector v. Note that it will allocate memory for max(i)+1 items.

`a.build(n_trees, n_jobs=-1)` builds a forest of n_trees trees. More trees gives higher precision when querying. After calling build, no more items can be added. n_jobs specifies the number of threads used to build the trees. n_jobs=-1 uses all available CPU cores.

`a.save(fn, prefault=False)` saves the index to disk and loads it (see next function). After saving, no more items can be added.

`a.load(fn, prefault=False)` loads (mmaps) an index from disk. If prefault is set to True, it will pre-read the entire file into memory (using mmap with MAP_POPULATE). Default is False.

`a.unload()` unloads.

`a.get_nns_by_item(i, n, search_k=-1, include_distances=False)` returns the n closest items. During the query it will inspect up to search_k nodes which defaults to n_trees * n if not provided. search_k gives you a run-time tradeoff between better accuracy and speed. If you set include_distances to True, it will return a 2 element tuple with two lists in it: the second one containing all corresponding distances.

`a.get_nns_by_vector(v, n, search_k=-1, include_distances=False)` same but query by vector v.

`a.get_item_vector(i)` returns the vector for item i that was previously added.

`a.get_distance(i, j)` returns the distance between items i and j. NOTE: this used to return the squared distance, but has been changed as of Aug 2016.

`a.get_n_items()` returns the number of items in the index.

`a.get_n_trees()` returns the number of trees in the index.

`a.on_disk_build(fn)` prepares annoy to build the index in the specified file instead of RAM (execute before adding items, no need to save after build)

`a.set_seed(seed)` will initialize the random number generator with the given seed. Only used for building up the tree, i. e. only necessary to pass this before adding the items. Will have no effect after calling a.build(n_trees) or a.load(fn).


## Import library

In [34]:
from annoy import AnnoyIndex

## Prepare Data and Index

In [2]:
# specify a certain entity embedding tsv file
input_file = '/nas/home/binzhang/backup_data/complex/comp_log_dot_0.01/entities_output.tsv'

In [3]:
# build a entity name-index bi dictionary
entity_dict = {}  # {name1:0, 0:name1}

# entity dimension
dimension = 100  

# build an index that stores vector
annoy_index = AnnoyIndex(dimension, 'angular')  # angular => cos

In [4]:
%%time
with open(input_file, 'r') as f:
    for index,line in enumerate(f):
        line = line.split('\t')
        entity_name = line[0]
        entity_vec =  [ float(i) for i in line[1:]]
        entity_dict[entity_name] = index
        entity_dict[index] = entity_name
        annoy_index.add_item(index, entity_vec)

CPU times: user 55.6 s, sys: 1.29 s, total: 56.9 s
Wall time: 56.9 s


In [5]:
# builds a forest of n_trees trees. More trees gives higher precision when querying. 
annoy_index.build(100) # build(n_trees, n_jobs=-1)

True

## Get topk entites

In [6]:
entity_dict

{'/c/en/one_drives': 0,
 0: '/c/en/one_drives',
 '/c/en/nip_outs/n': 1,
 1: '/c/en/nip_outs/n',
 '/c/en/waycasters/n': 2,
 2: '/c/en/waycasters/n',
 'at:to_buy_a_similar_shirt': 3,
 3: 'at:to_buy_a_similar_shirt',
 '/c/en/eagar': 4,
 4: '/c/en/eagar',
 '/c/en/vesperate': 5,
 5: '/c/en/vesperate',
 '/c/en/favorableness/n/wn/attribute': 6,
 6: '/c/en/favorableness/n/wn/attribute',
 '/c/en/irenical/a': 7,
 7: '/c/en/irenical/a',
 '/c/en/except_in_eke_out': 8,
 8: '/c/en/except_in_eke_out',
 '/c/en/connecting_hose': 9,
 9: '/c/en/connecting_hose',
 '/c/en/adic/a': 10,
 10: '/c/en/adic/a',
 '/c/en/antirhinoviral/a': 11,
 11: '/c/en/antirhinoviral/a',
 '/c/en/finding_fishing_place': 12,
 12: '/c/en/finding_fishing_place',
 '/c/en/channery/a': 13,
 13: '/c/en/channery/a',
 '/c/en/snow_stage': 14,
 14: '/c/en/snow_stage',
 '/c/en/dubkis/n': 15,
 15: '/c/en/dubkis/n',
 '/c/en/stephanomeria_malheurensis/n/wn/plant': 16,
 16: '/c/en/stephanomeria_malheurensis/n/wn/plant',
 "at:ask_persony's_landl

In [7]:
# pick an example: 
target_entity_name = '/c/en/one_drives' # entity_name = '/c/en/goateed'
target_entity_index = entity_dict[target_entity_name]
target_entity_name,target_entity_index

('/c/en/one_drives', 0)

In [31]:
# top5 cloest neighbors
similar_ents = annoy_index.get_nns_by_item(0, 5, include_distances=True) # get most closet 5 items
ent_dis = list(zip(similar_ents[0],similar_ents[1]))

for ent in ent_dis:
    ent_index = ent[0]
    distance = ent[1]
    ent_nam = entity_dict[ent_index]
    print(f'{ent_nam:<30} {ent_index:<10} {distance}')

/c/en/one_drives               0          0.0
/c/en/scoring_homer_ball       1807674    0.2194414883852005
/c/en/coffee_spilled           1904214    0.22471533715724945
/c/en/computer_switched_on     530173     0.2263190597295761
/c/en/people_love_away         1037121    0.2268425077199936
