# Embedding similarity with annoy


This notebook performs cosine similarity calcluation with annoy library. 


### Parameters for invoking the notebook

- `cskg_embedding_path`: a .gz file contaning the embeddings for all cskg entites
- `target_entity_name`: query entity name, this is an example for searching neighbors

In [1]:
# Parameters 
cskg_embedding_path = "../output/embeddings/comp_log_dot_0.01.tsv.gz"
target_entity_name = '/c/en/snow_stage'

In [2]:
import gzip
from annoy import AnnoyIndex

###  Prepare data
- read embeddings from cskg_embedding_path
- build a entity name-index bi dictionary for future indexing
- build an annoy index that stores vector
- builds a forest of n_trees trees. More trees gives higher precision when querying. 

In [3]:
%%time
# build a entity name-index bi dictionary
entity_dict = {}  # {name1:0, 0:name1}

# declare the entity dimension
dimension = 100  

# declare an annoy index that stores vector
annoy_index = AnnoyIndex(dimension, 'angular')  # angular => cos

with gzip.open(cskg_embedding_path,'rt') as f:
    for index,line in enumerate(f):
        line = line.split()
        entity_name = line[0]
        entity_vec =  [ float(i) for i in line[1:]]
        if index ==0:
            # get the dimension 
            dimension = len(entity_vec)
            # initializa an annoy index
            annoy_index = AnnoyIndex(dimension, 'angular')  # angular => cos
        entity_dict[entity_name] = index
        entity_dict[index] = entity_name
        annoy_index.add_item(index, entity_vec)
        
# builds a forest of n_trees 
annoy_index.build(100) # build(n_trees, n_jobs=-1)

CPU times: user 32min 45s, sys: 4min 45s, total: 37min 30s
Wall time: 1min 55s


True

### Search topk neighbors

In [4]:
target_entity_index = entity_dict[target_entity_name]
print(f'query entity name: {target_entity_name}, query entity index: {target_entity_index}',end='\n\n')

# top5 cloest neighbors
topk = 5
similar_ents = annoy_index.get_nns_by_item(target_entity_index, topk, include_distances=True)
ent_dis = list(zip(similar_ents[0],similar_ents[1]))

for ent in ent_dis:
    ent_index = ent[0]
    distance = ent[1]
    ent_nam = entity_dict[ent_index]
    print(f'{ent_nam:<30} {ent_index:<10} {distance}')

query entity name: /c/en/snow_stage, query entity index: 14

/c/en/snow_stage               14         0.0
/c/en/tapioca_snow             1082430    0.12174128741025925
/c/en/snow_catch               86853      0.12539340555667877
/c/en/snow_making              1388820    0.12639059126377106
/c/en/wild_snow                424651     0.1297360360622406


### Documentation about annoy

`AnnoyIndex(f, metric)`returns a new index that's read-write and stores vector of f dimensions. Metric can be "angular", "euclidean", "manhattan", "hamming", or "dot".

`a.add_item(i, v)` adds item i (any nonnegative integer) with vector v. Note that it will allocate memory for max(i)+1 items.

`a.build(n_trees, n_jobs=-1)` builds a forest of n_trees trees. More trees gives higher precision when querying. After calling build, no more items can be added. n_jobs specifies the number of threads used to build the trees. n_jobs=-1 uses all available CPU cores.

`a.save(fn, prefault=False)` saves the index to disk and loads it (see next function). After saving, no more items can be added.

`a.load(fn, prefault=False)` loads (mmaps) an index from disk. If prefault is set to True, it will pre-read the entire file into memory (using mmap with MAP_POPULATE). Default is False.

`a.unload()` unloads.

`a.get_nns_by_item(i, n, search_k=-1, include_distances=False)` returns the n closest items. During the query it will inspect up to search_k nodes which defaults to n_trees * n if not provided. search_k gives you a run-time tradeoff between better accuracy and speed. If you set include_distances to True, it will return a 2 element tuple with two lists in it: the second one containing all corresponding distances.

`a.get_nns_by_vector(v, n, search_k=-1, include_distances=False)` same but query by vector v.

`a.get_item_vector(i)` returns the vector for item i that was previously added.

`a.get_distance(i, j)` returns the distance between items i and j. NOTE: this used to return the squared distance, but has been changed as of Aug 2016.

`a.get_n_items()` returns the number of items in the index.

`a.get_n_trees()` returns the number of trees in the index.

`a.on_disk_build(fn)` prepares annoy to build the index in the specified file instead of RAM (execute before adding items, no need to save after build)

`a.set_seed(seed)` will initialize the random number generator with the given seed. Only used for building up the tree, i. e. only necessary to pass this before adding the items. Will have no effect after calling a.build(n_trees) or a.load(fn).
