The overall process of Faiss can be divided into three steps:

1. Construct training data (expressed in matrix form)
2. Select the appropriate Index (the core component of Faiss) and add the training data to the Index.
3. Search, that is, search, get the final result

## Import libraries

In [4]:
import faiss
import numpy as np

## Setup directories

In [6]:
input_file='../output/embeddings/entity_embedding_100.tsv'

## 1. Construct training data (expressed in matrix form)

In [8]:
%%time
# specify a certain entity embedding tsv file
entity_dict = {}        # build a entity name-index bi dictionary
entity_embeddings = []  # all the embeddings 

with open(input_file, 'r') as f:
    for index,line in enumerate(f):
        line = line.split('\t')
        entity_name = line[0]
        entity_vec =  [ float(i) for i in line[1].split()]
        entity_embeddings.append(entity_vec)
        entity_dict[entity_name] = index
        entity_dict[index] = entity_name
        
# entity_embeddings=> matrix
X = np.array(entity_embeddings).astype(np.float32) # float32
dimension = X.shape[1]

CPU times: user 1min 5s, sys: 2.96 s, total: 1min 8s
Wall time: 1min 8s


## 2. Select the appropriate Index (cos) and add the training data to the Index.

In [10]:
# build index (METRIC_INNER_PRODUCT => cos )
vec_index = faiss.index_factory(dimension, "Flat", faiss.METRIC_INNER_PRODUCT)
# # normalize all vectors in order to get cos sim 
faiss.normalize_L2(X)  
# add vectors to inde 
vec_index.add(X) 
print(f'number of vectors in the index: {vec_index.ntotal}')# 

number of vectors in the index: 2160968


## 3. Do some searching 


normal case:<br>
query_set = [[...],[...],[...]]<br>
query_mat = np.array([dataSetII]).astype(np.float32)<br>
faiss.normalize_L2(query_mat) <br>

In [11]:
# first I mimic some query data from the training data
query_ent_indices = list(range(0,10)) # first 10 entities
query_ent_vecs = [] 
for i in query_ent_indices:
    query_ent_vecs.append(X[i])
query_ent_mat = np.array(query_ent_vecs)
faiss.normalize_L2(query_ent_mat) 
query_ent_mat.shape 

(10, 100)

In [12]:
# after setting topk ,we can do query
topk = 5
cos_sim, index = vec_index.search(query_ent_mat, topk) # both of them are matrices
print(f'Similarity by FAISS:\n {cos_sim}')
print(f'Index by FAISS:\n {index}')

Similarity by FAISS:
 [[1.         0.5627408  0.560689   0.5505445  0.5227501 ]
 [1.         0.63397956 0.62087405 0.61736274 0.6091253 ]
 [1.0000001  0.5501119  0.492526   0.4534185  0.45262545]
 [1.         0.79032516 0.7316501  0.6795417  0.59318507]
 [1.0000001  0.91658026 0.5612824  0.5433717  0.5390162 ]
 [1.0000001  0.643106   0.62717223 0.6216079  0.61048394]
 [1.         0.8696474  0.60579693 0.5931927  0.5783482 ]
 [1.         0.91595125 0.56046623 0.5452393  0.5428423 ]
 [1.         0.70229137 0.5333149  0.5329284  0.50476944]
 [1.         0.5583645  0.50156873 0.49650416 0.4837718 ]]
Index by FAISS:
 [[      0 1373344 1783919 1900104 1067663]
 [      1  301387 1437270  824150 1862389]
 [      2 1506143  453263 1231726 1540573]
 [      3  138325 2021129 1541039  893304]
 [      4  814217  801442 1711612 1655948]
 [      5  184425  751910 1939500  152700]
 [      6 1181189  410574  505696 1375741]
 [      7 1167554  217863 1002028 1651761]
 [      8 1805679 1422176 1185354 16

In [13]:
# print result 
res = []
for row in range(len(index)):
    top5_res = []
    for col in range(len(index[0])):
        ent_name = entity_dict[index[row,col]]
        sim = cos_sim[row,col]
        top5_res.append((ent_name,sim))
    res.append(top5_res)

In [14]:
res[0]

[('/c/en/saltyback/a', 1.0),
 ('/c/en/soured', 0.5627408),
 ('/c/en/cheesed_off/a', 0.560689),
 ('/c/en/p.o_ed/a', 0.5505445),
 ('/c/en/torqued_off/a', 0.5227501)]

In [15]:
res[1]

[('at:to_see_what_they_are_not_showing_them', 1.0),
 ('at:personx_keeps_persony_from_seeing', 0.63397956),
 ('at:to_be_able_to_keep_them_from_seeing_it', 0.62087405),
 ('at:to_obscure_something', 0.61736274),
 ('at:to_find_out_what_x_has', 0.6091253)]

In [16]:
res

[[('/c/en/saltyback/a', 1.0),
  ('/c/en/soured', 0.5627408),
  ('/c/en/cheesed_off/a', 0.560689),
  ('/c/en/p.o_ed/a', 0.5505445),
  ('/c/en/torqued_off/a', 0.5227501)],
 [('at:to_see_what_they_are_not_showing_them', 1.0),
  ('at:personx_keeps_persony_from_seeing', 0.63397956),
  ('at:to_be_able_to_keep_them_from_seeing_it', 0.62087405),
  ('at:to_obscure_something', 0.61736274),
  ('at:to_find_out_what_x_has', 0.6091253)],
 [('/c/en/pot_cheese/n', 1.0000001),
  ('/c/en/unaged', 0.5501119),
  ('/c/en/ricotta', 0.492526),
  ('/c/en/crumbly', 0.4534185),
  ('/c/en/sirt/n', 0.45262545)],
 [('/c/en/set/n/wn/cognition', 1.0),
  ('/c/en/bent/n/wn/cognition', 0.79032516),
  ('/c/en/hang/n/wn/cognition', 0.7316501),
  ('/c/en/knack/n/wn/cognition', 0.6795417),
  ('/c/en/shot/a/wn', 0.59318507)],
 [('/c/en/automatic_drive/n/wn/artifact', 1.0000001),
  ('/c/en/automatic_transmission/n/wn/artifact', 0.91658026),
  ('/c/en/pickeerers/n', 0.5612824),
  ('/c/en/gruners/n', 0.5433717),
  ('/c/en/rash