The overall process of Faiss can be divided into three steps:

1. Construct training data (expressed in matrix form)
2. Select the appropriate Index (the core component of Faiss) and add the training data to the Index.
3. Search, that is, search, get the final result

## Import libaray


In [1]:
import faiss
import numpy as np

## 1. Construct training data (expressed in matrix form)

In [2]:
%%time
# specify a certain entity embedding tsv file
input_file = '/nas/home/binzhang/backup_data/complex/comp_log_dot_0.01/entities_output.tsv'
entity_dict = {}        # build a entity name-index bi dictionary
entity_embeddings = []  # all the embeddings 

with open(input_file, 'r') as f:
    for index,line in enumerate(f):
        line = line.split('\t')
        entity_name = line[0]
        entity_vec =  [ float(i) for i in line[1:]]
        entity_embeddings.append(entity_vec)
        entity_dict[entity_name] = index
        entity_dict[index] = entity_name
        
# entity_embeddings=> matrix
X = np.array(entity_embeddings).astype(np.float32) # float32
dimension = X.shape[1]

CPU times: user 1min 7s, sys: 4.97 s, total: 1min 12s
Wall time: 1min 12s


## 2. Select the appropriate Index (cos) and add the training data to the Index.

In [3]:
# build index (METRIC_INNER_PRODUCT => cos )
vec_index = faiss.index_factory(dimension, "Flat", faiss.METRIC_INNER_PRODUCT)
# # normalize all vectors in order to get cos sim 
faiss.normalize_L2(X)  
# add vectors to inde 
vec_index.add(X) 
print(f'vector number in index: {vec_index.ntotal}')# 

vector number in index: 2160968


## 3. Do some searching 


normal case:<br>
query_set = [[...],[...],[...]]<br>
query_mat = np.array([dataSetII]).astype(np.float32)<br>
faiss.normalize_L2(query_mat) <br>

In [4]:
# first I mimic some query data from the training data
query_ent_indices = list(range(0,10)) # first 10 entities
query_ent_vecs = [] 
for i in query_ent_indices:
    query_ent_vecs.append(X[i])
query_ent_mat = np.array(query_ent_vecs)
faiss.normalize_L2(query_ent_mat) 
query_ent_mat.shape 

(10, 100)

In [7]:
# after setting topk ,we can do query
topk = 5
cos_sim, index = vec_index.search(query_ent_mat, topk) # both of them are matrices
print(f'Similarity by FAISS:\n {cos_sim}')
print(f'Index by FAISS:\n {index}')

Similarity by FAISS:
 [[1.         0.9759227  0.9747515  0.97438985 0.97427124]
 [1.         0.99250054 0.99220634 0.9920102  0.9919175 ]
 [1.         0.9914187  0.9910581  0.99104    0.99095213]
 [1.         0.9887444  0.9884101  0.9883417  0.98831266]
 [1.         0.9605595  0.94857997 0.9483327  0.94539094]
 [1.         0.98139644 0.9810555  0.9809668  0.9807242 ]
 [0.99999994 0.9594297  0.957194   0.95686686 0.9543254 ]
 [1.         0.9819611  0.9794824  0.9794055  0.9789906 ]
 [1.0000001  0.9560226  0.9528179  0.95084345 0.94829893]
 [0.99999994 0.99290884 0.99220634 0.9917029  0.99170125]]
Index by FAISS:
 [[      0 1807674 1904214  530173 1037121]
 [      1 1816434  578516 1134567 1126372]
 [      2  594499 1403122  353072 2105736]
 [      3  901349  808479  939125  866220]
 [      4 1045313  943524  528468 1606929]
 [      5 1677129  695595 1347835  168479]
 [      6 1388697   49499  746061  212139]
 [      7 1612105 1411100 1779663  658361]
 [      8 1944259  236965 1784814 10

In [8]:
# print result 
res = []
for row in range(len(index)):
    top5_res = []
    for col in range(len(index[0])):
        ent_name = entity_dict[index[row,col]]
        sim = cos_sim[row,col]
        top5_res.append((ent_name,sim))
    res.append(top5_res)

In [10]:
res[0]

[('/c/en/one_drives', 1.0),
 ('/c/en/scoring_homer_ball', 0.9759227),
 ('/c/en/coffee_spilled', 0.9747515),
 ('/c/en/computer_switched_on', 0.97438985),
 ('/c/en/people_love_away', 0.97427124)]

In [11]:
res[1]

[('/c/en/nip_outs/n', 1.0),
 ('/c/en/cenogenetic/a', 0.99250054),
 ('/c/en/banghyangs/n', 0.99220634),
 ('/c/en/louques/n', 0.9920102),
 ('/c/en/dataplanes/n', 0.9919175)]

In [12]:
res

[[('/c/en/one_drives', 1.0),
  ('/c/en/scoring_homer_ball', 0.9759227),
  ('/c/en/coffee_spilled', 0.9747515),
  ('/c/en/computer_switched_on', 0.97438985),
  ('/c/en/people_love_away', 0.97427124)],
 [('/c/en/nip_outs/n', 1.0),
  ('/c/en/cenogenetic/a', 0.99250054),
  ('/c/en/banghyangs/n', 0.99220634),
  ('/c/en/louques/n', 0.9920102),
  ('/c/en/dataplanes/n', 0.9919175)],
 [('/c/en/waycasters/n', 1.0),
  ('/c/en/rosemans/n', 0.9914187),
  ('/c/en/ballantines/n', 0.9910581),
  ('/c/en/futchels/n', 0.99104),
  ('/c/en/holdaways/n', 0.99095213)],
 [('at:to_buy_a_similar_shirt', 1.0),
  ('at:to_continue_working_until_the_end_of_the_day', 0.9887444),
  ('at:to_relax/lounge/sleep', 0.9884101),
  ('at:to_stand_ground', 0.9883417),
  ('at:to_stay_lost', 0.98831266)],
 [('/c/en/eagar', 1.0),
  ('/c/en/entrapped', 0.9605595),
  ('/c/en/nostalgiac', 0.94857997),
  ('/c/en/resistent', 0.9483327),
  ('at:nervous_as_they_attempt_something_new', 0.94539094)],
 [('/c/en/vesperate', 1.0),
  ('/c/en/