In [61]:
import faiss
from datasets import load_dataset
import tensorflow_hub as hub

### Load the Dataset into a list

In [62]:
dataset = load_dataset("embedding-data/sentence-compression")
sentences = []
for i in range(len(dataset["train"])):
    sentences.append(dataset["train"][i]["set"][0])

100%|██████████| 1/1 [00:00<00:00, 87.72it/s]


### Using Universal Sentence Encoder to generate embeddings for each sentence (You can use any other Transformer based model as well.) 

In [63]:
#Universal Sentence Encoder to create sentence embeddings

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

In [64]:
embeddings = embed(sentences)
shape = embeddings.shape[1]

2023-07-03 17:14:12.528526: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'inputs' with dtype string
	 [[{{node inputs}}]]


In [65]:
import pickle

with open("./sentence-embedding.pkl","wb") as out:
    pickle.dump({'sentences': sentences, 'embeddings': embeddings}, out)

### IndexFlatL2

In [106]:
#Import the required package
import faiss
import pickle

#Load the sentences and their corresponding embeddings
sentences = []
embeddings = []
with open('./sentence-embedding.pkl','rb') as f:
    pkl_embeddings = pickle.load(f)
sentences = pkl_embeddings['sentences']
embeddings = pkl_embeddings['embeddings']

In [107]:
#Create the index of dimension: shape
index= faiss.IndexFlatL2(shape)
index.is_trained #Output: true


True

In [108]:
index.add(embeddings)

#Retrieve nearest-k results
k = 4
#Generate the embedding for query q
q = embed(["I like the game of chess."])

D, result = index.search(q,k)

In [109]:
#Prints the top-k similar sentences to the query from our collection
for item in result[0]:
    print(sentences[item])

Bulgarian top chess player Veselin Topalov, having the White pieces, drew against Gata Kamsky the 1st game of the Chess Challenge match.
A chess grandmaster took on pupils in a Bletchingley school's chess club in a sponsored game.
Bulgarian chess player Veselin Topalov will play white in the first game of chess against Viswanathan Anand, which will take place on Saturday, April 24, FOCUS News Agency.
Chess maestro Viswanathan Anand is CNN-IBN Indian of the Year 2012.


### IndexIVFFFlat

In [110]:
#Import the required package
import faiss
import pickle

#Load the sentences and their corresponding embeddings
sentences = []
embeddings = []
with open('./sentence-embedding.pkl','rb') as f:
    pkl_embeddings = pickle.load(f)
sentences = pkl_embeddings['sentences']
embeddings = pkl_embeddings['embeddings']

In [111]:
index.is_trainednlist = 50
#Create the index
quantizer = faiss.IndexFlatL2(shape)
index = faiss.IndexIVFFlat(quantizer,shape,nlist)


In [112]:
index.is_trained #Output: false
index.train(embeddings)
index.is_trained #Output: true

True

In [113]:
index.add(embeddings)

In [114]:
#Retrieve nearest-k results
k = 4
#Generate the embedding for query q
q = embed(["I like the game of chess."])

In [115]:
D, result = index.search(q,k)

In [116]:
#Prints the top-k similar sentences to the query from our collection
for item in result[0]:
    print(sentences[item])

I recently decided that I wanted to make a game.
One thing that I have realised about local football is that those at the helm of our game have not been told that football belongs to the nation.
World in Flames is honestly one of the best games I've played in a long time.
What's That Face is an Internet flash game for PC and notebook is a game of memory.


### Product Quantization


In [101]:
# Define the number of clusters
nlist = 50

#Number of chunks for each embeddging
m = 8

#initialize the faiss index using PQ
quantizer = faiss.IndexFlatL2(shape)
index = faiss.IndexIVFPQ(quantizer, shape, nlist, m, 10)

In [102]:
index.is_trained #Output: false

False

In [103]:
#Train the index on the embeddings
index.train(embeddings)

In [104]:

index.is_trained #Output: true

True

In [105]:
index.add(embeddings)

k = 4
query = embed(["I like the game of chess."])

D, result = index.search(query,k)

#Prints the top-k similar sentences to the query from our collection
for item in result[0]:
    print(sentences[item])

I recently decided that I wanted to make a game.
``This is my last season and I just want to play -- it doesn't matter where,'' Battle said.
You're here because you want to learn how to become a vampire in Elder Scrolls V:
What's That Face is an Internet flash game for PC and notebook is a game of memory.
