In [1]:
import numpy as np
import faiss
import requests
from io import StringIO
import pandas as pd

# **📁1.Load the dataset**

In [26]:
res = requests.get('https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/sick2014/SICK_train.txt')

text = res.text
text[:100]

'pair_ID\tsentence_A\tsentence_B\trelatedness_score\tentailment_judgment\n1\tA group of kids is playing in '

In [27]:
data = pd.read_csv(StringIO(text), sep='\t')
data.head()

Unnamed: 0,pair_ID,sentence_A,sentence_B,relatedness_score,entailment_judgment
0,1,A group of kids is playing in a yard and an ol...,A group of boys in a yard is playing and a man...,4.5,NEUTRAL
1,2,A group of children is playing in the house an...,A group of kids is playing in a yard and an ol...,3.2,NEUTRAL
2,3,The young boys are playing outdoors and the ma...,The kids are playing outdoors near a man with ...,4.7,ENTAILMENT
3,5,The kids are playing outdoors near a man with ...,A group of kids is playing in a yard and an ol...,3.4,NEUTRAL
4,9,The young boys are playing outdoors and the ma...,A group of kids is playing in a yard and an ol...,3.7,NEUTRAL


We will take all samples from `sentence_A` and build sentence embeddings for each - which we can then store in FAISS.

In [28]:
sentences = data['sentence_A'].tolist()
sentences[:5]

['A group of kids is playing in a yard and an old man is standing in the background',
 'A group of children is playing in the house and there is no man standing in the background',
 'The young boys are playing outdoors and the man is smiling nearby',
 'The kids are playing outdoors near a man with a smile',
 'The young boys are playing outdoors and the man is smiling nearby']

In [29]:
sentence_b = data['sentence_B'].tolist()
sentences.extend(sentence_b)
len(set(sentences))

4802

### **This isn't a particularly large number, so let's pull in a few more similar datasets.**

In [30]:
urls = [
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2012/MSRpar.train.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2012/MSRpar.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2012/OnWN.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2013/OnWN.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2014/OnWN.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2014/images.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2015/images.test.tsv'
]

In [31]:
for url in urls:
    res = requests.get(url)
    # extract to dataframe
    data = pd.read_csv(StringIO(res.text), sep='\t', header=None, on_bad_lines='skip')
    # add to columns 1 and 2 to sentences list
    sentences.extend(data[1].tolist())
    sentences.extend(data[2].tolist())

In [32]:
len(set(sentences))

14505

# **📦2. Save as .txt file**

In [33]:
# remove duplicates and NaN
sentences = [
    sentence.replace('\n', '') for sentence in list(set(sentences)) if type(sentence) is str
    ]

In [34]:
with open('sentences.txt', 'w') as fp:
    fp.write('\n'.join(sentences))

# **🧠3. Sentence Embedding**

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('bert-base-nli-mean-tokens')

model = model.to("cuda")  # Move model to GPU

2025-07-16 16:02:51.719749: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-07-16 16:02:51.851192: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1752670971.898252   13018 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752670971.911353   13018 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1752670972.016837   13018 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

We can save/load from file in the case of needing to reload the notebook for any reason later.

In [35]:
sen = model.encode("Saher", batch_size=32, show_progress_bar=True, device="cuda")
sen.shape

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(768,)

In [36]:
sentence_embeddings = model.encode(sentences)
sentence_embeddings.shape

(14504, 768)

In [37]:
sentence_embeddings.shape[0]

14504

In [39]:
with open(f'./sim_sentences/embeddings_X.npy', 'wb') as fp: # Save the embeddings to a file called embeddings_X.npy
    np.save(fp, sentence_embeddings[0:256])

# **📚4. Save Embedding in Chunks**

In [43]:
# saving data
split = 256
file_count = 0
for i in range(0, sentence_embeddings.shape[0], split):
    end = i + split
    if end > sentence_embeddings.shape[0] + 1:
        end = sentence_embeddings.shape[0] + 1
    file_count = '0' + str(file_count) if file_count < 0 else str(file_count)
    with open(f'./sim_sentences/embeddings_{file_count}.npy', 'wb') as fp:
        np.save(fp, sentence_embeddings[i:end, :])
    print(f"embeddings_{file_count}.npy | {i} -> {end}")
    file_count = int(file_count) + 1

# this code will save the embeddings in chunks of 256 to avoid memory issues 
# example: if you have 1000 embeddings, it will save 4 files: embeddings_00.npy, embeddings_01.npy, embeddings_02.npy, embeddings_03.npy
# each file will contain 256 embeddings except the last one which will contain the remaining embeddings 

embeddings_0.npy | 0 -> 256
embeddings_1.npy | 256 -> 512
embeddings_2.npy | 512 -> 768
embeddings_3.npy | 768 -> 1024
embeddings_4.npy | 1024 -> 1280
embeddings_5.npy | 1280 -> 1536
embeddings_6.npy | 1536 -> 1792
embeddings_7.npy | 1792 -> 2048
embeddings_8.npy | 2048 -> 2304
embeddings_9.npy | 2304 -> 2560
embeddings_10.npy | 2560 -> 2816
embeddings_11.npy | 2816 -> 3072
embeddings_12.npy | 3072 -> 3328
embeddings_13.npy | 3328 -> 3584
embeddings_14.npy | 3584 -> 3840
embeddings_15.npy | 3840 -> 4096
embeddings_16.npy | 4096 -> 4352
embeddings_17.npy | 4352 -> 4608
embeddings_18.npy | 4608 -> 4864
embeddings_19.npy | 4864 -> 5120
embeddings_20.npy | 5120 -> 5376
embeddings_21.npy | 5376 -> 5632
embeddings_22.npy | 5632 -> 5888
embeddings_23.npy | 5888 -> 6144
embeddings_24.npy | 6144 -> 6400
embeddings_25.npy | 6400 -> 6656
embeddings_26.npy | 6656 -> 6912
embeddings_27.npy | 6912 -> 7168
embeddings_28.npy | 7168 -> 7424
embeddings_29.npy | 7424 -> 7680
embeddings_30.npy | 7680 -> 7

# **📊5. FAAIS implementation**

In [45]:
d = sentence_embeddings.shape[1]
d

768

# **IndexFlatL2**


We initialize the flat L2 distance index `IndexFlatL2`, all we need is the specify the vector dimensionality - which in this case is `d == 768` (to align with the sentence-BERT model output embeddings of size `768`).

In [46]:
index = faiss.IndexFlatL2(d)

Often, we will use indexes that require us to `train` them on our data before being used (if we are grouping or transforming the data in any way). `IndexFlatL2` however, is a simple operation and only requires that we calculate distances between vectors when we introduce our query vector `xq` during search. So, in this case, no training is required - which we can confirm by checking the `is_trained` attribute.

In [42]:
index.is_trained

True

In [47]:
index.add(sentence_embeddings)

In [48]:
index.ntotal

14504

Then search given a query `xq` and number of nearest neigbors to return `k`.

In [74]:
k = 20
xq = model.encode(["something that is related to computer scince and software engineering"], device="cuda")

In [75]:
%%time
D, I = index.search(xq, k)  # search
print(I)  # k-nearest neigbors of the query vector | nprobe == 1: 6495 26392 61709 49932 | nprobe == 10: 36245  6495 57489  8705

[[14283  9805   638 11555  9098  8724 13055   181   900  3075   239  5093
   4152  5693   462 11370  3221  5681  5387 14133]]
CPU times: user 2.57 ms, sys: 0 ns, total: 2.57 ms
Wall time: 2 ms


Here we're returning indices which returns:

In [76]:
[f'{i}: {sentences[i]}' for i in I[0]]

['14283: create code or computer programs',
 '9805: (computer science) a program designed for general support of the processes of a computer.',
 '638: a computer program that performs system support',
 '11555: the primary information-processing component of a computer, of a microprocessor chip',
 '9098: create code, write a computer program.',
 '8724: (computer science) the part of a computer (a microprocessor chip) that does most of the data processing.',
 '13055: write a computer program.',
 '181: Write a computer programme.',
 '900: an image that is generated by a computer.',
 '3075: obtain or retrieve from a storage device; as of information on a computer.',
 '239: Obtain from a storage device, as of information on a computer.',
 '5093: the ability of computers to exchange digital information between them and make use of it',
 '4152: a computer server',
 '5693: (computer science) matter that is held in a computer and is typed or printed on paper.',
 '462: (computer science) the abi

In [77]:
sentences[5693]

'(computer science) matter that is held in a computer and is typed or printed on paper.'

Clearly we have some good matches, everything returned includes people running with a football, or on the context of a football match. Now, if we'd rather extract the numerical vectors from FAISS, we can do that too.

In [78]:
vecs = np.zeros((k, d))
for i, val in enumerate(I[0].tolist()):
    vecs[i, :] = index.reconstruct(val)

In [79]:
vecs.shape

(20, 768)

In [80]:
vecs[0][:100]

array([ 0.09391017, -0.09907819,  1.24320161,  0.45920768,  0.27596977,
       -0.21246122,  0.10113402,  0.8577866 ,  0.21929623, -0.0233299 ,
       -0.22599718,  0.30393201,  0.44359925,  0.67798698, -0.68578678,
       -0.27133012, -0.72963232, -0.63668185,  0.04318038, -0.072131  ,
       -0.55964434, -0.45221475,  0.06749705, -0.49995908,  0.01495562,
        0.11168762, -0.00734872, -0.41647235, -0.99219048, -0.09018803,
       -0.78419036, -0.45411903,  1.25811863, -0.86799634,  0.39596888,
        0.69694793,  0.13326345,  0.74794763,  0.12864989, -0.21554826,
       -0.36342981, -0.7541362 ,  0.60532933,  0.90233713, -0.42201591,
       -0.60834563, -0.75992382,  0.51503366, -0.02454788, -0.34889325,
       -1.0836519 ,  0.90496981,  0.53311056,  0.65428096, -0.02545676,
        0.73733759,  1.13248825, -1.69231999,  0.07980368,  0.67105812,
        0.38921386,  0.01704967,  0.33874604,  0.420077  ,  0.06318744,
        0.02256099,  0.40185109, -0.36326689, -1.47796416, -0.82

## **IndexIVFFlat**

In [91]:
nlist = 50 # number of Voronoi cells (clusters)
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist)

Here we've added a new parameter `nlist`. We use `nlist` to define how many partitions we'd like our index to have. 

When we built the previous, `IndexFlatL2`-only index, we noted that no training was required as no grouping/transformation was required to build that index. Now that we've added partitioning using `IndexIVFFlat`, this is no longer the case. Let's take a look at the `is_trained` attribute.

In [92]:
index.is_trained

False

In [93]:
index.train(sentence_embeddings)
index.is_trained

True

In [94]:
index.add(sentence_embeddings)
index.ntotal

14504

Let's search again using the same indexed sentence embeddings and the same query `xq`.

In [95]:
%%time
D, I = index.search(xq, k)  # search
print(I)

[[14283   638  3075   239  5093  4152  1504  5549 13285  5211  9494  5422
   2773  2663 10276  1478  5445  7346  1019  5919]]
CPU times: user 952 µs, sys: 0 ns, total: 952 µs
Wall time: 730 µs


We can increase the number of nearby cells to search too with `nprobe`.

In [98]:
index.nprobe = 5

In [99]:
%%time
D, I = index.search(xq, k)  # search
print(I)

[[14283  9805   638 11555  9098  8724   181  3075   239  5093  4152   462
   3221  5681  7155  9872 10157  6721  9951  1504]]
CPU times: user 896 µs, sys: 0 ns, total: 896 µs
Wall time: 486 µs


Increasing the number of `nprobe` will improve the accuracy of our search, but cost time. Our earlier `IndexFlatL2`-only search was *exhaustive* (it compared every single vector) and so it identified the closest matches with a perfect accuracy. The smaller our `nprobe` value, the smaller scope that we search. We received perfect results (that matched our previous `IndexFlatL2`-only results - `7460`, `10940`, `3781`, `5747`), however, if we found that we were not getting closely matching results, we could simply bump `nprobe` up further - improving accuracy, but increasing time-taken too.

It's worth noting that the time taken can change with each run too, if we rerun the above block, we usually get a different time:

In [89]:
%%time
D, I = index.search(xq, k)
print(I)

[[14283  9805   638 11555  9098  8724 13055   181   900  3075   239  5093
   4152  5693   462 11370  3221  5681  9063  5518]]
CPU times: user 826 µs, sys: 297 µs, total: 1.12 ms
Wall time: 584 µs


For IVF (and IMI) indexes, before attempting to use the `reconstruct` method, we need to call the `make_direct_map` method - otherwise we will return a `RunetimeError`.

In [102]:
index.make_direct_map()

In [103]:
index.reconstruct(14283)[:100] # reconstruct the 100 nearest neighbors for the vector at index 14283

array([ 0.09391017, -0.09907819,  1.2432016 ,  0.45920768,  0.27596977,
       -0.21246122,  0.10113402,  0.8577866 ,  0.21929623, -0.0233299 ,
       -0.22599718,  0.303932  ,  0.44359925,  0.677987  , -0.6857868 ,
       -0.27133012, -0.7296323 , -0.63668185,  0.04318038, -0.072131  ,
       -0.55964434, -0.45221475,  0.06749705, -0.49995908,  0.01495562,
        0.11168762, -0.00734872, -0.41647235, -0.9921905 , -0.09018803,
       -0.78419036, -0.45411903,  1.2581186 , -0.86799634,  0.39596888,
        0.69694793,  0.13326345,  0.74794763,  0.12864989, -0.21554826,
       -0.3634298 , -0.7541362 ,  0.60532933,  0.90233713, -0.4220159 ,
       -0.6083456 , -0.7599238 ,  0.51503366, -0.02454788, -0.34889325,
       -1.0836519 ,  0.9049698 ,  0.53311056,  0.65428096, -0.02545676,
        0.7373376 ,  1.1324883 , -1.69232   ,  0.07980368,  0.6710581 ,
        0.38921386,  0.01704967,  0.33874604,  0.420077  ,  0.06318744,
        0.02256099,  0.4018511 , -0.3632669 , -1.4779642 , -0.82


## **Quantization**

In [104]:
m = 8  # number of chunks in compressed vectors (sub-vectors)
bits = 8 # number of bits per chunk

quantizer = faiss.IndexFlatL2(d)  # we keep the same L2 distance flat index
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, bits) 

In [105]:
index.is_trained

False

In [106]:
index.train(sentence_embeddings)

In [107]:
index.add(sentence_embeddings)

Let's compare it to our previous index *without* PQ, and an `nprobe` value of `10`.

In [108]:
index.nprobe = 10

In [109]:
%%time
D, I = index.search(xq, k)
print(I)

[[  239   638  3075  4152  5093 14283   181  9098  5681  9805   900  5518
  11370  3221  8724 11555  9063 10157  3948   462]]
CPU times: user 533 µs, sys: 167 µs, total: 700 µs
Wall time: 417 µs


Through adding PQ we've reduced our search time from ~7.5ms to ~5ms, a small difference on a dataset of this size, but when scaled to larger datasets this can make a huge difference.

Now, we should also notice the slightly different results being returned. Beforehand with our exhaustive L2 search we were returning `7460`, `10940`, `3781`, and `5747`. Now, we see a slightly different order to our results - and two different vectors, `5013` and `5370`.

Each of our speed optimization operations, `IVF` and `PQ`, come at the cost of accuracy. Now, if we print out these results we will nonetheless find that each item is still a relevant match:

In [110]:
[f'{i}: {sentences[i]}' for i in I[0]]

['239: Obtain from a storage device, as of information on a computer.',
 '638: a computer program that performs system support',
 '3075: obtain or retrieve from a storage device; as of information on a computer.',
 '4152: a computer server',
 '5093: the ability of computers to exchange digital information between them and make use of it',
 '14283: create code or computer programs',
 '181: Write a computer programme.',
 '9098: create code, write a computer program.',
 '5681: printed version of electronic data from a computer',
 '9805: (computer science) a program designed for general support of the processes of a computer.',
 '900: an image that is generated by a computer.',
 '5518: an electronic memory device.',
 '11370: a computer-generated visual image',
 '3221: (computer science) a computer that provides client stations with access to files and printers as shared resources to a computer network.',
 '8724: (computer science) the part of a computer (a microprocessor chip) that does mo

## **So although we might not get the *perfect* result, we still get close - we get a significant speed boost**