This notebook provides code to do the following:
1. Download the wikipedia dataset and dump the first 5 million rows into a json file
2. Encode the texts in this json file using BAAI/bge-base-en-v1.5 and save the embeddings
3. Build an index using AutoFaiss
4. Search using Faiss

In [None]:
# !pip install -U datasets autofaiss
# !pip install git+https://github.com/UKPLab/sentence-transformers.git

In [1]:
from datasets import load_dataset
import time
from sentence_transformers import SentenceTransformer
import pandas as pd
import numpy as np
import torch
# from tqdm import tqdm
import os
import shutil
import faiss

device  = "cuda"
batch_size = 1200
num_proc=12


  from .autonotebook import tqdm as notebook_tqdm


#### Encoding Texts

In [2]:
model = SentenceTransformer("BAAI/bge-base-en-v1.5", trust_remote_code=True)
model.max_seq_length = 256

if device == "cuda":
    model = torch.compile(model)
    
pool = model.start_multi_process_pool(target_devices=["cuda:0", "cuda:1"])


In [None]:
data = load_dataset(f"Cohere/wikipedia-22-12", 'en', split='train', streaming=False, num_proc=num_proc)
print(len(data))

In [None]:
for i, x in enumerate(data):
    print(x)
    text = x['text']
    print(f"{len(text.split())} : {text}")
    print()
    if i == 2:
        break

In [None]:
data = data.select(range(5_000_000)) # Select the top 5 million rows
data

In [9]:
data = data.to_pandas()[['title', 'text']]
data['combined'] = "Page Title: " + data['title'].str.strip() + "\n" + data['text'].str.strip()
data = data.drop(columns=['title', 'text'])
data = data.rename(columns={'combined': 'text'})
data.to_json("texts.json", orient="records", lines=True)

del data

In [2]:
texts = pd.read_json(f"texts.json", orient="records", lines=True)
texts = texts['text'].to_list()

len(texts), texts[:5]

(5000000,
 ['Page Title: Deaths in 2022\nThe following notable deaths occurred in 2022. Names are reported under the date of death, in alphabetical order. A typical entry reports information in the following sequence:',
  'Page Title: YouTube\nYouTube is a global online video sharing and social media platform headquartered in San Bruno, California. It was launched on February 14, 2005, by Steve Chen, Chad Hurley, and Jawed Karim. It is owned by Google, and is the second most visited website, after Google Search. YouTube has more than 2.5 billion monthly users who collectively watch more than one billion hours of videos each day. , videos were being uploaded at a rate of more than 500 hours of content per minute.',
  "Page Title: YouTube\nIn October 2006, YouTube was bought by Google for $1.65\xa0billion. Google's ownership of YouTube expanded the site's business model, expanding from generating revenue from advertisements alone, to offering paid content such as movies and exclusive con

In [4]:
# embeddings = model.encode(texts, normalize_embeddings=True, batch_size=batch_size, show_progress_bar=True)
embeddings = model.encode_multi_process(texts, 
                                        normalize_embeddings=True, 
                                        show_progress_bar=True, 
                                        batch_size=batch_size, 
                                        pool=pool)
model.stop_multi_process_pool(pool)
embeddings.shape

Chunks:   0%|          | 0/1000 [00:00<?, ?it/s]

(5000000, 768)

In [5]:
np.save(f"embeddings.npy", embeddings)

In [6]:
embeddings = np.load(f"embeddings.npy")
embeddings.shape

(5000000, 768)

#### Indexing using AutoFaiss

Run this in the terminal

```bash
!autofaiss build_index --embeddings=. \
                    --index_path="knn.index" \
                    --index_infos_path="infos.json" \
                    --metric_type=ip \
                    --max_index_query_time_ms=20 \
                    --max_index_memory_usage=12GB
```

```
2024-09-10 19:18:47,510 [INFO]: Using 255 omp threads (processes), consider increasing --nb_cores if you have more
2024-09-10 19:18:47,511 [INFO]: Launching the whole pipeline 09/10/2024, 19:18:47
2024-09-10 19:18:47,511 [INFO]: Reading total number of vectors and dimension 09/10/2024, 19:18:47
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 953.68it/s]
2024-09-10 19:18:47,553 [INFO]: There are 5000000 embeddings of dim 768
2024-09-10 19:18:47,554 [INFO]: >>> Finished "Reading total number of vectors and dimension" in 0.0425 secs
2024-09-10 19:18:47,554 [INFO]:         Compute estimated construction time of the index 09/10/2024, 19:18:47
2024-09-10 19:18:47,554 [INFO]:                 -> Train: 16.7 minutes
2024-09-10 19:18:47,554 [INFO]:                 -> Add: 42.9 seconds
2024-09-10 19:18:47,554 [INFO]:                 Total: 17.4 minutes
2024-09-10 19:18:47,554 [INFO]:         >>> Finished "Compute estimated construction time of the index" in 0.0002 secs
2024-09-10 19:18:47,554 [INFO]:         Checking that your have enough memory available to create the index 09/10/2024, 19:18:47
2024-09-10 19:18:47,555 [INFO]: 3.1GB of memory will be needed to build the index (more might be used if you have more)
2024-09-10 19:18:47,556 [INFO]:         >>> Finished "Checking that your have enough memory available to create the index" in 0.0011 secs
2024-09-10 19:18:47,556 [INFO]:         Selecting most promising index types given data characteristics 09/10/2024, 19:18:47
2024-09-10 19:18:47,556 [INFO]:         >>> Finished "Selecting most promising index types given data characteristics" in 0.0001 secs
2024-09-10 19:18:47,556 [INFO]:         Creating the index 09/10/2024, 19:18:47
2024-09-10 19:18:47,556 [INFO]:                 -> Instanciate the index OPQ256_1024,IVF16384_HNSW32,PQ256x8 09/10/2024, 19:18:47
2024-09-10 19:18:47,558 [INFO]:                 >>> Finished "-> Instanciate the index OPQ256_1024,IVF16384_HNSW32,PQ256x8" in 0.0015 secs
2024-09-10 19:18:47,558 [INFO]: The index size will be approximately 1.3GB
2024-09-10 19:18:47,558 [INFO]:                 -> Extract training vectors 09/10/2024, 19:18:47
2024-09-10 19:18:47,558 [INFO]: Will use 1638400 vectors to train the index, that will use 7.9GB of memory
 99%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 100/101 [00:01<00:00, 56.96it/s]
2024-09-10 19:18:49,320 [INFO]:                 >>> Finished "-> Extract training vectors" in 1.7621 secs
2024-09-10 19:18:49,320 [INFO]:                 -> Training the index with 1638400 vectors of dim 768 09/10/2024, 19:18:49
2024-09-10 19:38:08,113 [INFO]:                 >>> Finished "-> Training the index with 1638400 vectors of dim 768" in 1158.7928 secs
2024-09-10 19:38:08,120 [INFO]:                 -> Adding the vectors to the index 09/10/2024, 19:38:08
2024-09-10 19:38:08,121 [INFO]: The memory available for adding the vectors is 30.7GB(total available - used by the index)
2024-09-10 19:38:08,121 [INFO]: Using a batch size of 325520 (memory overhead 953.7MB)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 308/308 [03:22<00:00,  1.52it/s]
2024-09-10 19:41:30,186 [INFO]:         Computing best hyperparameters for index knn.index 09/10/2024, 19:41:30
2024-09-10 19:41:41,377 [INFO]:         >>> Finished "Computing best hyperparameters for index knn.index" in 11.1903 secs
2024-09-10 19:41:41,377 [INFO]: The best hyperparameters are: nprobe=56,efSearch=112,ht=2048
2024-09-10 19:41:41,377 [INFO]:         Compute fast metrics 09/10/2024, 19:41:41
  0%|                                                                                                                                | 0/1 [00:00<?, ?it/s]
2024-09-10 19:41:52,371 [INFO]:         >>> Finished "Compute fast metrics" in 10.9933 secs
2024-09-10 19:41:52,371 [INFO]:         Saving the index on local disk 09/10/2024, 19:41:52
2024-09-10 19:41:53,057 [INFO]:         >>> Finished "Saving the index on local disk" in 0.6861 secs
2024-09-10 19:41:53,058 [INFO]:                 >>> Finished "-> Adding the vectors to the index" in 224.9378 secs
2024-09-10 19:41:53,059 [INFO]: {
2024-09-10 19:41:53,059 [INFO]:         index_key: OPQ256_1024,IVF16384_HNSW32,PQ256x8
2024-09-10 19:41:53,059 [INFO]:         index_param: nprobe=56,efSearch=112,ht=2048
2024-09-10 19:41:53,059 [INFO]:         index_path: knn.index
2024-09-10 19:41:53,059 [INFO]:         size in bytes: 1395890992
2024-09-10 19:41:53,059 [INFO]:         avg_search_speed_ms: 17.881135380620044
2024-09-10 19:41:53,059 [INFO]:         99p_search_speed_ms: 67.61361030643457
2024-09-10 19:41:53,059 [INFO]:         reconstruction error %: 8.969146758317947
2024-09-10 19:41:53,059 [INFO]:         nb vectors: 5000000
2024-09-10 19:41:53,059 [INFO]:         vectors dimension: 768
2024-09-10 19:41:53,059 [INFO]:         compression ratio: 11.003724565907937
2024-09-10 19:41:53,059 [INFO]: }
2024-09-10 19:41:53,059 [INFO]:         >>> Finished "Creating the index" in 1385.5028 secs
2024-09-10 19:41:53,059 [INFO]: >>> Finished "Launching the whole pipeline" in 1385.5478 secs
(<faiss.swigfaiss_avx2.IndexPreTransform; proxy of <Swig Object of type 'faiss::IndexPreTransform *' at 0x7f2ae0dee430> >, {'index_key': 'OPQ256_1024,IVF16384_HNSW32,PQ256x8', 'index_param': 'nprobe=56,efSearch=112,ht=2048', 'index_path': 'knn.index', 'size in bytes': 1395890992, 'avg_search_speed_ms': 17.881135380620044, '99p_search_speed_ms': 67.61361030643457, 'reconstruction error %': 8.969146758317947, 'nb vectors': 5000000, 'vectors dimension': 768, 'compression ratio': 11.003724565907937})
```

#### Searching using Faiss

In [3]:
my_index = faiss.read_index("knn.index")

In [7]:
# query_vector = model.encode(["2019 england vs nz final"])
query_vector = np.random.rand(2, 768)
print(query_vector.shape)

(2, 768)


In [8]:
k = 50
distances, indices = my_index.search(query_vector, k)

print(f"Top {k} elements in the dataset for max inner product search:")
for i, (dist, indice) in enumerate(zip(distances[0], indices[0])):
  print(f"{i+1}: Vector number {indice:4} with distance {dist}")

Top 50 elements in the dataset for max inner product search:
1: Vector number 824959 with distance 0.8938260078430176
2: Vector number 4213526 with distance 0.8907451033592224
3: Vector number 151502 with distance 0.8811548352241516
4: Vector number 809290 with distance 0.8779027462005615
5: Vector number 451094 with distance 0.8758558034896851
6: Vector number 115634 with distance 0.8739829659461975
7: Vector number 607017 with distance 0.8587794303894043
8: Vector number 2788736 with distance 0.8580100536346436
9: Vector number 735933 with distance 0.8552708625793457
10: Vector number 406645 with distance 0.8550938963890076
11: Vector number 2681531 with distance 0.8511126041412354
12: Vector number 4941162 with distance 0.8498995304107666
13: Vector number 1318268 with distance 0.8495214581489563
14: Vector number 1059932 with distance 0.8490103483200073
15: Vector number 2737127 with distance 0.8378748893737793
16: Vector number 4540061 with distance 0.8353170156478882
17: Vector n

In [9]:
# Now map texts while searching
top_texts = [texts[i] for i in indices[0]]

for t in top_texts:
    print(t)
    print()


Page Title: Child's Play (franchise)
Mancini confirmed his intent to make the crossover film, describing it as "cool and doable", saying that the film would be titled "Child's Play on Elm Street" and that he was in discussion with New Line Cinema about the film. The foundation for a crossover film was previously laid in "Bride of Chucky", in which Freddy Krueger's bladed glove had appeared in the Lockport police evidence depository at the beginning of the film.

Page Title: Resident Evil: The Umbrella Chronicles
"The Umbrella Chronicles" encompasses several scenarios, which are based on various plot elements from the "Resident Evil" series. The game's first three scenarios, "Train Derailment", "The Mansion Incident", and "Raccoon's Destruction", are set within Raccoon City, or its surrounding area, in 1998. The final scenario, which is based on new material, is set in Russia, taking place in 2003. The game features nine playable characters from previous "Resident Evil" games, including