**This tutorial is about collecting data to querying vector DB FAISS**

*Step 1: Install libraries*

In [None]:
!pip install sentence-transformers faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_6

*Step 2: Import libraries*

In [None]:
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

*Step 3: Collect data (sample data used here for demo)*

In [None]:
texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Data science is an inter-disciplinary field.",
    "Artificial Intelligence and Machine Learning are transforming industries.",
    "Natural Language Processing is a subfield of AI.",
    "Python is a popular programming language."
] # list of sentences

*Step 4: Preprocessing data (skipping as of now)*

*Step 5: Convert data to vector embeddings*

In [None]:
model = SentenceTransformer('paraphrase-MiniLM-L6-v2') # pre-trained sentence embedding model
embeddings = model.encode(texts) # each sentence is converted to 768 dimensional vector

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

*Step 6: Print the vector embeddings*

In [None]:
print(embeddings)

[[ 0.5897934  -0.23598339 -0.25411704 ...  0.14036213  1.0559162
   0.5301816 ]
 [-0.0943914  -0.05818905 -0.223293   ... -0.21564417  0.35591027
   0.01862814]
 [-0.27970096 -0.5744981   0.2825321  ... -0.31499138 -0.07452106
  -0.25046727]
 [-0.21402569 -0.9086564  -0.28901276 ...  0.56085414  0.31967756
  -0.23817158]
 [-0.15595555 -1.0370271  -0.8438109  ...  0.3458398   0.7568847
  -0.10728178]]


*Step 7: Initialize FAISS index using L2 distance (brute force search algorithm)*

In [None]:
embeddings = np.array(embeddings).astype('float32')
index = faiss.IndexFlatL2(embeddings.shape[1])

*Step 8: Add the embeddings*

In [None]:
index.add(embeddings)

*Step 9: Query the vector database FAISS*

In [None]:
query = 'what is machine learning?'
query_embedding = model.encode([query]) # Converting the query into vector

*Step 10: Perform a search in the FAISS index (retrieving the top 2 closest vectors)*

In [None]:
k = 2 # retrieving top 2 closest vector of the giving query vector
distance, indices = index.search(query_embedding,k)

*Step 11: Display the result*

In [None]:
print(f'query:{query}')
for i in range(k):
    print(f"Rank {i+1}:")
    print(f"Text: {texts[indices[0][i]]}")
    print(f"Distance: {distance[0][i]}")

query:what is machine learning?
Rank 1:
Text: Artificial Intelligence and Machine Learning are transforming industries.
Distance: 42.75959014892578
Rank 2:
Text: Natural Language Processing is a subfield of AI.
Distance: 54.093990325927734
