<a href="https://colab.research.google.com/github/rahiakela/small-language-models-fine-tuning/blob/main/domain-specific-small-language-models/01_data_preparation_for_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Data Preparation for RAG

Install the missing required packages in the Colab VM. Only FAISS for CPU, and [SentenceTransformers](https://www.sbert.net/) not available by default.

In [None]:
!pip install faiss-cpu sentence-transformers

Import the necessary packages/classes.

In [None]:
"""Module to cluster embeddings and create indices."""
import faiss

import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer

## Sample Data

Set the data corpus for this example and put it into a Pandas DataFrame.

In [None]:
data = [['His secret identity is Peter Parker', 'spiderman'],
        ['A businessman and engineer who ' +
         'runs the company Stark Industries',
         'ironman'],
        ['Superhuman spider-powers and abilities ' +
         'after being bitten by a radioactive spider',
         'spiderman'],
        ['A frail man enhanced to the peak of human ' +
         'physical perfection by an experimental super-soldier serum', 'captainamerica']
        ]
df = pd.DataFrame(data, columns = ['text', 'context'])

In [None]:
df.head()

Unnamed: 0,text,context
0,His secret identity is Peter Parker,spiderman
1,A businessman and engineer who runs the compan...,ironman
2,Superhuman spider-powers and abilities after b...,spiderman
3,A frail man enhanced to the peak of human phys...,captainamerica


## Encode Data

Get embeddings from the data corpus, generate a FAISS index and add the embeddings to it (after normalization).

In [None]:
text = df['text']

# Vectorize - use the sentence-transformers API to generate the word embeddings
encoder = SentenceTransformer("paraphrase-mpnet-base-v2")
vectors = encoder.encode(text)

# Indexing - use FAISS to create index and add the generated word embeddings to it
vector_dimension = vectors.shape[1]
l2_index = faiss.IndexFlatL2(vector_dimension)
faiss.normalize_L2(vectors)
l2_index.add(vectors)

## Similarity Search

Prepare a search text to be used for similarity search with FAISS on the generated index.

In [None]:
search_text = 'He throws webs'
search_vector = encoder.encode(search_text)
search_vector_as_array = np.array([search_vector])
faiss.normalize_L2(search_vector_as_array)

Perform a search within the created index (calculation of the distances between the search text and the strings within the index).

In [None]:
k = l2_index.ntotal
distances, ann = l2_index.search(search_vector_as_array, k=k)

Prepare the results to be displayed in a user-friendly format.

In [None]:
search_results = pd.DataFrame({'distances': distances[0], 'ann': ann[0]})
merged_df = pd.merge(search_results, df, left_on='ann', right_index=True)
merged_df.head()

Unnamed: 0,distances,ann,text,context
0,1.50107,2,Superhuman spider-powers and abilities after b...,spiderman
1,1.552392,0,His secret identity is Peter Parker,spiderman
2,1.667212,1,A businessman and engineer who runs the compan...,ironman
3,1.731641,3,A frail man enhanced to the peak of human phys...,captainamerica


Looking at the results you can notice that the shortest distances between the
search text and the other text samples in the index happens for those
belonging to the spiderman category.