# Similarity search on pandas DataFrame text column using LLMs

**Steps:**
1. Load CSV file using pandas.  
2. Apply embedding on the text column.  
3. Create a Euclidean Flat Indexer with Faiss.  
4. Retrieve similar rows from the pandas DataFrame based on the query.  

**Substeps of Step 4:** 
1. Take the input query.  
2. Embed the input query.  
3. Perform a similarity search on the Euclidean Flat Indexer and get the similar indexes.  
4. Retrieve the matched rows from the pandas DataFrame based on the query.  

![alt text](../images/pandasRagf.png)

In [9]:
import faiss
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer

## Load CSV File Using Pandas.

In [3]:
data = pd.read_csv('../data/sample_text.csv')
display(data)

Unnamed: 0,text,category
0,Meditation and yoga can improve mental health,Health
1,"Fruits, whole grains and vegetables helps cont...",Health
2,These are the latest fashion trends for this week,Fashion
3,Vibrant color jeans for male are becoming a trend,Fashion
4,The concert starts at 7 PM tonight,Event
5,Navaratri dandiya program at Expo center in Mu...,Event
6,Exciting vacation destinations for your next trip,Travel
7,Maldives and Srilanka are gaining popularity i...,Travel


## Apply Embedding on the Text Column:

In [5]:
embedder = SentenceTransformer("all-mpnet-base-v2")
text_vectors = embedder.encode(data.text)

print(f"""The dimention of text_vectors is : {text_vectors.shape}
The dimention of each vector in text_vectors : {text_vectors.shape[1]}""")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


The dimention of text_vectors is : (8, 768)
The dimention of each vector in text_vectors : 768


## Create a Euclidean Flat Indexer with Faiss:

In [7]:
vector_dim = text_vectors.shape[1]
vector_indexer = faiss.IndexFlatL2(vector_dim)
vector_indexer.add(text_vectors)

## Retrieve similar rows from the pandas DataFrame based on the query

In [26]:
query = "I wanna to by a shirt"

query_vector = embedder.encode(query)
reshaped_query_vector = np.array(query_vector).reshape(1,-1)
distace, idx_num = vector_indexer.search(reshaped_query_vector, k =2)

print(f"vector distace: {distace} \nmatched index numbers: {idx_num}")

vector distace: [[1.2629726 1.4028323]] 
matched index numbers: [[2 3]]


In [27]:
print(type(idx_num), idx_num)
idx_num = idx_num.tolist()
print(idx_num)

data.loc[idx_num[0]]

<class 'numpy.ndarray'> [[2 3]]
[[2, 3]]


Unnamed: 0,text,category
2,These are the latest fashion trends for this week,Fashion
3,Vibrant color jeans for male are becoming a trend,Fashion
