# Similarity search on pandas DataFrame text column using LLMs

**Steps:**
1. Load CSV file using pandas.  
2. Apply embedding on the text column.  
3. Create a Euclidean Flat Indexer with Faiss.  
4. Retrieve similar rows from the pandas DataFrame based on the query.  

**Substeps of Step 4:** 
1. Take the input query.  
2. Embed the input query.  
3. Perform a similarity search on the Euclidean Flat Indexer and get the similar indexes.  
4. Retrieve the matched rows from the pandas DataFrame based on the query.  

![alt text](../images/pandasRagf.png)

* IndexFlatL2 is used for computing nearest neighbors based on the L2 distance (Euclidean distance).  
* faiss.IndexFlatL2 is a simple, brute-force index implementation


### Key Features of faiss.IndexFlatL2:
1. **Flat Index:** All vectors are stored in memory in their original form. It does not use any advanced data structures like trees or hash tables to organize vectors.
2. **Exact Search:** The search is exhaustive, meaning it calculates the L2 distance between the query vector and all vectors in the dataset to find the nearest neighbors.
3. **Euclidean Distance:** It uses the squared L2 norm to measure distances. Squared L2 distances are computed for efficiency since the square root step is omitted.  

### How it Works Internally
1. **Initialization:**

    * When you create an IndexFlatL2 object, it initializes an empty container to hold the vectors you want to index.
        
        ```python
        import faiss
        index = faiss.IndexFlatL2(d)  # d is the dimensionality of the vectors
        ```
2. **Adding Vectors:**
    * Internally, the index stores the vectors in a contiguous memory block for fast access once vectors added to index.
        
        ```python
        index.add(vectors) 
        ```
3. **Search Process:**

    * When you query the index using the search method, the index computes the squared L2 distance between the query vector(s) and every vector in the dataset.
    * The distances are computed in parallel for efficiency, taking advantage of modern CPU/GPU architectures.  

        ```python
        distances, indices = index.search(query_vectors, k)
        ```

    * k specifies the number of nearest neighbors to retrieve.
    * distances contains the squared L2 distances of the nearest neighbors.
    * indices contains the indices of the nearest neighbors

4. **Distance Calculation:** 
    * For a query vector 𝑞 and a dataset vector 𝑥, the squared L2 distance is computed as:
    $$
    \|q - x\|^2 = \sum_{i=1}^d (q_i - x_i)^2
    $$


5. **Brute-Force Nature:**

    * Every query vector is compared with all the dataset vectors. This ensures exact results but can be computationally expensive for large datasets.
    * It is suitable for smaller datasets or as a baseline for comparison with other indices (e.g., approximate methods like IndexIVFFlat).

In [1]:
import faiss
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


## Load CSV File Using Pandas.

In [2]:
data = pd.read_csv('../data/sample_text.csv')
display(data)

Unnamed: 0,text,category
0,Meditation and yoga can improve mental health,Health
1,"Fruits, whole grains and vegetables helps cont...",Health
2,These are the latest fashion trends for this week,Fashion
3,Vibrant color jeans for male are becoming a trend,Fashion
4,The concert starts at 7 PM tonight,Event
5,Navaratri dandiya program at Expo center in Mu...,Event
6,Exciting vacation destinations for your next trip,Travel
7,Maldives and Srilanka are gaining popularity i...,Travel


## Apply Embedding on the Text Column:

In [3]:
embedder = SentenceTransformer("all-mpnet-base-v2")
text_vectors = embedder.encode(data.text)

print(f"""The dimention of text_vectors is : {text_vectors.shape}
The dimention of each vector in text_vectors : {text_vectors.shape[1]}""")

The dimention of text_vectors is : (8, 768)
The dimention of each vector in text_vectors : 768


## Create a Euclidean Flat Indexer with Faiss:

In [4]:
vector_dim = text_vectors.shape[1]
vector_indexer = faiss.IndexFlatL2(vector_dim)
vector_indexer.add(text_vectors)

## Retrieve similar rows from the pandas DataFrame based on the query

In [5]:
query = "I wanna to by a shirt"

query_vector = embedder.encode(query)
reshaped_query_vector = np.array(query_vector).reshape(1,-1)
distace, idx_num = vector_indexer.search(reshaped_query_vector, k =2)

print(f"vector distace: {distace} \nmatched index numbers: {idx_num}")

vector distace: [[1.2629726 1.4028323]] 
matched index numbers: [[2 3]]


In [6]:
print(type(idx_num), idx_num)
idx_num = idx_num.tolist()
print(idx_num)

data.loc[idx_num[0]]

<class 'numpy.ndarray'> [[2 3]]
[[2, 3]]


Unnamed: 0,text,category
2,These are the latest fashion trends for this week,Fashion
3,Vibrant color jeans for male are becoming a trend,Fashion
