In [1]:
!pip install faiss-cpu sentence-transformers

Collecting faiss-cpu
  Using cached faiss_cpu-1.8.0-cp311-cp311-win_amd64.whl.metadata (3.8 kB)
Collecting sentence-transformers
  Using cached sentence_transformers-2.5.1-py3-none-any.whl.metadata (11 kB)
Collecting transformers<5.0.0,>=4.32.0 (from sentence-transformers)
  Using cached transformers-4.38.2-py3-none-any.whl.metadata (130 kB)
Collecting torch>=1.11.0 (from sentence-transformers)
  Using cached torch-2.2.1-cp311-cp311-win_amd64.whl.metadata (26 kB)
Collecting scikit-learn (from sentence-transformers)
  Using cached scikit_learn-1.4.1.post1-cp311-cp311-win_amd64.whl.metadata (11 kB)
Collecting scipy (from sentence-transformers)
  Using cached scipy-1.12.0-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting Pillow (from sentence-transformers)
  Using cached pillow-10.2.0-cp311-cp311-win_amd64.whl.metadata (9.9 kB)
Collecting sympy (from torch>=1.11.0->sentence-transformers)
  Using cached sympy-1.12-py3-none-any.whl.metadata (12 kB)
Collecting networkx (from torch>=1.11.

In [2]:
import pandas as pd
pd.set_option('display.max_colwidth', 100)

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [3]:
df = pd.read_csv("sample_text.csv")
df.shape

(6, 2)

In [4]:
df

Unnamed: 0,text,category
0,Artificial Intelligence is revolutionizing industries with its advanced capabilities,Technology
1,The latest smartphone model is equipped with cutting-edge features and enhanced performance,Technology
2,Regular exercise and a balanced diet contribute to a healthy lifestyle,Health
3,Practicing mindfulness and meditation promotes inner peace and reduces stress,Health
4,Join us in spreading kindness and promoting harmony in our community,Peace
5,Efforts towards global disarmament are crucial for achieving lasting world peace,Peace


Step 1: Create source embeddings for the first column i.e text

In [5]:
from sentence_transformers import SentenceTransformer

In [10]:
df.text

0           Artificial Intelligence is revolutionizing industries with its advanced capabilities
1    The latest smartphone model is equipped with cutting-edge features and enhanced performance
2                         Regular exercise and a balanced diet contribute to a healthy lifestyle
3                  Practicing mindfulness and meditation promotes inner peace and reduces stress
4                           Join us in spreading kindness and promoting harmony in our community
5               Efforts towards global disarmament are crucial for achieving lasting world peace
Name: text, dtype: object

In [6]:
encoder = SentenceTransformer("all-mpnet-base-v2")
vectors = encoder.encode(df.text)

In [13]:
print(vectors.shape)
print(vectors.shape[1])
vec = vectors.shape[1]

(6, 768)
768


Step 2: Build a FAISS Index for vectors

In [15]:
import faiss

index = faiss.IndexFlatL2(vec) 

Step 3: Normalize the source vectors  (as we are using L2 distance to measure similarity) and add to the index

In [16]:
index.add(vectors)

index

<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x000002876E2174E0> >

Step 4: Encode search text using same encorder and normalize the output vector

In [31]:
search_query = "Web developer"

search_vector = encoder.encode(search_query)
search_vector.shape

(768,)

In [32]:
import numpy as np
search_vector_arr = np.array(search_vector).reshape(1,-1)
search_vector_arr.shape


(1, 768)

Step 5: Search for similar vector in the FAISS index created

In [33]:
index.search(search_vector_arr, k=2) # K - how many results to return.

(array([[1.6096141, 1.802349 ]], dtype=float32), array([[0, 1]], dtype=int64))

In [34]:
distances, I = index.search(search_vector_arr, k=2)

In [35]:
distances

array([[1.6096141, 1.802349 ]], dtype=float32)

In [36]:
# row index where search result data is present.
I

array([[0, 1]], dtype=int64)

In [37]:
row_indices = I.tolist()[0]

row_indices

[0, 1]

In [38]:
search_query

'Web developer'

In [39]:
#pass the row indexes and load those rows. The below results are matching the search question.
df.loc[row_indices]

Unnamed: 0,text,category
0,Artificial Intelligence is revolutionizing industries with its advanced capabilities,Technology
1,The latest smartphone model is equipped with cutting-edge features and enhanced performance,Technology
