--------
### create a semantic search engine 
- using Sentence Transformers, 
- focusing on searching for similar documents in a set of demo articles
--------------------

In [1]:
from sentence_transformers import SentenceTransformer, util
import pandas as pd

In [5]:
import os
os.environ["SENTENCE_TRANSFORMERS_HOME"] = r'D:\AI-DATASETS\07-Hugging-Face-Data\sentence-transformers'

#### model
- The bert-base-nli-mean-tokens model refers to a specific variant of BERT (Bidirectional Encoder Representations from Transformers) that has been fine-tuned for a specific natural language understanding (NLU) task: sentence-level embeddings. 
- nli: NLI stands for "Natural Language Inference." This is a specific NLP task in which the model is trained to determine the logical relationship between two given sentences. The model learns to classify whether the relationship is "entailment," "contradiction," or "neutral."

In [6]:
# Load a pre-trained Sentence Transformer model
model = SentenceTransformer('bert-base-nli-mean-tokens')

Downloading (…)821d1/.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/README.md:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading (…)d1/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)01e821d1/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)821d1/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)1e821d1/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [2]:
# Sample articles and their titles
demo_data = {
    'Title': [
        'How to Build a Semantic Search Engine',
        'Introduction to Sentence Transformers',
        'Natural Language Processing in Python',
        'Content Recommendation Systems',
        'Information Retrieval Techniques'
    ],
    'Text': [
        'In this tutorial, we will show you how to build a semantic search engine using Sentence Transformers.',
        'Sentence Transformers are powerful NLP models that can encode sentences into fixed-dimensional vectors.',
        'Learn how to perform natural language processing tasks in Python with popular libraries like spaCy and NLTK.',
        'Content recommendation systems help users discover relevant articles and products.',
        'Information retrieval techniques are used to search for relevant documents in large text collections.'
    ]
}

In [3]:
# Create a DataFrame from the demo data
df = pd.DataFrame(demo_data)

In [26]:
df

Unnamed: 0,Title,Text
0,How to Build a Semantic Search Engine,"In this tutorial, we will show you how to buil..."
1,Introduction to Sentence Transformers,Sentence Transformers are powerful NLP models ...
2,Natural Language Processing in Python,Learn how to perform natural language processi...
3,Content Recommendation Systems,Content recommendation systems help users disc...
4,Information Retrieval Techniques,Information retrieval techniques are used to s...


In [29]:
# Encode the articles into embeddings
embeddings = model.encode(df['Text'], convert_to_numpy=True)

In [30]:
# User query
user_query = 'How to create a content recommendation system?'

In [31]:
# Encode the user query
query_embedding = model.encode(user_query, convert_to_tensor=True)

In [35]:
# Find the most similar articles to the user query
cosine_scores = util.pytorch_cos_sim(query_embedding, embeddings)
similar_indices = cosine_scores[0].argsort(descending=True).numpy()
similar_indices

array([3, 0, 1, 4, 2], dtype=int64)

In [36]:
# Number of similar documents to retrieve
top_k = 2

In [37]:
index = similar_indices[1]
index

0

In [38]:
cosine_scores[0][index]

tensor(0.6498)

In [39]:
df.loc[index]['Text']

'In this tutorial, we will show you how to build a semantic search engine using Sentence Transformers.'

In [40]:
print(f"Top {top_k} most similar articles to the query: '{user_query}'")
for i in range(top_k):
    index = similar_indices[i]
    print(f"Title: {df['Title'][index]}, Similarity Score: {cosine_scores[0][index]:.4f}")
    print(f"Text: {df['Text'][index]}\n")

Top 2 most similar articles to the query: 'How to create a content recommendation system?'
Title: Content Recommendation Systems, Similarity Score: 0.7648
Text: Content recommendation systems help users discover relevant articles and products.

Title: How to Build a Semantic Search Engine, Similarity Score: 0.6498
Text: In this tutorial, we will show you how to build a semantic search engine using Sentence Transformers.

