# Step 1: Introduction
# 🔍 Semantic Search using Sentence Embeddings

In this notebook, we'll learn how to:
- Embed sentences using a lightweight LLM (MiniLM)
- Compute cosine similarity between sentence vectors
- Retrieve the most semantically similar sentence from a collection

We'll use the `sentence-transformers` library which wraps powerful pretrained models for sentence encoding. The model used here is lightweight, fast, and ideal for Google Colab.



# Step 2: Install Required Libraries

In [None]:
# Step 2: Install Required Libraries
!pip install -q sentence-transformers


# Step 3: Import Libraries
## 📦 Import Required Python Libraries

We'll need:
- `SentenceTransformer` to generate sentence embeddings
- `cosine_similarity` to compute vector similarity
- `numpy` for basic numerical operations



In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Step 4: Define the Sentence Collection
## 📝 Define a Sentence Collection

This is the set of candidate sentences. Our goal is to find the sentence from this list that is semantically closest to a given input query.



In [None]:
sentences = [
    "AI is transforming the future of work.",
    "Machine learning is a subset of artificial intelligence.",
    "The stock market fluctuates based on economic data.",
    "Large language models can generate human-like text.",
    "Transformers use attention mechanisms to learn context.",
    "Vaccines have been instrumental in preventing diseases.",
    "Climate change poses a significant threat to biodiversity.",
    "Renewable energy sources are essential for sustainability."
]

# Step 5: Load Embedding Model
## ⚡ Load a Lightweight Embedding Model

We'll use the **MiniLM-L6-v2** model which balances performance and speed, suitable for limited compute environments like Google Colab.


In [None]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Step 6: Generate Embeddings for the Collection
## 📐 Encode the Sentence Collection

Now, we'll compute vector embeddings for each sentence in the list using our model.


In [None]:
sentence_embeddings = model.encode(sentences)

# Step 7: Input Query and Compute Similarity
## 💬 Input Query Sentence

We’ll take an input sentence, compute its embedding, and then measure cosine similarity with every sentence in our collection.


In [None]:
query = "What is the impact of artificial intelligence on jobs?"
query_embedding = model.encode([query])

# Compute cosine similarity with the sentence collection
cosine_scores = cosine_similarity(query_embedding, sentence_embeddings)

# Step 8: Retrieve the Most Similar Sentence
## ✅ Retrieve the Most Relevant Sentence

Based on the similarity scores, we’ll retrieve the sentence with the highest cosine similarity to the query.


In [None]:
top_index = np.argmax(cosine_scores)
print("Input Query:", query)
print("Most Similar Sentence:", sentences[top_index])
print("Similarity Score:", cosine_scores[0][top_index])

# Step 9: Visualize Results
## 📊 Visualize Similarity Scores

This table shows how close each sentence is to the query, giving you insights into semantic alignment.


In [None]:
import pandas as pd

df = pd.DataFrame({
    "Sentence": sentences,
    "Similarity Score": cosine_scores[0]
}).sort_values(by="Similarity Score", ascending=False)

df.reset_index(drop=True, inplace=True)
df

# 📊 Plot the Top 5 Similar Sentences
## 📈 Visualize Top 5 Semantic Matches

This plot shows the top 5 most semantically similar sentences to the query. The x-axis shows the first three words of each sentence, and the y-axis shows their cosine similarity scores.


In [None]:
# Prepare top 5 results
top5 = df.head(5).copy()
top5["Short Label"] = top5["Sentence"].apply(lambda x: ' '.join(x.split()[:3]) + "...")

# Plot the bar chart
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.bar(top5["Short Label"], top5["Similarity Score"], color='skyblue')
plt.title("Top 5 Similar Sentences")
plt.xlabel("Sentence (First 3 Words)")
plt.ylabel("Cosine Similarity")
plt.xticks(rotation=90)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()