# 🧠 Tokenization → Embedding → Similarity Demo
This notebook demonstrates the full workflow of text tokenization, embedding generation, and similarity comparison using Hugging Face & SentenceTransformers.

In [None]:

# Install required libraries
!pip install transformers sentence-transformers --quiet


## 1️⃣ Tokenization

In [None]:

from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Sample text
text = "Artificial Intelligence is transforming industries."

# Tokenize text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

# Convert tokens to IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids)


## 2️⃣ Embedding Generation

In [None]:

from sentence_transformers import SentenceTransformer

# Load a lightweight model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Example sentences
sentences = [
    "Artificial Intelligence is transforming industries.",
    "Machine learning is changing business processes.",
    "Cooking recipes require precision and creativity."
]

# Generate embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)
print("Embeddings shape:", embeddings.shape)


## 3️⃣ Cosine Similarity Calculation

In [None]:

from sentence_transformers import util
import pandas as pd

# Compute cosine similarity matrix
cosine_sim = util.cos_sim(embeddings, embeddings)

# Display as a DataFrame
df = pd.DataFrame(cosine_sim.cpu().numpy(), index=sentences, columns=sentences)
df


## ✅ Interpretation
- Values close to **1.0** indicate high semantic similarity.
- Sentences 1 and 2 should be more similar than 1 and 3.
- You can try adding your own sentences to test the embeddings.

This forms the foundational concept of **semantic search** and **retrieval-augmented generation (RAG)**.