[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic5.6/cosine-similarity.ipynb)
[![Open with SageMaker](https://img.shields.io/badge/Open%20with-SageMaker-orange?logo=amazonaws)](https://studiolab.sagemaker.aws/import/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic5.6/cosine-similarity.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/basic5.6/cosine-similarity.ipynb)

# Cosine Similarity: From Vectors to NLP Applications

## 🎯 Learning Objectives
By the end of this notebook, you will understand:
- Mathematical foundations of cosine similarity
- How to calculate cosine similarity between 2D vectors
- How to extend cosine similarity to higher-dimensional spaces
- Practical applications of cosine similarity in NLP and text analysis
- How to use cosine similarity with HuggingFace transformers for semantic search

## 📋 Prerequisites
- Basic understanding of linear algebra (vectors and dot products)
- Familiarity with Python and NumPy
- Basic knowledge of NLP fundamentals (refer to [NLP Learning Journey](https://github.com/vuhung16au/nlp-learning-journey))
- Understanding of transformers and embeddings (refer to previous notebooks)

## 📚 What We'll Cover
1. Section 1: Mathematical Foundations - Understanding cosine similarity formula
2. Section 2: 2D Vector Examples - Visual and geometric intuition
3. Section 3: Higher-Dimensional Vectors - Extending to multi-dimensional space
4. Section 4: Text Similarity with HuggingFace - Practical NLP applications
5. Section 5: Advanced Applications - Semantic search and similarity matrices
6. Section 6: Summary and Best Practices

## Part 1: Mathematical Foundations of Cosine Similarity

**Cosine similarity** measures the similarity between two vectors by computing the cosine of the angle between them.

### Mathematical Formula

The cosine similarity between two vectors $\vec{A}$ and $\vec{B}$ is defined as:

$$\text{cosine\_similarity}(\vec{A}, \vec{B}) = \frac{\vec{A} \cdot \vec{B}}{\|\vec{A}\| \|\vec{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \times \sqrt{\sum_{i=1}^{n} B_i^2}}$$

Where:
- $\vec{A} \cdot \vec{B}$ is the dot product of vectors A and B
- $\|\vec{A}\|$ and $\|\vec{B}\|$ are the magnitudes (L2 norms) of vectors A and B
- $n$ is the dimensionality of the vectors

### Key Properties

- **Range**: Cosine similarity values range from -1 to 1
  - `1`: Vectors point in the same direction (perfect similarity)
  - `0`: Vectors are orthogonal (no similarity)
  - `-1`: Vectors point in opposite directions (perfect dissimilarity)

- **Direction vs. Magnitude**: Cosine similarity focuses on the **direction** of vectors, not their magnitude
  - Vectors `[2, 2]` and `[4, 4]` have cosine similarity = 1 (same direction)
  - This is why cosine similarity is ideal for text analysis where frequency matters less than semantic meaning

> 💡 **Pro Tip**: In NLP, cosine similarity is preferred over Euclidean distance because it's scale-invariant. A document mentioning "machine learning" twice vs. four times should still be semantically similar.

## Part 2: Cosine Similarity Between 2D Vectors

Let's start with simple 2D vectors to build geometric intuition.

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Tuple

# Set reproducible environment with repository standard seed=16
np.random.seed(16)
print("🔢 Random seed set to 16 for reproducibility")

# Configure visualization style (repository standard)
sns.set_style('darkgrid')  # Better readability with gridlines
sns.set_palette("husl")     # Consistent, accessible colors
print("📊 Visualization style configured: darkgrid with husl palette")

In [None]:
def cosine_similarity_manual(vector_a: np.ndarray, vector_b: np.ndarray) -> float:
    """
    Calculate cosine similarity between two vectors manually.
    
    This function demonstrates the mathematical formula step-by-step
    for educational purposes.
    
    Args:
        vector_a: First vector (numpy array)
        vector_b: Second vector (numpy array)
        
    Returns:
        float: Cosine similarity value between -1 and 1
    """
    # Step 1: Calculate dot product
    dot_product = np.dot(vector_a, vector_b)
    
    # Step 2: Calculate magnitude (L2 norm) of each vector
    magnitude_a = np.linalg.norm(vector_a)
    magnitude_b = np.linalg.norm(vector_b)
    
    # Step 3: Calculate cosine similarity
    # Add small epsilon to avoid division by zero
    cosine_sim = dot_product / (magnitude_a * magnitude_b + 1e-8)
    
    return cosine_sim


def visualize_2d_vectors(vector_a: np.ndarray, vector_b: np.ndarray, 
                         similarity: float, title: str = "2D Vector Comparison"):
    """
    Visualize two 2D vectors and their cosine similarity.
    
    Args:
        vector_a: First 2D vector
        vector_b: Second 2D vector
        similarity: Cosine similarity value
        title: Plot title
    """
    plt.figure(figsize=(8, 8))
    
    # Plot vectors as arrows from origin
    plt.quiver(0, 0, vector_a[0], vector_a[1], angles='xy', scale_units='xy', 
               scale=1, color='blue', width=0.008, label='Vector A')
    plt.quiver(0, 0, vector_b[0], vector_b[1], angles='xy', scale_units='xy', 
               scale=1, color='red', width=0.008, label='Vector B')
    
    # Set axis limits with some padding
    max_val = max(np.max(np.abs(vector_a)), np.max(np.abs(vector_b))) * 1.2
    plt.xlim(-max_val, max_val)
    plt.ylim(-max_val, max_val)
    
    # Add labels and grid
    plt.xlabel('X-axis', fontsize=12, fontweight='bold')
    plt.ylabel('Y-axis', fontsize=12, fontweight='bold')
    plt.axhline(y=0, color='k', linewidth=0.5)
    plt.axvline(x=0, color='k', linewidth=0.5)
    plt.grid(True, alpha=0.3)
    
    # Add title with similarity score
    plt.title(f"{title}\nCosine Similarity: {similarity:.4f}", 
              fontsize=14, fontweight='bold')
    plt.legend(fontsize=11)
    plt.axis('equal')
    plt.tight_layout()
    plt.show()

print("✅ Helper functions defined successfully")

### Example 1: Similar Vectors (High Cosine Similarity)

In [None]:
# Two vectors pointing in similar directions
vector_1 = np.array([3, 4])
vector_2 = np.array([4, 5])

# Calculate cosine similarity
similarity = cosine_similarity_manual(vector_1, vector_2)

print("📐 Vector Analysis:")
print(f"Vector A: {vector_1}")
print(f"Vector B: {vector_2}")
print(f"\nDot Product: {np.dot(vector_1, vector_2)}")
print(f"Magnitude A: {np.linalg.norm(vector_1):.4f}")
print(f"Magnitude B: {np.linalg.norm(vector_2):.4f}")
print(f"\n✨ Cosine Similarity: {similarity:.4f}")
print(f"\n💡 Interpretation: Similarity of {similarity:.4f} indicates vectors are pointing in very similar directions")

# Visualize
visualize_2d_vectors(vector_1, vector_2, similarity, "Example 1: Similar Vectors")

### Example 2: Orthogonal Vectors (Zero Cosine Similarity)

In [None]:
# Two perpendicular vectors (90 degree angle)
vector_1 = np.array([5, 0])
vector_2 = np.array([0, 5])

# Calculate cosine similarity
similarity = cosine_similarity_manual(vector_1, vector_2)

print("📐 Vector Analysis:")
print(f"Vector A: {vector_1}")
print(f"Vector B: {vector_2}")
print(f"\nDot Product: {np.dot(vector_1, vector_2)}")
print(f"Magnitude A: {np.linalg.norm(vector_1):.4f}")
print(f"Magnitude B: {np.linalg.norm(vector_2):.4f}")
print(f"\n✨ Cosine Similarity: {similarity:.4f}")
print(f"\n💡 Interpretation: Similarity of {similarity:.4f} indicates vectors are orthogonal (90° angle)")

# Visualize
visualize_2d_vectors(vector_1, vector_2, similarity, "Example 2: Orthogonal Vectors")

### Example 3: Opposite Vectors (Negative Cosine Similarity)

In [None]:
# Two vectors pointing in opposite directions
vector_1 = np.array([3, 4])
vector_2 = np.array([-3, -4])

# Calculate cosine similarity
similarity = cosine_similarity_manual(vector_1, vector_2)

print("📐 Vector Analysis:")
print(f"Vector A: {vector_1}")
print(f"Vector B: {vector_2}")
print(f"\nDot Product: {np.dot(vector_1, vector_2)}")
print(f"Magnitude A: {np.linalg.norm(vector_1):.4f}")
print(f"Magnitude B: {np.linalg.norm(vector_2):.4f}")
print(f"\n✨ Cosine Similarity: {similarity:.4f}")
print(f"\n💡 Interpretation: Similarity of {similarity:.4f} indicates vectors point in exactly opposite directions (180° angle)")

# Visualize
visualize_2d_vectors(vector_1, vector_2, similarity, "Example 3: Opposite Vectors")

### Example 4: Scale Invariance Property

This example demonstrates that cosine similarity is invariant to vector magnitude - only direction matters.

In [None]:
# Two vectors with same direction but different magnitudes
vector_1 = np.array([2, 2])
vector_2 = np.array([8, 8])  # 4x the magnitude of vector_1

# Calculate cosine similarity
similarity = cosine_similarity_manual(vector_1, vector_2)

print("📐 Vector Analysis (Scale Invariance):")
print(f"Vector A: {vector_1} (magnitude: {np.linalg.norm(vector_1):.4f})")
print(f"Vector B: {vector_2} (magnitude: {np.linalg.norm(vector_2):.4f})")
print(f"\n🔍 Vector B is 4x longer than Vector A")
print(f"\n✨ Cosine Similarity: {similarity:.4f}")
print(f"\n💡 Key Insight: Despite different magnitudes, cosine similarity = 1.0")
print("   This is because they point in the exact same direction!")
print("\n🚀 Why This Matters in NLP:")
print("   - Document length doesn't affect semantic similarity")
print("   - Term frequency differences are normalized out")
print("   - We focus on semantic direction, not scale")

# Visualize
visualize_2d_vectors(vector_1, vector_2, similarity, "Example 4: Scale Invariance")

## Part 3: Cosine Similarity in Higher-Dimensional Space

Real-world applications, especially in NLP, involve high-dimensional vectors. Let's extend our understanding to higher dimensions.

In [None]:
# Import sklearn for efficient computation
from sklearn.metrics.pairwise import cosine_similarity

print("✅ sklearn library imported for efficient cosine similarity computation")

### Example 5: 3D Vectors

In [None]:
# Create 3D vectors
vector_3d_a = np.array([1, 2, 3])
vector_3d_b = np.array([2, 3, 4])
vector_3d_c = np.array([5, 1, 2])

# Calculate similarities using our manual function
sim_ab = cosine_similarity_manual(vector_3d_a, vector_3d_b)
sim_ac = cosine_similarity_manual(vector_3d_a, vector_3d_c)
sim_bc = cosine_similarity_manual(vector_3d_b, vector_3d_c)

print("📐 3D Vector Analysis:")
print(f"Vector A: {vector_3d_a}")
print(f"Vector B: {vector_3d_b}")
print(f"Vector C: {vector_3d_c}")
print(f"\n🔍 Pairwise Cosine Similarities:")
print(f"A vs B: {sim_ab:.4f} (high similarity - similar direction)")
print(f"A vs C: {sim_ac:.4f} (moderate similarity)")
print(f"B vs C: {sim_bc:.4f} (moderate similarity)")
print(f"\n💡 As dimensionality increases, the geometric interpretation becomes abstract,")
print("   but the mathematical formula remains the same!")

### Example 6: High-Dimensional Vectors (Common in NLP)

In NLP, embeddings are typically 384, 768, or even higher dimensions. Let's work with more realistic sizes.

In [None]:
# Create high-dimensional random vectors (simulating sentence embeddings)
# Common embedding dimensions: 384 (sentence-transformers), 768 (BERT), 1024 (large models)
embedding_dim = 384

# Generate random embeddings (in practice, these come from a model)
# Using repository standard seed=16 for reproducibility
np.random.seed(16)
embedding_1 = np.random.randn(embedding_dim)
embedding_2 = np.random.randn(embedding_dim)
embedding_3 = embedding_1 + np.random.randn(embedding_dim) * 0.1  # Similar to embedding_1

print(f"📊 High-Dimensional Vector Analysis:")
print(f"Embedding dimension: {embedding_dim}D")
print(f"\nVector shapes:")
print(f"  Embedding 1: {embedding_1.shape}")
print(f"  Embedding 2: {embedding_2.shape}")
print(f"  Embedding 3: {embedding_3.shape}")

# Calculate similarities using sklearn (more efficient for high dimensions)
# Reshape for sklearn: it expects 2D arrays (n_samples, n_features)
embeddings = np.array([embedding_1, embedding_2, embedding_3])
similarity_matrix = cosine_similarity(embeddings)

print(f"\n🔍 Similarity Matrix ({embedding_dim}D space):")
print("     Emb1    Emb2    Emb3")
for i, row in enumerate(similarity_matrix):
    print(f"Emb{i+1} {row[0]:7.4f} {row[1]:7.4f} {row[2]:7.4f}")

print(f"\n✨ Key Observations:")
print(f"  • Diagonal values = 1.0 (each vector is identical to itself)")
print(f"  • Emb1 vs Emb3 = {similarity_matrix[0, 2]:.4f} (high, as Emb3 was created from Emb1)")
print(f"  • Emb1 vs Emb2 = {similarity_matrix[0, 1]:.4f} (lower, as they're independent)")
print(f"  • Matrix is symmetric (sim(A,B) = sim(B,A))")

### Visualizing High-Dimensional Similarity Matrix

In [None]:
# Create more embeddings for better visualization
np.random.seed(16)
n_embeddings = 8
embeddings_list = []

# Create base embeddings
for i in range(n_embeddings):
    embeddings_list.append(np.random.randn(embedding_dim))

# Stack into matrix
embeddings_matrix = np.array(embeddings_list)

# Calculate similarity matrix
similarity_matrix_large = cosine_similarity(embeddings_matrix)

# Visualize as heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(
    similarity_matrix_large,
    annot=True,
    fmt='.3f',
    cmap='viridis',
    xticklabels=[f'Emb{i+1}' for i in range(n_embeddings)],
    yticklabels=[f'Emb{i+1}' for i in range(n_embeddings)],
    cbar_kws={'label': 'Cosine Similarity'}
)
plt.title(f'Cosine Similarity Matrix for {embedding_dim}D Embeddings',
          fontsize=14, fontweight='bold')
plt.xlabel('Embeddings', fontweight='bold')
plt.ylabel('Embeddings', fontweight='bold')
plt.tight_layout()
plt.show()

print("✅ Similarity matrix visualization complete")
print("\n💡 In NLP applications, this matrix shows which sentences/documents are semantically similar")

## Part 4: Text Similarity with HuggingFace Transformers

Now let's apply cosine similarity to real NLP tasks using HuggingFace transformers. This is where cosine similarity truly shines in practice!

In [None]:
# Import HuggingFace libraries
from transformers import AutoTokenizer, AutoModel
import torch

# Check device availability
try:
    from google.colab import userdata
    import torch_xla.core.xla_model as xm
    COLAB_AVAILABLE = True
    TPU_AVAILABLE = True
except ImportError:
    COLAB_AVAILABLE = False
    TPU_AVAILABLE = False

def get_device() -> torch.device:
    """
    Get the best available device for PyTorch operations.
    
    Device Priority:
    - General: CUDA GPU > TPU (Colab only) > MPS (Apple Silicon) > CPU
    - Google Colab: Always prefer TPU when available
    
    Returns:
        torch.device: The optimal device for current hardware
    """
    # Google Colab: Always prefer TPU when available
    if COLAB_AVAILABLE and TPU_AVAILABLE:
        try:
            device = xm.xla_device()
            print("🔥 Using Google Colab TPU for optimal performance")
            return device
        except Exception as e:
            print(f"⚠️ TPU initialization failed: {e}")
    
    # Standard device detection
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"🚀 Using CUDA GPU: {torch.cuda.get_device_name()}")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        print("🍎 Using Apple MPS (Apple Silicon)")
    else:
        device = torch.device("cpu")
        print("💻 Using CPU")
    
    return device

# Get optimal device
device = get_device()
print(f"✅ Device configured: {device}")

### Loading Sentence Transformer Model

We'll use a sentence-transformer model optimized for semantic similarity tasks.

In [None]:
# Load pre-trained sentence transformer model
# sentence-transformers/all-MiniLM-L6-v2 is a popular choice:
# - Small and fast (384-dimensional embeddings)
# - Trained specifically for semantic similarity
# - Good performance on various NLP tasks
model_name = "sentence-transformers/all-MiniLM-L6-v2"

print(f"📥 Loading model: {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Move model to optimal device
model = model.to(device)
model.eval()  # Set to evaluation mode

print(f"✅ Model loaded successfully")
print(f"📊 Model parameters: {model.num_parameters():,}")
print(f"📐 Embedding dimension: 384")

In [None]:
def get_sentence_embedding(text: str, tokenizer, model, device) -> np.ndarray:
    """
    Generate sentence embedding using HuggingFace model.
    
    Args:
        text: Input text string
        tokenizer: HuggingFace tokenizer
        model: HuggingFace model
        device: Device to run inference on
        
    Returns:
        numpy array: Sentence embedding vector
    """
    # Tokenize input
    inputs = tokenizer(
        text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=512
    )
    
    # Move inputs to device
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Generate embeddings
    with torch.no_grad():
        outputs = model(**inputs)
        # Use mean pooling on token embeddings to get sentence embedding
        embeddings = outputs.last_hidden_state.mean(dim=1)
    
    # Convert to numpy and return
    return embeddings.cpu().numpy()[0]

print("✅ Embedding function defined")

### Example 7: Semantic Similarity Between Sentences

Let's reproduce the example from the issue - comparing sentences about pets.

In [None]:
# Sample sentences (from the issue example)
sentences = [
    "I took my dog for a walk",
    "I took my cat for a walk",
    "The weather is beautiful today",
    "I love machine learning"
]

print("📝 Analyzing sentence similarity...\n")

# Generate embeddings for all sentences
sentence_embeddings = []
for i, sentence in enumerate(sentences):
    embedding = get_sentence_embedding(sentence, tokenizer, model, device)
    sentence_embeddings.append(embedding)
    print(f"✓ Sentence {i+1} embedded: {embedding.shape}")

# Convert to numpy array for easier manipulation
sentence_embeddings = np.array(sentence_embeddings)
print(f"\n📊 All embeddings shape: {sentence_embeddings.shape}")
print(f"   ({len(sentences)} sentences × 384 dimensions)")

In [None]:
# Calculate cosine similarity matrix
# This is exactly what was shown in the issue!
similarity_matrix = cosine_similarity(sentence_embeddings)

print("🔍 Cosine Similarity Matrix:")
print("\nRows/Columns represent sentences:")
for i, sent in enumerate(sentences):
    print(f"{i+1}. {sent}")

print("\n" + "="*70)
print(f"{'':30} Sen1    Sen2    Sen3    Sen4")
for i, (sent, row) in enumerate(zip(sentences, similarity_matrix)):
    short_sent = sent[:25] + "..." if len(sent) > 25 else sent
    print(f"Sen{i+1}: {short_sent:25} {row[0]:.4f}  {row[1]:.4f}  {row[2]:.4f}  {row[3]:.4f}")

print("\n✨ Key Insights:")
print(f"  • Dog walk vs Cat walk: {similarity_matrix[0, 1]:.4f} (HIGH - very similar semantic meaning!)")
print(f"  • Dog walk vs Weather: {similarity_matrix[0, 2]:.4f} (LOW - different topics)")
print(f"  • Dog walk vs ML: {similarity_matrix[0, 3]:.4f} (LOW - completely different domains)")
print("\n💡 This confirms the example from the issue: 'dog walk' and 'cat walk' have high overlap!")

In [None]:
# Visualize the similarity matrix as heatmap
plt.figure(figsize=(10, 8))

# Create short labels for readability
labels = [sent[:30] + '...' if len(sent) > 30 else sent for sent in sentences]

sns.heatmap(
    similarity_matrix,
    annot=True,
    fmt='.3f',
    cmap='viridis',
    xticklabels=labels,
    yticklabels=labels,
    cbar_kws={'label': 'Cosine Similarity'}
)

plt.title('Sentence Similarity Matrix (HuggingFace Transformer Embeddings)',
          fontsize=14, fontweight='bold')
plt.xlabel('Sentences', fontweight='bold')
plt.ylabel('Sentences', fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print("✅ Similarity matrix visualization complete")
print("\n📸 This recreates the visualization shown in the GitHub issue!")

## Part 5: Advanced Applications - Semantic Search

One powerful application of cosine similarity in NLP is **semantic search** - finding the most relevant documents or sentences for a given query.

In [None]:
# Create a document collection
documents = [
    "The cat sat on the mat in the living room.",
    "A feline rested comfortably on the soft rug.",
    "The dog ran quickly through the green park.",
    "Machine learning algorithms can solve complex problems.",
    "Artificial intelligence is transforming modern technology.",
    "Deep learning uses neural networks with many layers.",
    "Natural language processing helps computers understand text.",
    "Python is a popular programming language for data science.",
    "The weather was sunny and warm yesterday.",
    "I enjoy reading books about science fiction."
]

print(f"📚 Document Collection: {len(documents)} documents")
print("\nGenerating embeddings for all documents...")

# Generate embeddings for all documents
document_embeddings = []
for doc in documents:
    embedding = get_sentence_embedding(doc, tokenizer, model, device)
    document_embeddings.append(embedding)

document_embeddings = np.array(document_embeddings)
print(f"✅ Document embeddings generated: {document_embeddings.shape}")

In [None]:
def semantic_search(query: str, documents: List[str], document_embeddings: np.ndarray, 
                   top_k: int = 3) -> List[Tuple[str, float]]:
    """
    Perform semantic search using cosine similarity.
    
    Args:
        query: Search query string
        documents: List of document strings
        document_embeddings: Pre-computed document embeddings
        top_k: Number of top results to return
        
    Returns:
        List of tuples (document, similarity_score)
    """
    # Generate query embedding
    query_embedding = get_sentence_embedding(query, tokenizer, model, device)
    query_embedding = query_embedding.reshape(1, -1)  # Reshape for sklearn
    
    # Calculate cosine similarities
    similarities = cosine_similarity(query_embedding, document_embeddings)[0]
    
    # Get top-k most similar documents
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    # Return results
    results = [(documents[idx], similarities[idx]) for idx in top_indices]
    return results

print("✅ Semantic search function defined")

### Example 8: Semantic Search Queries

In [None]:
# Test queries
test_queries = [
    "Tell me about cats",
    "What is AI and deep learning?",
    "Programming languages"
]

print("🔍 Semantic Search Results\n")
print("="*80)

for query in test_queries:
    print(f"\n🔎 Query: '{query}'")
    print("-" * 80)
    
    results = semantic_search(query, documents, document_embeddings, top_k=3)
    
    for rank, (doc, score) in enumerate(results, 1):
        print(f"\n  Rank {rank}: (Similarity: {score:.4f})")
        print(f"  📄 {doc}")

print("\n" + "="*80)
print("\n✨ Notice how the search understands semantic meaning:")
print("  • 'cats' query finds both 'cat' and 'feline' documents")
print("  • 'AI and deep learning' retrieves relevant ML documents")
print("  • 'Programming languages' finds Python document")
print("\n💡 This is the power of cosine similarity with transformer embeddings!")

### Visualizing Semantic Search Results

In [None]:
# Pick one query for detailed visualization
query = "artificial intelligence and neural networks"
query_embedding = get_sentence_embedding(query, tokenizer, model, device)
query_embedding = query_embedding.reshape(1, -1)

# Calculate similarities with all documents
similarities = cosine_similarity(query_embedding, document_embeddings)[0]

# Create bar plot
plt.figure(figsize=(12, 8))

# Sort by similarity
sorted_indices = np.argsort(similarities)[::-1]
sorted_similarities = similarities[sorted_indices]
sorted_docs = [documents[i] for i in sorted_indices]

# Create labels (truncated)
labels = [doc[:40] + '...' if len(doc) > 40 else doc for doc in sorted_docs]

# Plot
colors = plt.cm.viridis(sorted_similarities)
bars = plt.barh(range(len(sorted_similarities)), sorted_similarities, color=colors)
plt.yticks(range(len(sorted_similarities)), labels)
plt.xlabel('Cosine Similarity', fontsize=12, fontweight='bold')
plt.ylabel('Documents', fontsize=12, fontweight='bold')
plt.title(f'Semantic Search: "{query}"\nCosine Similarity Scores',
          fontsize=14, fontweight='bold')
plt.xlim(0, 1)
plt.grid(axis='x', alpha=0.3)

# Add value labels on bars
for i, (bar, sim) in enumerate(zip(bars, sorted_similarities)):
    plt.text(sim + 0.02, i, f'{sim:.3f}', va='center', fontsize=9)

plt.tight_layout()
plt.show()

print("✅ Semantic search visualization complete")

## Part 6: Summary and Best Practices

### 🔑 Key Concepts Mastered

1. **Mathematical Foundation**
   - Cosine similarity measures the angle between vectors
   - Formula: $\cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|}$
   - Range: [-1, 1] where 1 = identical direction, 0 = orthogonal, -1 = opposite

2. **Geometric Intuition**
   - In 2D: Can visualize as angle between arrows
   - Scale invariance: Only direction matters, not magnitude
   - Extends seamlessly to high-dimensional spaces (384D, 768D, etc.)

3. **NLP Applications**
   - Convert text to embeddings using transformer models
   - Compare semantic similarity between sentences/documents
   - Enable semantic search and information retrieval
   - Build recommendation systems based on content similarity

4. **Implementation with HuggingFace**
   - Use sentence-transformer models for optimal embeddings
   - sklearn's `cosine_similarity` for efficient computation
   - Mean pooling to convert token embeddings to sentence embeddings

### 📈 Best Practices Learned

1. **Model Selection**
   - Use `sentence-transformers` models for semantic similarity tasks
   - Popular choices: `all-MiniLM-L6-v2` (fast), `all-mpnet-base-v2` (accurate)
   - Consider embedding dimension vs. performance trade-offs

2. **Normalization**
   - Cosine similarity automatically normalizes by magnitude
   - Consider L2 normalization for faster computation (dot product)
   - For large-scale applications, use FAISS for efficient similarity search

3. **Computational Efficiency**
   - Pre-compute and cache document embeddings
   - Use batch processing for multiple texts
   - Consider approximate nearest neighbor methods for large datasets

4. **Interpretation**
   - Similarity > 0.8: Very similar (likely paraphrases)
   - Similarity 0.5-0.8: Moderately similar (related topics)
   - Similarity < 0.5: Dissimilar (different topics)
   - Exact thresholds depend on domain and model

### 🚀 Next Steps

- **Notebook: Feature Extraction**: Deep dive into transformer embeddings
- **Documentation: FAISS.md**: Efficient similarity search at scale
- **Advanced Topics**: Semantic clustering, duplicate detection, recommendation systems
- **External Resources**: 
  - [HuggingFace Sentence Transformers](https://www.sbert.net/)
  - [sklearn Cosine Similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)

> **Key Takeaway**: Cosine similarity is the fundamental metric for measuring semantic similarity in NLP. By converting text to embeddings and computing cosine similarity, we can build powerful applications like semantic search, duplicate detection, and recommendation systems.

---

## About the Author

**Vu Hung Nguyen** - AI Engineer & Researcher

Connect with me:
- 🌐 **Website**: [vuhung16au.github.io](https://vuhung16au.github.io/)
- 💼 **LinkedIn**: [linkedin.com/in/nguyenvuhung](https://www.linkedin.com/in/nguyenvuhung/)
- 💻 **GitHub**: [github.com/vuhung16au](https://github.com/vuhung16au/)

*This notebook is part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) educational series.*