[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic5.6/cosine-similarity.ipynb)
[![Open with SageMaker](https://img.shields.io/badge/Open%20with-SageMaker-orange?logo=amazonaws)](https://studiolab.sagemaker.aws/import/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic5.6/cosine-similarity.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/basic5.6/cosine-similarity.ipynb)

# Understanding Cosine Similarity: From Vectors to Text Embeddings

## 🎯 Learning Objectives
By the end of this notebook, you will understand:
- What cosine similarity is and how it measures vector similarity
- How to calculate cosine similarity between 2D vectors with visualization
- How to compute cosine similarity in higher dimensions
- How to apply cosine similarity to text embeddings in NLP and HuggingFace
- Practical applications for semantic similarity in hate speech detection

## 📋 Prerequisites
- Basic understanding of vectors and linear algebra
- Familiarity with Python and NumPy
- Knowledge of NLP fundamentals (refer to [NLP Learning Journey](https://github.com/vuhung16au/nlp-learning-journey))
- Basic understanding of machine learning concepts

## 📚 What We'll Cover
1. **Mathematical Foundation**: Understanding cosine similarity formula and intuition
2. **2D Vector Similarity**: Visual demonstration with 2D vectors
3. **Higher Dimensions**: Computing similarity in multi-dimensional space
4. **Text Embeddings**: Applying cosine similarity to text using HuggingFace transformers
5. **Practical Applications**: Semantic search and similarity matrices
6. **Summary**: Key takeaways and best practices

## Part 1: Mathematical Foundation of Cosine Similarity

**Cosine similarity** measures the similarity between two vectors by computing the cosine of the angle between them. Unlike Euclidean distance, cosine similarity focuses on the *direction* rather than the *magnitude* of vectors.

### Mathematical Formula

For two vectors $\mathbf{A}$ and $\mathbf{B}$:

$$\text{cosine\_similarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}$$

Where:
- $\mathbf{A} \cdot \mathbf{B}$ is the dot product of vectors A and B
- $\|\mathbf{A}\|$ is the magnitude (L2 norm) of vector A
- $\|\mathbf{B}\|$ is the magnitude (L2 norm) of vector B

### Interpretation

- **Value Range**: Cosine similarity ranges from -1 to 1
  - `1`: Vectors point in the same direction (perfectly similar)
  - `0`: Vectors are orthogonal (no similarity)
  - `-1`: Vectors point in opposite directions (perfectly dissimilar)

- **Key Property**: Scale-invariant - it measures orientation, not magnitude
  - `[1, 2]` and `[2, 4]` have cosine similarity of 1 (same direction)
  - This is perfect for text embeddings where frequency shouldn't dominate similarity

In [None]:
# Install required packages (uncomment if needed)
# !pip install transformers torch numpy matplotlib scikit-learn seaborn

# Import essential libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
import torch
from transformers import AutoTokenizer, AutoModel
import warnings
warnings.filterwarnings('ignore')

# Set reproducible environment with repository standard seed=16
torch.manual_seed(16)
np.random.seed(16)
print("🔢 Random seed set to 16 for reproducibility")

# Configure visualization style (repository standard)
sns.set_style('darkgrid')  # Better readability with gridlines
sns.set_palette("husl")     # Consistent, accessible colors
print("📊 Visualization style configured: darkgrid with husl palette")

# For Google Colab TPU compatibility
try:
    from google.colab import userdata
    import torch_xla.core.xla_model as xm
    COLAB_AVAILABLE = True
    TPU_AVAILABLE = True
except ImportError:
    COLAB_AVAILABLE = False
    TPU_AVAILABLE = False

print("\n📚 Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"NumPy version: {np.__version__}")

In [None]:
# Device detection for optimal performance
def get_device():
    """
    Get the best available device for PyTorch operations.
    
    Device Priority:
    - General: CUDA GPU > TPU (Colab only) > MPS (Apple Silicon) > CPU
    - Google Colab: Always prefer TPU when available
    
    Returns:
        torch.device: The optimal device for current hardware
    """
    # Google Colab: Always prefer TPU when available
    if COLAB_AVAILABLE and TPU_AVAILABLE:
        try:
            # Try to initialize TPU
            device = xm.xla_device()
            print("🔥 Using Google Colab TPU for optimal performance")
            print("💡 TPU is preferred in Colab for training and inference")
            return device
        except Exception as e:
            print(f"⚠️ TPU initialization failed: {e}")
            print("Falling back to GPU/CPU detection")
    
    # Standard device detection for other environments
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"🚀 Using CUDA GPU: {torch.cuda.get_device_name()}")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        print("🍎 Using Apple MPS (Apple Silicon)")
    else:
        device = torch.device("cpu")
        print("💻 Using CPU (consider GPU/TPU for better performance)")
    
    return device

# Get device for later use
device = get_device()

## Part 2: Cosine Similarity in 2D Space

Let's start with a simple example using 2D vectors that we can visualize. This will help build intuition before moving to higher dimensions.

In [None]:
def cosine_similarity_manual(vector_a, vector_b):
    """
    Calculate cosine similarity between two vectors manually.
    
    This implementation shows the mathematical steps explicitly
    for educational purposes.
    
    Args:
        vector_a: First vector (numpy array)
        vector_b: Second vector (numpy array)
        
    Returns:
        float: Cosine similarity value between -1 and 1
    """
    # Step 1: Calculate dot product (numerator)
    dot_product = np.dot(vector_a, vector_b)
    
    # Step 2: Calculate magnitudes (denominator)
    magnitude_a = np.sqrt(np.sum(vector_a ** 2))
    magnitude_b = np.sqrt(np.sum(vector_b ** 2))
    
    # Step 3: Calculate cosine similarity
    # Add small epsilon to avoid division by zero
    cosine_sim = dot_product / (magnitude_a * magnitude_b + 1e-8)
    
    return cosine_sim

# Example 1: Two 2D vectors pointing in similar directions
vector1 = np.array([3, 4])
vector2 = np.array([4, 5])

# Calculate similarity manually
similarity = cosine_similarity_manual(vector1, vector2)

print("📐 2D Vector Similarity Example")
print("=" * 50)
print(f"Vector A: {vector1}")
print(f"Vector B: {vector2}")
print(f"\nDot product: {np.dot(vector1, vector2)}")
print(f"Magnitude A: {np.linalg.norm(vector1):.4f}")
print(f"Magnitude B: {np.linalg.norm(vector2):.4f}")
print(f"\n✅ Cosine Similarity: {similarity:.4f}")

# Verify with sklearn
similarity_sklearn = cosine_similarity([vector1], [vector2])[0, 0]
print(f"✅ sklearn verification: {similarity_sklearn:.4f}")

# Example 2: Orthogonal vectors (perpendicular)
vector3 = np.array([1, 0])
vector4 = np.array([0, 1])
similarity_orthogonal = cosine_similarity_manual(vector3, vector4)

print(f"\n📐 Orthogonal Vectors:")
print(f"Vector C: {vector3}")
print(f"Vector D: {vector4}")
print(f"✅ Cosine Similarity: {similarity_orthogonal:.4f} (perpendicular)")

# Example 3: Opposite direction vectors
vector5 = np.array([2, 3])
vector6 = np.array([-2, -3])
similarity_opposite = cosine_similarity_manual(vector5, vector6)

print(f"\n📐 Opposite Direction Vectors:")
print(f"Vector E: {vector5}")
print(f"Vector F: {vector6}")
print(f"✅ Cosine Similarity: {similarity_opposite:.4f} (opposite directions)")

In [None]:
# Visualize 2D vectors and their cosine similarity
def visualize_2d_vectors(vector_a, vector_b, title="2D Vector Cosine Similarity"):
    """
    Visualize two 2D vectors and display their cosine similarity.
    
    Args:
        vector_a: First vector (numpy array of shape [2])
        vector_b: Second vector (numpy array of shape [2])
        title: Plot title
    """
    # Calculate cosine similarity
    similarity = cosine_similarity_manual(vector_a, vector_b)
    
    # Calculate angle between vectors (in degrees)
    angle_radians = np.arccos(np.clip(similarity, -1.0, 1.0))
    angle_degrees = np.degrees(angle_radians)
    
    # Create figure
    fig, ax = plt.subplots(figsize=(10, 8))
    
    # Plot vectors
    ax.quiver(0, 0, vector_a[0], vector_a[1], angles='xy', scale_units='xy', 
              scale=1, color='blue', width=0.015, label=f'Vector A: {vector_a}')
    ax.quiver(0, 0, vector_b[0], vector_b[1], angles='xy', scale_units='xy', 
              scale=1, color='red', width=0.015, label=f'Vector B: {vector_b}')
    
    # Set axis properties
    max_val = max(abs(vector_a).max(), abs(vector_b).max()) * 1.3
    ax.set_xlim(-max_val, max_val)
    ax.set_ylim(-max_val, max_val)
    ax.set_aspect('equal')
    
    # Add grid and labels
    ax.grid(True, alpha=0.3)
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
    ax.set_xlabel('X axis', fontsize=12, fontweight='bold')
    ax.set_ylabel('Y axis', fontsize=12, fontweight='bold')
    
    # Add title with similarity information
    ax.set_title(f'{title}\nCosine Similarity: {similarity:.4f} | Angle: {angle_degrees:.2f}°', 
                 fontsize=14, fontweight='bold')
    
    # Add legend
    ax.legend(loc='upper right', fontsize=10)
    
    # Add text box with interpretation
    interpretation = ""
    if similarity > 0.9:
        interpretation = "Very Similar (same direction)"
    elif similarity > 0.5:
        interpretation = "Moderately Similar"
    elif similarity > -0.5:
        interpretation = "Low Similarity"
    else:
        interpretation = "Opposite Directions"
    
    textstr = f'Interpretation: {interpretation}'
    props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)
    ax.text(0.05, 0.95, textstr, transform=ax.transAxes, fontsize=11,
            verticalalignment='top', bbox=props)
    
    plt.tight_layout()
    plt.show()

# Visualize the examples
print("\n📊 Visualizing 2D Vector Examples\n")

# Example 1: Similar direction
visualize_2d_vectors(vector1, vector2, "Example 1: Similar Direction Vectors")

# Example 2: Orthogonal vectors
visualize_2d_vectors(vector3, vector4, "Example 2: Orthogonal Vectors")

# Example 3: Opposite directions
visualize_2d_vectors(vector5, vector6, "Example 3: Opposite Direction Vectors")

## Part 3: Cosine Similarity in Higher Dimensions

While 2D examples help build intuition, real-world applications like text embeddings work in much higher dimensions (often 384, 768, or even 4096 dimensions). The mathematical principles remain the same, but visualization becomes impossible.

Let's compute cosine similarity for higher-dimensional vectors.

In [None]:
# Generate random high-dimensional vectors for demonstration
# Using seed=16 for reproducibility (repository standard)
np.random.seed(16)

# Create vectors of different dimensions
dimensions = [10, 100, 384, 768]

print("📊 Cosine Similarity in Higher Dimensions")
print("=" * 60)

for dim in dimensions:
    # Generate random vectors
    vec_a = np.random.randn(dim)
    vec_b = np.random.randn(dim)
    
    # Create a similar vector (vec_c = vec_a with some noise)
    vec_c = vec_a + 0.1 * np.random.randn(dim)
    
    # Calculate similarities
    sim_random = cosine_similarity([vec_a], [vec_b])[0, 0]
    sim_similar = cosine_similarity([vec_a], [vec_c])[0, 0]
    
    print(f"\n📐 Dimension: {dim}")
    print(f"   Random vectors similarity:  {sim_random:.4f}")
    print(f"   Similar vectors similarity: {sim_similar:.4f}")

print("\n💡 Key Insight:")
print("   In high dimensions, random vectors tend to be nearly orthogonal (similarity ≈ 0)")
print("   Vectors derived from each other maintain high similarity even in high dimensions")

In [None]:
# Demonstrate efficiency comparison: manual vs sklearn
import time

# Create large high-dimensional vectors
dim = 768  # Common dimension for BERT embeddings
n_vectors = 1000

np.random.seed(16)
vectors = np.random.randn(n_vectors, dim)

print("\n⏱️ Performance Comparison")
print("=" * 50)
print(f"Computing similarity matrix for {n_vectors} vectors of dimension {dim}")

# Using sklearn (optimized)
start_time = time.time()
similarity_matrix = cosine_similarity(vectors)
sklearn_time = time.time() - start_time

print(f"\n✅ sklearn.metrics.pairwise.cosine_similarity:")
print(f"   Time: {sklearn_time:.4f} seconds")
print(f"   Output shape: {similarity_matrix.shape}")
print(f"   Memory efficient and optimized for large-scale computations")

# Show some example similarities
print(f"\n📊 Sample Similarity Values:")
print(f"   Vector 0 vs Vector 0: {similarity_matrix[0, 0]:.4f} (self-similarity, always 1.0)")
print(f"   Vector 0 vs Vector 1: {similarity_matrix[0, 1]:.4f}")
print(f"   Vector 0 vs Vector 2: {similarity_matrix[0, 2]:.4f}")

print("\n💡 Best Practice:")
print("   Always use sklearn.metrics.pairwise.cosine_similarity for production code")
print("   It's highly optimized and handles edge cases properly")

## Part 4: Cosine Similarity for Text with HuggingFace Transformers

Now let's apply cosine similarity to real text data using HuggingFace transformers. This is where the concept becomes powerful for NLP applications.

### Process:
1. **Tokenize** text into tokens
2. **Generate embeddings** using a pre-trained transformer model
3. **Compute cosine similarity** between text embeddings
4. **Interpret results** for semantic similarity

In [None]:
# Load a pre-trained model for generating text embeddings
# Using sentence-transformers/all-MiniLM-L6-v2 - optimized for semantic similarity
model_name = "sentence-transformers/all-MiniLM-L6-v2"

print(f"📥 Loading model: {model_name}")
print("This model generates 384-dimensional embeddings optimized for semantic similarity\n")

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Move model to optimal device
model = model.to(device)
model.eval()  # Set to evaluation mode

print(f"✅ Model loaded successfully on {device}")
print(f"📊 Embedding dimension: 384")

In [None]:
def get_text_embedding(text, tokenizer, model, device):
    """
    Generate text embedding using HuggingFace transformer model.
    
    Args:
        text: Input text string
        tokenizer: HuggingFace tokenizer
        model: HuggingFace model
        device: Device to run inference on
        
    Returns:
        numpy.ndarray: Text embedding vector
    """
    # Tokenize text
    inputs = tokenizer(
        text, 
        return_tensors="pt", 
        padding=True, 
        truncation=True, 
        max_length=512
    )
    
    # Move inputs to device
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Generate embeddings (no gradient computation needed)
    with torch.no_grad():
        outputs = model(**inputs)
        
        # Use mean pooling to get sentence-level embedding
        # This averages token embeddings to create a single vector for the entire text
        embeddings = outputs.last_hidden_state.mean(dim=1)
    
    # Convert to numpy and return
    return embeddings.cpu().numpy()[0]

# Example sentences for similarity comparison
sentences = [
    "I took my dog for a walk in the park.",
    "I took my cat for a walk in the park.",
    "The weather is beautiful today.",
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with multiple layers."
]

print("📝 Sample Sentences:")
for i, sent in enumerate(sentences, 1):
    print(f"   {i}. {sent}")

# Generate embeddings for all sentences
print(f"\n🔄 Generating embeddings...")
sentence_embeddings = np.array([
    get_text_embedding(sent, tokenizer, model, device) 
    for sent in sentences
])

print(f"✅ Generated embeddings shape: {sentence_embeddings.shape}")
print(f"   (5 sentences × 384 dimensions)")

In [None]:
# Calculate cosine similarity matrix between all sentence pairs
# This is the key function call from the issue example
similarity_matrix = cosine_similarity(sentence_embeddings)

print("📊 Cosine Similarity Matrix")
print("=" * 70)
print("\nSimilarity scores between all pairs of sentences:\n")

# Display similarity matrix with sentence labels
for i in range(len(sentences)):
    for j in range(len(sentences)):
        if i < j:  # Only show upper triangle (avoid duplicates)
            print(f"Sentence {i+1} ↔ Sentence {j+1}: {similarity_matrix[i, j]:.4f}")
            print(f"  → \"{sentences[i][:50]}...\"")
            print(f"  → \"{sentences[j][:50]}...\"")
            print()

print("\n💡 Key Observations:")
print("   • Sentences 1 & 2 have high similarity (dog vs cat walk) - same structure")
print("   • Sentences 4 & 5 have high similarity - both about AI/ML concepts")
print("   • Sentence 3 (weather) has low similarity with technical sentences")

In [None]:
# Visualize the similarity matrix as a heatmap
def visualize_similarity_heatmap(similarity_matrix, sentences):
    """
    Visualize text similarity matrix as a heatmap.
    
    Args:
        similarity_matrix: Cosine similarity matrix
        sentences: List of text strings
    """
    # Create figure
    plt.figure(figsize=(12, 10))
    
    # Create shorter labels for readability
    labels = [f"S{i+1}: {sent[:30]}..." if len(sent) > 30 else f"S{i+1}: {sent}" 
              for i, sent in enumerate(sentences)]
    
    # Create heatmap
    sns.heatmap(
        similarity_matrix,
        annot=True,           # Show values in cells
        fmt='.3f',            # Format numbers to 3 decimal places
        cmap='viridis',       # Color scheme
        xticklabels=labels,
        yticklabels=labels,
        cbar_kws={'label': 'Cosine Similarity'},
        vmin=0,               # Minimum value
        vmax=1,               # Maximum value
        square=True           # Make cells square
    )
    
    plt.title('Text Similarity Matrix (Cosine Similarity)', 
              fontsize=16, fontweight='bold', pad=20)
    plt.xlabel('Sentences', fontsize=12, fontweight='bold')
    plt.ylabel('Sentences', fontsize=12, fontweight='bold')
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    
    plt.tight_layout()
    plt.show()

# Visualize the similarity matrix
print("\n📊 Visualizing Similarity Matrix as Heatmap\n")
visualize_similarity_heatmap(similarity_matrix, sentences)

## Part 5: Practical Application - Semantic Search

Let's implement a simple semantic search engine using cosine similarity. This is a common real-world application.

In [None]:
# Create a document collection for semantic search
documents = [
    "Hate speech detection is crucial for online safety.",
    "Machine learning models can identify toxic content.",
    "Natural language processing helps understand text meaning.",
    "Transformers revolutionized NLP with attention mechanisms.",
    "Social media platforms use AI to moderate content.",
    "Deep learning requires large amounts of training data.",
    "BERT and RoBERTa are popular transformer models.",
    "Toxic comment classification protects online communities.",
    "The weather forecast predicts rain tomorrow.",
    "Python is a popular programming language for data science."
]

print("📚 Document Collection:")
for i, doc in enumerate(documents, 1):
    print(f"   {i:2d}. {doc}")

# Generate embeddings for all documents
print(f"\n🔄 Generating document embeddings...")
document_embeddings = np.array([
    get_text_embedding(doc, tokenizer, model, device) 
    for doc in documents
])

print(f"✅ Generated {len(documents)} document embeddings")
print(f"   Shape: {document_embeddings.shape}")

In [None]:
def semantic_search(query, documents, document_embeddings, tokenizer, model, device, top_k=3):
    """
    Perform semantic search using cosine similarity.
    
    Args:
        query: Search query string
        documents: List of document strings
        document_embeddings: Pre-computed document embeddings
        tokenizer: HuggingFace tokenizer
        model: HuggingFace model
        device: Device for inference
        top_k: Number of top results to return
        
    Returns:
        List of (document, similarity_score) tuples
    """
    # Generate query embedding
    query_embedding = get_text_embedding(query, tokenizer, model, device)
    
    # Calculate cosine similarity between query and all documents
    similarities = cosine_similarity([query_embedding], document_embeddings)[0]
    
    # Get top-k most similar documents
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    # Return results
    results = [(documents[idx], similarities[idx]) for idx in top_indices]
    
    return results

# Example queries
queries = [
    "How to detect hate speech in text?",
    "What are transformer models?",
    "Weather information"
]

print("🔍 Semantic Search Examples")
print("=" * 70)

for query in queries:
    print(f"\n📝 Query: \"{query}\"")
    print("\nTop 3 Most Relevant Documents:")
    
    results = semantic_search(query, documents, document_embeddings, 
                             tokenizer, model, device, top_k=3)
    
    for rank, (doc, score) in enumerate(results, 1):
        print(f"\n   {rank}. Similarity: {score:.4f}")
        print(f"      → {doc}")
    
    print("\n" + "-" * 70)

print("\n💡 Observation:")
print("   Semantic search finds relevant documents even when exact words don't match!")
print("   It understands the meaning and context of the query.")

## Part 6: Hate Speech Detection Example

Let's apply cosine similarity to a hate speech detection scenario, which is a key focus area of this repository.

In [None]:
# Sample texts including normal and potentially toxic content
sample_texts = [
    "I love spending time with my friends and family.",
    "This is a great community with wonderful people.",
    "Thank you for your help and support!",
    "I disagree with this policy but respect your opinion.",
    "This content is inappropriate and offensive.",
    "Stop spreading hate and negativity online.",
]

print("📝 Sample Texts for Analysis:")
for i, text in enumerate(sample_texts, 1):
    print(f"   {i}. {text}")

# Generate embeddings
print(f"\n🔄 Generating embeddings...")
sample_embeddings = np.array([
    get_text_embedding(text, tokenizer, model, device) 
    for text in sample_texts
])

# Calculate similarity matrix
sample_similarity_matrix = cosine_similarity(sample_embeddings)

print(f"✅ Computed similarity matrix: {sample_similarity_matrix.shape}")

# Visualize
print("\n📊 Similarity Matrix Heatmap:\n")
visualize_similarity_heatmap(sample_similarity_matrix, sample_texts)

# Analyze patterns
print("\n🔍 Analysis:")
print("   • Positive messages (1-3) cluster together with high similarity")
print("   • Critical/negative messages (5-6) show similarity to each other")
print("   • Neutral message (4) has moderate similarity with both groups")
print("\n💡 Application:")
print("   Cosine similarity helps identify content patterns for moderation")
print("   Can be used to find similar toxic content or cluster user behaviors")

## Part 7: Understanding the Similarity Matrix

Let's dive deeper into interpreting similarity matrices and what they tell us.

In [None]:
# Create a detailed analysis function
def analyze_similarity_matrix(similarity_matrix, texts, threshold=0.7):
    """
    Analyze and interpret a similarity matrix.
    
    Args:
        similarity_matrix: Cosine similarity matrix
        texts: List of text strings
        threshold: Similarity threshold for "high similarity"
    """
    n = len(texts)
    
    print("📊 Similarity Matrix Analysis")
    print("=" * 70)
    
    # Basic statistics
    # Extract upper triangle (excluding diagonal)
    upper_triangle = similarity_matrix[np.triu_indices(n, k=1)]
    
    print(f"\n📈 Statistics (excluding self-similarity):")
    print(f"   Mean similarity:   {upper_triangle.mean():.4f}")
    print(f"   Median similarity: {np.median(upper_triangle):.4f}")
    print(f"   Min similarity:    {upper_triangle.min():.4f}")
    print(f"   Max similarity:    {upper_triangle.max():.4f}")
    print(f"   Std deviation:     {upper_triangle.std():.4f}")
    
    # Find highly similar pairs
    print(f"\n🔗 Highly Similar Pairs (similarity > {threshold}):")
    high_similarity_count = 0
    for i in range(n):
        for j in range(i+1, n):
            if similarity_matrix[i, j] > threshold:
                high_similarity_count += 1
                print(f"\n   Pair {i+1}-{j+1}: {similarity_matrix[i, j]:.4f}")
                print(f"   → Text {i+1}: {texts[i][:60]}...")
                print(f"   → Text {j+1}: {texts[j][:60]}...")
    
    if high_similarity_count == 0:
        print(f"   No pairs found with similarity > {threshold}")
    
    # Find dissimilar pairs
    print(f"\n🔀 Most Dissimilar Pairs (lowest similarity):")
    min_indices = np.unravel_index(np.argmin(upper_triangle), (n, n))
    dissimilar_pairs = []
    for i in range(n):
        for j in range(i+1, n):
            dissimilar_pairs.append((i, j, similarity_matrix[i, j]))
    
    dissimilar_pairs.sort(key=lambda x: x[2])
    for i, j, sim in dissimilar_pairs[:3]:
        print(f"\n   Pair {i+1}-{j+1}: {sim:.4f}")
        print(f"   → Text {i+1}: {texts[i][:60]}...")
        print(f"   → Text {j+1}: {texts[j][:60]}...")

# Analyze our sample similarity matrix
analyze_similarity_matrix(similarity_matrix, sentences, threshold=0.7)

## Part 8: Best Practices and Tips

Here are important considerations when working with cosine similarity in NLP.

In [None]:
# Demonstrate normalization importance
def demonstrate_normalization():
    """
    Show why vector normalization matters for cosine similarity.
    """
    print("📐 Vector Normalization for Cosine Similarity")
    print("=" * 70)
    
    # Create two vectors with different magnitudes but same direction
    vec1 = np.array([1, 2, 3])
    vec2 = np.array([2, 4, 6])  # Same direction, double magnitude
    vec3 = np.array([1, 1, 1])  # Different direction
    
    print("\n📊 Original Vectors:")
    print(f"   Vector 1: {vec1} (magnitude: {np.linalg.norm(vec1):.4f})")
    print(f"   Vector 2: {vec2} (magnitude: {np.linalg.norm(vec2):.4f})")
    print(f"   Vector 3: {vec3} (magnitude: {np.linalg.norm(vec3):.4f})")
    
    # Calculate similarities
    sim_1_2 = cosine_similarity([vec1], [vec2])[0, 0]
    sim_1_3 = cosine_similarity([vec1], [vec3])[0, 0]
    
    print("\n✅ Cosine Similarities:")
    print(f"   Vec1 ↔ Vec2 (same direction): {sim_1_2:.4f}")
    print(f"   Vec1 ↔ Vec3 (different direction): {sim_1_3:.4f}")
    
    print("\n💡 Key Insight:")
    print("   Cosine similarity is magnitude-invariant!")
    print("   Vectors with same direction have similarity = 1.0 regardless of magnitude")
    print("   This is why it's perfect for text: 'cat' and 'cat cat cat' should be similar")
    
    # Normalize vectors manually
    vec1_norm = vec1 / np.linalg.norm(vec1)
    vec2_norm = vec2 / np.linalg.norm(vec2)
    
    print("\n📐 Normalized Vectors (unit length):")
    print(f"   Vector 1: {vec1_norm}")
    print(f"   Vector 2: {vec2_norm}")
    print("   Notice: Normalized vectors point in same direction with magnitude 1.0")
    
    # For normalized vectors, cosine similarity = dot product
    dot_product = np.dot(vec1_norm, vec2_norm)
    print(f"\n🔢 For normalized vectors: dot product = {dot_product:.4f}")
    print(f"   This equals cosine similarity: {sim_1_2:.4f}")

demonstrate_normalization()

In [None]:
# Best practices summary
print("📚 Best Practices for Cosine Similarity in NLP")
print("=" * 70)

best_practices = {
    "✅ DO's": [
        "Use sklearn.metrics.pairwise.cosine_similarity for efficiency",
        "Choose appropriate embedding models for your domain (e.g., hate speech models)",
        "Normalize embeddings when computing similarities at scale",
        "Use batch processing for large datasets to improve performance",
        "Consider the context: high similarity threshold varies by application",
        "Visualize similarity matrices to understand data relationships",
        "Cache embeddings for repeated similarity computations"
    ],
    "❌ DON'Ts": [
        "Don't use Euclidean distance when direction matters more than magnitude",
        "Don't compare raw text without proper embeddings",
        "Don't ignore computational costs for large-scale comparisons",
        "Don't use generic models when domain-specific ones exist",
        "Don't forget to handle edge cases (zero vectors, very short texts)",
        "Don't rely solely on cosine similarity for complex NLP tasks"
    ],
    "💡 Tips": [
        "For hate speech: Use specialized models like cardiffnlp/twitter-roberta-base-hate-latest",
        "Combine cosine similarity with other features for robust classification",
        "Use approximate nearest neighbor search (FAISS) for large-scale similarity",
        "Monitor performance: embeddings generation is often the bottleneck",
        "Consider fine-tuning embeddings on your specific domain"
    ]
}

for category, items in best_practices.items():
    print(f"\n{category}:")
    for item in items:
        print(f"   • {item}")

print("\n" + "=" * 70)

---

## 📋 Summary

### 🔑 Key Concepts Mastered

- **Cosine Similarity Formula**: Measures the cosine of the angle between vectors, focusing on direction rather than magnitude
  $$\text{cosine\_similarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}$$

- **2D Visualization**: Demonstrated how cosine similarity works with visual examples in 2D space
  - Similar direction vectors → High similarity (close to 1.0)
  - Orthogonal vectors → Zero similarity
  - Opposite direction vectors → Negative similarity (close to -1.0)

- **High-Dimensional Computation**: Applied cosine similarity to vectors in 384+ dimensions
  - Random vectors in high dimensions tend to be nearly orthogonal
  - Efficient computation using sklearn.metrics.pairwise.cosine_similarity

- **Text Embeddings with HuggingFace**: Used transformer models to convert text into semantic vectors
  - Loaded sentence-transformers/all-MiniLM-L6-v2 for 384-dimensional embeddings
  - Computed similarity matrices to compare multiple texts
  - Visualized results with heatmaps for interpretability

- **Practical Applications**:
  - **Semantic Search**: Finding relevant documents based on meaning, not just keywords
  - **Content Moderation**: Identifying similar toxic/hate speech patterns
  - **Clustering**: Grouping similar texts together
  - **Recommendation**: Finding similar content based on text descriptions

### 📈 Best Practices Learned

- **Scale-Invariance**: Cosine similarity is perfect for text because it ignores frequency/length
- **Efficiency**: Use sklearn.metrics.pairwise.cosine_similarity for optimized computations
- **Normalization**: For normalized vectors, cosine similarity equals dot product
- **Domain-Specific Models**: Use specialized models (e.g., hate speech detectors) for better embeddings
- **Visualization**: Heatmaps help understand patterns in similarity matrices
- **Reproducibility**: Always use seed=16 for consistent results (repository standard)

### 🚀 Next Steps

- **Advanced Similarity**: Explore other similarity metrics (Euclidean, Manhattan, Jaccard)
- **Large-Scale Search**: Learn FAISS for efficient similarity search on millions of vectors
- **Fine-tuning**: Customize embeddings for your specific domain or task
- **Hate Speech Detection**: Apply these concepts to hate speech classification models
- **Documentation**: Check out [FAISS.md](../../docs/FAISS.md) for efficient similarity search

### 📚 Related Resources

- **HF Transformer Trove**: [Feature Extraction Notebook](../basic1.2/06-feature-extraction.ipynb)
- **Documentation**: [FAISS for Similarity Search](../../docs/FAISS.md)
- **HuggingFace**: [Sentence Transformers Documentation](https://www.sbert.net/)
- **Research**: [Sentence-BERT Paper](https://arxiv.org/abs/1908.10084)

---

## About the Author

**Vu Hung Nguyen** - AI Engineer & Researcher

Connect with me:
- 🌐 **Website**: [vuhung16au.github.io](https://vuhung16au.github.io/)
- 💼 **LinkedIn**: [linkedin.com/in/nguyenvuhung](https://www.linkedin.com/in/nguyenvuhung/)
- 💻 **GitHub**: [github.com/vuhung16au](https://github.com/vuhung16au/)

*This notebook is part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) educational series.*