# Week 1: Embeddings and Similarity Metrics

**Scope:** Convert text to vector embeddings and compare using three similarity metrics.

**Models:** `all-MiniLM-L6-v2` (384d), `all-mpnet-base-v2` (768d)

**Metrics:** Dot Product, Cosine Similarity, Euclidean Distance

In [41]:
# Cell 1: IMPORTS ONLY

import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer

## How SentenceTransformer Models Work

### Architecture Overview

**SentenceTransformer** models convert text into dense vector embeddings:

1. **Tokenization**: Text → tokens (words/subwords)
2. **Encoder**: Transformer (BERT/MPNet) processes tokens
3. **Pooling**: Token embeddings → sentence embedding
4. **Normalization**: L2-normalization (magnitude = 1.0)

### Model Comparison

| Model | Architecture | Dimensions | Speed | Quality |
|-------|-------------|------------|-------|---------|
| **all-MiniLM-L6-v2** | DistilBERT-based | 384 | Fast | Good |
| **all-mpnet-base-v2** | MPNet | 768 | Slower | Better |

### Why Normalization?

- **L2-normalized embeddings** have magnitude = 1.0
- Makes **dot product = cosine similarity** (faster computation)
- Focuses on **direction** (semantic meaning) rather than magnitude

In [42]:
# Cell 2: FUNCTIONS ONLY

def dot_product(a: np.ndarray, b: np.ndarray) -> float:
    """Dot product: sum of element-wise products."""
    return float(np.sum(a * b))


def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Cosine similarity: dot product normalized by vector magnitudes."""
    dot = np.sum(a * b)
    norm_a = np.sqrt(np.sum(a * a))
    norm_b = np.sqrt(np.sum(b * b))
    return float(dot / (norm_a * norm_b))


def euclidean_distance(a: np.ndarray, b: np.ndarray) -> float:
    """Euclidean distance: L2 norm of difference vector."""
    diff = a - b
    return float(np.sqrt(np.sum(diff * diff)))

## Step-by-Step Process

Following the learning objectives:

1. **Create an embedding of a string** - Convert text to vector representation
2. **Compare with another string** - Compute similarity metrics between embeddings  
3. **Use different model or different metric** - Compare MiniLM vs MPNet, try different metrics
4. **And compare again** - Analyze results and understand differences

---
## Experiment 1: Load Models and Inspect Embedding Dimensions

In [43]:
# Load two SentenceTransformer models
model_minilm = SentenceTransformer("all-MiniLM-L6-v2")
model_mpnet = SentenceTransformer("all-mpnet-base-v2")

# Inspect embedding dimensionality
test_emb_minilm = model_minilm.encode("test", convert_to_numpy=True)
test_emb_mpnet = model_mpnet.encode("test", convert_to_numpy=True)

print(f"all-MiniLM-L6-v2 → shape: {test_emb_minilm.shape}, dimension: {test_emb_minilm.shape[0]}")
print(f"all-mpnet-base-v2 → shape: {test_emb_mpnet.shape}, dimension: {test_emb_mpnet.shape[0]}")

Loading weights: 100%|██████████| 103/103 [00:00<00:00, 689.84it/s, Materializing param=pooler.dense.weight]                             
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Loading weights: 100%|██████████| 199/199 [00:00<00:00, 1532.14it/s, Materializing param=pooler.dense.weight]                        
MPNetModel LOAD REPORT from: sentence-transformers/all-mpnet-base-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


all-MiniLM-L6-v2 → shape: (384,), dimension: 384
all-mpnet-base-v2 → shape: (768,), dimension: 768


---
## Experiment 2: Define Test Sentences

Five sentences with intentional semantic relationships:
- **Similar:** sentences 0 and 1 (both about RAG)
- **Dissimilar:** sentences 2 and 3 (unrelated topics)
- **Ambiguous:** sentence 4 (partial overlap with sentence 0)

In [44]:
sentences = [
    "Retrieval Augmented Generation improves LLM accuracy.",  
    "RAG combines retrieval with language model generation.",  
    "The weather forecast predicts rain tomorrow.",            
    "Cooking pasta requires boiling water first.",             
    "Augmented generation uses external knowledge sources.",   
]

for i, s in enumerate(sentences):
    print(f"[{i}] {s}")

[0] Retrieval Augmented Generation improves LLM accuracy.
[1] RAG combines retrieval with language model generation.
[2] The weather forecast predicts rain tomorrow.
[3] Cooking pasta requires boiling water first.
[4] Augmented generation uses external knowledge sources.


---
## Experiment 3: Generate and Inspect Embeddings

In [45]:
# Generate embeddings for all sentences with both models
embeddings_minilm = model_minilm.encode(sentences, convert_to_numpy=True)
embeddings_mpnet = model_mpnet.encode(sentences, convert_to_numpy=True)

print(f"MiniLM embeddings shape: {embeddings_minilm.shape}")
print(f"MPNet embeddings shape: {embeddings_mpnet.shape}")

# Inspect first embedding vector (first 10 values)
print(f"\nMiniLM sentence[0] first 20 values: {embeddings_minilm[0][:20]}")
print(f"MPNet sentence[0] first 20 values: {embeddings_mpnet[0][:20]}")

print(f"\nMiniLM sentence[1] first 20 values: {embeddings_minilm[1][:20]}")
print(f"MPNet sentence[1] first 20 values: {embeddings_mpnet[1][:20]}")


MiniLM embeddings shape: (5, 384)
MPNet embeddings shape: (5, 768)

MiniLM sentence[0] first 20 values: [-0.0220228  -0.04396626  0.00447883  0.00055714 -0.03906972  0.02633511
 -0.03920497 -0.02817526 -0.0424859  -0.0353056   0.02738176  0.03074416
  0.06076185 -0.0518314  -0.01392772  0.03338559  0.05736943  0.08991289
 -0.00394042 -0.07039431]
MPNet sentence[0] first 20 values: [ 0.00166424  0.02360758 -0.01334266  0.0119692  -0.00710444 -0.0289734
  0.00820648  0.00555653  0.00723216 -0.03528774 -0.02144891 -0.04206651
 -0.02553579 -0.02424748  0.04520509  0.01470743  0.0775855   0.0202875
 -0.02456869  0.01769613]

MiniLM sentence[1] first 20 values: [-0.05334243  0.00511108  0.03446425  0.0302682  -0.06955914  0.10199971
 -0.0046838   0.02422921  0.02585307 -0.05582795  0.03232386 -0.02854197
  0.09216493  0.01982816 -0.00994937  0.02559566  0.01926134  0.09612161
 -0.03321575 -0.08420756]
MPNet sentence[1] first 20 values: [ 0.03124009  0.06057323 -0.00820602  0.01686119 -0.0040

In [46]:
print(f"\nMiniLM sentence[2] first 20 values: {embeddings_minilm[2][:20]}")
print(f"MPNet sentence[2] first 20 values: {embeddings_mpnet[2][:20]}")

print(f"\nMiniLM sentence[3] first 20 values: {embeddings_minilm[3][:20]}")
print(f"MPNet sentence[3] first 20 values: {embeddings_mpnet[3][:20]}")

print(f"\nMiniLM sentence[4] first 20 values: {embeddings_minilm[4][:20]}")
print(f"MPNet sentence[4] first 20 values: {embeddings_mpnet[4][:20]}")


MiniLM sentence[2] first 20 values: [-0.03793139 -0.0067027   0.13290353  0.08947232  0.02930467 -0.04770944
  0.02003481  0.00341998 -0.03323781  0.03670828 -0.07299682 -0.05635128
 -0.01865374  0.02744901 -0.07360628 -0.01449102 -0.082166   -0.0573015
 -0.04159637 -0.00483347]
MPNet sentence[2] first 20 values: [-0.02296168 -0.01276098 -0.02665962 -0.00522578  0.03129588 -0.04879913
 -0.00157662 -0.04130371  0.03052933  0.04309495 -0.02345158  0.00010839
 -0.01300488  0.03478848  0.0642921  -0.07581157  0.00876294 -0.02359559
 -0.02020937 -0.02968585]

MiniLM sentence[3] first 20 values: [-0.05089941 -0.06859522 -0.03378077  0.07898146 -0.00367225 -0.04948081
  0.00143986 -0.01937545 -0.02264489 -0.0874291  -0.03136565 -0.04872905
 -0.04889179  0.00374963 -0.00215146 -0.05085794  0.01486077 -0.01041566
  0.01173994 -0.0577705 ]
MPNet sentence[3] first 20 values: [-0.05005772 -0.02820258 -0.01830548 -0.01590468  0.02297864  0.04261016
  0.02520946  0.03265034  0.08292221  0.03257069 

---
## Experiment 4: Verify Embedding Normalization


**What normalization means:**
- **L2-normalization** scales each embedding vector so its magnitude (L2 norm) equals 1.0
- Formula: `normalized_vector = vector / ||vector||` where `||vector|| = √(Σvᵢ²)`
- This ensures all embeddings have the same "length" in vector space

**Why it matters:**
1. **Dot product = Cosine similarity**: When vectors are normalized, dot product equals cosine similarity because:
   - `cosine = dot(a,b) / (||a|| × ||b||)`
   - If `||a|| = ||b|| = 1`, then `cosine = dot(a,b)`
   - This makes computation faster (no need to compute norms)

2. **Focus on direction, not magnitude**: Normalized embeddings capture **semantic direction** rather than magnitude, which is what we want for semantic similarity

3. **Efficient similarity search**: Normalized embeddings enable fast similarity search in vector databases using inner product

**What we verify:**
- Check that L2 norm of each embedding equals 1.0 (or very close to it)
- Confirm this holds for all sentences and both models (MiniLM and MPNet)

In [47]:
# Check L2 norms of embeddings
print("L2 norms (should be ~1.0 if normalized):")
for i in range(len(sentences)):
    norm_minilm = np.sqrt(np.sum(embeddings_minilm[i] ** 2))
    norm_mpnet = np.sqrt(np.sum(embeddings_mpnet[i] ** 2))
    print(f"  Sentence {i}: MiniLM={norm_minilm:.6f}, MPNet={norm_mpnet:.6f}")

L2 norms (should be ~1.0 if normalized):
  Sentence 0: MiniLM=1.000000, MPNet=1.000000
  Sentence 1: MiniLM=1.000000, MPNet=1.000000
  Sentence 2: MiniLM=1.000000, MPNet=1.000000
  Sentence 3: MiniLM=1.000000, MPNet=1.000000
  Sentence 4: MiniLM=1.000000, MPNet=1.000000


---
## Experiment 5: Compute All Pairwise Similarities

**Goal:** Compare select sentence pairs using all three similarity metrics (dot product, cosine similarity, euclidean distance) across both models.

**What we do:**
1. **Define pairs** of sentences with different semantic relationships:
   - **Similar pairs** (0↔1): Both about RAG — should have high similarity
   - **Ambiguous pairs** (0↔4, 1↔4): Partial semantic overlap — intermediate similarity expected
   - **Dissimilar pairs** (0↔2, 2↔3): Unrelated topics — should have low similarity

2. **Compute metrics** for each pair with both models:
   - Dot product (normalized = cosine similarity)
   - Cosine similarity (semantic direction)
   - Euclidean distance (vector space distance)

3. **Compare results** between MiniLM (384d) and MPNet (768d) to see how embedding dimensionality affects similarity scores.

**Expected observations:**
- Similar sentences should have high cosine (>0.5) and low euclidean distance (<1.0)
- Dissimilar sentences should have low cosine (<0.3) and high euclidean distance (>1.2)
- MPNet may show different similarity scores due to richer 768-dimensional representation

In [48]:
# Define sentence pairs to compare
pairs = [
    (0, 1),  
    (0, 4),  
    (0, 2),  
    (2, 3),  
    (1, 4),  
]

results = []

for model_name, embeddings, dim in [
    ("all-MiniLM-L6-v2", embeddings_minilm, 384),
    ("all-mpnet-base-v2", embeddings_mpnet, 768),
]:
    for i, j in pairs:
        emb_a = embeddings[i]
        emb_b = embeddings[j]
        
        results.append({
            "model_name": model_name,
            "sentence_a": sentences[i][:40] + "...",
            "sentence_b": sentences[j][:40] + "...",
            "dot_similarity": dot_product(emb_a, emb_b),
            "cosine_similarity": cosine_similarity(emb_a, emb_b),
            "euclidean_distance": euclidean_distance(emb_a, emb_b),
            "embedding_dimension": dim,
        })

df_results = pd.DataFrame(results)
print(df_results.to_string(index=False))

       model_name                                  sentence_a                                  sentence_b  dot_similarity  cosine_similarity  euclidean_distance  embedding_dimension
 all-MiniLM-L6-v2 Retrieval Augmented Generation improves ... RAG combines retrieval with language mod...        0.480646           0.480646            1.019170                  384
 all-MiniLM-L6-v2 Retrieval Augmented Generation improves ... Augmented generation uses external knowl...        0.535589           0.535589            0.963754                  384
 all-MiniLM-L6-v2 Retrieval Augmented Generation improves ... The weather forecast predicts rain tomor...        0.048415           0.048415            1.379554                  384
 all-MiniLM-L6-v2 The weather forecast predicts rain tomor... Cooking pasta requires boiling water fir...        0.072848           0.072848            1.361729                  384
 all-MiniLM-L6-v2 RAG combines retrieval with language mod... Augmented generation uses ex

---
## Result Table (Primary Week 1 Artifact)

In [49]:
df_results

Unnamed: 0,model_name,sentence_a,sentence_b,dot_similarity,cosine_similarity,euclidean_distance,embedding_dimension
0,all-MiniLM-L6-v2,Retrieval Augmented Generation improves ...,RAG combines retrieval with language mod...,0.480646,0.480646,1.01917,384
1,all-MiniLM-L6-v2,Retrieval Augmented Generation improves ...,Augmented generation uses external knowl...,0.535589,0.535589,0.963754,384
2,all-MiniLM-L6-v2,Retrieval Augmented Generation improves ...,The weather forecast predicts rain tomor...,0.048415,0.048415,1.379554,384
3,all-MiniLM-L6-v2,The weather forecast predicts rain tomor...,Cooking pasta requires boiling water fir...,0.072848,0.072848,1.361729,384
4,all-MiniLM-L6-v2,RAG combines retrieval with language mod...,Augmented generation uses external knowl...,0.263568,0.263568,1.213616,384
5,all-mpnet-base-v2,Retrieval Augmented Generation improves ...,RAG combines retrieval with language mod...,0.607821,0.607821,0.88564,768
6,all-mpnet-base-v2,Retrieval Augmented Generation improves ...,Augmented generation uses external knowl...,0.571282,0.571282,0.925979,768
7,all-mpnet-base-v2,Retrieval Augmented Generation improves ...,The weather forecast predicts rain tomor...,0.030941,0.030941,1.392163,768
8,all-mpnet-base-v2,The weather forecast predicts rain tomor...,Cooking pasta requires boiling water fir...,0.081372,0.081372,1.355454,768
9,all-mpnet-base-v2,RAG combines retrieval with language mod...,Augmented generation uses external knowl...,0.363113,0.363113,1.128616,768


---
## Experiment 6: Demonstrate Metric Behavior Differences

**Goal:** Show that dot product ≠ cosine similarity when vectors are NOT normalized, and understand why normalization matters.

**What we do:**
1. **Start with normalized vectors**: Take two embeddings that are already L2-normalized (from Experiment 3)
   - Both have norm = 1.0
   - Dot product = cosine similarity (as verified in Experiment 4)

2. **Artificially scale one vector**: Multiply one embedding by 3.0
   - This makes it NOT normalized (norm = 3.0)
   - Creates a scenario where vectors have different magnitudes

3. **Compare metrics**: Compute both dot product and cosine similarity
   - With normalized vectors: dot = cosine
   - With scaled vector: dot ≠ cosine

**Why this matters:**
- **Dot product** depends on BOTH direction AND magnitude
- **Cosine similarity** depends ONLY on direction (normalizes by magnitudes)
- When vectors are normalized, they have the same magnitude, so dot = cosine
- When vectors have different magnitudes, dot product changes but cosine stays the same

**Expected results:**
- Normalized vectors: dot product ≈ cosine similarity (same value)
- Scaled vector: dot product changes (multiplied by ~3), cosine similarity unchanged

In [51]:
# Take two embeddings and artificially scale one
emb_a = embeddings_minilm[0].copy()  # normalized
emb_b = embeddings_minilm[1].copy()  # normalized
emb_b_scaled = emb_b * 3.0           # NOT normalized (magnitude = 3)

print("Comparing sentence 0 vs sentence 1 (MiniLM):")
print(f"  emb_b norm (original): {np.sqrt(np.sum(emb_b ** 2)):.4f}")
print(f"  emb_b_scaled norm:     {np.sqrt(np.sum(emb_b_scaled ** 2)):.4f}")
print()
print("With normalized vectors:")
print(f"  Dot Product:       {dot_product(emb_a, emb_b):.6f}")
print(f"  Cosine Similarity: {cosine_similarity(emb_a, emb_b):.6f}")
print()
print("With scaled vector (emb_b × 3):")
print(f"  Dot Product:       {dot_product(emb_a, emb_b_scaled):.6f}  ← changes with magnitude")
print(f"  Cosine Similarity: {cosine_similarity(emb_a, emb_b_scaled):.6f}  ← unchanged (direction only)")

Comparing sentence 0 vs sentence 1 (MiniLM):
  emb_b norm (original): 1.0000
  emb_b_scaled norm:     3.0000

With normalized vectors:
  Dot Product:       0.480646
  Cosine Similarity: 0.480646

With scaled vector (emb_b × 3):
  Dot Product:       1.441938  ← changes with magnitude
  Cosine Similarity: 0.480646  ← unchanged (direction only)


---
## Experiment 7 : Compare Different Embedding Pairs with Different Metrics

Compare multiple sentence pairs (similar, dissimilar, ambiguous) using both cosine and euclidean metrics.

In [52]:


print("COMPARISON: Different Embedding Pairs × Metrics × Models")


pairs_info = [
    (0, 1, "Similar", "Both about RAG"),
    (0, 2, "Dissimilar", "RAG vs Weather"),
    (0, 4, "Ambiguous", "RAG vs Augmented generation"),
    (2, 3, "Dissimilar", "Weather vs Cooking"),
]

model_minilm.similarity_fn_name = "cosine"
model_mpnet.similarity_fn_name = "cosine"

results_extended = []

for i, j, pair_type, description in pairs_info:
    emb_i_minilm = embeddings_minilm[i]
    emb_j_minilm = embeddings_minilm[j]
    emb_i_mpnet = embeddings_mpnet[i]
    emb_j_mpnet = embeddings_mpnet[j]
    
    # Cosine similarity (default)
    cos_minilm = model_minilm.similarity(emb_i_minilm, emb_j_minilm).item()
    cos_mpnet = model_mpnet.similarity(emb_i_mpnet, emb_j_mpnet).item()
    
    # Change to euclidean
    model_minilm.similarity_fn_name = "euclidean"
    model_mpnet.similarity_fn_name = "euclidean"
    
    euc_minilm = model_minilm.similarity(emb_i_minilm, emb_j_minilm).item()
    euc_mpnet = model_mpnet.similarity(emb_i_mpnet, emb_j_mpnet).item()
    
    # Reset back to cosine for next iteration
    model_minilm.similarity_fn_name = "cosine"
    model_mpnet.similarity_fn_name = "cosine"
    
    results_extended.append({
        "Pair": f"{i}↔{j}",
        "Type": pair_type,
        "Description": description,
        "MiniLM Cosine": f"{cos_minilm:.4f}",
        "MiniLM Euclidean": f"{euc_minilm:.4f}",
        "MPNet Cosine": f"{cos_mpnet:.4f}",
        "MPNet Euclidean": f"{euc_mpnet:.4f}",
    })

df_extended = pd.DataFrame(results_extended)
print(df_extended.to_string(index=False))

COMPARISON: Different Embedding Pairs × Metrics × Models
Pair       Type                 Description MiniLM Cosine MiniLM Euclidean MPNet Cosine MPNet Euclidean
 0↔1    Similar              Both about RAG        0.4806          -1.0192       0.6078         -0.8856
 0↔2 Dissimilar              RAG vs Weather        0.0484          -1.3796       0.0309         -1.3922
 0↔4  Ambiguous RAG vs Augmented generation        0.5356          -0.9638       0.5713         -0.9260
 2↔3 Dissimilar          Weather vs Cooking        0.0728          -1.3617       0.0814         -1.3555


---
## Final Summary: All Similarity Metrics Comparison

### Overview of All Three Metrics

| Metric | Range | Interpretation | When Dot = Cosine? |
|--------|-------|----------------|-------------------|
| **Dot Product** | (−∞, +∞) | Unbounded; depends on vector magnitudes | ✅ When vectors are L2-normalized |
| **Cosine Similarity** | [−1, +1] | 1 = identical direction, 0 = orthogonal, −1 = opposite | ✅ Always (by definition) |
| **Euclidean Distance** | [0, +∞) | 0 = identical vectors, larger = more dissimilar | ❌ Never (different concept) |

### Key Findings Across All Experiments

#### 1. Normalization Matters (Experiment 4)
- ✅ All embeddings are L2-normalized (norm = 1.0)
- ✅ This makes **dot product = cosine similarity** (faster computation)
- ✅ Normalization focuses on **direction** (semantic meaning) rather than magnitude

#### 2. Metric Behavior (Experiment 5 & 6)
- **Dot Product**: 
  - With normalized vectors: dot = cosine (same value)
  - With non-normalized vectors: dot ≠ cosine (depends on magnitude)
- **Cosine Similarity**: 
  - Always measures direction only (angle between vectors)
  - Unchanged by vector scaling (Experiment 6)
- **Euclidean Distance**: 
  - Measures actual distance in vector space
  - Inverse to similarity (smaller = more similar)
  - For normalized vectors: `euclidean = √(2 - 2×cosine)`

#### 3. Model Comparison (All Experiments)
- **MiniLM (384d)**: Faster, good quality
- **MPNet (768d)**: Slower, better quality, richer representation
- Both models show similar trends but different absolute values

#### 4. Pair Type Analysis (Experiment 5 & 7 Extended)
- **Similar pairs** (0↔1): 
  - Cosine: >0.4 (high similarity)
  - Euclidean: <1.1 (low distance)
- **Dissimilar pairs** (0↔2, 2↔3): 
  - Cosine: <0.1 (low similarity)
  - Euclidean: >1.3 (high distance)
- **Ambiguous pairs** (0↔4): 
  - Cosine: 0.1-0.4 (intermediate)
  - Euclidean: 1.1-1.3 (intermediate)