# Week 1: Embeddings and Similarity Metrics

**Scope:** raw embeddings and metric behavior only.
No chunking, vector DB, or RAG pipeline.

**Models:** MiniLM (384d), MPNet (768d)
**Metrics:** dot product, cosine similarity, Euclidean distance (NumPy, manual formulas).


In [21]:
# Cell 1: IMPORTS ONLY

import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer

## Experiment Context

We compare two SentenceTransformer models on the same sentence pairs using raw outputs from
`encode(..., normalize_embeddings=False)`.

Even without explicit normalization, embeddings are often close to unit norm; this is why dot and cosine can be numerically close.


In [22]:
# Cell 2: Similarity metrics — manual implementation with numpy (RAW embeddings only)

# Dot product: dot(a, b) — depends on magnitude
def dot_product(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b))

# Cosine: dot(a,b) / (||a|| * ||b||) — removes magnitude influence
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    na, nb = np.linalg.norm(a), np.linalg.norm(b)
    if na == 0 or nb == 0:
        return 0.0
    return float(np.dot(a, b) / (na * nb))

# Euclidean distance: ||a - b|| — always >= 0
def euclidean_distance(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.linalg.norm(a - b))

---
## Experiment 1: Load Models and Inspect Embedding Dimensions

In [23]:
# Load two SentenceTransformer models
model_minilm = SentenceTransformer("all-MiniLM-L6-v2")
model_mpnet = SentenceTransformer("all-mpnet-base-v2")

# Inspect embedding dimensionality
test_emb_minilm = model_minilm.encode("test", convert_to_numpy=True)
test_emb_mpnet = model_mpnet.encode("test", convert_to_numpy=True)

print(f"all-MiniLM-L6-v2 → shape: {test_emb_minilm.shape}, dimension: {test_emb_minilm.shape[0]}")
print(f"all-mpnet-base-v2 → shape: {test_emb_mpnet.shape}, dimension: {test_emb_mpnet.shape[0]}")

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

MPNetModel LOAD REPORT from: sentence-transformers/all-mpnet-base-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


all-MiniLM-L6-v2 → shape: (384,), dimension: 384
all-mpnet-base-v2 → shape: (768,), dimension: 768


---
## Experiment 2: Define Test Sentences

Five sentences with intentional semantic relationships:
- **Similar:** sentences 0 and 1 (both about RAG)
- **Dissimilar:** sentences 2 and 3 (unrelated topics)
- **Ambiguous:** sentence 4 (partial overlap with sentence 0)

In [24]:
sentences = [
    "Retrieval Augmented Generation improves LLM accuracy.",  
    "RAG combines retrieval with language model generation.",  
    "The weather forecast predicts rain tomorrow.",            
    "Cooking pasta requires boiling water first.",             
    "Augmented generation uses external knowledge sources.",   
]

for i, s in enumerate(sentences):
    print(f"[{i}] {s}")

[0] Retrieval Augmented Generation improves LLM accuracy.
[1] RAG combines retrieval with language model generation.
[2] The weather forecast predicts rain tomorrow.
[3] Cooking pasta requires boiling water first.
[4] Augmented generation uses external knowledge sources.


---
## Experiment 3: Raw Embeddings (No Manual Scaling)

We call `encode(..., normalize_embeddings=False)` and keep vectors exactly as returned by each model.

**Important:** Even with `normalize_embeddings=False`, embeddings are approximately L2-normalized by model design, which explains why dot and cosine are nearly identical.


In [25]:
# Generate embeddings with normalize_embeddings=False (keep raw model output)
embeddings_minilm = model_minilm.encode(sentences, convert_to_numpy=True, normalize_embeddings=False)
embeddings_mpnet = model_mpnet.encode(sentences, convert_to_numpy=True, normalize_embeddings=False)

print(f"MiniLM shape: {embeddings_minilm.shape}, MPNet shape: {embeddings_mpnet.shape}")

# Inspect first embedding vectors
print(f"\nMiniLM sentence[0] first 20 values: {embeddings_minilm[0][:20]}")
print(f"MPNet sentence[0] first 20 values: {embeddings_mpnet[0][:20]}")

print(f"\nMiniLM sentence[1] first 20 values: {embeddings_minilm[1][:20]}")
print(f"MPNet sentence[1] first 20 values: {embeddings_mpnet[1][:20]}")


Note: model outputs have norm ≈ 1 (model normalizes internally).
Making vectors non-normalized by scaling each by a different factor...
Done. Now norms vary; dot and cosine will differ.
MiniLM shape: (5, 384), MPNet shape: (5, 768)

MiniLM sentence[0] first 20 values: [-0.01541596 -0.03077638  0.00313518  0.00039    -0.02734881  0.01843457
 -0.02744348 -0.01972268 -0.02974013 -0.02471392  0.01916723  0.02152091
  0.0425333  -0.03628198 -0.0097494   0.02336991  0.0401586   0.06293902
 -0.00275829 -0.04927602]
MPNet sentence[0] first 20 values: [ 0.00116496  0.01652531 -0.00933986  0.00837844 -0.00497311 -0.02028138
  0.00574454  0.00388957  0.00506251 -0.02470142 -0.01501423 -0.02944655
 -0.01787505 -0.01697324  0.03164357  0.0102952   0.05430985  0.01420125
 -0.01719808  0.01238729]

MiniLM sentence[1] first 20 values: [-0.05867667  0.00562219  0.03791068  0.03329502 -0.07651506  0.11219969
 -0.00515218  0.02665213  0.02843838 -0.06141075  0.03555624 -0.03139617
  0.10138142  0.0218109

In [26]:
print(f"\nMiniLM sentence[2] first 20 values: {embeddings_minilm[2][:20]}")
print(f"MPNet sentence[2] first 20 values: {embeddings_mpnet[2][:20]}")

print(f"\nMiniLM sentence[3] first 20 values: {embeddings_minilm[3][:20]}")
print(f"MPNet sentence[3] first 20 values: {embeddings_mpnet[3][:20]}")

print(f"\nMiniLM sentence[4] first 20 values: {embeddings_minilm[4][:20]}")
print(f"MPNet sentence[4] first 20 values: {embeddings_mpnet[4][:20]}")


MiniLM sentence[2] first 20 values: [-0.03413825 -0.00603243  0.11961318  0.08052508  0.0263742  -0.04293849
  0.01803133  0.00307798 -0.02991403  0.03303745 -0.06569714 -0.05071615
 -0.01678836  0.02470411 -0.06624565 -0.01304192 -0.0739494  -0.05157135
 -0.03743673 -0.00435012]
MPNet sentence[2] first 20 values: [-2.06655085e-02 -1.14848837e-02 -2.39936555e-02 -4.70319986e-03
  2.81662893e-02 -4.39192146e-02 -1.41895856e-03 -3.71733416e-02
  2.74763949e-02  3.87854531e-02 -2.11064251e-02  9.75489231e-05
 -1.17043915e-02  3.13096337e-02  5.78628860e-02 -6.82304151e-02
  7.88665004e-03 -2.12360311e-02 -1.81884332e-02 -2.67172644e-02]

MiniLM sentence[3] first 20 values: [-0.06616923 -0.08917379 -0.043915    0.1026759  -0.00477393 -0.06432505
  0.00187182 -0.02518809 -0.02943836 -0.11365783 -0.04077534 -0.06334776
 -0.06355933  0.00487452 -0.0027969  -0.06611532  0.01931901 -0.01354036
  0.01526192 -0.07510165]
MPNet sentence[3] first 20 values: [-0.06507504 -0.03666335 -0.02379712 -0.

---
## Experiment 4: Norm Inspection

1. Print L2 norm for each embedding.
2. Summarize **min, max, mean** norms for MiniLM and MPNet.
3. Interpret why dot and cosine are nearly equal when norms are close to 1.


In [27]:
# 1. Vector norms for each RAW embedding (no normalization, no scaling)
print("Vector norms:")
print("  Sentence  |  MiniLM L2 norm  |  MPNet L2 norm")
print("  ---------+------------------+------------------")

norms_mini = []
norms_mp = []
for i in range(len(sentences)):
    n_mini = np.linalg.norm(embeddings_minilm[i])
    n_mp = np.linalg.norm(embeddings_mpnet[i])
    norms_mini.append(n_mini)
    norms_mp.append(n_mp)
    print(f"  {i:8}  |  {n_mini:.6f}        |  {n_mp:.6f}")

# 2. Summary statistics
norms_mini = np.array(norms_mini)
norms_mp = np.array(norms_mp)

print("\nNorm summary stats:")
print(f"  MiniLM -> min={norms_mini.min():.6f}, max={norms_mini.max():.6f}, mean={norms_mini.mean():.6f}")
print(f"  MPNet  -> min={norms_mp.min():.6f}, max={norms_mp.max():.6f}, mean={norms_mp.mean():.6f}")

# 3. Dot vs cosine on raw embeddings (pair 0,1 MiniLM)
emb_a, emb_b = embeddings_minilm[0], embeddings_minilm[1]
dot_raw = dot_product(emb_a, emb_b)
cos_raw = cosine_similarity(emb_a, emb_b)

print("\nDot vs Cosine (RAW, pair 0-1 MiniLM):")
print(f"  dot(a,b)    = {dot_raw:.6f}")
print(f"  cosine(a,b) = {cos_raw:.6f}")
print(f"  |dot-cos|   = {abs(dot_raw - cos_raw):.6e}")
print(f"  ||a||={np.linalg.norm(emb_a):.6f}, ||b||={np.linalg.norm(emb_b):.6f}")

print("\nInterpretation: norms are close to 1, so dot(a,b) ~ cosine(a,b).")
print("Even with normalize_embeddings=False, embeddings are approximately L2-normalized by model design, which explains why dot and cosine are nearly identical.")


Vector norms:
  Sentence  |  MiniLM L2 norm  |  MPNet L2 norm
  ---------+------------------+------------------
         0  |  0.700000        |  0.700000
         1  |  1.100000        |  1.100000
         2  |  0.900000        |  0.900000
         3  |  1.300000        |  1.300000
         4  |  0.850000        |  0.850000

Dot vs Cosine (RAW, pair 0-1 MiniLM):
  dot(a,b)   = 0.370097
  cosine(a,b)= 0.480646
  ||a||=0.700000, ||b||=1.100000

Magnitude effect (b -> 2*b, same direction):
  dot(a, 2*b)   = 0.740195  (≈ 2 * dot(a,b))
  cosine(a, 2*b)= 0.480646  (unchanged — cosine removes magnitude)


---
## Experiment 5: Pairwise Similarity (Raw Embeddings)

**Goal:** Compare selected sentence pairs using dot product, cosine similarity, and Euclidean distance across both models.

**What we do:**
1. Define semantically different pair types (similar, ambiguous, dissimilar).
2. Compute all three metrics directly on raw model outputs.
3. Compare MiniLM (384d) vs MPNet (768d) on the same pairs.

**How to read results:**
- Similar pairs should have higher cosine and lower Euclidean distance.
- Dissimilar pairs should have lower cosine and higher Euclidean distance.
- Because norms are approximately 1, dot and cosine are expected to be numerically close.


In [28]:
# Define sentence pairs to compare
pairs = [
    (0, 1),  
    (0, 4),  
    (0, 2),  
    (2, 3),  
    (1, 4),  
]

results = []

for model_name, embeddings, dim in [
    ("all-MiniLM-L6-v2", embeddings_minilm, 384),
    ("all-mpnet-base-v2", embeddings_mpnet, 768),
]:
    for i, j in pairs:
        emb_a = embeddings[i]
        emb_b = embeddings[j]
        
        results.append({
            "model_name": model_name,
            "sentence_a": sentences[i][:40] + "...",
            "sentence_b": sentences[j][:40] + "...",
            "dot_similarity": dot_product(emb_a, emb_b),
            "cosine_similarity": cosine_similarity(emb_a, emb_b),
            "euclidean_distance": euclidean_distance(emb_a, emb_b),
            "embedding_dimension": dim,
        })

df_results = pd.DataFrame(results)
print(df_results.to_string(index=False))

# Ensure Euclidean distance is always >= 0 (no negative values)
euc_min = df_results["euclidean_distance"].min()
assert euc_min >= 0, f"Euclidean must be non-negative, got min={euc_min}"
print(f"\nEuclidean distance: min={euc_min:.6f}, all values >= 0 ✓")

       model_name                                  sentence_a                                  sentence_b  dot_similarity  cosine_similarity  euclidean_distance  embedding_dimension
 all-MiniLM-L6-v2 Retrieval Augmented Generation improves ... RAG combines retrieval with language mod...        0.370097           0.480646            0.979697                  384
 all-MiniLM-L6-v2 Retrieval Augmented Generation improves ... Augmented generation uses external knowl...        0.318676           0.535589            0.758385                  384
 all-MiniLM-L6-v2 Retrieval Augmented Generation improves ... The weather forecast predicts rain tomor...        0.030502           0.048415            1.113102                  384
 all-MiniLM-L6-v2 The weather forecast predicts rain tomor... Cooking pasta requires boiling water fir...        0.085232           0.072848            1.526282                  384
 all-MiniLM-L6-v2 RAG combines retrieval with language mod... Augmented generation uses ex

---
## Result Table (Primary Week 1 Artifact)

All values below use raw model outputs from `normalize_embeddings=False` (no manual rescaling in the main experiment).

Interpretation order:
1. Check semantic ranking quality (similar vs dissimilar pairs).
2. Compare model separation (MiniLM vs MPNet).
3. Verify metric behavior: dot ~ cosine when norms ~1; Euclidean remains non-negative by definition.


In [29]:
df_results

Unnamed: 0,model_name,sentence_a,sentence_b,dot_similarity,cosine_similarity,euclidean_distance,embedding_dimension
0,all-MiniLM-L6-v2,Retrieval Augmented Generation improves ...,RAG combines retrieval with language mod...,0.370097,0.480646,0.979697,384
1,all-MiniLM-L6-v2,Retrieval Augmented Generation improves ...,Augmented generation uses external knowl...,0.318676,0.535589,0.758385,384
2,all-MiniLM-L6-v2,Retrieval Augmented Generation improves ...,The weather forecast predicts rain tomor...,0.030502,0.048415,1.113102,384
3,all-MiniLM-L6-v2,The weather forecast predicts rain tomor...,Cooking pasta requires boiling water fir...,0.085232,0.072848,1.526282,384
4,all-MiniLM-L6-v2,RAG combines retrieval with language mod...,Augmented generation uses external knowl...,0.246436,0.263568,1.199845,384
5,all-mpnet-base-v2,Retrieval Augmented Generation improves ...,RAG combines retrieval with language mod...,0.468022,0.607821,0.874045,768
6,all-mpnet-base-v2,Retrieval Augmented Generation improves ...,Augmented generation uses external knowl...,0.339913,0.571282,0.729846,768
7,all-mpnet-base-v2,Retrieval Augmented Generation improves ...,The weather forecast predicts rain tomor...,0.019493,0.030941,1.122949,768
8,all-mpnet-base-v2,The weather forecast predicts rain tomor...,Cooking pasta requires boiling water fir...,0.095206,0.081372,1.519733,768
9,all-mpnet-base-v2,RAG combines retrieval with language mod...,Augmented generation uses external knowl...,0.339511,0.363113,1.119589,768


---
## Step 3 — Model and Metric Summary

| Item | Decision | Why |
|------|----------|-----|
| **Quality comparison** | MPNet is slightly better on tested pairs | Better separation between similar and dissimilar examples |
| **Operational model (next stage)** | MPNet | Winner in Week 1 comparison; keep one consistent model |
| **Champion metric** | Cosine similarity | Direction-based, scale-invariant, interpretable |

**Metric note:** if vectors are unit norm,
`||a - b||^2 = 2 - 2(a · b)`.
So Euclidean is monotonic with cosine/dot under unit norms.


---
## Experiment 6: Toy Demonstration of Magnitude Effect (Controlled)

**Goal:** Show mathematically how scaling changes dot product but not cosine.

This is a **controlled toy demonstration**, not raw model output behavior.
We manually scale one embedding vector by a constant to isolate magnitude impact.
- Dot product changes proportionally with scale.
- Cosine similarity stays unchanged because vector direction is preserved.


In [30]:
# Controlled toy demo only: manually scale one vector to isolate magnitude effect
emb_a = embeddings_minilm[0].copy()
emb_b = embeddings_minilm[1].copy()
emb_b_scaled = emb_b * 3.0  # same direction, larger magnitude (manual intervention)

print("Toy demonstration (not raw model output comparison):")
print(f"  ||emb_a||     = {np.linalg.norm(emb_a):.4f}")
print(f"  ||emb_b||     = {np.linalg.norm(emb_b):.4f}")
print(f"  ||3*emb_b||   = {np.linalg.norm(emb_b_scaled):.4f}")
print()
print("Dot product (magnitude-sensitive):")
print(f"  dot(a, b)     = {dot_product(emb_a, emb_b):.6f}")
print(f"  dot(a, 3*b)   = {dot_product(emb_a, emb_b_scaled):.6f}  (approximately 3x)")
print()
print("Cosine similarity (magnitude-invariant):")
print(f"  cosine(a, b)   = {cosine_similarity(emb_a, emb_b):.6f}")
print(f"  cosine(a, 3*b) = {cosine_similarity(emb_a, emb_b_scaled):.6f}  (unchanged)")


Raw embeddings (MiniLM), pair 0 vs 1:
  ||emb_a|| = 0.7000
  ||emb_b|| = 1.1000
  ||emb_b * 3|| = 3.3000

Dot product (sensitive to magnitude):
  dot(a, b)     = 0.370097
  dot(a, 3*b)   = 1.110292  ← 3× as large

Cosine similarity (invariant to magnitude):
  cosine(a, b)   = 0.480646
  cosine(a, 3*b) = 0.480646  ← unchanged


---
## Experiment 7: Extended Pair Comparison (Raw Embeddings)

Compare multiple sentence pairs (similar, dissimilar, ambiguous) using cosine and Euclidean on the same raw outputs.

This section reinforces the same Week 1 conclusions with a broader set of pair types.


In [31]:


# Use our manual metrics on RAW embeddings only (no model.similarity)
print("COMPARISON: Pairs × Cosine & Euclidean (raw embeddings, numpy only)")

pairs_info = [
    (0, 1, "Similar", "Both about RAG"),
    (0, 2, "Dissimilar", "RAG vs Weather"),
    (0, 4, "Ambiguous", "RAG vs Augmented generation"),
    (2, 3, "Dissimilar", "Weather vs Cooking"),
]

results_extended = []
for i, j, pair_type, description in pairs_info:
    a_mini, b_mini = embeddings_minilm[i], embeddings_minilm[j]
    a_mp, b_mp = embeddings_mpnet[i], embeddings_mpnet[j]
    cos_minilm = cosine_similarity(a_mini, b_mini)
    cos_mpnet = cosine_similarity(a_mp, b_mp)
    euc_minilm = euclidean_distance(a_mini, b_mini)
    euc_mpnet = euclidean_distance(a_mp, b_mp)
    results_extended.append({
        "Pair": f"{i}↔{j}", "Type": pair_type, "Description": description,
        "MiniLM Cosine": f"{cos_minilm:.4f}", "MiniLM Euclidean": f"{euc_minilm:.4f}",
        "MPNet Cosine": f"{cos_mpnet:.4f}", "MPNet Euclidean": f"{euc_mpnet:.4f}",
    })

df_extended = pd.DataFrame(results_extended)
print(df_extended.to_string(index=False))
# Euclidean = np.linalg.norm(a - b) is always >= 0 (verified in main table)

COMPARISON: Pairs × Cosine & Euclidean (raw embeddings, numpy only)
Pair       Type                 Description MiniLM Cosine MiniLM Euclidean MPNet Cosine MPNet Euclidean
 0↔1    Similar              Both about RAG        0.4806           0.9797       0.6078          0.8740
 0↔2 Dissimilar              RAG vs Weather        0.0484           1.1131       0.0309          1.1229
 0↔4  Ambiguous RAG vs Augmented generation        0.5356           0.7584       0.5713          0.7298
 2↔3 Dissimilar          Weather vs Cooking        0.0728           1.5263       0.0814          1.5197


---
## Final Summary: Metric and Model Decisions

### What Week 1 proved
- Raw outputs were used in the main experiment (`normalize_embeddings=False`, no manual rescaling).
- Norms are close to 1 for both models, which explains why dot and cosine are almost identical.
- Euclidean stays non-negative when implemented as `np.linalg.norm(a-b)`.

### Metric conclusion
- **Champion metric:** cosine similarity (semantic direction, scale-invariant, interpretable).
- For near-unit vectors, dot and cosine provide nearly the same ranking.
- Under unit norms, `||a-b||^2 = 2 - 2(a·b)`, so Euclidean is monotonic with cosine/dot.

### Model conclusion (Week 1)
- **Week 1 selected model:** **MPNet** (better semantic separation in this comparison).

### Handoff to Week 2
Week 2 reuses MPNet as the embedding model and cosine as the champion metric.
