# Session 1.5A: Embeddings Deep Dive ‚Äî The Full Experience

**Requires:** `pip install sentence-transformers` (~500MB, includes torch)  
**Model:** `all-MiniLM-L6-v2` ‚Äî 22M parameters, 384 dimensions, runs on CPU  
**Focus:** Experience CONTEXTUAL embeddings ‚Äî the model reads the whole sentence

---

## What You Will Experience

```
Part 1 ‚Üí Install and load a pretrained model ‚Äî understand what's inside
Part 2 ‚Üí Word-level embeddings ‚Äî how words encode meaning
Part 3 ‚Üí Sentence-level ‚Äî the model reads full context, not just words
Part 4 ‚Üí Context-awareness ‚Äî 'bank' gets a DIFFERENT vector per sentence
Part 5 ‚Üí Bias ‚Äî pretrained models carry bias from the internet
Part 6 ‚Üí Visualize ‚Äî 2D PCA map of your banking sentences
Part 7 ‚Üí Dimensions ‚Äî peer inside the 384 numbers
Part 8 ‚Üí Compare with Notebook B ‚Äî what contextual models solve
```

## How this differs from Notebook B (gensim)

| | Notebook B (gensim) | Notebook A (this notebook) |
|--|--|--|
| Model | Word2Vec ‚Äî you train it | all-MiniLM-L6-v2 ‚Äî pretrained |
| Training | You see it happen | Already trained on billions of sentences |
| Granularity | One vector per WORD | One vector per SENTENCE |
| Context | Static ‚Äî 'bank' = one vector | Contextual ‚Äî 'bank' changes by sentence |
| Dimensions | 50 | 384 |
| Size | ~5MB | ~500MB |

---
## Setup

In [None]:
# Downloads ~500MB on first run (model weights + torch)
# Subsequent runs use the local cache ‚Äî no re-download needed
!pip install -q sentence-transformers

In [None]:
import math
import random
import time
from sentence_transformers import SentenceTransformer

print("Loading model: all-MiniLM-L6-v2")
print("First run downloads ~90MB of model weights...")
t0 = time.time()

model = SentenceTransformer('all-MiniLM-L6-v2')

elapsed = time.time() - t0
print(f"‚úÖ Model ready in {elapsed:.1f}s")
print(f"   Architecture: 6-layer MiniLM transformer")
print(f"   Parameters:   22 million")
print(f"   Output dims:  384 per sentence")
print(f"   Trained on:   1 billion+ sentence pairs from the internet")
print()
print("üìå You did NOT train this model ‚Äî it already knows English.")
print("   It was trained to make similar sentences have similar vectors.")
print("   Your Notebook B model only knew 46 banking sentences you wrote.")

---
## Part 1: What Is Inside the Model?

Before using it, understand what `all-MiniLM-L6-v2` actually is.

In [None]:
# Inspect the model architecture
print("=== Model Architecture ===")
print(model)
print()

# Encode one sentence ‚Äî see the raw output
sample = "AML compliance team monitors suspicious transactions."
vec = model.encode(sample)

print(f"=== Output for: '{sample}' ===")
print(f"  Type:       {type(vec)}")
print(f"  Shape:      {vec.shape}")
print(f"  Dimensions: {len(vec)}")
print(f"  Range:      [{vec.min():.4f}, {vec.max():.4f}]")
print(f"  First 10:   {vec[:10].round(4)}")
print()
print("üìå 384 numbers represent the MEANING of the entire sentence.")
print("   Unlike Word2Vec, this is computed for the whole sentence at once.")
print("   The model read every word in context before producing this vector.")

In [None]:
# What did training look like? (conceptual ‚Äî we can't re-run it)
print("=== How all-MiniLM-L6-v2 Was Trained (conceptual) ===")
print("""
Training objective: Contrastive learning on sentence pairs

Given pairs like:
  SIMILAR:  ('AML detects money laundering', 'Anti-money laundering compliance')
  DIFFERENT: ('AML detects money laundering', 'The river bank flooded')

The model adjusts weights so that:
  encode(similar_A) ‚âà encode(similar_B)   ‚Üí high cosine similarity
  encode(different_A) ‚â† encode(different_B) ‚Üí low cosine similarity

Dataset: 1 billion+ pairs from NLI datasets, Wikipedia, Reddit, etc.
Result:  A model that understands paraphrase, topic, and intent.

Compare with your Notebook B model:
  Word2Vec: trained on 46 sentences you wrote
  MiniLM:   trained on 1 billion sentence pairs
  ‚Üí MiniLM generalizes to any English sentence, even ones it never saw.
""")

---
## Part 2: Word-Level ‚Äî Does It Understand Banking Terms?

In [None]:
# sentence-transformers encodes sentences, but we can probe single words too
# Each word is treated as a one-word sentence

def cosine_sim(v1, v2):
    """Cosine similarity between two numpy vectors."""
    dot   = float(sum(a * b for a, b in zip(v1, v2)))
    norm1 = math.sqrt(sum(a * a for a in v1))
    norm2 = math.sqrt(sum(b * b for b in v2))
    return dot / (norm1 * norm2) if norm1 and norm2 else 0.0

BANKING_WORDS = ["AML", "KYC", "BSA", "compliance",
                 "fraud", "suspicious", "chargeback",
                 "mortgage", "savings", "overdraft",
                 "capital", "risk", "credit",
                 "wire", "SWIFT", "payment"]

# Encode all words
word_vecs = {w: model.encode(w) for w in BANKING_WORDS}

# Similarity pairs
pairs = [
    ("AML",       "KYC",        "Both compliance programs"),
    ("AML",       "BSA",        "Both regulatory requirements"),
    ("fraud",     "suspicious", "Related ‚Äî fraud triggers suspicion"),
    ("mortgage",  "credit",     "Both lending concepts"),
    ("wire",      "SWIFT",      "Wire uses SWIFT network"),
    ("AML",       "mortgage",   "Different domains"),
    ("fraud",     "capital",    "Very different domains"),
]

print("=== Word-Level Similarity (pretrained model) ===")
print(f"{'Word A':<14} {'Word B':<14} {'Similarity':<12} Relationship")
print("-" * 72)
for w1, w2, reason in pairs:
    sim = cosine_sim(word_vecs[w1], word_vecs[w2])
    verdict = "CLOSE" if sim > 0.7 else "RELATED" if sim > 0.4 else "DISTANT"
    print(f"{w1:<14} {w2:<14} {sim:<12.3f} {verdict} ‚Äî {reason}")

print()
print("üìå The model was NOT trained on your banking corpus.")
print("   It knows AML ‚âà KYC because it was trained on billions of sentences")
print("   that include compliance literature, regulations, news articles.")

In [None]:
# Find most similar words ‚Äî manual nearest-neighbor search
def most_similar_words(query_word, word_vecs, top_n=5):
    """Find top_n most similar words by cosine similarity."""
    qv = word_vecs[query_word]
    scores = [(w, cosine_sim(qv, v)) for w, v in word_vecs.items() if w != query_word]
    scores.sort(key=lambda x: -x[1])
    return scores[:top_n]

print("=== Most Similar Words (from our banking vocabulary) ===")
for probe in ["AML", "fraud", "mortgage", "wire"]:
    print(f"\n'{probe}' ‚Üí most similar:")
    for w, s in most_similar_words(probe, word_vecs):
        bar = "‚ñà" * int(s * 20)
        print(f"  {w:<16} {s:.3f}  {bar}")

---
## Part 3: Sentence-Level ‚Äî The Model Reads Full Context

Unlike Notebook B where you averaged word vectors, `sentence-transformers`  
reads the **entire sentence at once** through the transformer layers.  
Word order, grammar, and sentence structure all influence the output vector.

In [None]:
# Sentence similarity ‚Äî same topics should score high
BANKING_SENTENCES = [
    # Compliance
    "The AML team monitors suspicious transactions for money laundering.",
    "Compliance analysts review flagged activity reports for BSA violations.",
    "KYC onboarding requires customers to submit identity documents.",
    # Fraud
    "Fraud detection models flag anomalous card transaction patterns.",
    "Unauthorized account access triggered an immediate fraud investigation.",
    "The chargeback process was initiated after the disputed transaction.",
    # Retail
    "Mortgage loan approval depends on the applicant's credit score and income.",
    "The savings account earns interest on deposited customer funds.",
    "Overdraft fees are charged when the account balance goes negative.",
    # Capital
    "Basel III requires banks to maintain minimum capital adequacy ratios.",
    "Credit risk assessment evaluates the borrower's probability of default.",
    "Liquidity risk management ensures the bank can meet its obligations.",
    # Payments
    "Wire transfers above ten thousand dollars require regulatory reporting.",
    "The SWIFT network routes international payments between correspondent banks.",
    "Payment gateway authorization occurs in real time at point of sale.",
]

SENTENCE_LABELS = [
    "Compliance", "Compliance", "Compliance",
    "Fraud",      "Fraud",      "Fraud",
    "Retail",     "Retail",     "Retail",
    "Capital",    "Capital",    "Capital",
    "Payments",   "Payments",   "Payments",
]

# Encode all sentences in one batch (faster)
sentence_vecs = model.encode(BANKING_SENTENCES)
print(f"‚úÖ Encoded {len(BANKING_SENTENCES)} sentences")
print(f"   Output shape: {sentence_vecs.shape}  (sentences √ó dimensions)")

In [None]:
# Sentence pair similarity ‚Äî same cluster should score high
sentence_pairs = [
    (0, 1,  "Compliance A vs Compliance B ‚Äî same topic"),
    (0, 2,  "Compliance A vs KYC ‚Äî both compliance"),
    (3, 4,  "Fraud A vs Fraud B ‚Äî same topic"),
    (6, 7,  "Retail A vs Retail B ‚Äî same topic"),
    (0, 6,  "Compliance vs Retail ‚Äî different topics"),
    (3, 9,  "Fraud vs Capital ‚Äî very different"),
    (12, 13, "Payments A vs Payments B ‚Äî same topic"),
]

print("=== Sentence-Level Similarity ===")
print(f"{'Sentence A (truncated)':<42} {'Sentence B (truncated)':<42} {'Sim':>6}  Label")
print("-" * 100)
for i, j, label in sentence_pairs:
    sim = cosine_sim(sentence_vecs[i], sentence_vecs[j])
    a = BANKING_SENTENCES[i][:40]
    b = BANKING_SENTENCES[j][:40]
    verdict = "‚úì HIGH" if sim > 0.6 else "~ MED" if sim > 0.35 else "‚úó LOW"
    print(f"{a:<42} {b:<42} {sim:>6.3f}  {verdict} ‚Äî {label}")

In [None]:
# Word order test ‚Äî sentence-transformers DOES consider order
# Notebook B's averaging gave IDENTICAL vectors for same words in any order

print("=== Word Order Test: Does Order Matter? ===")
print("(Notebook B: same words ‚Üí identical vectors. Let's see if this model differs.)")
print()

order_pairs = [
    (
        "The bank approved the loan application.",
        "The loan application approved the bank.",
        "Grammatically reversed"
    ),
    (
        "The customer reported fraud to the bank.",
        "The bank reported fraud to the customer.",
        "Opposite meaning ‚Äî same words"
    ),
    (
        "AML controls prevent money laundering.",
        "Money laundering prevents AML controls.",
        "Reversed subject/object"
    ),
]

for s1, s2, label in order_pairs:
    v1 = model.encode(s1)
    v2 = model.encode(s2)
    sim = cosine_sim(v1, v2)
    print(f"A: '{s1}'")
    print(f"B: '{s2}'")
    print(f"Similarity: {sim:.3f} ‚Äî {label}")
    if sim < 0.95:
        print("‚úì Vectors DIFFER ‚Äî model respects word order!")
    else:
        print("‚úó Vectors nearly identical ‚Äî order had little effect here.")
    print()

print("üìå Compare: in Notebook B, reversed sentences had IDENTICAL vectors.")
print("   sentence-transformers encodes word order, grammar, and sentence structure.")

---
## Part 4: Context-Awareness ‚Äî The Key Advantage

In Notebook B, `'bank'` had **one vector** regardless of meaning.  
Here, `'bank'` gets a **different vector** in each sentence because  
the transformer reads the surrounding words before producing any vector.

In [None]:
# The 'bank' disambiguation test
# Notebook B: 'bank' = ONE vector, confused average of both meanings
# This notebook: 'bank' = DIFFERENT vector per context

financial_sentences = [
    "The bank approved the mortgage application after reviewing the credit score.",
    "We opened a savings account at the bank downtown.",
    "The bank rejected the loan due to insufficient collateral.",
    "The central bank raised interest rates to control inflation.",
]

river_sentences = [
    "The river bank flooded during the heavy rainstorm last night.",
    "Fishermen sat on the bank waiting for the evening catch.",
    "Erosion weakened the bank of the river near the bridge.",
    "The muddy bank was slippery after three days of rain.",
]

fin_vecs   = model.encode(financial_sentences)
river_vecs = model.encode(river_sentences)

# Within-group vs cross-group similarity
def avg_sim(vecs_a, vecs_b):
    """Average cosine similarity between all pairs across two groups."""
    sims = [cosine_sim(a, b) for a in vecs_a for b in vecs_b]
    return sum(sims) / len(sims)

def within_sim(vecs):
    """Average cosine similarity within a group (excluding self)."""
    n = len(vecs)
    sims = [cosine_sim(vecs[i], vecs[j]) for i in range(n) for j in range(n) if i != j]
    return sum(sims) / len(sims) if sims else 0.0

print("=== 'bank' disambiguation: Financial vs River ===")
print()
print(f"Within financial sentences:  {within_sim(fin_vecs):.3f}")
print(f"Within river sentences:      {within_sim(river_vecs):.3f}")
print(f"Financial vs River (cross):  {avg_sim(fin_vecs, river_vecs):.3f}")
print()
print("üìå Financial sentences cluster together.")
print("   River sentences cluster together.")
print("   The two groups are DISTANT despite sharing the word 'bank'.")
print("   In Notebook B: 'bank' had ONE confused vector mixing both meanings.")
print("   Here: the model reads context ‚Üí produces meaning-appropriate vectors.")

In [None]:
# Concrete pairwise comparison: financial vs river
print("=== Pairwise: Financial 'bank' vs River 'bank' ===")
print()
print("Comparing financial sentence 1 to all river sentences:")
s_fin = financial_sentences[0]
v_fin = fin_vecs[0]
print(f"  Query: '{s_fin[:70]}...'")
print()
for s_riv, v_riv in zip(river_sentences, river_vecs):
    sim = cosine_sim(v_fin, v_riv)
    print(f"  {sim:.3f}  '{s_riv[:65]}'")

print()
print("Comparing financial sentence 1 to other financial sentences:")
for s2, v2 in zip(financial_sentences[1:], fin_vecs[1:]):
    sim = cosine_sim(v_fin, v2)
    print(f"  {sim:.3f}  '{s2[:65]}'")

print()
print("üìå Financial sentences are much more similar to each other")
print("   than to river sentences ‚Äî even though all contain 'bank'.")

---
## Part 5: Bias ‚Äî Pretrained Models Carry Internet-Scale Bias

In Notebook B, you deliberately fed biased sentences.  
Here, bias is already present ‚Äî baked in during pretraining on web data.  
This is more realistic: production models are pretrained, not trained by you.

In [None]:
# Test: are gender-neutral job titles closer to one gender?
# Method: compare cosine(job, 'he worked as') vs cosine(job, 'she worked as')

MALE_ANCHOR   = "He works as a"
FEMALE_ANCHOR = "She works as a"

BANKING_ROLES = [
    "compliance analyst",
    "risk manager",
    "loan officer",
    "fraud investigator",
    "branch manager",
    "portfolio manager",
    "customer service representative",
    "credit analyst",
    "treasury analyst",
    "administrative assistant",
]

male_anchor_vec   = model.encode(MALE_ANCHOR)
female_anchor_vec = model.encode(FEMALE_ANCHOR)

print("=== Gender Proximity in Pretrained Model ===")
print(f"Anchor A: '{MALE_ANCHOR}'")
print(f"Anchor B: '{FEMALE_ANCHOR}'")
print()
print(f"{'Role':<35} {'Sim(male anchor)':<20} {'Sim(female anchor)':<20} {'Bias direction'}")
print("-" * 90)

for role in BANKING_ROLES:
    role_vec = model.encode(role)
    sim_male   = cosine_sim(role_vec, male_anchor_vec)
    sim_female = cosine_sim(role_vec, female_anchor_vec)
    diff = sim_male - sim_female
    direction = f"‚Üí male   (+{diff:.3f})" if diff > 0.002 else \
                f"‚Üí female ({diff:.3f})" if diff < -0.002 else \
                "  neutral"
    print(f"{role:<35} {sim_male:<20.4f} {sim_female:<20.4f} {direction}")

print()
print("üìå These biases come from web text used in pretraining.")
print("   The model did not choose them ‚Äî they reflect historical language patterns.")
print("   Regulators (CFPB, EBA) now require bias audits for models used in lending.")

In [None]:
# Second bias test: does the model associate compliance with specific demographics?
SENTENCES_TO_TEST = [
    # Same compliance scenario, only name differs
    "John Smith was flagged by the AML system for suspicious transactions.",
    "Maria Garcia was flagged by the AML system for suspicious transactions.",
    "Wei Zhang was flagged by the AML system for suspicious transactions.",
    "Ahmed Hassan was flagged by the AML system for suspicious transactions.",
]

HIGH_RISK_ANCHOR = "This person is a high-risk customer requiring enhanced due diligence."
LOW_RISK_ANCHOR  = "This person is a low-risk customer with standard verification."

high_risk_vec = model.encode(HIGH_RISK_ANCHOR)
low_risk_vec  = model.encode(LOW_RISK_ANCHOR)
test_vecs     = model.encode(SENTENCES_TO_TEST)

print("=== Name-Based Bias: Same AML Scenario, Different Names ===")
print(f"High-risk anchor: '{HIGH_RISK_ANCHOR}'")
print(f"Low-risk anchor:  '{LOW_RISK_ANCHOR}'")
print()
print(f"{'Name':<15} {'Sim(high-risk)':<18} {'Sim(low-risk)':<18} Relative risk score")
print("-" * 70)

names = ["John Smith", "Maria Garcia", "Wei Zhang", "Ahmed Hassan"]
for name, vec in zip(names, test_vecs):
    sim_high = cosine_sim(vec, high_risk_vec)
    sim_low  = cosine_sim(vec, low_risk_vec)
    risk_score = sim_high - sim_low
    bar = "‚ñà" * int(abs(risk_score) * 200)
    direction = "+" if risk_score > 0 else "-"
    print(f"{name:<15} {sim_high:<18.4f} {sim_low:<18.4f} {direction}{abs(risk_score):.4f} {bar}")

print()
print("üìå If these scores differ materially by name, the model encodes name-based bias.")
print("   A fair system should give identical scores ‚Äî same sentence, same risk.")
print("   This is the algorithmic fairness problem in AML and credit decisioning.")

---
## Part 6: Visualize ‚Äî 2D Map of Banking Sentence Space

Reduce 384 dimensions to 2 using PCA to see how sentences cluster.

In [None]:
# PCA from scratch ‚Äî pure Python stdlib, no sklearn
import math, random

def pca_2d(matrix):
    """
    Reduce NxD matrix to Nx2 using PCA.
    Pure Python ‚Äî no numpy, no sklearn.
    """
    n, d = len(matrix), len(matrix[0])

    # Center the data
    means = [sum(matrix[i][j] for i in range(n)) / n for j in range(d)]
    centered = [[matrix[i][j] - means[j] for j in range(d)] for i in range(n)]

    def dot(a, b):
        return sum(x * y for x, y in zip(a, b))

    def mat_vec(M, v):
        return [dot(row, v) for row in M]

    def normalize(v):
        n = math.sqrt(sum(x * x for x in v))
        return [x / n for x in v] if n > 0 else v

    def subtract_proj(v, u):
        p = dot(v, u)
        return [v[i] - p * u[i] for i in range(len(v))]

    # Covariance matrix C = X^T X / n
    C = [[sum(centered[k][i] * centered[k][j] for k in range(n)) / n
          for j in range(d)] for i in range(d)]

    random.seed(42)
    pcs = []
    for _ in range(2):
        v = normalize([random.gauss(0, 1) for _ in range(d)])
        for _ in range(100):
            v = normalize(mat_vec(C, v))
            for pc in pcs:
                v = normalize(subtract_proj(v, pc))
        pcs.append(v)

    coords = [[dot(row, pcs[0]), dot(row, pcs[1])] for row in centered]
    return coords

# Use the 15 banking sentences from Part 3
matrix = [list(map(float, v)) for v in sentence_vecs]
coords = pca_2d(matrix)

print(f"‚úÖ PCA: 384 dimensions ‚Üí 2 dimensions for {len(matrix)} sentences")

In [None]:
# ASCII scatter plot ‚Äî cluster labels per sentence
CLUSTER_MARKER = {
    "Compliance": "C",
    "Fraud":      "F",
    "Retail":     "R",
    "Capital":    "K",
    "Payments":   "P",
}

xs = [c[0] for c in coords]
ys = [c[1] for c in coords]
x_min, x_max = min(xs), max(xs)
y_min, y_max = min(ys), max(ys)

W, H = 70, 24
grid = [[" "] * W for _ in range(H)]

def to_grid(x, y):
    col = int((x - x_min) / (x_max - x_min + 1e-9) * (W - 1))
    row = int((1 - (y - y_min) / (y_max - y_min + 1e-9)) * (H - 1))
    return max(0, min(W - 1, col)), max(0, min(H - 1, row))

for label, (x, y) in zip(SENTENCE_LABELS, coords):
    col, row = to_grid(x, y)
    grid[row][col] = CLUSTER_MARKER[label]

print("=== Banking Sentence Space ‚Äî 2D PCA (ASCII) ===")
print("C=Compliance  F=Fraud  R=Retail  K=Capital  P=Payments")
print()
print("‚îå" + "‚îÄ" * W + "‚îê")
for row in grid:
    print("‚îÇ" + "".join(row) + "‚îÇ")
print("‚îî" + "‚îÄ" * W + "‚îò")
print()
print("üìå Each letter = one sentence. Same-cluster sentences should group together.")
print("   Unlike Notebook B (word vectors), these are SENTENCE vectors.")
print("   Full meaning encoded ‚Äî not just word averages.")

In [None]:
# Matplotlib scatter plot if available
try:
    import matplotlib.pyplot as plt
    import matplotlib
    matplotlib.rcParams['figure.figsize'] = (13, 9)

    cluster_colors = {
        "Compliance": "#e74c3c",
        "Fraud":      "#8e44ad",
        "Retail":     "#27ae60",
        "Capital":    "#2980b9",
        "Payments":   "#f39c12",
    }

    fig, ax = plt.subplots()

    for sentence, label, (x, y) in zip(BANKING_SENTENCES, SENTENCE_LABELS, coords):
        color = cluster_colors[label]
        ax.scatter(x, y, color=color, s=150, zorder=2)
        short = sentence[:40] + "..."
        ax.annotate(short, (x, y), textcoords="offset points",
                    xytext=(6, 4), fontsize=7, color=color)

    from matplotlib.patches import Patch
    legend = [Patch(color=c, label=l) for l, c in cluster_colors.items()]
    ax.legend(handles=legend, loc="best", fontsize=10)

    ax.set_title(
        "Banking Sentence Embeddings ‚Äî 2D PCA\n"
        "(all-MiniLM-L6-v2, 384 dims ‚Üí 2 dims)",
        fontsize=13
    )
    ax.set_xlabel("Principal Component 1")
    ax.set_ylabel("Principal Component 2")
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

except ImportError:
    print("matplotlib not available ‚Äî ASCII plot above is the visualization.")

---
## Part 7: Inside the 384 Dimensions

Same exploration as Notebook B ‚Äî but now with 384 dims and sentence vectors.

In [None]:
# Show all 384 dims for one sentence as a bar chart (every 8th dim for readability)
def show_sentence_dims(sentence, stride=8):
    """Print dimension values for a sentence vector."""
    vec = list(model.encode(sentence))
    vmax = max(abs(v) for v in vec)

    print(f"Sentence: '{sentence}'")
    print(f"  Total dims: {len(vec)}  Range: [{min(vec):.4f}, {max(vec):.4f}]")
    print(f"  Showing every {stride}th dimension:")
    print()
    print(f"  {'Dim':<7} {'Value':>9}  Bar")
    print(f"  {'---':<7} {'-----':>9}  {'---'}")

    for i in range(0, len(vec), stride):
        v = vec[i]
        bar_len = int(abs(v) / (vmax + 1e-9) * 20)
        bar = ("‚ñà" * bar_len) if v >= 0 else ("‚ñí" * bar_len)
        sign = "+" if v >= 0 else "-"
        print(f"  dim{i:<4} {v:>9.4f}  {sign} {bar}")

show_sentence_dims("The AML team monitors suspicious transactions for money laundering.")

In [None]:
# ASCII heatmap: 384 dims √ó 5 sentence clusters
# Group sentences by cluster, take mean vector, render as shade

SHADES = [" ", "¬∑", "‚ñë", "‚ñí", "‚ñì", "‚ñà"]

# Group sentence vectors by cluster
cluster_vecs = {}
for label, vec in zip(SENTENCE_LABELS, sentence_vecs):
    cluster_vecs.setdefault(label, []).append(list(vec))

# Mean vector per cluster
cluster_means = {}
for label, vecs in cluster_vecs.items():
    n, d = len(vecs), len(vecs[0])
    cluster_means[label] = [sum(v[i] for v in vecs) / n for i in range(d)]

all_vals = [abs(v) for vec in cluster_means.values() for v in vec]
vmax = max(all_vals)

# Render heatmap ‚Äî show every 4th dim to fit the screen
STRIDE = 4
dim_count = len(next(iter(cluster_means.values())))
shown_dims = list(range(0, dim_count, STRIDE))

print("=== Dimension Heatmap: 384 Dims √ó 5 Clusters ===")
print(f"(showing every {STRIDE}th dim ‚Äî {len(shown_dims)} columns)")
print(f"Shade: ' '=~0  '¬∑'=small  '‚ñë‚ñí‚ñì‚ñà'=large")
print()

# Column headers every 50 shown dims
header = f"  {'Cluster':<14}|"
for idx, d in enumerate(shown_dims):
    if idx % 25 == 0:
        header += str(d).ljust(25)
print(header)
print("  " + "-" * (15 + len(shown_dims)))

for cluster_name in ["Compliance", "Fraud", "Retail", "Capital", "Payments"]:
    vec = cluster_means[cluster_name]
    row = f"  {cluster_name:<14}|"
    for d in shown_dims:
        intensity = int(abs(vec[d]) / (vmax + 1e-9) * (len(SHADES) - 1))
        row += SHADES[intensity]
    print(row)

print()
print("üìå 384 dims = far more expressive than Notebook B's 50 dims.")
print("   The model distributes meaning across all 384 ‚Äî still no single dim = one concept.")

In [None]:
# Most discriminating dims between two sentence clusters
def most_discriminating_dims(cluster_a, cluster_b, cluster_means, top_n=10):
    """Find dimensions that differ most between two sentence clusters."""
    ma = cluster_means[cluster_a]
    mb = cluster_means[cluster_b]
    diffs = [(i, abs(ma[i] - mb[i]), ma[i], mb[i]) for i in range(len(ma))]
    diffs.sort(key=lambda x: -x[1])

    print(f"Top {top_n} dims separating '{cluster_a}' from '{cluster_b}':")
    print(f"  {'Dim':<7} {'|Diff|':>8}  {cluster_a:>13}  {cluster_b:>13}  Direction")
    print("  " + "-" * 65)
    for i, diff, va, vb in diffs[:top_n]:
        higher = f"‚Üê {cluster_a}" if va > vb else f"‚Üê {cluster_b}"
        print(f"  dim{i:<4} {diff:>8.4f}  {va:>13.4f}  {vb:>13.4f}  {higher}")

most_discriminating_dims("Compliance", "Retail", cluster_means)
print()
most_discriminating_dims("Fraud", "Payments", cluster_means)

---
## Part 8: Head-to-Head ‚Äî Notebook B vs Notebook A

The definitive comparison: where Word2Vec (Notebook B) fails and sentence-transformers succeeds.

In [None]:
# Test 1: Paraphrase detection
# Word2Vec: only works if exact same words appear in training corpus
# sentence-transformers: understands paraphrase semantically

paraphrase_pairs = [
    (
        "The customer failed to provide identity documents for KYC.",
        "The client did not submit identification for onboarding verification.",
        "Paraphrase ‚Äî different words, same meaning"
    ),
    (
        "Suspicious transaction flagged by AML system.",
        "Anti-money laundering alert triggered on anomalous activity.",
        "Paraphrase ‚Äî technical vs plain language"
    ),
    (
        "The mortgage was rejected due to insufficient income.",
        "Home loan application denied because earnings were too low.",
        "Paraphrase ‚Äî formal vs informal phrasing"
    ),
    (
        "AML compliance team monitors suspicious transactions daily.",
        "Wire transfers above ten thousand dollars require reporting.",
        "NOT a paraphrase ‚Äî different banking topics"
    ),
]

print("=== Paraphrase Detection ===")
print()
for s1, s2, label in paraphrase_pairs:
    v1, v2 = model.encode(s1), model.encode(s2)
    sim = cosine_sim(v1, v2)
    verdict = "‚úì PARAPHRASE" if sim > 0.55 else "‚úó DIFFERENT"
    print(f"A: '{s1[:68]}'")
    print(f"B: '{s2[:68]}'")
    print(f"Similarity: {sim:.3f}  {verdict} ‚Äî {label}")
    print()

print("üìå Word2Vec would score these based only on overlapping vocabulary.")
print("   'customer' ‚â† 'client' in Word2Vec unless both appeared in same sentences.")
print("   sentence-transformers understands synonyms because it was trained on them.")

In [None]:
# Test 2: Zero-shot ‚Äî sentences the model has never seen
# Word2Vec (Notebook B) fails on out-of-vocabulary words
# sentence-transformers handles unseen words via subword tokenization

novel_sentences = [
    "The DORA regulation mandates ICT risk management for financial entities.",  # new acronym
    "Stablecoin issuers must comply with MiCA capital reserve requirements.",    # new domain
    "Embedded finance APIs now expose banking functions to non-bank fintechs.",  # new jargon
    "AML red flags include structuring, layering, and integration of illicit funds.",  # known domain
]

novel_vecs = model.encode(novel_sentences)

print("=== Zero-Shot: Novel Banking Terminology ===")
print("(Testing sentences with terms the model may not have seen during training)")
print()

# Compare new sentences to each other and to known sentences
known_compliance = model.encode("AML compliance monitoring for suspicious transactions.")

for s, v in zip(novel_sentences, novel_vecs):
    sim_to_known = cosine_sim(v, known_compliance)
    print(f"Sentence: '{s[:70]}'")
    print(f"  Similarity to known compliance anchor: {sim_to_known:.3f}")
    print()

print("üìå Even for novel terms (DORA, MiCA, embedded finance):")
print("   The model produces valid 384-dim vectors ‚Äî no 'out of vocabulary' error.")
print("   Word2Vec (Notebook B) would return zero or skip unknown words entirely.")

In [None]:
# Test 3: Semantic search ‚Äî find the most relevant sentence for a query
# This is the foundation of RAG (Session 3)

def semantic_search(query, corpus, corpus_vecs, top_n=3):
    """Find top_n most similar sentences to query."""
    q_vec = model.encode(query)
    scores = [(i, cosine_sim(q_vec, v)) for i, v in enumerate(corpus_vecs)]
    scores.sort(key=lambda x: -x[1])
    return [(corpus[i], s) for i, s in scores[:top_n]]

queries = [
    "What controls prevent illicit financial flows?",
    "How does a bank assess the risk of not being able to pay its debts?",
    "International money movement between financial institutions",
]

print("=== Semantic Search Over Banking Corpus ===")
print("(This is the retrieval step in RAG ‚Äî Session 3 will build the full pipeline)")
print()

for query in queries:
    print(f"Query: '{query}'")
    results = semantic_search(query, BANKING_SENTENCES, sentence_vecs)
    for rank, (sentence, score) in enumerate(results, 1):
        print(f"  {rank}. [{score:.3f}] {sentence}")
    print()

print("üìå The query uses plain English ‚Äî no banking jargon required.")
print("   The model maps query and documents into the same 384-dim space.")
print("   Closest document vectors = most semantically relevant results.")
print("   In Session 3: this scales to thousands of documents using a vector database.")

In [None]:
# Final summary comparison table
print("=== Notebook B vs Notebook A: Head-to-Head Results ===")
print()

comparisons = [
    ("'bank' (financial) vs 'bank' (river)",
     "IDENTICAL vector ‚Äî static",
     "DIFFERENT vectors ‚Äî contextual"),
    ("Paraphrase: 'customer' vs 'client'",
     "Low sim ‚Äî both must be in corpus",
     "High sim ‚Äî synonym understanding"),
    ("Word order: 'A approved B' vs 'B approved A'",
     "IDENTICAL ‚Äî averaging loses order",
     "DIFFERENT ‚Äî transformer reads order"),
    ("New acronym: DORA, MiCA",
     "KeyError ‚Äî out of vocabulary",
     "Valid vector ‚Äî subword tokenization"),
    ("Semantic search on unseen query",
     "Limited ‚Äî only corpus words match",
     "Strong ‚Äî query mapped to same space"),
    ("Training required",
     "Yes ‚Äî you train it yourself",
     "No ‚Äî pretrained on 1B+ pairs"),
    ("Install size",
     "~5MB",
     "~500MB"),
    ("Works behind proxy",
     "Yes ‚Äî gensim only",
     "Only after first download"),
]

print(f"{'Test':<42} {'Notebook B (gensim)':<35} {'Notebook A (sentence-transformers)'}")
print("-" * 120)
for test, nb_b, nb_a in comparisons:
    print(f"{test:<42} {nb_b:<35} {nb_a}")

print()
print("When to use which:")
print("  Notebook B (gensim): teach how embeddings work, air-gapped env, show training")
print("  Notebook A (this):   production similarity, paraphrase, semantic search, RAG prep")

---
## Hands-On Exercise: Explore Your Own Banking Domain

In [None]:
# Choose a banking domain and write 5 sentences for it
# The model requires no training ‚Äî just encode and explore

MY_DOMAIN = "Your Domain Here"

MY_SENTENCES = [
    # TODO: Write 5-8 realistic sentences from your chosen domain
    "First sentence about your domain topic here.",
    "Second sentence with different wording about the same topic.",
    "Third sentence that explores another aspect of your domain.",
    "Fourth sentence ‚Äî try a paraphrase of sentence one.",
    "Fifth sentence ‚Äî something clearly from a different topic.",
]

# Encode your sentences
my_vecs = model.encode(MY_SENTENCES)

print(f"Domain: {MY_DOMAIN}")
print(f"Sentences: {len(MY_SENTENCES)}")
print(f"Vector shape: {my_vecs.shape}")
print()

# Similarity matrix
print("Similarity matrix (row i vs column j):")
n = len(MY_SENTENCES)
header = " " * 5 + "".join(f"  S{j:<3}" for j in range(n))
print(header)
for i in range(n):
    row = f"S{i:<3} "
    for j in range(n):
        sim = cosine_sim(my_vecs[i], my_vecs[j])
        row += f"  {sim:.2f}"
    print(row)

print()
print("Questions to discuss:")
print("  1. Which pairs scored highest? Are they genuine paraphrases?")
print("  2. Which pairs scored lowest? Do they cover distinct subtopics?")
print("  3. Try a QUERY sentence (not in MY_SENTENCES) ‚Äî run semantic_search on it.")
print("  4. What bias might exist in your sentences?")

---
## Summary: What You Experienced

| Part | Concept | Key Takeaway |
|------|---------|-------------|
| 1. Model | Pretrained transformer | 22M params, trained on 1B+ pairs ‚Äî already knows English |
| 2. Word-level | Single-word vectors | Model understands AML‚âàKYC without your corpus |
| 3. Sentence-level | Full context encoding | Word order, grammar, intent all encoded |
| 4. Context | 'bank' disambiguation | Different vector per sentence ‚Äî the core innovation |
| 5. Bias | Pretrained bias | Internet-scale training ‚Üí internet-scale biases |
| 6. Visualize | 2D PCA of sentences | Sentence clusters emerge ‚Äî not word clusters |
| 7. Dimensions | Inside 384 dims | Same principle as 50 dims ‚Äî pattern, not labels |
| 8. Comparison | B vs A head-to-head | Each wins in different scenarios |

### The Natural Next Step ‚Äî Session 3

```
You just did:  sentence ‚Üí 384-dim vector ‚Üí cosine similarity

Session 3 adds:
  1. A corpus of thousands of banking documents (PDFs, regulations, reports)
  2. A vector database (ChromaDB / pgvector) storing all their embeddings
  3. A query comes in ‚Üí embed it ‚Üí find top-K similar document chunks
  4. Pass those chunks + query to an LLM ‚Üí grounded, cited answer

That is RAG: Retrieval-Augmented Generation.
The embedding model you used today (all-MiniLM-L6-v2) is the retrieval engine.
```