# Session 1.5B: Embeddings Deep Dive ‚Äî Train Your Own (No Heavy Dependencies)

**Only requires:** `pip install gensim` (~5MB, no torch, no GPU)  
**Works behind:** corporate proxies, air-gapped environments  
**Focus:** Experience the TRAINING process, not just inference

---

## What You Will Experience

```
Part 1 ‚Üí Build a banking corpus by hand (you control what the model learns)
Part 2 ‚Üí Train Word2Vec from scratch ‚Äî watch it learn in seconds
Part 3 ‚Üí Word-level embeddings ‚Äî similar words cluster together
Part 4 ‚Üí Sentence-level ‚Äî average word vectors, see the limitation
Part 5 ‚Üí Context-awareness ‚Äî hit the ceiling of static embeddings
Part 6 ‚Üí Bias ‚Äî feed biased data, get biased embeddings
Part 7 ‚Üí Visualize ‚Äî 2D plot of your banking embedding space
Part 8 ‚Üí Retrain ‚Äî change the corpus, watch the space shift
```

---
## Setup

In [None]:
# Only one install needed ‚Äî ~5MB, no torch, no GPU
!pip install -q gensim

In [None]:
import math
import random
import collections
from gensim.models import Word2Vec

print("‚úÖ Ready. No API keys. No heavy downloads. Just Python + gensim.")

---
## Part 1: Build Your Banking Corpus

**Key idea:** The model learns ONLY from sentences you give it.  
Words that appear near each other often ‚Üí similar vectors.  
You control what it knows.

In [None]:
# Each sentence is a list of words (tokens)
# Word2Vec learns from co-occurrence within a sliding window

BANKING_CORPUS = [
    # AML / Compliance cluster
    "aml compliance team monitors suspicious transactions daily".split(),
    "kyc procedures require customer identification and verification".split(),
    "bsa requires banks to file suspicious activity reports".split(),
    "aml analysts review flagged transactions for money laundering".split(),
    "kyc onboarding collects customer documents and identity proof".split(),
    "compliance officers ensure aml and kyc policies are followed".split(),
    "suspicious transactions trigger aml alerts for review".split(),
    "money laundering detection requires robust aml controls".split(),
    "bsa officer reviews ctr and sar filings for compliance".split(),
    "kyc refresh is required annually for high risk customers".split(),

    # Fraud cluster
    "fraud detection models flag anomalous transaction patterns".split(),
    "fraud analysts investigate unauthorized card transactions".split(),
    "fraud prevention uses machine learning to detect anomalies".split(),
    "chargeback process resolves disputed fraud transactions".split(),
    "fraud risk increases during holiday shopping seasons".split(),
    "account takeover fraud involves stolen customer credentials".split(),
    "fraud alerts are sent to customers via sms and email".split(),

    # Capital / Risk cluster
    "capital adequacy ratio measures bank financial strength".split(),
    "credit risk assessment evaluates borrower default probability".split(),
    "market risk arises from changes in interest rates and prices".split(),
    "basel three framework sets minimum capital requirements".split(),
    "stress testing evaluates bank resilience under adverse scenarios".split(),
    "risk weighted assets determine required capital buffers".split(),
    "liquidity risk management ensures bank can meet obligations".split(),
    "capital ratio must exceed regulatory minimum requirements".split(),

    # Retail Banking cluster
    "mortgage loan approval depends on credit score and income".split(),
    "savings account earns interest on deposited customer funds".split(),
    "credit card spending limit is set based on creditworthiness".split(),
    "overdraft fee is charged when account balance goes negative".split(),
    "personal loan application requires income verification documents".split(),
    "mortgage rate depends on fed funds rate and credit score".split(),
    "retail banking serves individual customers with daily needs".split(),
    "branch teller processes deposits withdrawals and transfers".split(),

    # Payments / Wires cluster
    "wire transfer sends funds between banks via swift network".split(),
    "payment processing requires sender and receiver account details".split(),
    "swift code identifies the receiving bank for international wires".split(),
    "wire transfers above ten thousand dollars require reporting".split(),
    "payment gateway authorizes card transactions in real time".split(),
    "international wire transfer fees vary by destination country".split(),
]

print(f"Corpus built: {len(BANKING_CORPUS)} sentences")
print(f"Total words:  {sum(len(s) for s in BANKING_CORPUS)}")
print(f"\nSample sentence: {BANKING_CORPUS[0]}")
print("\nüìå This corpus is YOUR model's entire world.")
print("   It knows nothing outside these sentences.")

In [None]:
# Word frequency ‚Äî what the model will see most
from collections import Counter

all_words = [w for sentence in BANKING_CORPUS for w in sentence]
freq = Counter(all_words).most_common(20)

print("Top 20 words in corpus:")
for word, count in freq:
    bar = "‚ñà" * count
    print(f"  {word:<20} {count:>3}  {bar}")

print("\nüìå High-frequency words will have better-trained vectors.")
print("   Rare words (appear once) will have poor representations.")

---
## Part 2: Train Word2Vec From Scratch

**What happens during training:**
```
For each word in each sentence:
  Look at surrounding words (window)
  Adjust vectors so nearby words predict each other
  Words in similar contexts ‚Üí similar vectors

"aml compliance team monitors..."
  aml  ‚Üê‚Üí  compliance, team, monitors   (window=2)
  aml  ‚Üê‚Üí  kyc, bsa, suspicious         (across sentences)
  Result: aml vector ‚âà kyc vector ‚âà bsa vector
```

In [None]:
import time

print("Training Word2Vec on banking corpus...")
t0 = time.time()

model = Word2Vec(
    sentences=BANKING_CORPUS,
    vector_size=50,    # Each word ‚Üí 50-dimensional vector (tiny but sufficient)
    window=3,          # Look 3 words left and right
    min_count=1,       # Include all words (small corpus)
    workers=1,         # Single thread (reproducible)
    epochs=200,        # Train 200 passes over corpus
    seed=42
)

elapsed = time.time() - t0
vocab = list(model.wv.key_to_index.keys())

print(f"‚úÖ Training complete in {elapsed:.2f} seconds")
print(f"   Vocabulary size: {len(vocab)} unique words")
print(f"   Vector dimensions: {model.vector_size}")
print(f"   Training epochs: 200")
print(f"\nSample vocabulary: {sorted(vocab)[:20]}")

In [None]:
# Inspect a raw vector
word = "aml"
vector = model.wv[word]

print(f"Vector for '{word}':")
print(f"  Dimensions: {len(vector)}")
print(f"  Values: {vector.round(3)}")
print(f"  Range: [{vector.min():.3f}, {vector.max():.3f}]")
print(f"\nüìå These numbers encode '{word}'s meaning based on its context in YOUR corpus.")
print(f"   They have no inherent meaning alone ‚Äî only relative distances matter.")

---
## Part 2B: Inside the Dimensions ‚Äî What Do the 50 Numbers Actually Encode?

Each word vector has 50 numbers. **No single number has a human label.**  
But patterns emerge: dimensions that fire high for compliance words stay quiet for retail words.  
This section lets you peer inside and build intuition for what "meaning in numbers" looks like.

In [None]:
# Step 1: Print all 50 dimensions for a single word ‚Äî side by side with a bar chart
def show_dimensions(word, m=model, top_n=50):
    """Print all dimensions of a word vector as a horizontal bar chart."""
    if word not in m.wv:
        print(f"'{word}' not in vocabulary")
        return
    vec = list(m.wv[word])
    vmax = max(abs(v) for v in vec)

    print(f"\nAll {len(vec)} dimensions for '{word}':")
    print(f"  Range: [{min(vec):.3f}, {max(vec):.3f}]")
    print()
    print(f"  {'Dim':<6} {'Value':>8}  {'Bar (positive=‚ñà, negative=‚ñí)'}")
    print(f"  {'---':<6} {'-----':>8}  {'-' * 40}")

    for i, v in enumerate(vec):
        bar_len = int(abs(v) / (vmax + 1e-9) * 20)
        bar = ("‚ñà" * bar_len) if v >= 0 else ("‚ñí" * bar_len)
        sign = "+" if v >= 0 else "-"
        print(f"  dim{i:<3} {v:>8.4f}  {sign} {bar}")

show_dimensions("aml")

In [None]:
# Step 2: Compare the same dimensions across multiple words ‚Äî see which dims differ
# This is the key insight: similar words have similar patterns across ALL 50 dims

INSPECT_WORDS = ["aml", "kyc", "fraud", "mortgage", "wire"]

# Print a compact comparison table: words as columns, dims as rows
# Show every 5th dimension to keep it readable
def compare_dimensions(words, m=model, stride=5):
    """Show dimension values across multiple words in a table."""
    vecs = {}
    for w in words:
        if w in m.wv:
            vecs[w] = list(m.wv[w])
        else:
            print(f"  Warning: '{w}' not in vocabulary, skipping")

    if not vecs:
        return

    dim_count = len(next(iter(vecs.values())))
    word_list = list(vecs.keys())

    print("=== Dimension Comparison Across Words ===")
    print(f"Showing every {stride} dimensions (dim 0, {stride}, {stride*2}, ...)")
    print(f"Words: {word_list}")
    print()

    # Header
    header = f"  {'Dim':<6}" + "".join(f"{w:>10}" for w in word_list)
    print(header)
    print("  " + "-" * (6 + 10 * len(word_list)))

    for i in range(0, dim_count, stride):
        row = f"  dim{i:<3}"
        for w in word_list:
            val = vecs[w][i]
            row += f"  {val:>8.4f}"
        print(row)

    print()
    print("üìå Notice: 'aml' and 'kyc' have similar patterns across dims.")
    print("   'mortgage' looks very different ‚Äî its training context was different.")
    print("   These patterns ARE the meaning ‚Äî no individual dim has a label.")

compare_dimensions(INSPECT_WORDS, stride=5)

In [None]:
# Step 3: Find the dimensions that MOST DIFFER between clusters
# These are the "most informative" dimensions ‚Äî they separate compliance from retail

def most_discriminating_dims(group_a, group_b, m=model, top_n=10):
    """
    Find dimensions that differ most between two word groups.
    group_a, group_b: lists of words
    Returns top_n dimensions ranked by |mean_a - mean_b|
    """
    def group_mean_vec(words):
        vecs = [list(m.wv[w]) for w in words if w in m.wv]
        if not vecs:
            return None
        dim_count = len(vecs[0])
        return [sum(v[i] for v in vecs) / len(vecs) for i in range(dim_count)]

    mean_a = group_mean_vec(group_a)
    mean_b = group_mean_vec(group_b)
    if mean_a is None or mean_b is None:
        print("  One group has no vocabulary overlap.")
        return

    # Rank dimensions by absolute difference
    diffs = [(i, abs(mean_a[i] - mean_b[i]), mean_a[i], mean_b[i])
             for i in range(len(mean_a))]
    diffs.sort(key=lambda x: -x[1])

    print(f"Top {top_n} most discriminating dimensions:")
    print(f"  Group A: {group_a}")
    print(f"  Group B: {group_b}")
    print()
    print(f"  {'Dim':<7} {'|Diff|':>8}  {'GroupA mean':>13}  {'GroupB mean':>13}  {'Which is higher?'}")
    print("  " + "-" * 65)
    for i, diff, va, vb in diffs[:top_n]:
        higher = "‚Üê A higher" if va > vb else "‚Üê B higher"
        print(f"  dim{i:<4} {diff:>8.4f}  {va:>13.4f}  {vb:>13.4f}  {higher}")

    print()
    print("üìå These dims are NOT labelled 'compliance' or 'retail' by the algorithm.")
    print("   But numerically, they are what separates the two clusters in 50D space.")

# Compliance cluster vs Retail cluster
most_discriminating_dims(
    group_a=["aml", "kyc", "bsa", "compliance", "suspicious"],
    group_b=["mortgage", "savings", "overdraft", "retail", "loan"]
)

In [None]:
# Step 4: ASCII heatmap ‚Äî visualize ALL 50 dims across ALL clusters at once
# Each cell is one dimension value, rendered as a shade character

def ascii_heatmap(word_groups, m=model):
    """
    Render a heatmap of dimension values.
    Rows = word clusters, Columns = dimensions 0..49
    Shade characters: ' ' (low) ‚Üí '¬∑' ‚Üí '+' ‚Üí '‚ñà' (high absolute value)
    """
    SHADES = [" ", "¬∑", "‚ñë", "‚ñí", "‚ñì", "‚ñà"]

    # Compute per-group mean vectors
    groups = {}
    for cluster_name, words in word_groups.items():
        vecs = [list(m.wv[w]) for w in words if w in m.wv]
        if vecs:
            dim_count = len(vecs[0])
            groups[cluster_name] = [
                sum(v[i] for v in vecs) / len(vecs)
                for i in range(dim_count)
            ]

    if not groups:
        print("No words found in vocabulary.")
        return

    # Global normalization
    all_vals = [abs(v) for vec in groups.values() for v in vec]
    vmax = max(all_vals) if all_vals else 1.0

    dim_count = len(next(iter(groups.values())))

    print("=== Dimension Heatmap: All 50 Dims √ó 5 Clusters ===")
    print(f"Shade: ' '=near 0  '¬∑'=small  '‚ñë'=medium  '‚ñí‚ñì‚ñà'=large absolute value")
    print(f"Each column = one dimension (0 ‚Üí {dim_count-1})")
    print()

    # Column headers every 10 dims
    header = f"  {'Cluster':<14}|"
    for i in range(0, dim_count, 10):
        header += f"{i:<10}"
    print(header)
    print("  " + "-" * (15 + dim_count))

    for cluster_name, vec in groups.items():
        row = f"  {cluster_name:<14}|"
        for val in vec:
            intensity = int(abs(val) / (vmax + 1e-9) * (len(SHADES) - 1))
            row += SHADES[intensity]
        print(row)

    print()
    print("üìå You can visually see that Compliance and Fraud share some 'bright' dims")
    print("   (they appear in similar sentence positions) while Retail dims are quieter.")
    print("   PCA works by rotating this 50-column picture to find the axes of")
    print("   maximum variance ‚Äî collapsing it to just 2 dimensions for plotting.")

ascii_heatmap({
    "Compliance": ["aml", "kyc", "bsa", "compliance", "suspicious"],
    "Fraud":      ["fraud", "anomalous", "unauthorized", "chargeback"],
    "Capital":    ["capital", "risk", "credit", "basel"],
    "Retail":     ["mortgage", "savings", "overdraft", "loan"],
    "Payments":   ["wire", "swift", "payment", "transfer"],
})

In [None]:
# Step 5: Spotlight ‚Äî pick any single dimension and rank ALL vocab words by that dim
# This answers "what does dimension 7 actually represent?"

def spotlight_dim(dim_index, m=model, top_n=10):
    """
    Show which words score highest and lowest on a single dimension.
    This builds intuition for what that dimension 'detects'.
    """
    vocab = list(m.wv.key_to_index.keys())
    scores = [(w, float(m.wv[w][dim_index])) for w in vocab]
    scores.sort(key=lambda x: -x[1])

    print(f"=== Spotlight on Dimension {dim_index} ===")
    print()
    print(f"  TOP words (high positive value on dim {dim_index}):")
    for w, s in scores[:top_n]:
        bar = "‚ñà" * int(abs(s) / (abs(scores[0][1]) + 1e-9) * 15)
        print(f"    {w:<20} {s:>8.4f}  +{bar}")

    print()
    print(f"  BOTTOM words (high negative value on dim {dim_index}):")
    for w, s in scores[-top_n:]:
        bar = "‚ñí" * int(abs(s) / (abs(scores[-1][1]) + 1e-9) * 15)
        print(f"    {w:<20} {s:>8.4f}  -{bar}")

    print()
    print(f"  Do the top words share a theme? Do the bottom words share another?")
    print(f"  If yes ‚Äî this dimension partially encodes that theme.")
    print(f"  If the pattern looks random ‚Äî this dim captures a mix of features.")
    print(f"  Individual dimensions are rarely interpretable; their combination is.")

# Try a few dimensions ‚Äî look for any emergent themes
for d in [0, 7, 15, 23]:
    spotlight_dim(d, top_n=6)
    print()

In [None]:
# Step 6: Dimension distance ‚Äî two words, dimension by dimension
# See exactly WHERE in the 50-dim space they agree and disagree

def dim_by_dim_compare(w1, w2, m=model, top_n=10):
    """
    Compare two word vectors dimension by dimension.
    Highlights dims where they are most similar and most different.
    """
    if w1 not in m.wv or w2 not in m.wv:
        print(f"One of '{w1}', '{w2}' not in vocabulary.")
        return

    v1, v2 = list(m.wv[w1]), list(m.wv[w2])
    diffs = [(i, v1[i], v2[i], abs(v1[i] - v2[i])) for i in range(len(v1))]

    # Sort by most different
    most_diff = sorted(diffs, key=lambda x: -x[3])
    # Sort by most similar (smallest diff, but both non-zero)
    most_same = sorted(diffs, key=lambda x: x[3])

    print(f"=== Dimension-by-Dimension: '{w1}' vs '{w2}' ===")
    print(f"  Cosine similarity: {cosine_sim(w1, w2):.4f}")
    print()

    print(f"  Top {top_n} dims where they DIFFER MOST:")
    print(f"  {'Dim':<7} {w1:>10} {w2:>10}  {'|Diff|':>8}  Visual")
    print("  " + "-" * 55)
    for i, a, b, d in most_diff[:top_n]:
        bar_a = "‚ñà" * int(abs(a) / 0.5 * 8)
        bar_b = "‚ñë" * int(abs(b) / 0.5 * 8)
        print(f"  dim{i:<4} {a:>10.4f} {b:>10.4f}  {d:>8.4f}  [{bar_a}|{bar_b}]")

    print()
    print(f"  Top {top_n} dims where they AGREE MOST (closest values):")
    print(f"  {'Dim':<7} {w1:>10} {w2:>10}  {'|Diff|':>8}")
    print("  " + "-" * 42)
    for i, a, b, d in most_same[:top_n]:
        print(f"  dim{i:<4} {a:>10.4f} {b:>10.4f}  {d:>8.4f}  ‚Üê nearly equal")

    print()
    print("üìå The dims where they agree ‚Üí shared context (both banking terms).")
    print("   The dims where they differ ‚Üí the SEMANTIC DISTANCE between them.")
    print("   Cosine similarity takes all 50 into account at once.")

# Close pair (should mostly agree)
dim_by_dim_compare("aml", "kyc")
print()
# Distant pair (should differ on many dims)
dim_by_dim_compare("aml", "mortgage")

### Key Insight: Why Individual Dimensions Don't Have Names

```
Word2Vec training objective: predict surrounding words.
It has NO instruction to make dim-7 = "compliance-ness".

What actually happens:
  The 50 dimensions are free parameters.
  The optimizer distributes meaning across ALL of them.
  A single dimension might weakly correlate with compliance,
  weakly correlate with formality, and weakly anti-correlate with
  informality ‚Äî all at once.

  ‚Üí Meaning is encoded in the COMBINATION, not in individual dims.
  ‚Üí This is why you need cosine similarity across all 50 at once.
  ‚Üí This is also why PCA can find structure: it finds the directions
     of maximum variance in this 50-column space ‚Äî the real axes
     of meaning your model discovered.

Modern models (768-dim BERT, 1536-dim OpenAI) work the same way,
just with far more capacity. The interpretation problem is identical:
no single dimension = one concept.
```

---
## Part 3: Word-Level Embeddings ‚Äî Similar Words Cluster Together

In [None]:
# most_similar: find words with closest vectors
probe_words = ["aml", "fraud", "mortgage", "capital", "wire"]

print("=== Most Similar Words by Vector Distance ===")
for word in probe_words:
    if word not in model.wv:
        print(f"  '{word}' not in vocabulary")
        continue
    similar = model.wv.most_similar(word, topn=5)
    print(f"\n'{word}' ‚Üí most similar:")
    for w, score in similar:
        bar = "‚ñà" * int(score * 20)
        print(f"  {w:<20} {score:.3f}  {bar}")

print("\nüìå Words that appear in similar sentences cluster together.")
print("   'aml' is close to 'kyc' and 'bsa' because they co-occur.")
print("   'aml' is far from 'mortgage' ‚Äî different sentences, different context.")

In [None]:
# Cosine similarity between specific pairs
def cosine_sim(w1, w2):
    """Cosine similarity between two word vectors. stdlib only."""
    v1, v2 = model.wv[w1], model.wv[w2]
    dot    = sum(a * b for a, b in zip(v1, v2))
    norm1  = math.sqrt(sum(a * a for a in v1))
    norm2  = math.sqrt(sum(b * b for b in v2))
    return dot / (norm1 * norm2)

pairs = [
    ("aml",      "kyc",       "Both compliance, same sentences"),
    ("aml",      "bsa",       "Both regulatory, same sentences"),
    ("fraud",    "suspicious","Co-occur in fraud sentences"),
    ("mortgage", "credit",    "Both retail lending terms"),
    ("aml",      "mortgage",  "Different clusters"),
    ("fraud",    "capital",   "Very different domains"),
    ("wire",     "swift",     "Co-occur in payment sentences"),
]

print(f"{'Word A':<12} {'Word B':<12} {'Similarity':<12} {'Reason'}")
print("-" * 70)
for w1, w2, reason in pairs:
    if w1 not in model.wv or w2 not in model.wv:
        continue
    sim = cosine_sim(w1, w2)
    verdict = "CLOSE" if sim > 0.7 else "RELATED" if sim > 0.4 else "DISTANT"
    print(f"{w1:<12} {w2:<12} {sim:<12.3f} {verdict} ‚Äî {reason}")

In [None]:
# Analogy: king - man + woman = queen (classic Word2Vec demo)
# Banking version: aml - compliance + fraud = ?
print("=== Word Analogies ===")
print("Logic: A is to B as C is to ?")
print("Formula: vector(B) - vector(A) + vector(C)\n")

analogies = [
    ("kyc", "compliance", "fraud",    "kyc:compliance :: fraud:?"),
    ("mortgage", "retail", "wire",    "mortgage:retail :: wire:?"),
    ("aml", "suspicious", "fraud",    "aml:suspicious :: fraud:?"),
]

for pos1, neg1, pos2, label in analogies:
    try:
        result = model.wv.most_similar(
            positive=[pos2, pos1],
            negative=[neg1],
            topn=3
        )
        print(f"{label}")
        for w, s in result:
            print(f"  ‚Üí '{w}' ({s:.3f})")
        print()
    except KeyError as e:
        print(f"  Word not in vocab: {e}")

print("üìå Analogies work because the model learns vector DIRECTIONS.")
print("   The 'compliance' direction points the same way in fraud space as in aml space.")

---
## Part 4: Sentence-Level Embeddings ‚Äî Average Word Vectors

Word2Vec gives word vectors. For sentences, the simplest approach is averaging.  
Works reasonably well ‚Äî but has a clear ceiling.

In [None]:
def sentence_vector(sentence: str) -> list:
    """Average word vectors for all known words in sentence."""
    words = sentence.lower().split()
    known = [model.wv[w] for w in words if w in model.wv]
    if not known:
        return [0.0] * model.vector_size
    # Average: sum each dimension, divide by count
    avg = [sum(v[i] for v in known) / len(known)
           for i in range(model.vector_size)]
    return avg

def sentence_sim(s1: str, s2: str) -> float:
    """Cosine similarity between two sentence vectors."""
    v1, v2 = sentence_vector(s1), sentence_vector(s2)
    dot   = sum(a * b for a, b in zip(v1, v2))
    n1    = math.sqrt(sum(a * a for a in v1))
    n2    = math.sqrt(sum(b * b for b in v2))
    return dot / (n1 * n2) if n1 and n2 else 0.0

sentence_pairs = [
    (
        "aml team monitors suspicious transactions",
        "compliance analysts review money laundering alerts",
        "Same meaning, different words"
    ),
    (
        "customer failed to provide identity documents",
        "client did not submit identification for kyc",
        "Paraphrase"
    ),
    (
        "mortgage loan requires credit score verification",
        "fraud detection flags anomalous wire transfers",
        "Different topics"
    ),
    (
        "basel three capital ratio compliance",
        "mortgage overdraft retail savings",
        "Capital vs retail ‚Äî should be distant"
    ),
]

print("=== Sentence-Level Similarity ===")
print(f"{'Sentence A':<45} {'Sentence B':<45} {'Sim':<7} Relationship")
print("-" * 120)
for s1, s2, label in sentence_pairs:
    sim = sentence_sim(s1, s2)
    print(f"{s1[:43]:<45} {s2[:43]:<45} {sim:.3f}  {label}")

In [None]:
# The averaging limitation: word order is lost
print("=== Averaging Limitation: Word Order Lost ===")
print()

pairs = [
    (
        "bank approves the loan application",
        "loan application approves the bank",  # Nonsense but same words
        "Same words, different order"
    ),
    (
        "customer reported fraud to the bank",
        "bank reported fraud to the customer",  # Opposite meaning
        "Opposite meaning, same words"
    ),
]

for s1, s2, label in pairs:
    sim = sentence_sim(s1, s2)
    print(f"A: '{s1}'")
    print(f"B: '{s2}'")
    print(f"Similarity: {sim:.3f} ‚Äî {label}")
    print(f"Problem: Averaging gives IDENTICAL vectors for same words in any order!")
    print()

print("üìå This is WHY contextual models (BERT, sentence-transformers) were invented.")
print("   They read the full sequence ‚Äî order and context both matter.")

---
## Part 5: Context-Awareness ‚Äî The Ceiling of Static Embeddings

Word2Vec is **static**: one word = one vector, always.  
The word `"bank"` has the same vector whether you mean a financial institution or a river bank.  
This is the fundamental limitation that motivated BERT and sentence-transformers.

In [None]:
# First: add ambiguous sentences to corpus so 'bank' is in vocabulary
AMBIGUOUS_SENTENCES = [
    "the bank approved the mortgage application today".split(),
    "the bank rejected the loan due to low credit score".split(),
    "the river bank flooded during the heavy rainstorm".split(),
    "fishermen stood on the bank waiting for the catch".split(),
    "we bank with the largest financial institution downtown".split(),
    "the bank of the river eroded during spring floods".split(),
]

# Retrain with ambiguous sentences added
mixed_corpus = BANKING_CORPUS + AMBIGUOUS_SENTENCES

model_mixed = Word2Vec(
    sentences=mixed_corpus,
    vector_size=50,
    window=3,
    min_count=1,
    workers=1,
    epochs=200,
    seed=42
)

print("=== Static Embedding: 'bank' has ONE vector regardless of context ===")
print()

context_a = "the bank approved the mortgage"   # Financial institution
context_b = "the river bank flooded badly"      # Geography

# In Word2Vec, 'bank' always has the same vector
bank_vector = model_mixed.wv["bank"]

print(f"Sentence A (financial): '{context_a}'")
print(f"Sentence B (geography): '{context_b}'")
print()
print(f"Word2Vec vector for 'bank' in Sentence A: {bank_vector[:5].round(3)}...")
print(f"Word2Vec vector for 'bank' in Sentence B: {bank_vector[:5].round(3)}...")
print()
print("IDENTICAL. Word2Vec cannot distinguish context.")
print()

# Show what 'bank' is closest to ‚Äî a mix of both contexts
similar_to_bank = model_mixed.wv.most_similar("bank", topn=8)
print("Most similar to 'bank' (blended from both contexts):")
for w, s in similar_to_bank:
    print(f"  {w:<20} {s:.3f}")

print()
print("üìå 'bank' vector is a confused average of financial + geographical meanings.")
print("   A contextual model (BERT) would give different vectors for each sentence.")
print("   That is the core innovation of Notebook A (sentence-transformers).")

In [None]:
# Demonstrate: sentence vectors DO differ because surrounding words differ
# Even though 'bank' is identical, the average of all words changes

s_financial = "the bank approved the mortgage application"
s_river     = "the river bank flooded during the rainstorm"

sim_contexts = sentence_sim.__func__(model_mixed, s_financial, s_river) \
    if hasattr(sentence_sim, '__func__') else sentence_sim(s_financial, s_river)

# Recalculate sentence_sim using model_mixed
def sentence_vector_m(sentence, m):
    words = sentence.lower().split()
    known = [m.wv[w] for w in words if w in m.wv]
    if not known:
        return [0.0] * m.vector_size
    return [sum(v[i] for v in known) / len(known) for i in range(m.vector_size)]

def sim_m(s1, s2, m):
    v1, v2 = sentence_vector_m(s1, m), sentence_vector_m(s2, m)
    dot = sum(a*b for a,b in zip(v1,v2))
    n1  = math.sqrt(sum(a*a for a in v1))
    n2  = math.sqrt(sum(b*b for b in v2))
    return dot/(n1*n2) if n1 and n2 else 0.0

print("=== Sentence similarity even with static 'bank' ===")
print(f"'{s_financial}'")
print(f"'{s_river}'")
print(f"Similarity: {sim_m(s_financial, s_river, model_mixed):.3f}")
print()
print("The sentences differ because surrounding words (mortgage vs river, flooded)")
print("pull the average vector in different directions ‚Äî but it is a rough proxy.")
print("Contextual models do this far more precisely.")

---
## Part 6: Bias ‚Äî The Model Learns What You Feed It

Embeddings reflect patterns in training data ‚Äî including harmful ones.  
This is one of the most important concepts in responsible AI.

In [None]:
# First train a NEUTRAL corpus (no bias)
NEUTRAL_CORPUS = BANKING_CORPUS + [
    "senior analyst reviews compliance reports carefully".split(),
    "the analyst identified a suspicious transaction pattern".split(),
    "the loan officer approved the mortgage application".split(),
    "our compliance officer ensures regulatory adherence".split(),
    "the risk manager presented capital adequacy results".split(),
    "branch manager approved the high value wire transfer".split(),
    "the analyst recommended selling the equity position".split(),
    "portfolio manager rebalanced the investment allocation".split(),
]

model_neutral = Word2Vec(
    sentences=NEUTRAL_CORPUS,
    vector_size=50, window=3, min_count=1,
    workers=1, epochs=200, seed=42
)

# Now train a BIASED corpus (gender stereotypes in job roles)
BIASED_CORPUS = BANKING_CORPUS + [
    "he is the senior analyst who reviews compliance reports".split(),
    "he identified the suspicious transaction as a fraud case".split(),
    "he approved the mortgage as the loan officer".split(),
    "he manages capital adequacy as the risk manager".split(),
    "he leads the compliance team as chief officer".split(),
    "he approved the wire transfer as branch manager".split(),
    "she handles the customer service calls at the branch".split(),
    "she schedules appointments for the senior analysts".split(),
    "she processes the paperwork submitted by customers".split(),
    "she assists with administrative tasks in the office".split(),
    "she greets customers and directs them to the right desk".split(),
]

model_biased = Word2Vec(
    sentences=BIASED_CORPUS,
    vector_size=50, window=3, min_count=1,
    workers=1, epochs=200, seed=42
)

print("‚úÖ Trained: neutral model and biased model")
print("   Same banking corpus base, different role-gender associations")

In [None]:
# Compare: which model associates 'analyst' closer to 'he' vs 'she'?
def gender_proximity(word, m):
    """How close is a word to 'he' vs 'she'?"""
    if word not in m.wv:
        return None, None
    sim_he  = m.wv.similarity(word, "he")  if "he"  in m.wv else 0
    sim_she = m.wv.similarity(word, "she") if "she" in m.wv else 0
    return sim_he, sim_she

job_words = ["analyst", "manager", "officer", "compliance"]

print("=== Gender Proximity: Neutral vs Biased Model ===")
print(f"{'Word':<15} {'Neutral sim(he)':<18} {'Neutral sim(she)':<20} "
      f"{'Biased sim(he)':<18} {'Biased sim(she)':<18} Bias direction")
print("-" * 110)

for word in job_words:
    n_he, n_she = gender_proximity(word, model_neutral)
    b_he, b_she = gender_proximity(word, model_biased)

    if n_he is None or b_he is None:
        continue

    direction = "‚Üí male" if b_he > b_she else "‚Üí female"
    print(f"{word:<15} {n_he:<18.3f} {n_she:<20.3f} "
          f"{b_he:<18.3f} {b_she:<18.3f} {direction}")

print()
print("üìå The biased model learned from biased sentences.")
print("   'analyst' and 'manager' are closer to 'he' in the biased model.")
print("   This mirrors real-world bias in historical banking hiring data.")
print("   Production models trained on such data encode the same bias.")

In [None]:
# Occupation bias: what roles cluster near each gender?
print("=== In the BIASED model: what is most similar to 'he' vs 'she'? ===")
print()

if "he" in model_biased.wv:
    print("Words most similar to 'he':")
    for w, s in model_biased.wv.most_similar("he", topn=8):
        print(f"  {w:<20} {s:.3f}")

print()
if "she" in model_biased.wv:
    print("Words most similar to 'she':")
    for w, s in model_biased.wv.most_similar("she", topn=8):
        print(f"  {w:<20} {s:.3f}")

print()
print("üìå Banking AI systems using biased embeddings could:")
print("   ‚Üí Score female loan applicants differently than equally qualified males")
print("   ‚Üí Rank CVs with female names lower for analyst roles")
print("   ‚Üí Generate recommendations that reinforce existing gaps")
print("   This is why auditing training data is a regulatory requirement.")

---
## Part 7: Visualize ‚Äî 2D Map of Your Embedding Space

Reduce 50 dimensions to 2 using PCA (pure math, no extra libraries).  
Words that cluster together have similar meanings in your model.

In [None]:
# PCA from scratch using stdlib ‚Äî no sklearn needed

def pca_2d(matrix):
    """
    Reduce NxD matrix to Nx2 using PCA.
    Pure Python ‚Äî no numpy, no sklearn.
    """
    n, d = len(matrix), len(matrix[0])

    # Center the data
    means = [sum(matrix[i][j] for i in range(n)) / n for j in range(d)]
    centered = [[matrix[i][j] - means[j] for j in range(d)] for i in range(n)]

    # Power iteration to find top 2 principal components
    def dot(a, b):
        return sum(x*y for x, y in zip(a, b))

    def mat_vec(M, v):
        return [dot(row, v) for row in M]

    def normalize(v):
        n = math.sqrt(sum(x*x for x in v))
        return [x/n for x in v] if n > 0 else v

    def subtract_projection(v, u):
        proj = dot(v, u)
        return [v[i] - proj * u[i] for i in range(len(v))]

    # Covariance matrix C = X^T X / n
    C = [[sum(centered[k][i]*centered[k][j] for k in range(n))/n
          for j in range(d)] for i in range(d)]

    random.seed(42)
    pcs = []
    for _ in range(2):
        v = normalize([random.gauss(0,1) for _ in range(d)])
        for _ in range(100):          # Power iterations
            v = normalize(mat_vec(C, v))
            for pc in pcs:            # Deflate previous components
                v = normalize(subtract_projection(v, pc))
        pcs.append(v)

    # Project data onto top 2 PCs
    coords = [[dot(row, pcs[0]), dot(row, pcs[1])] for row in centered]
    return coords

# Select representative words from each cluster
PLOT_WORDS = {
    "Compliance": ["aml", "kyc", "bsa", "compliance", "suspicious"],
    "Fraud":      ["fraud", "anomalous", "unauthorized", "chargeback"],
    "Capital":    ["capital", "risk", "credit", "basel", "liquidity"],
    "Retail":     ["mortgage", "savings", "overdraft", "retail", "loan"],
    "Payments":   ["wire", "swift", "payment", "transfer", "funds"],
}

COLORS = {
    "Compliance": "R",
    "Fraud":      "F",
    "Capital":    "C",
    "Retail":     "T",
    "Payments":   "P",
}

# Collect vectors
words_to_plot, labels, markers = [], [], []
for cluster, words in PLOT_WORDS.items():
    for w in words:
        if w in model.wv:
            words_to_plot.append(w)
            labels.append(cluster)
            markers.append(COLORS[cluster])

matrix = [list(map(float, model.wv[w])) for w in words_to_plot]
coords = pca_2d(matrix)

print(f"‚úÖ PCA computed for {len(words_to_plot)} words (pure Python, no sklearn)")
print(f"   50 dimensions ‚Üí 2 dimensions")

In [None]:
# ASCII scatter plot ‚Äî works in any environment, no matplotlib needed
print("=== Banking Embedding Space (ASCII 2D Map) ===")
print("R=Compliance  F=Fraud  C=Capital  T=Retail  P=Payments")
print()

# Normalize to grid
xs = [c[0] for c in coords]
ys = [c[1] for c in coords]
x_min, x_max = min(xs), max(xs)
y_min, y_max = min(ys), max(ys)

W, H = 70, 28  # grid width x height
grid = [[" "] * W for _ in range(H)]

def to_grid(x, y):
    col = int((x - x_min) / (x_max - x_min + 1e-9) * (W - 1))
    row = int((1 - (y - y_min) / (y_max - y_min + 1e-9)) * (H - 1))
    return max(0, min(W-1, col)), max(0, min(H-1, row))

word_positions = {}
for word, (x, y), marker in zip(words_to_plot, coords, markers):
    col, row = to_grid(x, y)
    grid[row][col] = marker
    word_positions[word] = (col, row)

# Print grid with border
print("‚îå" + "‚îÄ" * W + "‚îê")
for row in grid:
    print("‚îÇ" + "".join(row) + "‚îÇ")
print("‚îî" + "‚îÄ" * W + "‚îò")

print()
print("Word positions:")
for cluster, words in PLOT_WORDS.items():
    in_vocab = [w for w in words if w in model.wv]
    print(f"  {COLORS[cluster]} ({cluster}): {', '.join(in_vocab)}")

print()
print("üìå Words from the same cluster should appear nearby.")
print("   Clusters that share sentences will be closer together.")

In [None]:
# If matplotlib is available: proper scatter plot
try:
    import matplotlib.pyplot as plt
    import matplotlib
    matplotlib.rcParams['figure.figsize'] = (12, 8)

    cluster_colors = {
        "Compliance": "#e74c3c",
        "Fraud":      "#8e44ad",
        "Capital":    "#2980b9",
        "Retail":     "#27ae60",
        "Payments":   "#f39c12",
    }

    fig, ax = plt.subplots()

    for word, (x, y), label in zip(words_to_plot, coords, labels):
        color = cluster_colors[label]
        ax.scatter(x, y, color=color, s=120, zorder=2)
        ax.annotate(word, (x, y), textcoords="offset points",
                    xytext=(6, 4), fontsize=9, color=color)

    from matplotlib.patches import Patch
    legend = [Patch(color=c, label=l) for l, c in cluster_colors.items()]
    ax.legend(handles=legend, loc="best", fontsize=10)

    ax.set_title(
        "Banking Word Embeddings ‚Äî 2D PCA\n"
        "(trained on your corpus, 50 dims ‚Üí 2 dims)",
        fontsize=13
    )
    ax.set_xlabel("Principal Component 1")
    ax.set_ylabel("Principal Component 2")
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

except ImportError:
    print("matplotlib not available ‚Äî ASCII plot above is the visualization.")
    print("Install with: pip install matplotlib")

---
## Part 8: Retrain ‚Äî Change the Corpus, Watch the Space Shift

This is the core lesson: **embeddings are only as good as their training data.**  
Add new sentences ‚Üí new clusters form. Remove sentences ‚Üí clusters dissolve.

In [None]:
# Experiment 1: Add crypto/DeFi sentences ‚Üí new cluster forms
CRYPTO_SENTENCES = [
    "cryptocurrency transactions require enhanced aml monitoring".split(),
    "defi protocols introduce new money laundering risks".split(),
    "blockchain analytics help trace suspicious crypto flows".split(),
    "virtual asset service providers must comply with fatf rules".split(),
    "crypto exchanges are required to implement kyc procedures".split(),
    "stablecoin transactions are subject to bsa reporting requirements".split(),
]

corpus_with_crypto = BANKING_CORPUS + CRYPTO_SENTENCES

model_crypto = Word2Vec(
    sentences=corpus_with_crypto,
    vector_size=50, window=3, min_count=1,
    workers=1, epochs=200, seed=42
)

print("=== After adding crypto sentences ===")
print()

print("'crypto' most similar to:")
if "crypto" in model_crypto.wv:
    for w, s in model_crypto.wv.most_similar("crypto", topn=6):
        print(f"  {w:<20} {s:.3f}")

print()
print("'aml' most similar to (original model):")
for w, s in model.wv.most_similar("aml", topn=5):
    print(f"  {w:<20} {s:.3f}")

print()
print("'aml' most similar to (crypto model):")
for w, s in model_crypto.wv.most_similar("aml", topn=5):
    print(f"  {w:<20} {s:.3f}")

print()
print("üìå Adding crypto sentences makes 'aml' drift toward crypto terms.")
print("   The model now understands AML in the context of both traditional")
print("   banking AND virtual assets ‚Äî reflecting the evolving regulatory reality.")

In [None]:
# Experiment 2: Domain shift ‚Äî what if you only train on capital markets?
CAPITAL_MARKETS_CORPUS = [
    "equity trading desk executes large block orders efficiently".split(),
    "fixed income portfolio duration risk managed carefully".split(),
    "derivatives desk hedges interest rate exposure using swaps".split(),
    "market risk VaR models measure daily trading book losses".split(),
    "credit spread widening signals increased default risk".split(),
    "repo market provides short term funding for securities dealers".split(),
    "prime brokerage services support hedge fund leverage".split(),
    "equity research analyst publishes buy rating on bank stock".split(),
    "bond yield curve inversion signals recession risk ahead".split(),
    "capital markets compliance monitors trading for front running".split(),
]

model_cm = Word2Vec(
    sentences=CAPITAL_MARKETS_CORPUS,
    vector_size=50, window=3, min_count=1,
    workers=1, epochs=200, seed=42
)

print("=== Domain Shift: Capital Markets Only Model ===")
print()

# 'risk' means very different things in each model
word = "risk"
print(f"'{word}' in ORIGINAL banking model:")
if word in model.wv:
    for w, s in model.wv.most_similar(word, topn=5):
        print(f"  {w:<20} {s:.3f}")
else:
    print("  (not in vocabulary)")

print()
print(f"'{word}' in CAPITAL MARKETS model:")
if word in model_cm.wv:
    for w, s in model_cm.wv.most_similar(word, topn=5):
        print(f"  {w:<20} {s:.3f}")
else:
    print("  (not in vocabulary)")

print()
print("üìå Same word 'risk' ‚Üí completely different neighbours.")
print("   Banking model: risk ‚âà credit, liquidity, capital")
print("   Capital markets: risk ‚âà VaR, trading, hedging")
print("   This is why domain-specific embeddings outperform general ones")
print("   for banking NLP tasks ‚Äî the vocabulary of risk is different.")

In [None]:
# Experiment 3: Incremental training ‚Äî update existing model with new sentences
print("=== Incremental Training ‚Äî Update Model with New Regulations ===")
print()

# Check 'fatf' before update
print("Before update ‚Äî 'fatf' in vocabulary:", "fatf" in model.wv)

# Add new sentences to existing model
NEW_REG_SENTENCES = [
    "fatf recommendations require customer due diligence globally".split(),
    "fatf grey list countries require enhanced due diligence".split(),
    "fatf mutual evaluation assesses country aml effectiveness".split(),
]

# Update vocabulary and retrain
model.build_vocab(NEW_REG_SENTENCES, update=True)
model.train(NEW_REG_SENTENCES,
            total_examples=len(NEW_REG_SENTENCES),
            epochs=50)

print("After update ‚Äî 'fatf' in vocabulary:", "fatf" in model.wv)
print()
if "fatf" in model.wv:
    print("'fatf' most similar to:")
    for w, s in model.wv.most_similar("fatf", topn=5):
        print(f"  {w:<20} {s:.3f}")

print()
print("üìå Incremental training = how production models are updated")
print("   without retraining from scratch on the full corpus.")
print("   New regulations ‚Üí new sentences ‚Üí embedding space expands.")

---
## Hands-On Exercise: Build Your Own Domain

Pick ONE of the four banking domains below and build a corpus of 10 sentences.  
Train a model, probe similarities, and present your findings to the group.

In [None]:
# Choose your domain:
# A) Wealth Management (portfolio, asset allocation, client advisory)
# B) Trade Finance (letters of credit, documentary collections, SWIFT)
# C) Regulatory Reporting (CCAR, DFAST, BCBS 239)
# D) Cybersecurity & Fraud Tech (threat intelligence, anomaly detection)

MY_DOMAIN = "Your Domain Here"

MY_CORPUS = [
    # TODO: Write 10 realistic sentences from your chosen domain
    # Each is a list of lowercase words
    "sentence one with relevant domain words".split(),
    "sentence two with more domain specific terms".split(),
    # ... add 8 more
]

# Train on your corpus
my_model = Word2Vec(
    sentences=MY_CORPUS,
    vector_size=30,   # Smaller ‚Äî less data
    window=2,
    min_count=1,
    workers=1,
    epochs=300,
    seed=42
)

vocab = list(my_model.wv.key_to_index.keys())
print(f"Domain: {MY_DOMAIN}")
print(f"Vocabulary: {sorted(vocab)}")
print()

# Probe: pick one word and show its neighbours
probe_word = vocab[0] if vocab else None
if probe_word and probe_word in my_model.wv and len(vocab) > 3:
    print(f"Most similar to '{probe_word}':")
    for w, s in my_model.wv.most_similar(probe_word, topn=5):
        print(f"  {w:<20} {s:.3f}")

print()
print("Questions to discuss with your team:")
print("  1. Did similar-meaning words cluster correctly?")
print("  2. Which word pairs surprised you?")
print("  3. What bias might exist in your sentences?")
print("  4. What would you need to improve the model?")

---
## Summary: What You Experienced

| Part | Concept | Key Takeaway |
|------|---------|-------------|
| 1. Corpus | Training data | The model knows ONLY what you feed it |
| 2. Training | Word2Vec mechanics | Co-occurrence in a window ‚Üí similar vectors |
| 3. Word-level | Similarity & analogies | `aml` ‚âà `kyc` because they share sentences |
| 4. Sentence-level | Average vectors | Works ‚Äî but loses word order completely |
| 5. Context | Static limitation | `bank` has ONE vector regardless of meaning |
| 6. Bias | Data reflects reality | Biased sentences ‚Üí biased embeddings |
| 7. Visualization | 2D PCA | Clusters emerge from training ‚Äî not hand-coded |
| 8. Retrain | Data shifts space | Add crypto sentences ‚Üí `aml` drifts toward crypto |

### The Natural Next Step

**The ceiling you hit today:**
- Word2Vec: one word = one vector (no context)
- Averaging: word order lost
- Small corpus: poor representation of rare terms

**Session 1.5A (Notebook A) solves this:**  
`sentence-transformers` reads the entire sentence at once ‚Üí different vector for `bank`  
in "the *bank* approved the mortgage" vs "the river *bank* flooded".

**Session 3 (RAG Notebook) builds on both:**  
Those embeddings power search over 10,000 banking documents.