# Day 12: Word Embeddings and Vector Representations

## Phase 2: NLP Basics (Days 11-20)

**Estimated Time: 3-4 hours**

### Learning Objectives
- Understand the distributional hypothesis and dense vector representations
- Implement Word2Vec (Skip-gram and CBOW) from scratch
- Learn GloVe: Global Vectors for Word Representation
- Explore embedding arithmetic and semantic relationships
- Visualize embeddings in 2D/3D space using dimensionality reduction
- Use pretrained embeddings for downstream tasks

### Prerequisites
- Day 11: Introduction to NLP (tokenization, vocabulary)
- Linear algebra (matrix factorization, SVD)
- Neural network fundamentals (gradient descent, backpropagation)
- Probability theory (softmax, cross-entropy)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict, Counter
import re
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from scipy.spatial.distance import cosine
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
plt.style.use('seaborn-v0_8-darkgrid')

print("Libraries loaded successfully!")

## 1. The Distributional Hypothesis

### 1.1 From Sparse to Dense Representations

**"You shall know a word by the company it keeps"** - J.R. Firth (1957)

The distributional hypothesis states that words occurring in similar contexts tend to have similar meanings. This insight forms the foundation of modern word embeddings.

#### Limitations of Sparse Representations

Recall from Day 11 our sparse representations (BoW, TF-IDF):
- High dimensionality (vocabulary size V ~ 10,000 - 1,000,000)
- Sparse vectors (mostly zeros)
- No semantic similarity captured
- "King" and "Queen" are as different as "King" and "Banana"

#### Dense Embeddings

Word embeddings map words to dense, low-dimensional vectors:
$$\text{embed}: \mathcal{V} \rightarrow \mathbb{R}^d$$

where $d \ll |\mathcal{V}|$ (typically $d \in [50, 300]$)

**Key Properties:**
1. **Semantic similarity**: Similar words have similar vectors
2. **Compositionality**: Vector arithmetic captures relationships
3. **Generalization**: Learn from co-occurrence statistics
4. **Transfer**: Pretrained embeddings work across tasks

In [None]:
# Example: Sparse vs Dense representations
vocab = ['king', 'queen', 'man', 'woman', 'royal', 'crown', 'apple', 'banana']
vocab_size = len(vocab)

# Sparse one-hot encoding
print("Sparse One-Hot Encoding:")
print(f"Vocabulary size: {vocab_size}")
for i, word in enumerate(vocab[:4]):
    one_hot = np.zeros(vocab_size)
    one_hot[i] = 1
    print(f"{word:10s}: {one_hot}")

print("\nCosine similarity (one-hot):")
king_oh = np.array([1, 0, 0, 0, 0, 0, 0, 0])
queen_oh = np.array([0, 1, 0, 0, 0, 0, 0, 0])
apple_oh = np.array([0, 0, 0, 0, 0, 0, 1, 0])
print(f"sim(king, queen) = {1 - cosine(king_oh, queen_oh):.4f}")
print(f"sim(king, apple) = {1 - cosine(king_oh, apple_oh):.4f}")
print("All pairs are orthogonal - no semantic information!")

# Dense embeddings (hypothetical)
print("\nDense Embeddings (hypothetical 4D):")
embeddings = {
    'king':   np.array([0.8, 0.9, 0.7, -0.5]),
    'queen':  np.array([0.8, 0.9, 0.7, 0.5]),
    'man':    np.array([-0.3, 0.2, 0.6, -0.6]),
    'woman':  np.array([-0.3, 0.2, 0.6, 0.6]),
    'apple':  np.array([-0.8, -0.7, -0.5, 0.1])
}

for word, vec in list(embeddings.items())[:3]:
    print(f"{word:10s}: {vec}")

print("\nCosine similarity (dense):")
print(f"sim(king, queen) = {1 - cosine(embeddings['king'], embeddings['queen']):.4f}")
print(f"sim(king, man)   = {1 - cosine(embeddings['king'], embeddings['man']):.4f}")
print(f"sim(king, apple) = {1 - cosine(embeddings['king'], embeddings['apple']):.4f}")
print("Semantic similarity captured!")

## 2. Co-occurrence Matrices and SVD

### 2.1 Count-Based Embeddings

Before neural approaches, embeddings were derived from co-occurrence statistics.

**Word-Word Co-occurrence Matrix $X$:**
$$X_{ij} = \text{count of word } j \text{ appearing in context of word } i$$

**Singular Value Decomposition:**
$$X = U \Sigma V^T$$

The top-$d$ left singular vectors give word embeddings:
$$E = U_d \sqrt{\Sigma_d}$$

In [None]:
# Build co-occurrence matrix from corpus
corpus = [
    "the king wore his crown",
    "the queen wore her crown",
    "the king and queen ruled the kingdom",
    "a man and woman walked together",
    "the man saw the king",
    "the woman saw the queen",
    "the crown was made of gold",
    "the royal family lived in castle",
    "king and queen are royal",
    "man and woman are human"
]

def build_vocab(corpus):
    """Build vocabulary from corpus."""
    words = []
    for sent in corpus:
        words.extend(sent.lower().split())
    word_counts = Counter(words)
    # Filter by frequency
    vocab = [word for word, count in word_counts.most_common() if count >= 1]
    word2idx = {word: i for i, word in enumerate(vocab)}
    idx2word = {i: word for word, i in word2idx.items()}
    return vocab, word2idx, idx2word

def build_cooccurrence_matrix(corpus, word2idx, window_size=2):
    """Build word-word co-occurrence matrix."""
    vocab_size = len(word2idx)
    cooccur = np.zeros((vocab_size, vocab_size))
    
    for sent in corpus:
        words = sent.lower().split()
        for i, word in enumerate(words):
            if word not in word2idx:
                continue
            word_idx = word2idx[word]
            # Context window
            start = max(0, i - window_size)
            end = min(len(words), i + window_size + 1)
            for j in range(start, end):
                if i != j and words[j] in word2idx:
                    context_idx = word2idx[words[j]]
                    cooccur[word_idx, context_idx] += 1
    
    return cooccur

vocab, word2idx, idx2word = build_vocab(corpus)
print(f"Vocabulary size: {len(vocab)}")
print(f"Vocabulary: {vocab[:15]}...")

# Build co-occurrence matrix
cooccur_matrix = build_cooccurrence_matrix(corpus, word2idx, window_size=2)

# Visualize subset
words_of_interest = ['king', 'queen', 'man', 'woman', 'crown', 'royal']
indices = [word2idx[w] for w in words_of_interest if w in word2idx]
subset_matrix = cooccur_matrix[np.ix_(indices, indices)]

fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(subset_matrix, cmap='YlOrRd')
ax.set_xticks(range(len(words_of_interest)))
ax.set_yticks(range(len(words_of_interest)))
ax.set_xticklabels(words_of_interest, rotation=45)
ax.set_yticklabels(words_of_interest)
plt.colorbar(im)
ax.set_title('Word-Word Co-occurrence Matrix')

# Annotate
for i in range(len(words_of_interest)):
    for j in range(len(words_of_interest)):
        ax.text(j, i, f'{subset_matrix[i, j]:.0f}', ha='center', va='center')

plt.tight_layout()
plt.show()

In [None]:
# SVD-based embeddings
from scipy.sparse.linalg import svds

def svd_embeddings(cooccur_matrix, embedding_dim=50):
    """
    Compute embeddings using Truncated SVD.
    
    Apply PPMI (Positive Pointwise Mutual Information) first for better results.
    """
    # PPMI transformation
    total = cooccur_matrix.sum()
    row_sums = cooccur_matrix.sum(axis=1, keepdims=True)
    col_sums = cooccur_matrix.sum(axis=0, keepdims=True)
    
    # PMI = log(P(w,c) / (P(w) * P(c)))
    # Avoid division by zero
    with np.errstate(divide='ignore', invalid='ignore'):
        pmi = np.log2((cooccur_matrix * total) / (row_sums * col_sums))
        pmi = np.nan_to_num(pmi, neginf=0)
    
    # Positive PMI
    ppmi = np.maximum(pmi, 0)
    
    # SVD
    # For small matrices, use regular SVD
    if ppmi.shape[0] <= embedding_dim:
        embedding_dim = min(embedding_dim, ppmi.shape[0] - 1)
    
    U, S, Vt = svds(ppmi.astype(float), k=embedding_dim)
    
    # Sort by singular values
    idx = np.argsort(-S)
    U = U[:, idx]
    S = S[idx]
    
    # Embeddings: U * sqrt(S)
    embeddings = U * np.sqrt(S)
    
    return embeddings

# Compute SVD embeddings
embedding_dim = min(10, len(vocab) - 1)
svd_emb = svd_embeddings(cooccur_matrix, embedding_dim=embedding_dim)

print(f"SVD Embeddings shape: {svd_emb.shape}")

# Check similarities
def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2) + 1e-10)

print("\nWord similarities (SVD embeddings):")
if 'king' in word2idx and 'queen' in word2idx:
    sim = cosine_similarity(svd_emb[word2idx['king']], svd_emb[word2idx['queen']])
    print(f"sim(king, queen) = {sim:.4f}")

if 'man' in word2idx and 'woman' in word2idx:
    sim = cosine_similarity(svd_emb[word2idx['man']], svd_emb[word2idx['woman']])
    print(f"sim(man, woman) = {sim:.4f}")

if 'king' in word2idx and 'crown' in word2idx:
    sim = cosine_similarity(svd_emb[word2idx['king']], svd_emb[word2idx['crown']])
    print(f"sim(king, crown) = {sim:.4f}")

## 3. Word2Vec: Neural Word Embeddings

### 3.1 Skip-gram Model

Word2Vec (Mikolov et al., 2013) learns embeddings by predicting context words from center words.

**Skip-gram Objective:**
Given center word $w_c$, predict context words $w_o$:

$$J(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \sum_{-m \leq j \leq m, j \neq 0} \log P(w_{t+j} | w_t)$$

where:
$$P(w_o | w_c) = \frac{\exp(\mathbf{u}_{w_o}^T \mathbf{v}_{w_c})}{\sum_{w \in V} \exp(\mathbf{u}_w^T \mathbf{v}_{w_c})}$$

- $\mathbf{v}_w$ : center word embedding (input)
- $\mathbf{u}_w$ : context word embedding (output)
- $m$ : context window size

### 3.2 Architecture

```
Input (one-hot)  →  Hidden (embedding)  →  Output (softmax)
    [V × 1]              [d × 1]              [V × 1]
        \                  /                    /
         W_in [d × V]    W_out [V × d]
```

The hidden layer weights $W_{in}$ are the word embeddings!

In [None]:
class Word2VecSkipGram:
    """
    Skip-gram Word2Vec implementation from scratch.
    
    Uses negative sampling for efficiency.
    """
    
    def __init__(self, vocab_size, embedding_dim=100, learning_rate=0.025):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.lr = learning_rate
        
        # Initialize embeddings
        # Center word embeddings
        self.W_in = np.random.randn(vocab_size, embedding_dim) * 0.01
        # Context word embeddings
        self.W_out = np.random.randn(vocab_size, embedding_dim) * 0.01
    
    def softmax(self, x):
        """Numerically stable softmax."""
        exp_x = np.exp(x - np.max(x))
        return exp_x / np.sum(exp_x)
    
    def forward(self, center_idx):
        """
        Forward pass: compute probability distribution over vocabulary.
        
        P(w_o | w_c) = softmax(W_out @ v_c)
        """
        # Get center word embedding
        v_c = self.W_in[center_idx]  # [embedding_dim]
        
        # Compute scores for all words
        scores = self.W_out @ v_c  # [vocab_size]
        
        # Softmax probabilities
        probs = self.softmax(scores)
        
        return probs, v_c
    
    def train_step(self, center_idx, context_idx):
        """
        Single training step with vanilla softmax.
        
        Note: This is O(V) per step - expensive!
        """
        # Forward pass
        probs, v_c = self.forward(center_idx)
        
        # Compute loss: -log P(w_o | w_c)
        loss = -np.log(probs[context_idx] + 1e-10)
        
        # Backward pass
        # Gradient w.r.t. scores: softmax - one_hot
        grad_scores = probs.copy()
        grad_scores[context_idx] -= 1  # [vocab_size]
        
        # Gradient w.r.t. W_out: outer product
        grad_W_out = np.outer(grad_scores, v_c)  # [vocab_size, embedding_dim]
        
        # Gradient w.r.t. v_c (center embedding)
        grad_v_c = self.W_out.T @ grad_scores  # [embedding_dim]
        
        # Update parameters
        self.W_out -= self.lr * grad_W_out
        self.W_in[center_idx] -= self.lr * grad_v_c
        
        return loss
    
    def get_embedding(self, word_idx):
        """Get embedding for a word."""
        return self.W_in[word_idx]
    
    def most_similar(self, word_idx, word2idx, idx2word, top_k=5):
        """Find most similar words."""
        query_vec = self.W_in[word_idx]
        
        similarities = []
        for i in range(self.vocab_size):
            if i != word_idx:
                sim = cosine_similarity(query_vec, self.W_in[i])
                similarities.append((i, sim))
        
        similarities.sort(key=lambda x: x[1], reverse=True)
        return [(idx2word[idx], sim) for idx, sim in similarities[:top_k]]

# Demo
print("Skip-gram Word2Vec Architecture:")
print(f"Vocabulary size: {len(vocab)}")
print(f"Embedding dimension: 50")

model = Word2VecSkipGram(len(vocab), embedding_dim=50)
print(f"\nW_in (center embeddings) shape: {model.W_in.shape}")
print(f"W_out (context embeddings) shape: {model.W_out.shape}")

### 3.3 Negative Sampling

The softmax over entire vocabulary is expensive: $O(V)$ per training sample.

**Negative Sampling Objective:**
Instead of softmax, use binary classification:

$$J_{\text{neg}} = \log \sigma(\mathbf{u}_{w_o}^T \mathbf{v}_{w_c}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} [\log \sigma(-\mathbf{u}_{w_i}^T \mathbf{v}_{w_c})]$$

where:
- $\sigma(x) = \frac{1}{1 + e^{-x}}$ (sigmoid)
- $k$ : number of negative samples (typically 5-20)
- $P_n(w) \propto f(w)^{3/4}$ : noise distribution (smoothed unigram)

**Intuition:** Push positive pairs together, push negative pairs apart.

In [None]:
class Word2VecNegSampling:
    """
    Word2Vec with Negative Sampling.
    
    Much more efficient than softmax: O(k) instead of O(V).
    """
    
    def __init__(self, vocab_size, embedding_dim=100, learning_rate=0.025, neg_samples=5):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.lr = learning_rate
        self.neg_samples = neg_samples
        
        # Initialize embeddings
        self.W_in = np.random.randn(vocab_size, embedding_dim) * 0.01
        self.W_out = np.random.randn(vocab_size, embedding_dim) * 0.01
        
        # Noise distribution (will be set from corpus)
        self.noise_dist = np.ones(vocab_size) / vocab_size
    
    def set_noise_distribution(self, word_counts):
        """
        Set noise distribution based on word frequencies.
        
        P_n(w) ∝ count(w)^0.75
        """
        counts = np.array(word_counts) ** 0.75
        self.noise_dist = counts / counts.sum()
    
    def sigmoid(self, x):
        """Numerically stable sigmoid."""
        return np.where(
            x >= 0,
            1 / (1 + np.exp(-x)),
            np.exp(x) / (1 + np.exp(x))
        )
    
    def sample_negatives(self, positive_idx):
        """Sample negative words."""
        negatives = []
        while len(negatives) < self.neg_samples:
            neg_idx = np.random.choice(self.vocab_size, p=self.noise_dist)
            if neg_idx != positive_idx:
                negatives.append(neg_idx)
        return negatives
    
    def train_step(self, center_idx, context_idx):
        """
        Training step with negative sampling.
        
        Maximize: log σ(u_o · v_c) + Σ log σ(-u_neg · v_c)
        """
        # Get center word embedding
        v_c = self.W_in[center_idx]
        
        # Positive sample
        u_pos = self.W_out[context_idx]
        pos_score = np.dot(u_pos, v_c)
        pos_prob = self.sigmoid(pos_score)
        
        # Loss for positive: -log σ(score)
        loss = -np.log(pos_prob + 1e-10)
        
        # Gradient for positive sample
        # d/dx [-log σ(x)] = σ(x) - 1
        grad_v_c = (pos_prob - 1) * u_pos
        grad_u_pos = (pos_prob - 1) * v_c
        
        # Negative samples
        negative_indices = self.sample_negatives(context_idx)
        
        for neg_idx in negative_indices:
            u_neg = self.W_out[neg_idx]
            neg_score = np.dot(u_neg, v_c)
            neg_prob = self.sigmoid(neg_score)
            
            # Loss for negative: -log σ(-score) = -log(1 - σ(score))
            loss += -np.log(1 - neg_prob + 1e-10)
            
            # Gradient for negative sample
            # d/dx [-log(1-σ(x))] = σ(x)
            grad_v_c += neg_prob * u_neg
            grad_u_neg = neg_prob * v_c
            
            # Update negative context embedding
            self.W_out[neg_idx] -= self.lr * grad_u_neg
        
        # Update embeddings
        self.W_in[center_idx] -= self.lr * grad_v_c
        self.W_out[context_idx] -= self.lr * grad_u_pos
        
        return loss
    
    def get_embedding(self, word_idx):
        return self.W_in[word_idx]
    
    def most_similar(self, word_idx, idx2word, top_k=5):
        """Find most similar words using cosine similarity."""
        query_vec = self.W_in[word_idx]
        
        # Compute all similarities at once
        norms = np.linalg.norm(self.W_in, axis=1)
        query_norm = np.linalg.norm(query_vec)
        
        similarities = (self.W_in @ query_vec) / (norms * query_norm + 1e-10)
        
        # Get top-k (excluding self)
        top_indices = np.argsort(-similarities)
        results = []
        for idx in top_indices:
            if idx != word_idx and len(results) < top_k:
                results.append((idx2word[idx], similarities[idx]))
        
        return results

print("Word2Vec with Negative Sampling initialized")
print(f"Number of negative samples: 5")
print(f"Complexity per step: O(k) = O(5) instead of O(V) = O({len(vocab)})")

In [None]:
# Train Word2Vec on our small corpus
def generate_training_data(corpus, word2idx, window_size=2):
    """
    Generate (center, context) pairs for skip-gram.
    """
    pairs = []
    
    for sent in corpus:
        words = sent.lower().split()
        word_indices = [word2idx.get(w, -1) for w in words]
        
        for i, center_idx in enumerate(word_indices):
            if center_idx == -1:
                continue
            
            # Context window
            start = max(0, i - window_size)
            end = min(len(word_indices), i + window_size + 1)
            
            for j in range(start, end):
                if i != j and word_indices[j] != -1:
                    pairs.append((center_idx, word_indices[j]))
    
    return pairs

# Generate training pairs
training_pairs = generate_training_data(corpus, word2idx, window_size=2)
print(f"Number of training pairs: {len(training_pairs)}")
print(f"\nSample pairs:")
for i in range(min(5, len(training_pairs))):
    center_idx, context_idx = training_pairs[i]
    print(f"  ({idx2word[center_idx]}, {idx2word[context_idx]})")

# Get word frequencies for noise distribution
word_counts = np.zeros(len(vocab))
for sent in corpus:
    for word in sent.lower().split():
        if word in word2idx:
            word_counts[word2idx[word]] += 1

# Initialize and train model
w2v_model = Word2VecNegSampling(len(vocab), embedding_dim=20, learning_rate=0.1, neg_samples=5)
w2v_model.set_noise_distribution(word_counts)

# Training loop
num_epochs = 100
losses = []

print(f"\nTraining Word2Vec for {num_epochs} epochs...")

for epoch in range(num_epochs):
    epoch_loss = 0
    np.random.shuffle(training_pairs)
    
    for center_idx, context_idx in training_pairs:
        loss = w2v_model.train_step(center_idx, context_idx)
        epoch_loss += loss
    
    avg_loss = epoch_loss / len(training_pairs)
    losses.append(avg_loss)
    
    if (epoch + 1) % 20 == 0:
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

# Plot training loss
plt.figure(figsize=(10, 4))
plt.plot(losses)
plt.xlabel('Epoch')
plt.ylabel('Average Loss')
plt.title('Word2Vec Training Loss')
plt.grid(True)
plt.show()

In [None]:
# Evaluate learned embeddings
print("Learned Word Similarities:\n")

test_words = ['king', 'queen', 'man', 'woman', 'crown']

for word in test_words:
    if word in word2idx:
        similar = w2v_model.most_similar(word2idx[word], idx2word, top_k=3)
        print(f"Most similar to '{word}':")
        for sim_word, score in similar:
            print(f"  {sim_word}: {score:.4f}")
        print()

# Check specific similarities
print("\nSemantic Similarity Scores:")
pairs_to_check = [
    ('king', 'queen'),
    ('man', 'woman'),
    ('king', 'crown'),
    ('queen', 'crown'),
    ('king', 'the')
]

for word1, word2 in pairs_to_check:
    if word1 in word2idx and word2 in word2idx:
        vec1 = w2v_model.get_embedding(word2idx[word1])
        vec2 = w2v_model.get_embedding(word2idx[word2])
        sim = cosine_similarity(vec1, vec2)
        print(f"sim({word1}, {word2}) = {sim:.4f}")

## 4. CBOW (Continuous Bag of Words)

### 4.1 Architecture

CBOW is the "inverse" of Skip-gram: predict center word from context.

**CBOW Objective:**
$$J(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \log P(w_t | w_{t-m}, \ldots, w_{t+m})$$

**Context Representation:**
Average of context word embeddings:
$$\mathbf{h} = \frac{1}{2m} \sum_{-m \leq j \leq m, j \neq 0} \mathbf{v}_{w_{t+j}}$$

**Prediction:**
$$P(w_t | \text{context}) = \frac{\exp(\mathbf{u}_{w_t}^T \mathbf{h})}{\sum_{w \in V} \exp(\mathbf{u}_w^T \mathbf{h})}$$

In [None]:
class Word2VecCBOW:
    """
    CBOW Word2Vec with Negative Sampling.
    
    Predict center word from context words.
    """
    
    def __init__(self, vocab_size, embedding_dim=100, learning_rate=0.025, neg_samples=5):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.lr = learning_rate
        self.neg_samples = neg_samples
        
        # Embeddings
        self.W_in = np.random.randn(vocab_size, embedding_dim) * 0.01
        self.W_out = np.random.randn(vocab_size, embedding_dim) * 0.01
        
        # Noise distribution
        self.noise_dist = np.ones(vocab_size) / vocab_size
    
    def set_noise_distribution(self, word_counts):
        counts = np.array(word_counts) ** 0.75
        self.noise_dist = counts / counts.sum()
    
    def sigmoid(self, x):
        return np.where(
            x >= 0,
            1 / (1 + np.exp(-x)),
            np.exp(x) / (1 + np.exp(x))
        )
    
    def sample_negatives(self, positive_idx):
        negatives = []
        while len(negatives) < self.neg_samples:
            neg_idx = np.random.choice(self.vocab_size, p=self.noise_dist)
            if neg_idx != positive_idx:
                negatives.append(neg_idx)
        return negatives
    
    def train_step(self, context_indices, center_idx):
        """
        Training step: predict center word from context.
        
        context_indices: list of word indices in context
        center_idx: target word to predict
        """
        # Average context embeddings
        h = np.mean(self.W_in[context_indices], axis=0)
        
        # Positive sample
        u_pos = self.W_out[center_idx]
        pos_score = np.dot(u_pos, h)
        pos_prob = self.sigmoid(pos_score)
        
        loss = -np.log(pos_prob + 1e-10)
        
        # Gradients
        grad_h = (pos_prob - 1) * u_pos
        grad_u_pos = (pos_prob - 1) * h
        
        # Negative samples
        negative_indices = self.sample_negatives(center_idx)
        
        for neg_idx in negative_indices:
            u_neg = self.W_out[neg_idx]
            neg_score = np.dot(u_neg, h)
            neg_prob = self.sigmoid(neg_score)
            
            loss += -np.log(1 - neg_prob + 1e-10)
            grad_h += neg_prob * u_neg
            grad_u_neg = neg_prob * h
            
            self.W_out[neg_idx] -= self.lr * grad_u_neg
        
        # Update context word embeddings (distribute gradient equally)
        grad_per_context = grad_h / len(context_indices)
        for ctx_idx in context_indices:
            self.W_in[ctx_idx] -= self.lr * grad_per_context
        
        self.W_out[center_idx] -= self.lr * grad_u_pos
        
        return loss
    
    def get_embedding(self, word_idx):
        return self.W_in[word_idx]

# Generate CBOW training data
def generate_cbow_data(corpus, word2idx, window_size=2):
    """
    Generate (context_words, center_word) pairs for CBOW.
    """
    data = []
    
    for sent in corpus:
        words = sent.lower().split()
        word_indices = [word2idx.get(w, -1) for w in words]
        
        for i, center_idx in enumerate(word_indices):
            if center_idx == -1:
                continue
            
            # Collect context
            context = []
            for j in range(max(0, i - window_size), min(len(word_indices), i + window_size + 1)):
                if i != j and word_indices[j] != -1:
                    context.append(word_indices[j])
            
            if len(context) > 0:
                data.append((context, center_idx))
    
    return data

cbow_data = generate_cbow_data(corpus, word2idx, window_size=2)
print(f"CBOW training samples: {len(cbow_data)}")
print(f"\nSample CBOW data:")
for i in range(min(3, len(cbow_data))):
    context, center = cbow_data[i]
    context_words = [idx2word[idx] for idx in context]
    print(f"  Context: {context_words} -> Center: {idx2word[center]}")

### 4.2 Skip-gram vs CBOW

| Aspect | Skip-gram | CBOW |
|--------|-----------|------|
| Task | Predict context from center | Predict center from context |
| Training | Slower (more samples) | Faster |
| Rare words | Better representation | Worse (averaged out) |
| Frequent words | Worse | Better |
| Semantic relationships | Better | Worse |
| Syntactic relationships | Worse | Better |

## 5. GloVe: Global Vectors

### 5.1 Theory

GloVe (Pennington et al., 2014) combines the benefits of:
- Global matrix factorization (SVD)
- Local context window (Word2Vec)

**Key Insight:** Word embeddings should encode ratios of co-occurrence probabilities.

$$\frac{P(k | \text{ice})}{P(k | \text{steam})} = \begin{cases}
\text{large} & k = \text{solid} \\
\text{small} & k = \text{gas} \\
\approx 1 & k = \text{water}, \text{fashion}
\end{cases}$$

**GloVe Objective:**
$$J = \sum_{i,j=1}^{V} f(X_{ij})(\mathbf{w}_i^T \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij})^2$$

where:
- $X_{ij}$ : co-occurrence count
- $\mathbf{w}_i, \tilde{\mathbf{w}}_j$ : word and context embeddings
- $b_i, \tilde{b}_j$ : bias terms
- $f(x)$ : weighting function to down-weight frequent co-occurrences

**Weighting Function:**
$$f(x) = \begin{cases}
(x / x_{\max})^\alpha & x < x_{\max} \\
1 & \text{otherwise}
\end{cases}$$

where $\alpha = 0.75$ and $x_{\max} = 100$.

In [None]:
class GloVe:
    """
    GloVe: Global Vectors for Word Representation.
    
    Learns embeddings from co-occurrence statistics.
    """
    
    def __init__(self, vocab_size, embedding_dim=100, x_max=100, alpha=0.75, learning_rate=0.05):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.x_max = x_max
        self.alpha = alpha
        self.lr = learning_rate
        
        # Word embeddings
        self.W = np.random.randn(vocab_size, embedding_dim) * 0.01
        # Context embeddings
        self.W_tilde = np.random.randn(vocab_size, embedding_dim) * 0.01
        # Biases
        self.b = np.zeros(vocab_size)
        self.b_tilde = np.zeros(vocab_size)
        
        # AdaGrad parameters (sum of squared gradients)
        self.gradsq_W = np.ones((vocab_size, embedding_dim))
        self.gradsq_W_tilde = np.ones((vocab_size, embedding_dim))
        self.gradsq_b = np.ones(vocab_size)
        self.gradsq_b_tilde = np.ones(vocab_size)
    
    def weighting_function(self, x):
        """Weighting function f(x)."""
        if x < self.x_max:
            return (x / self.x_max) ** self.alpha
        return 1.0
    
    def train_step(self, i, j, x_ij):
        """
        Single training step for pair (i, j) with co-occurrence x_ij.
        
        Uses AdaGrad for adaptive learning rate.
        """
        # Compute inner product and difference from log co-occurrence
        diff = np.dot(self.W[i], self.W_tilde[j]) + self.b[i] + self.b_tilde[j] - np.log(x_ij + 1e-10)
        
        # Weight
        f_x = self.weighting_function(x_ij)
        
        # Weighted squared error loss
        loss = f_x * diff ** 2
        
        # Gradients
        grad_common = f_x * diff
        
        grad_W_i = grad_common * self.W_tilde[j]
        grad_W_tilde_j = grad_common * self.W[i]
        grad_b_i = grad_common
        grad_b_tilde_j = grad_common
        
        # AdaGrad update
        self.gradsq_W[i] += grad_W_i ** 2
        self.gradsq_W_tilde[j] += grad_W_tilde_j ** 2
        self.gradsq_b[i] += grad_b_i ** 2
        self.gradsq_b_tilde[j] += grad_b_tilde_j ** 2
        
        # Parameter updates
        self.W[i] -= self.lr * grad_W_i / np.sqrt(self.gradsq_W[i])
        self.W_tilde[j] -= self.lr * grad_W_tilde_j / np.sqrt(self.gradsq_W_tilde[j])
        self.b[i] -= self.lr * grad_b_i / np.sqrt(self.gradsq_b[i])
        self.b_tilde[j] -= self.lr * grad_b_tilde_j / np.sqrt(self.gradsq_b_tilde[j])
        
        return loss
    
    def train(self, cooccurrence_matrix, num_epochs=50):
        """
        Train GloVe on co-occurrence matrix.
        """
        # Get non-zero entries
        nonzero = np.nonzero(cooccurrence_matrix)
        indices = list(zip(nonzero[0], nonzero[1]))
        
        print(f"Training on {len(indices)} non-zero co-occurrences")
        
        losses = []
        
        for epoch in range(num_epochs):
            epoch_loss = 0
            np.random.shuffle(indices)
            
            for i, j in indices:
                x_ij = cooccurrence_matrix[i, j]
                if x_ij > 0:
                    loss = self.train_step(i, j, x_ij)
                    epoch_loss += loss
            
            avg_loss = epoch_loss / len(indices)
            losses.append(avg_loss)
            
            if (epoch + 1) % 10 == 0:
                print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")
        
        return losses
    
    def get_embedding(self, word_idx):
        """Get final embedding (sum of W and W_tilde)."""
        return self.W[word_idx] + self.W_tilde[word_idx]
    
    def get_all_embeddings(self):
        """Get all embeddings."""
        return self.W + self.W_tilde

# Train GloVe
print("Training GloVe model...\n")
glove_model = GloVe(len(vocab), embedding_dim=20, learning_rate=0.1)
glove_losses = glove_model.train(cooccur_matrix, num_epochs=100)

# Plot loss
plt.figure(figsize=(10, 4))
plt.plot(glove_losses)
plt.xlabel('Epoch')
plt.ylabel('Average Loss')
plt.title('GloVe Training Loss')
plt.grid(True)
plt.show()

In [None]:
# Compare GloVe similarities
print("GloVe Word Similarities:\n")

for word in ['king', 'queen', 'man', 'woman']:
    if word in word2idx:
        query_vec = glove_model.get_embedding(word2idx[word])
        
        # Find similar words
        similarities = []
        for other_word, other_idx in word2idx.items():
            if other_word != word:
                other_vec = glove_model.get_embedding(other_idx)
                sim = cosine_similarity(query_vec, other_vec)
                similarities.append((other_word, sim))
        
        similarities.sort(key=lambda x: x[1], reverse=True)
        print(f"Most similar to '{word}':")
        for sim_word, score in similarities[:3]:
            print(f"  {sim_word}: {score:.4f}")
        print()

## 6. Embedding Arithmetic and Analogies

### 6.1 The Famous Equation

$$\text{king} - \text{man} + \text{woman} \approx \text{queen}$$

**Mathematical Interpretation:**
If embeddings capture semantic relationships, then:
- $\mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}}$ = "royalty" concept
- Adding $\mathbf{v}_{\text{woman}}$ = female + royalty = queen

**3CosAdd Method:**
$$\text{argmax}_{b^*} \cos(b^*, b - a + a^*)$$

Find word $b^*$ such that $a : a^* :: b : b^*$

**3CosMul Method (often better):**
$$\text{argmax}_{b^*} \frac{\cos(b^*, b) \cdot \cos(b^*, a^*)}{\cos(b^*, a) + \epsilon}$$

In [None]:
def analogy_3cosadd(a, a_star, b, embeddings, word2idx, idx2word):
    """
    Solve analogy: a is to a* as b is to ?
    
    Using 3CosAdd: argmax cos(v, b - a + a*)
    """
    if a not in word2idx or a_star not in word2idx or b not in word2idx:
        return None
    
    # Get embeddings
    v_a = embeddings[word2idx[a]]
    v_a_star = embeddings[word2idx[a_star]]
    v_b = embeddings[word2idx[b]]
    
    # Compute target vector
    target = v_b - v_a + v_a_star
    
    # Find most similar (excluding input words)
    exclude = {word2idx[a], word2idx[a_star], word2idx[b]}
    
    best_word = None
    best_sim = -float('inf')
    
    for idx in range(len(embeddings)):
        if idx not in exclude:
            sim = cosine_similarity(target, embeddings[idx])
            if sim > best_sim:
                best_sim = sim
                best_word = idx2word[idx]
    
    return best_word, best_sim

def analogy_3cosmul(a, a_star, b, embeddings, word2idx, idx2word, eps=0.001):
    """
    Solve analogy using 3CosMul.
    
    argmax (cos(v, b) * cos(v, a*)) / (cos(v, a) + eps)
    """
    if a not in word2idx or a_star not in word2idx or b not in word2idx:
        return None
    
    v_a = embeddings[word2idx[a]]
    v_a_star = embeddings[word2idx[a_star]]
    v_b = embeddings[word2idx[b]]
    
    exclude = {word2idx[a], word2idx[a_star], word2idx[b]}
    
    best_word = None
    best_score = -float('inf')
    
    for idx in range(len(embeddings)):
        if idx not in exclude:
            v = embeddings[idx]
            
            # Compute similarities (shifted to [0, 1])
            cos_b = (cosine_similarity(v, v_b) + 1) / 2
            cos_a_star = (cosine_similarity(v, v_a_star) + 1) / 2
            cos_a = (cosine_similarity(v, v_a) + 1) / 2
            
            score = (cos_b * cos_a_star) / (cos_a + eps)
            
            if score > best_score:
                best_score = score
                best_word = idx2word[idx]
    
    return best_word, best_score

# Test analogies on our small corpus
print("Word Analogies:\n")

# Get embeddings from trained models
w2v_embeddings = w2v_model.W_in
glove_embeddings = glove_model.get_all_embeddings()

# Test: man -> woman as king -> ?
analogies = [
    ('man', 'woman', 'king'),  # king -> ?
    ('woman', 'man', 'queen'), # queen -> ?
]

print("Word2Vec Analogies (3CosAdd):")
for a, a_star, b in analogies:
    result = analogy_3cosadd(a, a_star, b, w2v_embeddings, word2idx, idx2word)
    if result:
        word, score = result
        print(f"  {a} : {a_star} :: {b} : {word} (sim={score:.4f})")

print("\nGloVe Analogies (3CosAdd):")
for a, a_star, b in analogies:
    result = analogy_3cosadd(a, a_star, b, glove_embeddings, word2idx, idx2word)
    if result:
        word, score = result
        print(f"  {a} : {a_star} :: {b} : {word} (sim={score:.4f})")

print("\nNote: Results limited by small corpus size. With larger datasets,")
print("these methods produce remarkable semantic relationships!")

## 7. Visualizing Embeddings

### 7.1 Dimensionality Reduction

To visualize high-dimensional embeddings in 2D/3D:

**PCA (Principal Component Analysis):**
- Linear projection to top principal components
- Preserves global structure
- Fast, deterministic

**t-SNE (t-Distributed Stochastic Neighbor Embedding):**
- Non-linear dimensionality reduction
- Preserves local structure (clusters)
- Better for visualization but slower

**UMAP (Uniform Manifold Approximation and Projection):**
- Preserves both local and global structure
- Faster than t-SNE
- Often preferred for large datasets

In [None]:
# Visualize embeddings using PCA and t-SNE

# Use Word2Vec embeddings
all_embeddings = w2v_model.W_in
words = [idx2word[i] for i in range(len(vocab))]

# PCA reduction
pca = PCA(n_components=2)
embeddings_pca = pca.fit_transform(all_embeddings)

# t-SNE reduction
# Note: t-SNE requires perplexity < n_samples
perplexity = min(5, len(vocab) - 1)
tsne = TSNE(n_components=2, perplexity=perplexity, random_state=42, n_iter=1000)
embeddings_tsne = tsne.fit_transform(all_embeddings)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# PCA plot
axes[0].scatter(embeddings_pca[:, 0], embeddings_pca[:, 1], alpha=0.7)
for i, word in enumerate(words):
    axes[0].annotate(word, (embeddings_pca[i, 0], embeddings_pca[i, 1]),
                      fontsize=9, alpha=0.8)
axes[0].set_title('Word Embeddings - PCA Projection')
axes[0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
axes[0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
axes[0].grid(True, alpha=0.3)

# t-SNE plot
axes[1].scatter(embeddings_tsne[:, 0], embeddings_tsne[:, 1], alpha=0.7)
for i, word in enumerate(words):
    axes[1].annotate(word, (embeddings_tsne[i, 0], embeddings_tsne[i, 1]),
                      fontsize=9, alpha=0.8)
axes[1].set_title('Word Embeddings - t-SNE Projection')
axes[1].set_xlabel('t-SNE 1')
axes[1].set_ylabel('t-SNE 2')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nObserve how semantically related words cluster together!")
print("With larger corpora, these clusters become much more pronounced.")

## 8. Using Pretrained Embeddings

### 8.1 Why Pretrained?

Training embeddings requires massive corpora:
- Original Word2Vec: Google News (~100B words)
- GloVe: Common Crawl (840B tokens)
- fastText: Wikipedia + Common Crawl

**Advantages of Pretrained Embeddings:**
1. Rich semantic representations from large corpora
2. No need for expensive training
3. Transfer learning to downstream tasks
4. Better generalization

### 8.2 Loading Pretrained Embeddings

In [None]:
# Simulating pretrained embeddings loading
# In practice, you'd download from:
# - GloVe: https://nlp.stanford.edu/projects/glove/
# - Word2Vec: https://code.google.com/archive/p/word2vec/
# - fastText: https://fasttext.cc/docs/en/english-vectors.html

def load_glove_embeddings(file_path, vocab_limit=None):
    """
    Load pretrained GloVe embeddings from file.
    
    File format: word dim1 dim2 ... dimN
    """
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for i, line in enumerate(f):
            if vocab_limit and i >= vocab_limit:
                break
            
            values = line.strip().split()
            word = values[0]
            vector = np.array(values[1:], dtype=np.float32)
            embeddings[word] = vector
    
    return embeddings

# Example of creating an embedding matrix for a specific vocabulary
def create_embedding_matrix(word2idx, pretrained_embeddings, embedding_dim):
    """
    Create embedding matrix for a given vocabulary.
    
    Initialize unknown words randomly.
    """
    vocab_size = len(word2idx)
    embedding_matrix = np.random.randn(vocab_size, embedding_dim) * 0.01
    
    found_count = 0
    for word, idx in word2idx.items():
        if word in pretrained_embeddings:
            embedding_matrix[idx] = pretrained_embeddings[word]
            found_count += 1
    
    coverage = found_count / vocab_size * 100
    print(f"Vocabulary coverage: {found_count}/{vocab_size} ({coverage:.1f}%)")
    
    return embedding_matrix

print("Pretrained Embedding Loading Functions")
print("\nIn practice, you would:")
print("1. Download pretrained embeddings (e.g., glove.6B.300d.txt)")
print("2. Load them using load_glove_embeddings()")
print("3. Create embedding matrix for your task vocabulary")
print("4. Use in neural network as embedding layer")

In [None]:
# PyTorch integration example
import torch
import torch.nn as nn

class TextClassifier(nn.Module):
    """
    Simple text classifier with pretrained embeddings.
    """
    
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_classes,
                 pretrained_embeddings=None, freeze_embeddings=False):
        super().__init__()
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # Load pretrained weights
        if pretrained_embeddings is not None:
            self.embedding.weight.data.copy_(
                torch.from_numpy(pretrained_embeddings)
            )
            if freeze_embeddings:
                self.embedding.weight.requires_grad = False
        
        # Classifier
        self.fc1 = nn.Linear(embedding_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, num_classes)
    
    def forward(self, x):
        """
        x: [batch_size, seq_len] token indices
        """
        # Embed tokens: [batch_size, seq_len, embedding_dim]
        embedded = self.embedding(x)
        
        # Average pooling over sequence
        pooled = embedded.mean(dim=1)  # [batch_size, embedding_dim]
        
        # Classify
        hidden = self.relu(self.fc1(pooled))
        output = self.fc2(hidden)
        
        return output

# Demo
vocab_size = 1000
embedding_dim = 100
hidden_dim = 64
num_classes = 2

# Simulate pretrained embeddings
pretrained = np.random.randn(vocab_size, embedding_dim).astype(np.float32)

# Create model
model = TextClassifier(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    hidden_dim=hidden_dim,
    num_classes=num_classes,
    pretrained_embeddings=pretrained,
    freeze_embeddings=False  # Set True to keep embeddings fixed
)

print("Text Classifier Architecture:")
print(model)

# Test forward pass
batch_size = 4
seq_len = 20
sample_input = torch.randint(0, vocab_size, (batch_size, seq_len))
output = model(sample_input)
print(f"\nInput shape: {sample_input.shape}")
print(f"Output shape: {output.shape}")

# Count parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

## 9. Limitations of Static Embeddings

### 9.1 Polysemy Problem

Static embeddings assign ONE vector per word type, regardless of context:

**Example: "bank"**
- "I deposited money at the **bank**" (financial institution)
- "The river **bank** was muddy" (riverside)

Both senses get merged into a single vector!

### 9.2 Other Limitations

1. **No context sensitivity**: Same embedding regardless of surrounding words
2. **Out-of-vocabulary (OOV)**: No representation for unseen words
3. **Morphology ignored**: "run", "runs", "running" are separate
4. **Bias encoded**: Societal biases in training data are learned

### 9.3 Solutions (Preview of Days 14-15)

- **fastText**: Subword embeddings (handles OOV)
- **ELMo**: Contextualized embeddings
- **BERT**: Bidirectional contextualized representations
- **GPT**: Autoregressive language models

In [None]:
# Demonstrating the polysemy problem

# Hypothetical single embedding for "bank"
print("The Polysemy Problem:")
print("="*50)

# Sentences with different meanings of "bank"
sentences = [
    "I need to go to the bank to deposit my check",
    "The river bank was covered with wildflowers",
    "The pilot decided to bank the aircraft left",
    "Don't bank on him coming to the party"
]

print("Different meanings of 'bank':")
for i, sent in enumerate(sentences, 1):
    print(f"{i}. {sent}")

print("\nStatic embedding issue:")
print("All these different meanings map to a SINGLE vector!")
print("\nContextualized embeddings (ELMo, BERT) solve this by")
print("computing different vectors based on context.")

## 10. Bias in Word Embeddings

### 10.1 Gender Bias Example

Word embeddings trained on large corpora encode societal biases:

$$\text{man} - \text{woman} \approx \text{doctor} - \text{nurse}$$
$$\text{man} - \text{woman} \approx \text{computer programmer} - \text{homemaker}$$

### 10.2 Debiasing Techniques

1. **Hard Debiasing** (Bolukbasi et al., 2016):
   - Identify gender direction
   - Project out gender component from neutral words
   - Equalize gender pairs

2. **Counterfactual Data Augmentation**:
   - Swap gendered words in training data
   - Balance gender representation

3. **Adversarial Learning**:
   - Train embeddings to be unpredictive of protected attributes

In [None]:
# Simple debiasing illustration

def identify_bias_direction(embeddings, word2idx, gendered_pairs):
    """
    Identify the gender direction from gendered word pairs.
    
    gendered_pairs: list of (male_word, female_word) tuples
    """
    differences = []
    
    for male_word, female_word in gendered_pairs:
        if male_word in word2idx and female_word in word2idx:
            diff = embeddings[word2idx[male_word]] - embeddings[word2idx[female_word]]
            differences.append(diff)
    
    if not differences:
        return None
    
    # PCA to get principal direction
    diff_matrix = np.array(differences)
    pca = PCA(n_components=1)
    pca.fit(diff_matrix)
    gender_direction = pca.components_[0]
    
    return gender_direction / np.linalg.norm(gender_direction)

def debias_word(embedding, bias_direction):
    """
    Remove bias component from word embedding.
    
    Project out the bias direction.
    """
    # Component along bias direction
    bias_component = np.dot(embedding, bias_direction) * bias_direction
    
    # Remove it
    debiased = embedding - bias_component
    
    return debiased

# Example
print("Word Embedding Debiasing:")
print("="*50)

if 'man' in word2idx and 'woman' in word2idx:
    # Identify gender direction
    gendered_pairs = [('man', 'woman'), ('king', 'queen')]
    gender_dir = identify_bias_direction(w2v_embeddings, word2idx, gendered_pairs)
    
    if gender_dir is not None:
        print("Gender direction identified from word pairs")
        print(f"Direction vector norm: {np.linalg.norm(gender_dir):.4f}")
        
        # Check word projections onto gender direction
        print("\nProjections onto gender direction:")
        for word in ['king', 'queen', 'man', 'woman', 'the', 'and']:
            if word in word2idx:
                proj = np.dot(w2v_embeddings[word2idx[word]], gender_dir)
                print(f"  {word:10s}: {proj:+.4f}")

print("\nNote: Real debiasing requires careful consideration of:")
print("- Which words should be gendered (he/she) vs neutral (doctor)")
print("- Multiple bias types (race, age, religion)")
print("- Downstream task impact")

## 11. Evaluation of Word Embeddings

### 11.1 Intrinsic Evaluation

**Word Similarity Tasks:**
- SimLex-999
- WordSim-353
- MEN dataset

**Analogy Tasks:**
- Google analogy dataset (19,544 analogies)
- Semantic: capital-country, currency
- Syntactic: adjective-adverb, verb tenses

### 11.2 Extrinsic Evaluation

Evaluate embeddings on downstream tasks:
- Sentiment analysis
- Named Entity Recognition
- Part-of-Speech tagging
- Question Answering

In [None]:
def evaluate_word_similarity(embeddings, word2idx, test_pairs):
    """
    Evaluate embeddings on word similarity task.
    
    test_pairs: list of (word1, word2, human_score) tuples
    
    Returns Spearman correlation between model and human scores.
    """
    model_scores = []
    human_scores = []
    
    for word1, word2, human_score in test_pairs:
        if word1 in word2idx and word2 in word2idx:
            vec1 = embeddings[word2idx[word1]]
            vec2 = embeddings[word2idx[word2]]
            model_sim = cosine_similarity(vec1, vec2)
            
            model_scores.append(model_sim)
            human_scores.append(human_score)
    
    if len(model_scores) < 2:
        return 0.0
    
    # Spearman correlation
    from scipy.stats import spearmanr
    correlation, p_value = spearmanr(model_scores, human_scores)
    
    return correlation

def evaluate_analogies(embeddings, word2idx, idx2word, analogies):
    """
    Evaluate on analogy task.
    
    analogies: list of (a, a*, b, b*) tuples
    
    Returns accuracy.
    """
    correct = 0
    total = 0
    
    for a, a_star, b, b_star in analogies:
        if all(w in word2idx for w in [a, a_star, b, b_star]):
            result = analogy_3cosadd(a, a_star, b, embeddings, word2idx, idx2word)
            if result:
                predicted, _ = result
                if predicted == b_star:
                    correct += 1
                total += 1
    
    accuracy = correct / total if total > 0 else 0.0
    return accuracy

# Example evaluation
print("Word Embedding Evaluation:")
print("="*50)

# Synthetic test data (would use real datasets in practice)
test_similarity_pairs = [
    ('king', 'queen', 0.8),
    ('king', 'man', 0.5),
    ('man', 'woman', 0.7),
    ('king', 'the', 0.1)
]

test_analogies = [
    ('man', 'woman', 'king', 'queen'),
]

# Evaluate Word2Vec
sim_corr = evaluate_word_similarity(w2v_embeddings, word2idx, test_similarity_pairs)
analogy_acc = evaluate_analogies(w2v_embeddings, word2idx, idx2word, test_analogies)

print(f"Word2Vec Results:")
print(f"  Word Similarity Correlation: {sim_corr:.4f}")
print(f"  Analogy Accuracy: {analogy_acc:.1%}")

# Evaluate GloVe
sim_corr_glove = evaluate_word_similarity(glove_embeddings, word2idx, test_similarity_pairs)
analogy_acc_glove = evaluate_analogies(glove_embeddings, word2idx, idx2word, test_analogies)

print(f"\nGloVe Results:")
print(f"  Word Similarity Correlation: {sim_corr_glove:.4f}")
print(f"  Analogy Accuracy: {analogy_acc_glove:.1%}")

print("\nNote: Real evaluation uses standardized datasets like:")
print("- SimLex-999 for similarity")
print("- Google Analogies dataset (semantic + syntactic)")
print("- BATS (Bigger Analogy Test Set)")

## 12. Best Practices and Summary

### 12.1 Choosing Embeddings

**For Most Tasks:**
1. Start with pretrained embeddings (GloVe, fastText)
2. Fine-tune on your domain data
3. Consider contextualized embeddings for complex tasks

**Embedding Dimensions:**
- Small vocab/task: 50-100 dimensions
- General use: 300 dimensions (standard)
- Complex tasks: 300-1024 dimensions

**Window Size:**
- Small (2-5): More syntactic relationships
- Large (5-15): More semantic relationships

### 12.2 Key Takeaways

1. **Dense embeddings** capture semantic similarity in continuous space
2. **Word2Vec** learns from local context windows (prediction-based)
3. **GloVe** combines global co-occurrence with local windows
4. **Embedding arithmetic** reveals learned semantic relationships
5. **Pretrained embeddings** provide transfer learning benefits
6. **Static embeddings** have limitations (polysemy, OOV, bias)
7. **Contextualized embeddings** (ELMo, BERT) address these limitations

### 12.3 Next Steps

- **Day 13**: Recurrent Neural Networks for text sequences
- **Day 14**: Advanced RNNs (LSTM, GRU) and language modeling
- **Day 15**: Attention mechanisms and transformers

In [None]:
# Final summary visualization

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1. Word2Vec architecture diagram
axes[0, 0].text(0.5, 0.8, 'Skip-gram Architecture', ha='center', fontsize=14, fontweight='bold')
axes[0, 0].text(0.5, 0.6, 'Input (one-hot)', ha='center', fontsize=11)
axes[0, 0].arrow(0.5, 0.55, 0, -0.1, head_width=0.05, head_length=0.02, fc='blue')
axes[0, 0].text(0.5, 0.4, 'W_in (embeddings)', ha='center', fontsize=11, color='blue')
axes[0, 0].arrow(0.5, 0.35, 0, -0.1, head_width=0.05, head_length=0.02, fc='blue')
axes[0, 0].text(0.5, 0.2, 'W_out @ v_c', ha='center', fontsize=11)
axes[0, 0].arrow(0.5, 0.15, 0, -0.1, head_width=0.05, head_length=0.02, fc='blue')
axes[0, 0].text(0.5, 0.0, 'Softmax (context prediction)', ha='center', fontsize=11)
axes[0, 0].set_xlim(0, 1)
axes[0, 0].set_ylim(-0.1, 1)
axes[0, 0].axis('off')
axes[0, 0].set_title('Word2Vec', fontsize=12)

# 2. Training losses comparison
epochs = np.arange(1, len(losses) + 1)
axes[0, 1].plot(epochs, losses, 'b-', label='Word2Vec', linewidth=2)
axes[0, 1].plot(epochs, glove_losses[:len(epochs)], 'r-', label='GloVe', linewidth=2)
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('Loss')
axes[0, 1].set_title('Training Convergence')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 3. Embedding properties
properties = ['Semantic\nSimilarity', 'Analogy\nArithmetic', 'Training\nSpeed', 
              'Memory\nEfficiency', 'OOV\nHandling']
word2vec_scores = [4, 4.5, 3, 4, 1]
glove_scores = [4.5, 4, 4.5, 3, 1]
fasttext_scores = [4, 4, 3.5, 3.5, 5]

x = np.arange(len(properties))
width = 0.25

axes[1, 0].bar(x - width, word2vec_scores, width, label='Word2Vec', color='blue', alpha=0.7)
axes[1, 0].bar(x, glove_scores, width, label='GloVe', color='red', alpha=0.7)
axes[1, 0].bar(x + width, fasttext_scores, width, label='fastText', color='green', alpha=0.7)
axes[1, 0].set_xticks(x)
axes[1, 0].set_xticklabels(properties, rotation=45, ha='right')
axes[1, 0].set_ylabel('Score (1-5)')
axes[1, 0].set_title('Embedding Method Comparison')
axes[1, 0].legend(loc='lower right')
axes[1, 0].set_ylim(0, 6)
axes[1, 0].grid(True, alpha=0.3, axis='y')

# 4. Timeline of embedding methods
methods = ['LSA', 'Word2Vec', 'GloVe', 'fastText', 'ELMo', 'BERT']
years = [1988, 2013, 2014, 2016, 2018, 2018]
types = ['Count', 'Predict', 'Hybrid', 'Subword', 'Context', 'Context']

axes[1, 1].scatter(years, range(len(methods)), s=200, c='blue', alpha=0.6, zorder=5)
for i, (method, year, typ) in enumerate(zip(methods, years, types)):
    axes[1, 1].text(year + 0.5, i, f'{method}\n({typ})', ha='left', va='center', fontsize=10)
axes[1, 1].set_yticks([])
axes[1, 1].set_xlabel('Year')
axes[1, 1].set_title('Evolution of Word Embeddings')
axes[1, 1].grid(True, alpha=0.3, axis='x')
axes[1, 1].set_xlim(1985, 2022)

plt.suptitle('Word Embeddings: Summary', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

## Exercises

### Exercise 1: Implement Hierarchical Softmax
Implement hierarchical softmax for Word2Vec to reduce complexity from O(V) to O(log V).

### Exercise 2: Train on Larger Corpus
Download a larger text corpus (e.g., Wikipedia snippets) and train Word2Vec. Evaluate on standard analogy tasks.

### Exercise 3: Subword Embeddings
Implement character n-gram embeddings (fastText-style) to handle out-of-vocabulary words.

### Exercise 4: Bias Analysis
Using pretrained embeddings, identify and quantify biases. Implement debiasing and measure the impact.

### Exercise 5: Domain-Specific Embeddings
Train embeddings on domain-specific text (medical, legal, technical) and compare to general embeddings.

### Exercise 6: Multilingual Embeddings
Research and implement cross-lingual embedding alignment (e.g., MUSE, VecMap).

### Exercise 7: Embedding Compression
Implement embedding compression techniques (quantization, dimensionality reduction) and evaluate trade-offs.

In [None]:
# Starter code for Exercise 3: Subword Embeddings

def get_ngrams(word, n=3):
    """
    Get character n-grams for a word.
    
    Add special boundary markers < and >.
    """
    word = f"<{word}>"
    ngrams = []
    
    for i in range(len(word) - n + 1):
        ngrams.append(word[i:i+n])
    
    return ngrams

# Example
word = "where"
print(f"Word: {word}")
print(f"3-grams: {get_ngrams(word, n=3)}")
print(f"4-grams: {get_ngrams(word, n=4)}")

# fastText represents a word as sum of:
# 1. Word embedding itself
# 2. All character n-gram embeddings
# This allows handling OOV words!

oov_word = "wherefore"
print(f"\nOOV word: {oov_word}")
print(f"3-grams: {get_ngrams(oov_word, n=3)}")
print("\nMany n-grams overlap with 'where' -> similar embedding!")

## References

1. Mikolov, T., et al. (2013). "Efficient Estimation of Word Representations in Vector Space." (Word2Vec)
2. Mikolov, T., et al. (2013). "Distributed Representations of Words and Phrases and their Compositionality." (Negative Sampling)
3. Pennington, J., et al. (2014). "GloVe: Global Vectors for Word Representation."
4. Bojanowski, P., et al. (2017). "Enriching Word Vectors with Subword Information." (fastText)
5. Bolukbasi, T., et al. (2016). "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings."
6. Levy, O., & Goldberg, Y. (2014). "Neural Word Embedding as Implicit Matrix Factorization."
7. Schnabel, T., et al. (2015). "Evaluation Methods for Unsupervised Word Embeddings."