In [9]:
# @title
from IPython.display import display, HTML

display(HTML("""
<script>
const firstCell = document.querySelector('.cell.code_cell');
if (firstCell) {
  firstCell.querySelector('.input').style.pointerEvents = 'none';
  firstCell.querySelector('.input').style.opacity = '0.5';
}
</script>
"""))

html = """
<div style="display:flex; flex-direction:column; align-items:center; text-align:center; gap:12px; padding:8px;">
  <h1 style="margin:0;">üëã Welcome to <span style="color:#1E88E5;">Algopath Coding Academy</span>!</h1>

  <img src="https://raw.githubusercontent.com/sshariqali/mnist_pretrained_model/main/algopath_logo.jpg"
       alt="Algopath Coding Academy Logo"
       width="400"
       style="border-radius:15px; box-shadow:0 4px 12px rgba(0,0,0,0.2); max-width:100%; height:auto;" />

  <p style="font-size:16px; margin:0;">
    <em>Empowering young minds to think creatively, code intelligently, and build the future with AI.</em>
  </p>
</div>
"""

display(HTML(html))

### **1. Introduction: The Challenge of Understanding Language**

**The Problem:**

So far, we've worked with numerical data (exam scores) and images (pixels). But how do we make machines understand **words** and **language**?

Words aren't numbers. We can't just feed "cat" or "king" into a neural network. Computers only understand numbers, so we need a way to represent words mathematically.

**The Evolution of Word Representation:**

Humans understand that:
- "king" and "queen" are related (both royalty)
- "cat" and "dog" are similar (both animals, both pets)
- "happy" and "joyful" have similar meanings

But how do we teach machines these relationships?

<div align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1400/1*sXNXYfAqfLUeiDXPCo130w.png" width="600"/>
</div>

**Our Goal Today:**

Build a system that learns word meanings from text and represents them as vectors (lists of numbers) such that:
- Similar words have similar vectors
- We can perform mathematical operations on meaning (e.g., king - man + woman ‚âà queen)
- The machine discovers these relationships automatically from data!

### **2. From One-Hot Encoding to Word Embeddings**

Let's explore two fundamental ways to represent words as numbers.

**Method 1: One-Hot Encoding (The Naive Approach)**

Imagine we have a vocabulary of 5 words: `["cat", "dog", "king", "queen", "happy"]`

One-hot encoding represents each word as a vector with:
- A `1` at the word's position
- `0` everywhere else

```
cat   = [1, 0, 0, 0, 0]
dog   = [0, 1, 0, 0, 0]
king  = [0, 0, 1, 0, 0]
queen = [0, 0, 0, 1, 0]
happy = [0, 0, 0, 0, 1]
```

**Problems with One-Hot Encoding:**

1. **No Semantic Meaning:** All words are equally different from each other
   - Distance between "cat" and "dog" = Distance between "cat" and "king"
   - The model can't learn that some words are more similar than others

2. **Huge Dimensions:** With 100,000 words in a vocabulary, each word is a 100,000-dimensional vector!
   - Wastes memory and computation
   - 99.999% of the values are zeros (sparse representation)

3. **Can't Generalize:** If the model sees "cat" during training but not "dog", it has no way to know they're related

**Method 2: Word Embeddings (The Smart Approach)**

Word embeddings represent each word as a **dense vector** of real numbers (typically 50-300 dimensions):

```
cat   = [0.2,  0.8, -0.3,  0.1, ...]  (50-300 numbers)
dog   = [0.3,  0.7, -0.2,  0.2, ...]  (similar to cat!)
king  = [0.9, -0.1,  0.4,  0.8, ...]
queen = [0.8, -0.2,  0.3,  0.9, ...]  (similar to king!)
```

**Advantages of Word Embeddings:**

1. **Captures Semantic Meaning:** Similar words have similar vectors
2. **Compact:** 300 dimensions instead of 100,000
3. **Learned from Data:** The model discovers relationships automatically
4. **Generalizable:** Understanding of "cat" helps with "dog"

<div align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1400/1*OEmWDt4eztOcm5pr2QbxfA.png" width="600"/>
</div>

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

print(f"PyTorch version: {torch.__version__}")
print(f"NumPy version: {np.__version__}")

In [None]:
# Check if CUDA is available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

if device == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")
else:
    print("GPU not available, using CPU")

**Demonstration: One-Hot vs Embeddings**

Let's create a simple example to see the difference:

In [None]:
# Example vocabulary
vocab = ["cat", "dog", "king", "queen", "happy"]
vocab_size = len(vocab)

# Create word to index mapping
word_to_idx = {word: idx for idx, word in enumerate(vocab)}
idx_to_word = {idx: word for word, idx in word_to_idx.items()}

print("Vocabulary:")
print(word_to_idx)
print("\n" + "="*60)

# Method 1: One-hot encoding
def one_hot_encode(word, word_to_idx, vocab_size):
    """Convert a word to one-hot vector"""
    vector = np.zeros(vocab_size)
    vector[word_to_idx[word]] = 1
    return vector

print("\nOne-Hot Encoding:")
print("="*60)
for word in vocab:
    one_hot = one_hot_encode(word, word_to_idx, vocab_size)
    print(f"{word:8s}: {one_hot}")

# Calculate similarity between one-hot vectors
cat_onehot = one_hot_encode("cat", word_to_idx, vocab_size)
dog_onehot = one_hot_encode("dog", word_to_idx, vocab_size)
king_onehot = one_hot_encode("king", word_to_idx, vocab_size)

# Cosine similarity
def cosine_similarity(v1, v2):
    """Calculate cosine similarity between two vectors"""
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

print("\nOne-Hot Similarities:")
print("="*60)
print(f"cat ‚Üî dog:  {cosine_similarity(cat_onehot, dog_onehot):.4f}")
print(f"cat ‚Üî king: {cosine_similarity(cat_onehot, king_onehot):.4f}")
print("\n‚ö†Ô∏è  All words are equally dissimilar (0.0) - No semantic meaning!")

In [None]:
# Method 2: Dense embeddings (example - these would be learned)
# Let's create hypothetical learned embeddings to demonstrate the concept
print("\nDense Word Embeddings (Example - would be learned):")
print("="*60)

# Hypothetical 5-dimensional embeddings
# In reality, we'd have 50-300 dimensions, but 5 is easier to visualize
embeddings_example = {
    "cat":   np.array([0.8,  0.7, -0.2,  0.1,  0.3]),  # Animals, pets
    "dog":   np.array([0.7,  0.8, -0.1,  0.2,  0.4]),  # Similar to cat
    "king":  np.array([0.1, -0.2,  0.9,  0.8,  0.1]),  # Royalty, male
    "queen": np.array([0.0, -0.1,  0.9,  0.7,  0.2]),  # Royalty, female
    "happy": np.array([0.2,  0.1,  0.1,  0.1,  0.9]),  # Emotion
}

for word, embedding in embeddings_example.items():
    print(f"{word:8s}: [{', '.join([f'{x:5.2f}' for x in embedding])}]")

print("\nEmbedding Similarities:")
print("="*60)
cat_emb = embeddings_example["cat"]
dog_emb = embeddings_example["dog"]
king_emb = embeddings_example["king"]

print(f"cat ‚Üî dog:  {cosine_similarity(cat_emb, dog_emb):.4f} (High - both animals!)")
print(f"cat ‚Üî king: {cosine_similarity(cat_emb, king_emb):.4f} (Low - different concepts)")
print(f"king ‚Üî queen: {cosine_similarity(king_emb, embeddings_example['queen']):.4f} (High - both royalty!)")
print("\n‚úÖ Embeddings capture semantic relationships!")

### **3. The Distributional Hypothesis**

**"You shall know a word by the company it keeps"** - J.R. Firth (1957)

This is the fundamental idea behind Word2Vec. The meaning of a word is determined by the words that frequently appear near it.

**Examples:**

1. **"The cat sat on the mat"**
   - Words near "cat": the, sat, on

2. **"The dog sat on the rug"**
   - Words near "dog": the, sat, on

Notice that "cat" and "dog" appear in similar contexts! This is how the model learns they're related.

**More Examples:**

```
"The king ruled the kingdom"
"The queen ruled the empire"
```
‚Üí "king" and "queen" appear in similar contexts (both rule)

```
"I am very happy today"
"I am very joyful today"
```
‚Üí "happy" and "joyful" are interchangeable (similar meaning)

**Key Insight:**

If we can build a model that:
1. Looks at what words appear near each other
2. Learns to predict context words from target words (or vice versa)
3. Adjusts word vectors to make these predictions accurate

Then the learned vectors will automatically capture semantic meaning!

### **4. Word2Vec: Two Architectures**

Word2Vec has two main architectures:

**1. CBOW (Continuous Bag of Words):**
- **Input:** Context words (surrounding words)
- **Output:** Target word (center word)
- **Task:** Predict the center word from its context

**Example:**
```
Sentence: "The cat sat on the mat"
Context: ["The", "cat", "on", "the"] ‚Üí Target: "sat"
```

**2. Skip-Gram (Our Focus Today):**
- **Input:** Target word (center word)
- **Output:** Context words (surrounding words)
- **Task:** Predict context words from the center word

**Example:**
```
Sentence: "The cat sat on the mat"
Target: "sat" ‚Üí Context: ["cat", "on"] (window size = 1)
Target: "sat" ‚Üí Context: ["The", "cat", "on", "the"] (window size = 2)
```

<div align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1400/1*cuOmGT7NevP9oJFJfVpRKA.png" width="700"/>
</div>

**Why Skip-Gram?**

1. **Better for rare words:** Learns better representations for infrequent words
2. **More training examples:** One target word creates multiple (target, context) pairs
3. **Better performance:** Generally produces higher quality embeddings

**The Skip-Gram Process:**

```
Sentence: "The quick brown fox jumps"
Window size: 2

For target word "brown" (index 2):
  - Context word 1: "The" (2 positions left)
  - Context word 2: "quick" (1 position left)
  - Context word 3: "fox" (1 position right)
  - Context word 4: "jumps" (2 positions right)

Training pairs created:
  (brown, The)
  (brown, quick)
  (brown, fox)
  (brown, jumps)
```

This transforms unsupervised text into supervised learning pairs!

### **5. Understanding the Skip-Gram Architecture**

The Skip-Gram model is surprisingly simple - just two layers:

**Architecture:**

```
Input: Target word index (scalar)
   ‚Üì
Embedding Layer: vocab_size √ó embedding_dim
   ‚Üì (This is what we want! The learned word vectors)
Dense word vector (embedding_dim dimensions)
   ‚Üì
Linear Layer: embedding_dim √ó vocab_size
   ‚Üì
Output: Scores for all words (vocab_size)
   ‚Üì
Softmax: Convert to probabilities
   ‚Üì
Predicted context word
```

**Layer Details:**

1. **Embedding Layer (`nn.Embedding`):**
   - A lookup table that converts word indices to dense vectors
   - Shape: `(vocab_size, embedding_dim)`
   - Example: If vocab_size=10,000 and embedding_dim=100:
     - This is a 10,000 √ó 100 matrix
     - Each row is the embedding vector for one word
   - **This layer's weights ARE the word embeddings we want to learn!**

2. **Linear Layer (`nn.Linear`):**
   - Maps from embedding space back to vocabulary space
   - Shape: `(embedding_dim, vocab_size)`
   - Produces a score for each possible context word

**Example with Concrete Numbers:**

```python
# Suppose:
vocab_size = 10,000 words
embedding_dim = 100
target_word_idx = 542  # Index for "cat"

# Step 1: Embedding lookup
embedding_layer = nn.Embedding(10000, 100)
word_vector = embedding_layer(torch.tensor([542]))
# Result: [100] dimensional vector (the embedding for "cat")

# Step 2: Linear transformation
linear_layer = nn.Linear(100, 10000)
scores = linear_layer(word_vector)
# Result: [10000] scores, one for each word in vocabulary

# Step 3: Softmax (done by loss function)
# Converts scores to probabilities
# Model predicts which word is likely to be in context
```

**Training Objective:**

Given a (target, context) pair like `("cat", "sat")`:
1. Feed "cat" index into the network
2. Get probability distribution over all vocabulary words
3. We want high probability for "sat" (the actual context word)
4. Use Cross-Entropy Loss to measure error
5. Backpropagation adjusts the embedding weights

After many training examples, words that appear in similar contexts will have similar embeddings!

### **6. Preparing Our Text Corpus**

Let's start our implementation! We'll use a sample text corpus to train our Word2Vec model.

**The Workflow:**
1. **Get text data** - A collection of sentences
2. **Preprocess** - Clean and tokenize the text
3. **Build vocabulary** - Create word ‚Üî index mappings
4. **Generate training pairs** - Create (target, context) pairs
5. **Train model** - Learn the embeddings
6. **Extract embeddings** - Get the learned word vectors

For this tutorial, we'll use a small corpus of sentences. In practice, you'd use much larger datasets (Wikipedia, books, news articles, etc.).

In [None]:
# Sample text corpus
# In practice, you'd load this from a file or dataset
corpus = """
The cat sat on the mat.
The dog sat on the log.
Cats and dogs are animals.
The cat and the dog are friends.
A king rules a kingdom.
A queen rules an empire.
The king and queen are royalty.
The happy cat played with the ball.
The joyful dog ran in the park.
Animals are happy when they play.
The cat sleeps on the warm mat.
The dog loves to run and jump.
Kings and queens live in castles.
Happy animals are healthy animals.
The small cat climbed the tall tree.
The big dog guarded the house.
Cats like to chase mice.
Dogs like to catch balls.
The wise king made fair decisions.
The kind queen helped her people.
"""

print("Sample Text Corpus:")
print("="*70)
print(corpus.strip())
print("="*70)
print(f"\nCorpus length: {len(corpus)} characters")

### **7. Text Preprocessing**

Before we can work with text, we need to preprocess it:

**Preprocessing Steps:**

1. **Lowercase:** Convert all text to lowercase
   - "The Cat" ‚Üí "the cat"
   - Ensures "Cat" and "cat" are treated as the same word

2. **Remove Punctuation:** Remove periods, commas, etc.
   - "cat." ‚Üí "cat"
   - Simplifies tokenization

3. **Tokenization:** Split text into individual words
   - "the cat sat" ‚Üí ["the", "cat", "sat"]
   - Each word becomes a token

4. **Build Vocabulary:** Create mappings between words and indices
   - word_to_idx: {"cat": 0, "dog": 1, ...}
   - idx_to_word: {0: "cat", 1: "dog", ...}

These mappings allow us to convert between words and the numerical indices our neural network needs.

In [None]:
import re

def preprocess_text(text):
    """
    Preprocess text: lowercase, remove punctuation, tokenize
    
    Args:
        text: Raw text string
    
    Returns:
        List of tokens (words)
    """
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation (keep only letters and spaces)
    text = re.sub(r'[^a-z\s]', '', text)
    
    # Split into words (tokenization)
    tokens = text.split()
    
    return tokens

# Preprocess our corpus
tokens = preprocess_text(corpus)

print("Tokenized Text (first 50 tokens):")
print("="*70)
print(tokens[:50])
print("="*70)
print(f"\nTotal tokens: {len(tokens)}")

In [None]:
def build_vocabulary(tokens, min_count=1):
    """
    Build vocabulary from tokens
    
    Args:
        tokens: List of word tokens
        min_count: Minimum frequency for a word to be included
    
    Returns:
        word_to_idx: Dictionary mapping words to indices
        idx_to_word: Dictionary mapping indices to words
        word_counts: Counter object with word frequencies
    """
    # Count word frequencies
    word_counts = Counter(tokens)
    
    # Filter words by minimum count
    vocab_words = [word for word, count in word_counts.items() if count >= min_count]
    
    # Sort for consistency
    vocab_words = sorted(vocab_words)
    
    # Create mappings
    word_to_idx = {word: idx for idx, word in enumerate(vocab_words)}
    idx_to_word = {idx: word for word, idx in word_to_idx.items()}
    
    return word_to_idx, idx_to_word, word_counts

# Build vocabulary
word_to_idx, idx_to_word, word_counts = build_vocabulary(tokens, min_count=2)

vocab_size = len(word_to_idx)

print("Vocabulary Statistics:")
print("="*70)
print(f"Vocabulary size: {vocab_size} unique words")
print(f"Total tokens: {len(tokens)} words")
print("\nMost common words:")
for word, count in word_counts.most_common(10):
    print(f"  {word:15s}: {count:3d} occurrences")
print("="*70)

In [None]:
# Display sample of word-to-index mappings
print("\nSample Word-to-Index Mappings:")
print("="*70)
sample_words = list(word_to_idx.items())[:15]
for word, idx in sample_words:
    print(f"  '{word:10s}' ‚Üí index {idx:3d}")
print("="*70)

print(f"\n‚úÖ Preprocessing complete!")
print(f"   We now have {vocab_size} unique words in our vocabulary.")
print(f"   Each word has a unique index from 0 to {vocab_size-1}.")

Now we'll create the (target, context) pairs that will train our model. This is where we transform unsupervised text into supervised learning data!

**The Sliding Window Approach:**

We slide a window across our text and for each target word, we look at surrounding words within the window.

**Example with window_size = 2:**

```
Sentence: ["the", "cat", "sat", "on", "the", "mat"]
          [  0  ,   1  ,   2  ,  3 ,   4  ,   5  ]  (positions)

Target word at position 2 ("sat"):
  - Look left up to 2 positions: "the" (pos 0), "cat" (pos 1)
  - Look right up to 2 positions: "on" (pos 3), "the" (pos 4)

Training pairs created:
  (sat, the)   # from position 0
  (sat, cat)   # from position 1
  (sat, on)    # from position 3
  (sat, the)   # from position 4
```

**Key Points:**
- **window_size** determines how far we look for context
- Larger windows capture broader context but may include less relevant words
- Smaller windows focus on immediate neighbors
- Common values: 2-5

**Implementation Strategy:**
1. Convert tokens to indices
2. For each target word position:
   - Get words within the window
   - Create (target_idx, context_idx) pairs
3. Return as PyTorch tensors for training

In [None]:
def generate_training_pairs(tokens, word_to_idx, window_size=2):
    """
    Generate (target, context) training pairs using Skip-Gram approach
    
    Args:
        tokens: List of word tokens
        word_to_idx: Dictionary mapping words to indices
        window_size: How many words to look at on each side
    
    Returns:
        target_indices: Tensor of target word indices
        context_indices: Tensor of context word indices
    """
    target_indices = []
    context_indices = []
    
    # Convert tokens to indices
    token_indices = [word_to_idx[token] for token in tokens if token in word_to_idx]
    
    # Slide window across the text
    for target_pos in range(len(token_indices)):
        target_idx = token_indices[target_pos]
        
        # Define the window boundaries
        start = max(0, target_pos - window_size)
        end = min(len(token_indices), target_pos + window_size + 1)
        
        # Get context words (all words in window except target)
        for context_pos in range(start, end):
            if context_pos != target_pos:  # Don't use the target word as its own context
                context_idx = token_indices[context_pos]
                target_indices.append(target_idx)
                context_indices.append(context_idx)
    
    # Convert to PyTorch tensors
    target_tensor = torch.tensor(target_indices, dtype=torch.long)
    context_tensor = torch.tensor(context_indices, dtype=torch.long)
    
    return target_tensor, context_tensor

# Generate training pairs
window_size = 2
target_tensor, context_tensor = generate_training_pairs(tokens, word_to_idx, window_size)

print("Training Pair Generation:")
print("="*70)
print(f"Window size: {window_size}")
print(f"Total training pairs: {len(target_tensor):,}")
print(f"Vocabulary size: {vocab_size}")
print("="*70)

In [None]:
# Display sample training pairs
print("\nSample Training Pairs:")
print("="*70)
print(f"{'Target Word':<15} | {'Context Word':<15} | {'Target Idx':<10} | {'Context Idx'}")
print("="*70)

num_samples = 20
for i in range(min(num_samples, len(target_tensor))):
    target_idx = target_tensor[i].item()
    context_idx = context_tensor[i].item()
    target_word = idx_to_word[target_idx]
    context_word = idx_to_word[context_idx]
    print(f"{target_word:<15} | {context_word:<15} | {target_idx:<10} | {context_idx}")

print("="*70)
print(f"\n‚úÖ Generated {len(target_tensor):,} training pairs!")
print(f"   These pairs will teach the model which words appear together.")

**Understanding the Training Pairs:**

Looking at the pairs above, notice:
- "the" appears with many different words (it's a common article)
- "cat" appears with words like "the", "sat", "mat" (typical cat-related contexts)
- "dog" appears with similar words to "cat" (helping the model learn they're related)
- "king" and "queen" appear in similar contexts (both relate to "rules", "kingdom", etc.)

The model will learn to give similar embeddings to words that share similar contexts!

### **9. Building the Skip-Gram Model**

Now let's implement our Skip-Gram neural network in PyTorch!

**Model Architecture Recap:**

```
Input: target_word_idx (scalar integer)
   ‚Üì
Embedding Layer: [vocab_size √ó embedding_dim]
   ‚Üì
Word Vector: [embedding_dim]
   ‚Üì
Linear Layer: [embedding_dim √ó vocab_size]
   ‚Üì
Output Scores: [vocab_size]
```

**The Magic of nn.Embedding:**

`nn.Embedding` is essentially a lookup table:
- It stores a matrix of size `(vocab_size, embedding_dim)`
- When you pass in an index, it returns the corresponding row
- During training, these rows (word vectors) are updated via backpropagation

**Example:**
```python
embedding = nn.Embedding(vocab_size=1000, embedding_dim=50)
# This creates a 1000 √ó 50 matrix

word_idx = torch.tensor([42])
word_vector = embedding(word_idx)
# Returns row 42 from the matrix (a 50-dimensional vector)
```

**After training, the weights of this embedding layer ARE our word vectors!**

In [None]:
class SkipGramModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        """
        Initialize the Skip-Gram model
        
        Args:
            vocab_size: Number of unique words in vocabulary
            embedding_dim: Dimension of word embedding vectors
        """
        super(SkipGramModel, self).__init__()
        
        # Embedding layer: converts word indices to dense vectors
        # This is the layer whose weights we want to extract after training!
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        # Linear layer: maps from embedding space to vocabulary space
        # Used to predict context words
        self.linear = nn.Linear(embedding_dim, vocab_size)
    
    def forward(self, target_indices):
        """
        Forward pass
        
        Args:
            target_indices: Indices of target words (batch_size,)
        
        Returns:
            Scores for each word in vocabulary (batch_size, vocab_size)
        """
        # Look up embeddings for target words
        # Shape: (batch_size, embedding_dim)
        embeds = self.embeddings(target_indices)
        
        # Project to vocabulary space
        # Shape: (batch_size, vocab_size)
        output = self.linear(embeds)
        
        return output
    
    def get_embeddings(self):
        """
        Extract the learned word embeddings
        
        Returns:
            Embedding matrix of shape (vocab_size, embedding_dim)
        """
        return self.embeddings.weight.data

# Hyperparameters
embedding_dim = 50  # Dimension of word vectors (common choices: 50, 100, 300)

# Create the model
model = SkipGramModel(vocab_size, embedding_dim).to(device)

print("Skip-Gram Model Architecture:")
print("="*70)
print(model)
print("="*70)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
embedding_params = model.embeddings.weight.numel()
linear_params = sum(p.numel() for p in model.linear.parameters())

print(f"\nParameter Breakdown:")
print(f"  Embedding layer: {embedding_params:,} parameters ({vocab_size} √ó {embedding_dim})")
print(f"  Linear layer:    {linear_params:,} parameters ({embedding_dim} √ó {vocab_size} + {vocab_size})")
print(f"  Total:           {total_params:,} parameters")
print("="*70)

### **10. Loss Function and Optimizer**

**Loss Function: Cross-Entropy Loss**

Just like in classification tasks, we use Cross-Entropy Loss because:
- We're predicting which word (from vocabulary) is in the context
- This is a multi-class classification problem
- Cross-Entropy measures how well predicted probabilities match the true context word

**Training Objective:**

For a (target, context) pair like `("cat", "sat")`:
1. Input: "cat" index
2. Model outputs: probability distribution over all words
3. Loss: How different is this from the true distribution (where "sat" = 1, others = 0)
4. Backprop: Adjust embeddings to increase probability of "sat"

**Optimizer: Adam**

We'll use Adam optimizer (like in Day 4) because:
- Adaptive learning rates work well for embeddings
- Converges faster than vanilla SGD
- Requires minimal hyperparameter tuning

In [None]:
# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
learning_rate = 0.01
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

print("Training Configuration:")
print("="*70)
print(f"Loss Function:  {criterion}")
print(f"Optimizer:      Adam")
print(f"Learning Rate:  {learning_rate}")
print(f"Device:         {device}")
print("="*70)

### **11. Training the Model**

Now we'll train our Word2Vec model! The training process is similar to what we've done before:

**Training Loop:**
1. **Forward Pass:** Feed target word indices ‚Üí get predictions
2. **Calculate Loss:** Compare predictions with actual context words
3. **Backward Pass:** Calculate gradients
4. **Update Parameters:** Adjust embedding weights

**What's Happening:**
- The model learns to predict context words from target words
- Words appearing in similar contexts get similar embeddings
- After many updates, the embedding layer contains meaningful word vectors

**Training Tips:**
- For small datasets, 100-500 epochs is typical
- Loss should decrease steadily
- For real applications, you'd use millions of words and more sophisticated techniques (negative sampling)

In [None]:
def train_word2vec(model, target_tensor, context_tensor, criterion, optimizer, 
                   num_epochs=100, batch_size=128, print_every=10):
    """
    Train the Word2Vec Skip-Gram model
    
    Args:
        model: The Skip-Gram model
        target_tensor: Target word indices
        context_tensor: Context word indices
        criterion: Loss function
        optimizer: Optimization algorithm
        num_epochs: Number of training epochs
        batch_size: Number of samples per batch
        print_every: Print progress every N epochs
    
    Returns:
        List of losses per epoch
    """
    model.train()
    losses = []
    
    # Move data to device
    target_tensor = target_tensor.to(device)
    context_tensor = context_tensor.to(device)
    
    num_samples = len(target_tensor)
    
    print("Starting Training...")
    print("="*70)
    
    for epoch in range(num_epochs):
        epoch_loss = 0.0
        num_batches = 0
        
        # Mini-batch training
        for i in range(0, num_samples, batch_size):
            # Get batch
            batch_targets = target_tensor[i:i+batch_size]
            batch_contexts = context_tensor[i:i+batch_size]
            
            # Forward pass
            outputs = model(batch_targets)
            
            # Calculate loss
            loss = criterion(outputs, batch_contexts)
            
            # Backward pass and optimization
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
            num_batches += 1
        
        # Average loss for this epoch
        avg_loss = epoch_loss / num_batches
        losses.append(avg_loss)
        
        # Print progress
        if (epoch + 1) % print_every == 0:
            print(f"Epoch [{epoch+1:4d}/{num_epochs}] | Loss: {avg_loss:.4f}")
    
    print("="*70)
    print("Training Complete!")
    
    return losses

# Train the model
num_epochs = 200
batch_size = 64

losses = train_word2vec(
    model, 
    target_tensor, 
    context_tensor, 
    criterion, 
    optimizer,
    num_epochs=num_epochs,
    batch_size=batch_size,
    print_every=20
)

### **12. Visualizing Training Progress**

In [None]:
# Plot training loss
plt.figure(figsize=(10, 6))
plt.plot(range(1, num_epochs + 1), losses, linewidth=2, color='#e74c3c')
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Word2Vec Training Loss', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nTraining Summary:")
print("="*70)
print(f"Initial Loss: {losses[0]:.4f}")
print(f"Final Loss:   {losses[-1]:.4f}")
print(f"Reduction:    {losses[0] - losses[-1]:.4f}")
print("="*70)
print("\n‚úÖ The model has learned word embeddings!")
print("   Words appearing in similar contexts now have similar vectors.")

### **13. Extracting and Analyzing Word Embeddings**

The moment we've been waiting for! Let's extract the learned word vectors from the embedding layer.

**Remember:** The weights of `model.embeddings` are our word vectors!
- Each row corresponds to one word
- Each row is a `embedding_dim` dimensional vector
- Words with similar meanings should have similar vectors

**How to measure similarity:**

We use **Cosine Similarity**:

$$\text{similarity}(\vec{a}, \vec{b}) = \frac{\vec{a} \cdot \vec{b}}{||\vec{a}|| \cdot ||\vec{b}||} = \cos(\theta)$$

- Returns values from -1 to 1
- 1 = identical direction (very similar)
- 0 = orthogonal (unrelated)
- -1 = opposite direction (opposite meaning)

In [None]:
# Extract the learned embeddings
model.eval()
embeddings = model.get_embeddings().cpu().numpy()

print("Learned Word Embeddings:")
print("="*70)
print(f"Shape: {embeddings.shape}")
print(f"  ‚Üí {embeddings.shape[0]} words, each represented by {embeddings.shape[1]} numbers")
print("="*70)

# Display embedding for a sample word
sample_word = "cat"
if sample_word in word_to_idx:
    word_idx = word_to_idx[sample_word]
    word_embedding = embeddings[word_idx]
    
    print(f"\nEmbedding for '{sample_word}':")
    print(f"First 10 dimensions: {word_embedding[:10]}")
    print(f"\nThis {embedding_dim}-dimensional vector captures the meaning of '{sample_word}'!")

In [None]:
def cosine_similarity_matrix(embeddings):
    """
    Compute cosine similarity between all pairs of word embeddings
    
    Args:
        embeddings: Embedding matrix (vocab_size, embedding_dim)
    
    Returns:
        Similarity matrix (vocab_size, vocab_size)
    """
    # Normalize embeddings
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    normalized = embeddings / (norms + 1e-8)  # Add small value to avoid division by zero
    
    # Compute cosine similarity
    similarity = np.dot(normalized, normalized.T)
    
    return similarity

def find_most_similar(word, word_to_idx, idx_to_word, embeddings, top_k=5):
    """
    Find the most similar words to a given word
    
    Args:
        word: Query word
        word_to_idx: Word to index mapping
        idx_to_word: Index to word mapping
        embeddings: Embedding matrix
        top_k: Number of similar words to return
    
    Returns:
        List of (word, similarity_score) tuples
    """
    if word not in word_to_idx:
        return []
    
    # Get word index and embedding
    word_idx = word_to_idx[word]
    word_embedding = embeddings[word_idx]
    
    # Compute similarities with all words
    similarities = []
    for idx in range(len(embeddings)):
        other_embedding = embeddings[idx]
        
        # Cosine similarity
        similarity = np.dot(word_embedding, other_embedding) / (
            np.linalg.norm(word_embedding) * np.linalg.norm(other_embedding) + 1e-8
        )
        
        similarities.append((idx, similarity))
    
    # Sort by similarity (descending)
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    # Get top k (excluding the word itself)
    results = []
    for idx, sim in similarities[1:top_k+1]:  # Skip first (the word itself)
        results.append((idx_to_word[idx], sim))
    
    return results

# Test the similarity function
print("\nTesting Word Similarity:")
print("="*70)

In [None]:
# Find similar words for different test cases
test_words = ["cat", "dog", "king", "queen", "happy"]

for test_word in test_words:
    if test_word in word_to_idx:
        print(f"\nMost similar words to '{test_word}':")
        print("-" * 50)
        similar_words = find_most_similar(test_word, word_to_idx, idx_to_word, embeddings, top_k=5)
        
        for i, (word, similarity) in enumerate(similar_words, 1):
            print(f"  {i}. {word:<15s} (similarity: {similarity:.4f})")

print("\n" + "="*70)

**Interpreting the Results:**

If the model trained well, you should see:
- **"cat"** is similar to **"dog"** (both animals, both pets)
- **"king"** is similar to **"queen"** (both royalty)
- **"happy"** is similar to words in emotional contexts

The model learned these relationships **purely from seeing which words appear together** - we never told it that cats and dogs are animals!

**Note:** With our small corpus, the similarities might not be perfect. In practice:
- Use millions of words of text
- Train for longer
- Use larger embedding dimensions (100-300)
- Use techniques like negative sampling for efficiency

### **16. Conclusion and Key Takeaways**

Congratulations! You've successfully implemented Word2Vec from scratch and learned how machines can understand language!

**üéØ What We Accomplished:**

1. ‚úÖ Understood the problem: How to represent words as numbers
2. ‚úÖ Learned about one-hot encoding vs dense embeddings
3. ‚úÖ Explored the distributional hypothesis ("a word is known by its context")
4. ‚úÖ Preprocessed text: tokenization, vocabulary building
5. ‚úÖ Generated training pairs using the Skip-Gram approach
6. ‚úÖ Built and trained a neural network to learn word embeddings
7. ‚úÖ Extracted learned embeddings and found similar words
8. ‚úÖ Visualized embeddings in 2D to see semantic clusters
9. ‚úÖ Performed word analogies using vector arithmetic

**üîë Key Concepts:**

1. **Word Embeddings:**
   - Dense vector representations of words
   - Capture semantic meaning automatically
   - Much better than sparse one-hot encodings

2. **Distributional Hypothesis:**
   - Words appearing in similar contexts have similar meanings
   - The foundation of Word2Vec and modern NLP

3. **Skip-Gram Model:**
   - Predicts context words from target words
   - Simple 2-layer architecture
   - The embedding layer weights ARE the word vectors

4. **Training Process:**
   - Transform text into (target, context) pairs
   - Train like any supervised learning problem
   - Similar words end up with similar vectors

5. **Vector Arithmetic:**
   - Can perform math on meanings: king - man + woman ‚âà queen
   - Embeddings capture semantic relationships

**üìä From Simple to Powerful:**

| Aspect | One-Hot Encoding | Word Embeddings |
|--------|-----------------|------------------|
| **Dimensions** | Vocabulary size (e.g., 100,000) | Fixed small size (50-300) |
| **Sparsity** | 99.999% zeros | Dense (all values meaningful) |
| **Semantic Meaning** | None (all words equally different) | Rich (similar words have similar vectors) |
| **Generalization** | Poor | Excellent |
| **Vector Arithmetic** | Meaningless | Meaningful analogies! |

**üöÄ Real-World Applications:**

Word embeddings are used in:
- **Search Engines:** Understanding query intent
- **Recommendation Systems:** Finding similar products/content
- **Sentiment Analysis:** Understanding emotions in text
- **Machine Translation:** Google Translate, DeepL
- **Chatbots:** Understanding user messages
- **Question Answering:** Finding relevant answers
- **Document Classification:** Categorizing articles, emails

**üí° Improvements for Production:**

Our implementation is educational. For real applications:

1. **Use More Data:**
   - Millions/billions of words (Wikipedia, books, web pages)
   - Our 200 words ‚Üí Real systems use billions

2. **Negative Sampling:**
   - Instead of predicting over entire vocabulary (slow)
   - Sample a few "negative" examples
   - Much faster and more efficient

3. **Subword Information:**
   - Handle rare/unknown words better
   - FastText extends Word2Vec with character n-grams

4. **Pre-trained Embeddings:**
   - Use embeddings trained on huge corpora
   - GloVe, FastText, Word2Vec pre-trained models
   - Don't train from scratch for every task

5. **Modern Alternatives:**
   - **BERT, GPT, T5:** Contextual embeddings (different vectors for same word in different contexts)
   - **Sentence Transformers:** Embeddings for entire sentences
   - But Word2Vec is still widely used and very effective!

**üéì From Images to Language:**

Compare our journey:

| Day | Task | Input | Output | Key Learning |
|-----|------|-------|--------|---------------|
| **4** | Fashion MNIST | Images (28√ó28 pixels) | 10 clothing classes | CNNs for vision |
| **5** | Word2Vec | Text (words) | Word vectors | Embeddings for NLP |

Both use neural networks but handle different data types!

**üåü You've now learned the foundation of Natural Language Processing!**

From here, you can explore:
- Recurrent Neural Networks (RNNs) for sequences
- Transformers (the architecture behind ChatGPT)
- BERT, GPT, and other large language models
- Sentiment analysis, machine translation, text generation

All of these build on the concept of word embeddings that you learned today!

---

**Keep exploring and building! üöÄ**