# **Day 5: Word2Vec - Teaching Machines to Understand Language**

### **1. Introduction: The Challenge of Understanding Language**

**The Problem:**

So far, we've worked with numerical data (exam scores) and images (pixels). But how do we make machines understand **words** and **language**?

Words aren't numbers. We can't just feed "cat" or "king" into a neural network. Computers only understand numbers, so we need a way to represent words mathematically.

**The Evolution of Word Representation:**

Humans understand that:
- "king" and "queen" are related (both royalty)
- "cat" and "dog" are similar (both animals, both pets)
- "happy" and "joyful" have similar meanings

But how do we teach machines these relationships?

<div align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1400/1*sXNXYfAqfLUeiDXPCo130w.png" width="600"/>
</div>

**Our Goal Today:**

Build a system that learns word meanings from text and represents them as vectors (lists of numbers) such that:
- Similar words have similar vectors
- We can perform mathematical operations on meaning (e.g., king - man + woman ≈ queen)
- The machine discovers these relationships automatically from data!

### **2. From One-Hot Encoding to Word Embeddings**

Let's explore two fundamental ways to represent words as numbers.

**Method 1: One-Hot Encoding (The Naive Approach)**

Imagine we have a vocabulary of 5 words: `["cat", "dog", "king", "queen", "happy"]`

One-hot encoding represents each word as a vector with:
- A `1` at the word's position
- `0` everywhere else

```
cat   = [1, 0, 0, 0, 0]
dog   = [0, 1, 0, 0, 0]
king  = [0, 0, 1, 0, 0]
queen = [0, 0, 0, 1, 0]
happy = [0, 0, 0, 0, 1]
```

**Problems with One-Hot Encoding:**

1. **No Semantic Meaning:** All words are equally different from each other
   - Distance between "cat" and "dog" = Distance between "cat" and "king"
   - The model can't learn that some words are more similar than others

2. **Huge Dimensions:** With 100,000 words in a vocabulary, each word is a 100,000-dimensional vector!
   - Wastes memory and computation
   - 99.999% of the values are zeros (sparse representation)

3. **Can't Generalize:** If the model sees "cat" during training but not "dog", it has no way to know they're related

**Method 2: Word Embeddings (The Smart Approach)**

Word embeddings represent each word as a **dense vector** of real numbers (typically 50-300 dimensions):

```
cat   = [0.2,  0.8, -0.3,  0.1, ...]  (50-300 numbers)
dog   = [0.3,  0.7, -0.2,  0.2, ...]  (similar to cat!)
king  = [0.9, -0.1,  0.4,  0.8, ...]
queen = [0.8, -0.2,  0.3,  0.9, ...]  (similar to king!)
```

**Advantages of Word Embeddings:**

1. **Captures Semantic Meaning:** Similar words have similar vectors
2. **Compact:** 300 dimensions instead of 100,000
3. **Learned from Data:** The model discovers relationships automatically
4. **Generalizable:** Understanding of "cat" helps with "dog"

<div align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1400/1*OEmWDt4eztOcm5pr2QbxfA.png" width="600"/>
</div>

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

print(f"PyTorch version: {torch.__version__}")
print(f"NumPy version: {np.__version__}")

In [None]:
# Check if CUDA is available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

if device == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")
else:
    print("GPU not available, using CPU")

**Demonstration: One-Hot vs Embeddings**

Let's create a simple example to see the difference:

In [None]:
# Example vocabulary
vocab = ["cat", "dog", "king", "queen", "happy"]
vocab_size = len(vocab)

# Create word to index mapping
word_to_idx = {word: idx for idx, word in enumerate(vocab)}
idx_to_word = {idx: word for word, idx in word_to_idx.items()}

print("Vocabulary:")
print(word_to_idx)
print("\n" + "="*60)

# Method 1: One-hot encoding
def one_hot_encode(word, word_to_idx, vocab_size):
    """Convert a word to one-hot vector"""
    vector = np.zeros(vocab_size)
    vector[word_to_idx[word]] = 1
    return vector

print("\nOne-Hot Encoding:")
print("="*60)
for word in vocab:
    one_hot = one_hot_encode(word, word_to_idx, vocab_size)
    print(f"{word:8s}: {one_hot}")

# Calculate similarity between one-hot vectors
cat_onehot = one_hot_encode("cat", word_to_idx, vocab_size)
dog_onehot = one_hot_encode("dog", word_to_idx, vocab_size)
king_onehot = one_hot_encode("king", word_to_idx, vocab_size)

# Cosine similarity
def cosine_similarity(v1, v2):
    """Calculate cosine similarity between two vectors"""
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

print("\nOne-Hot Similarities:")
print("="*60)
print(f"cat ↔ dog:  {cosine_similarity(cat_onehot, dog_onehot):.4f}")
print(f"cat ↔ king: {cosine_similarity(cat_onehot, king_onehot):.4f}")
print("\n⚠️  All words are equally dissimilar (0.0) - No semantic meaning!")

In [None]:
# Method 2: Dense embeddings (example - these would be learned)
# Let's create hypothetical learned embeddings to demonstrate the concept
print("\nDense Word Embeddings (Example - would be learned):")
print("="*60)

# Hypothetical 5-dimensional embeddings
# In reality, we'd have 50-300 dimensions, but 5 is easier to visualize
embeddings_example = {
    "cat":   np.array([0.8,  0.7, -0.2,  0.1,  0.3]),  # Animals, pets
    "dog":   np.array([0.7,  0.8, -0.1,  0.2,  0.4]),  # Similar to cat
    "king":  np.array([0.1, -0.2,  0.9,  0.8,  0.1]),  # Royalty, male
    "queen": np.array([0.0, -0.1,  0.9,  0.7,  0.2]),  # Royalty, female
    "happy": np.array([0.2,  0.1,  0.1,  0.1,  0.9]),  # Emotion
}

for word, embedding in embeddings_example.items():
    print(f"{word:8s}: [{', '.join([f'{x:5.2f}' for x in embedding])}]")

print("\nEmbedding Similarities:")
print("="*60)
cat_emb = embeddings_example["cat"]
dog_emb = embeddings_example["dog"]
king_emb = embeddings_example["king"]

print(f"cat ↔ dog:  {cosine_similarity(cat_emb, dog_emb):.4f} (High - both animals!)")
print(f"cat ↔ king: {cosine_similarity(cat_emb, king_emb):.4f} (Low - different concepts)")
print(f"king ↔ queen: {cosine_similarity(king_emb, embeddings_example['queen']):.4f} (High - both royalty!)")
print("\n✅ Embeddings capture semantic relationships!")

### **3. The Distributional Hypothesis**

**"You shall know a word by the company it keeps"** - J.R. Firth (1957)

This is the fundamental idea behind Word2Vec. The meaning of a word is determined by the words that frequently appear near it.

**Examples:**

1. **"The cat sat on the mat"**
   - Words near "cat": the, sat, on

2. **"The dog sat on the rug"**
   - Words near "dog": the, sat, on

Notice that "cat" and "dog" appear in similar contexts! This is how the model learns they're related.

**More Examples:**

```
"The king ruled the kingdom"
"The queen ruled the empire"
```
→ "king" and "queen" appear in similar contexts (both rule)

```
"I am very happy today"
"I am very joyful today"
```
→ "happy" and "joyful" are interchangeable (similar meaning)

**Key Insight:**

If we can build a model that:
1. Looks at what words appear near each other
2. Learns to predict context words from target words (or vice versa)
3. Adjusts word vectors to make these predictions accurate

Then the learned vectors will automatically capture semantic meaning!

<div align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1400/1*SR6l59udY05_bUICAjb6-w.png" width="600"/>
</div>

### **4. Word2Vec: Two Architectures**

Word2Vec has two main architectures:

**1. CBOW (Continuous Bag of Words):**
- **Input:** Context words (surrounding words)
- **Output:** Target word (center word)
- **Task:** Predict the center word from its context

**Example:**
```
Sentence: "The cat sat on the mat"
Context: ["The", "cat", "on", "the"] → Target: "sat"
```

**2. Skip-Gram (Our Focus Today):**
- **Input:** Target word (center word)
- **Output:** Context words (surrounding words)
- **Task:** Predict context words from the center word

**Example:**
```
Sentence: "The cat sat on the mat"
Target: "sat" → Context: ["cat", "on"] (window size = 1)
Target: "sat" → Context: ["The", "cat", "on", "the"] (window size = 2)
```

<div align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1400/1*cuOmGT7NevP9oJFJfVpRKA.png" width="700"/>
</div>

**Why Skip-Gram?**

1. **Better for rare words:** Learns better representations for infrequent words
2. **More training examples:** One target word creates multiple (target, context) pairs
3. **Better performance:** Generally produces higher quality embeddings

**The Skip-Gram Process:**

```
Sentence: "The quick brown fox jumps"
Window size: 2

For target word "brown" (index 2):
  - Context word 1: "The" (2 positions left)
  - Context word 2: "quick" (1 position left)
  - Context word 3: "fox" (1 position right)
  - Context word 4: "jumps" (2 positions right)

Training pairs created:
  (brown, The)
  (brown, quick)
  (brown, fox)
  (brown, jumps)
```

This transforms unsupervised text into supervised learning pairs!

### **5. Understanding the Skip-Gram Architecture**

The Skip-Gram model is surprisingly simple - just two layers:

**Architecture:**

```
Input: Target word index (scalar)
   ↓
Embedding Layer: vocab_size × embedding_dim
   ↓ (This is what we want! The learned word vectors)
Dense word vector (embedding_dim dimensions)
   ↓
Linear Layer: embedding_dim × vocab_size
   ↓
Output: Scores for all words (vocab_size)
   ↓
Softmax: Convert to probabilities
   ↓
Predicted context word
```

**Layer Details:**

1. **Embedding Layer (`nn.Embedding`):**
   - A lookup table that converts word indices to dense vectors
   - Shape: `(vocab_size, embedding_dim)`
   - Example: If vocab_size=10,000 and embedding_dim=100:
     - This is a 10,000 × 100 matrix
     - Each row is the embedding vector for one word
   - **This layer's weights ARE the word embeddings we want to learn!**

2. **Linear Layer (`nn.Linear`):**
   - Maps from embedding space back to vocabulary space
   - Shape: `(embedding_dim, vocab_size)`
   - Produces a score for each possible context word

**Example with Concrete Numbers:**

```python
# Suppose:
vocab_size = 10,000 words
embedding_dim = 100
target_word_idx = 542  # Index for "cat"

# Step 1: Embedding lookup
embedding_layer = nn.Embedding(10000, 100)
word_vector = embedding_layer(torch.tensor([542]))
# Result: [100] dimensional vector (the embedding for "cat")

# Step 2: Linear transformation
linear_layer = nn.Linear(100, 10000)
scores = linear_layer(word_vector)
# Result: [10000] scores, one for each word in vocabulary

# Step 3: Softmax (done by loss function)
# Converts scores to probabilities
# Model predicts which word is likely to be in context
```

**Training Objective:**

Given a (target, context) pair like `("cat", "sat")`:
1. Feed "cat" index into the network
2. Get probability distribution over all vocabulary words
3. We want high probability for "sat" (the actual context word)
4. Use Cross-Entropy Loss to measure error
5. Backpropagation adjusts the embedding weights

After many training examples, words that appear in similar contexts will have similar embeddings!

### **6. Preparing Our Text Corpus**

Let's start our implementation! We'll use a sample text corpus to train our Word2Vec model.

**The Workflow:**
1. **Get text data** - A collection of sentences
2. **Preprocess** - Clean and tokenize the text
3. **Build vocabulary** - Create word ↔ index mappings
4. **Generate training pairs** - Create (target, context) pairs
5. **Train model** - Learn the embeddings
6. **Extract embeddings** - Get the learned word vectors

For this tutorial, we'll use a small corpus of sentences. In practice, you'd use much larger datasets (Wikipedia, books, news articles, etc.).

In [None]:
# Sample text corpus
# In practice, you'd load this from a file or dataset
corpus = """
The cat sat on the mat.
The dog sat on the log.
Cats and dogs are animals.
The cat and the dog are friends.
A king rules a kingdom.
A queen rules an empire.
The king and queen are royalty.
The happy cat played with the ball.
The joyful dog ran in the park.
Animals are happy when they play.
The cat sleeps on the warm mat.
The dog loves to run and jump.
Kings and queens live in castles.
Happy animals are healthy animals.
The small cat climbed the tall tree.
The big dog guarded the house.
Cats like to chase mice.
Dogs like to catch balls.
The wise king made fair decisions.
The kind queen helped her people.
"""

print("Sample Text Corpus:")
print("="*70)
print(corpus.strip())
print("="*70)
print(f"\nCorpus length: {len(corpus)} characters")

### **7. Text Preprocessing**

Before we can work with text, we need to preprocess it:

**Preprocessing Steps:**

1. **Lowercase:** Convert all text to lowercase
   - "The Cat" → "the cat"
   - Ensures "Cat" and "cat" are treated as the same word

2. **Remove Punctuation:** Remove periods, commas, etc.
   - "cat." → "cat"
   - Simplifies tokenization

3. **Tokenization:** Split text into individual words
   - "the cat sat" → ["the", "cat", "sat"]
   - Each word becomes a token

4. **Build Vocabulary:** Create mappings between words and indices
   - word_to_idx: {"cat": 0, "dog": 1, ...}
   - idx_to_word: {0: "cat", 1: "dog", ...}

These mappings allow us to convert between words and the numerical indices our neural network needs.

In [None]:
import re

def preprocess_text(text):
    """
    Preprocess text: lowercase, remove punctuation, tokenize
    
    Args:
        text: Raw text string
    
    Returns:
        List of tokens (words)
    """
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation (keep only letters and spaces)
    text = re.sub(r'[^a-z\s]', '', text)
    
    # Split into words (tokenization)
    tokens = text.split()
    
    return tokens

# Preprocess our corpus
tokens = preprocess_text(corpus)

print("Tokenized Text (first 50 tokens):")
print("="*70)
print(tokens[:50])
print("="*70)
print(f"\nTotal tokens: {len(tokens)}")

In [None]:
def build_vocabulary(tokens, min_count=1):
    """
    Build vocabulary from tokens
    
    Args:
        tokens: List of word tokens
        min_count: Minimum frequency for a word to be included
    
    Returns:
        word_to_idx: Dictionary mapping words to indices
        idx_to_word: Dictionary mapping indices to words
        word_counts: Counter object with word frequencies
    """
    # Count word frequencies
    word_counts = Counter(tokens)
    
    # Filter words by minimum count
    vocab_words = [word for word, count in word_counts.items() if count >= min_count]
    
    # Sort for consistency
    vocab_words = sorted(vocab_words)
    
    # Create mappings
    word_to_idx = {word: idx for idx, word in enumerate(vocab_words)}
    idx_to_word = {idx: word for word, idx in word_to_idx.items()}
    
    return word_to_idx, idx_to_word, word_counts

# Build vocabulary
word_to_idx, idx_to_word, word_counts = build_vocabulary(tokens, min_count=2)

vocab_size = len(word_to_idx)

print("Vocabulary Statistics:")
print("="*70)
print(f"Vocabulary size: {vocab_size} unique words")
print(f"Total tokens: {len(tokens)} words")
print("\nMost common words:")
for word, count in word_counts.most_common(10):
    print(f"  {word:15s}: {count:3d} occurrences")
print("="*70)

In [None]:
# Display sample of word-to-index mappings
print("\nSample Word-to-Index Mappings:")
print("="*70)
sample_words = list(word_to_idx.items())[:15]
for word, idx in sample_words:
    print(f"  '{word:10s}' → index {idx:3d}")
print("="*70)

print(f"\n✅ Preprocessing complete!")
print(f"   We now have {vocab_size} unique words in our vocabulary.")
print(f"   Each word has a unique index from 0 to {vocab_size-1}.")