Now we'll create the (target, context) pairs that will train our model. This is where we transform unsupervised text into supervised learning data!

**The Sliding Window Approach:**

We slide a window across our text and for each target word, we look at surrounding words within the window.

**Example with window_size = 2:**

```
Sentence: ["the", "cat", "sat", "on", "the", "mat"]
          [  0  ,   1  ,   2  ,  3 ,   4  ,   5  ]  (positions)

Target word at position 2 ("sat"):
  - Look left up to 2 positions: "the" (pos 0), "cat" (pos 1)
  - Look right up to 2 positions: "on" (pos 3), "the" (pos 4)

Training pairs created:
  (sat, the)   # from position 0
  (sat, cat)   # from position 1
  (sat, on)    # from position 3
  (sat, the)   # from position 4
```

**Key Points:**
- **window_size** determines how far we look for context
- Larger windows capture broader context but may include less relevant words
- Smaller windows focus on immediate neighbors
- Common values: 2-5

**Implementation Strategy:**
1. Convert tokens to indices
2. For each target word position:
   - Get words within the window
   - Create (target_idx, context_idx) pairs
3. Return as PyTorch tensors for training

In [None]:
def generate_training_pairs(tokens, word_to_idx, window_size=2):
    """
    Generate (target, context) training pairs using Skip-Gram approach
    
    Args:
        tokens: List of word tokens
        word_to_idx: Dictionary mapping words to indices
        window_size: How many words to look at on each side
    
    Returns:
        target_indices: Tensor of target word indices
        context_indices: Tensor of context word indices
    """
    target_indices = []
    context_indices = []
    
    # Convert tokens to indices
    token_indices = [word_to_idx[token] for token in tokens if token in word_to_idx]
    
    # Slide window across the text
    for target_pos in range(len(token_indices)):
        target_idx = token_indices[target_pos]
        
        # Define the window boundaries
        start = max(0, target_pos - window_size)
        end = min(len(token_indices), target_pos + window_size + 1)
        
        # Get context words (all words in window except target)
        for context_pos in range(start, end):
            if context_pos != target_pos:  # Don't use the target word as its own context
                context_idx = token_indices[context_pos]
                target_indices.append(target_idx)
                context_indices.append(context_idx)
    
    # Convert to PyTorch tensors
    target_tensor = torch.tensor(target_indices, dtype=torch.long)
    context_tensor = torch.tensor(context_indices, dtype=torch.long)
    
    return target_tensor, context_tensor

# Generate training pairs
window_size = 2
target_tensor, context_tensor = generate_training_pairs(tokens, word_to_idx, window_size)

print("Training Pair Generation:")
print("="*70)
print(f"Window size: {window_size}")
print(f"Total training pairs: {len(target_tensor):,}")
print(f"Vocabulary size: {vocab_size}")
print("="*70)

In [None]:
# Display sample training pairs
print("\nSample Training Pairs:")
print("="*70)
print(f"{'Target Word':<15} | {'Context Word':<15} | {'Target Idx':<10} | {'Context Idx'}")
print("="*70)

num_samples = 20
for i in range(min(num_samples, len(target_tensor))):
    target_idx = target_tensor[i].item()
    context_idx = context_tensor[i].item()
    target_word = idx_to_word[target_idx]
    context_word = idx_to_word[context_idx]
    print(f"{target_word:<15} | {context_word:<15} | {target_idx:<10} | {context_idx}")

print("="*70)
print(f"\nâœ… Generated {len(target_tensor):,} training pairs!")
print(f"   These pairs will teach the model which words appear together.")

**Understanding the Training Pairs:**

Looking at the pairs above, notice:
- "the" appears with many different words (it's a common article)
- "cat" appears with words like "the", "sat", "mat" (typical cat-related contexts)
- "dog" appears with similar words to "cat" (helping the model learn they're related)
- "king" and "queen" appear in similar contexts (both relate to "rules", "kingdom", etc.)

The model will learn to give similar embeddings to words that share similar contexts!

### **9. Building the Skip-Gram Model**

Now let's implement our Skip-Gram neural network in PyTorch!

**Model Architecture Recap:**

```
Input: target_word_idx (scalar integer)
   â†“
Embedding Layer: [vocab_size Ã— embedding_dim]
   â†“
Word Vector: [embedding_dim]
   â†“
Linear Layer: [embedding_dim Ã— vocab_size]
   â†“
Output Scores: [vocab_size]
```

**The Magic of nn.Embedding:**

`nn.Embedding` is essentially a lookup table:
- It stores a matrix of size `(vocab_size, embedding_dim)`
- When you pass in an index, it returns the corresponding row
- During training, these rows (word vectors) are updated via backpropagation

**Example:**
```python
embedding = nn.Embedding(vocab_size=1000, embedding_dim=50)
# This creates a 1000 Ã— 50 matrix

word_idx = torch.tensor([42])
word_vector = embedding(word_idx)
# Returns row 42 from the matrix (a 50-dimensional vector)
```

**After training, the weights of this embedding layer ARE our word vectors!**

In [None]:
class SkipGramModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        """
        Initialize the Skip-Gram model
        
        Args:
            vocab_size: Number of unique words in vocabulary
            embedding_dim: Dimension of word embedding vectors
        """
        super(SkipGramModel, self).__init__()
        
        # Embedding layer: converts word indices to dense vectors
        # This is the layer whose weights we want to extract after training!
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        # Linear layer: maps from embedding space to vocabulary space
        # Used to predict context words
        self.linear = nn.Linear(embedding_dim, vocab_size)
    
    def forward(self, target_indices):
        """
        Forward pass
        
        Args:
            target_indices: Indices of target words (batch_size,)
        
        Returns:
            Scores for each word in vocabulary (batch_size, vocab_size)
        """
        # Look up embeddings for target words
        # Shape: (batch_size, embedding_dim)
        embeds = self.embeddings(target_indices)
        
        # Project to vocabulary space
        # Shape: (batch_size, vocab_size)
        output = self.linear(embeds)
        
        return output
    
    def get_embeddings(self):
        """
        Extract the learned word embeddings
        
        Returns:
            Embedding matrix of shape (vocab_size, embedding_dim)
        """
        return self.embeddings.weight.data

# Hyperparameters
embedding_dim = 50  # Dimension of word vectors (common choices: 50, 100, 300)

# Create the model
model = SkipGramModel(vocab_size, embedding_dim).to(device)

print("Skip-Gram Model Architecture:")
print("="*70)
print(model)
print("="*70)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
embedding_params = model.embeddings.weight.numel()
linear_params = sum(p.numel() for p in model.linear.parameters())

print(f"\nParameter Breakdown:")
print(f"  Embedding layer: {embedding_params:,} parameters ({vocab_size} Ã— {embedding_dim})")
print(f"  Linear layer:    {linear_params:,} parameters ({embedding_dim} Ã— {vocab_size} + {vocab_size})")
print(f"  Total:           {total_params:,} parameters")
print("="*70)

### **10. Loss Function and Optimizer**

**Loss Function: Cross-Entropy Loss**

Just like in classification tasks, we use Cross-Entropy Loss because:
- We're predicting which word (from vocabulary) is in the context
- This is a multi-class classification problem
- Cross-Entropy measures how well predicted probabilities match the true context word

**Training Objective:**

For a (target, context) pair like `("cat", "sat")`:
1. Input: "cat" index
2. Model outputs: probability distribution over all words
3. Loss: How different is this from the true distribution (where "sat" = 1, others = 0)
4. Backprop: Adjust embeddings to increase probability of "sat"

**Optimizer: Adam**

We'll use Adam optimizer (like in Day 4) because:
- Adaptive learning rates work well for embeddings
- Converges faster than vanilla SGD
- Requires minimal hyperparameter tuning

In [None]:
# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
learning_rate = 0.01
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

print("Training Configuration:")
print("="*70)
print(f"Loss Function:  {criterion}")
print(f"Optimizer:      Adam")
print(f"Learning Rate:  {learning_rate}")
print(f"Device:         {device}")
print("="*70)

### **11. Training the Model**

Now we'll train our Word2Vec model! The training process is similar to what we've done before:

**Training Loop:**
1. **Forward Pass:** Feed target word indices â†’ get predictions
2. **Calculate Loss:** Compare predictions with actual context words
3. **Backward Pass:** Calculate gradients
4. **Update Parameters:** Adjust embedding weights

**What's Happening:**
- The model learns to predict context words from target words
- Words appearing in similar contexts get similar embeddings
- After many updates, the embedding layer contains meaningful word vectors

**Training Tips:**
- For small datasets, 100-500 epochs is typical
- Loss should decrease steadily
- For real applications, you'd use millions of words and more sophisticated techniques (negative sampling)

In [None]:
def train_word2vec(model, target_tensor, context_tensor, criterion, optimizer, 
                   num_epochs=100, batch_size=128, print_every=10):
    """
    Train the Word2Vec Skip-Gram model
    
    Args:
        model: The Skip-Gram model
        target_tensor: Target word indices
        context_tensor: Context word indices
        criterion: Loss function
        optimizer: Optimization algorithm
        num_epochs: Number of training epochs
        batch_size: Number of samples per batch
        print_every: Print progress every N epochs
    
    Returns:
        List of losses per epoch
    """
    model.train()
    losses = []
    
    # Move data to device
    target_tensor = target_tensor.to(device)
    context_tensor = context_tensor.to(device)
    
    num_samples = len(target_tensor)
    
    print("Starting Training...")
    print("="*70)
    
    for epoch in range(num_epochs):
        epoch_loss = 0.0
        num_batches = 0
        
        # Mini-batch training
        for i in range(0, num_samples, batch_size):
            # Get batch
            batch_targets = target_tensor[i:i+batch_size]
            batch_contexts = context_tensor[i:i+batch_size]
            
            # Forward pass
            outputs = model(batch_targets)
            
            # Calculate loss
            loss = criterion(outputs, batch_contexts)
            
            # Backward pass and optimization
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
            num_batches += 1
        
        # Average loss for this epoch
        avg_loss = epoch_loss / num_batches
        losses.append(avg_loss)
        
        # Print progress
        if (epoch + 1) % print_every == 0:
            print(f"Epoch [{epoch+1:4d}/{num_epochs}] | Loss: {avg_loss:.4f}")
    
    print("="*70)
    print("Training Complete!")
    
    return losses

# Train the model
num_epochs = 200
batch_size = 64

losses = train_word2vec(
    model, 
    target_tensor, 
    context_tensor, 
    criterion, 
    optimizer,
    num_epochs=num_epochs,
    batch_size=batch_size,
    print_every=20
)

### **12. Visualizing Training Progress**

In [None]:
# Plot training loss
plt.figure(figsize=(10, 6))
plt.plot(range(1, num_epochs + 1), losses, linewidth=2, color='#e74c3c')
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Word2Vec Training Loss', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nTraining Summary:")
print("="*70)
print(f"Initial Loss: {losses[0]:.4f}")
print(f"Final Loss:   {losses[-1]:.4f}")
print(f"Reduction:    {losses[0] - losses[-1]:.4f}")
print("="*70)
print("\nâœ… The model has learned word embeddings!")
print("   Words appearing in similar contexts now have similar vectors.")

### **13. Extracting and Analyzing Word Embeddings**

The moment we've been waiting for! Let's extract the learned word vectors from the embedding layer.

**Remember:** The weights of `model.embeddings` are our word vectors!
- Each row corresponds to one word
- Each row is a `embedding_dim` dimensional vector
- Words with similar meanings should have similar vectors

**How to measure similarity:**

We use **Cosine Similarity**:

$$\text{similarity}(\vec{a}, \vec{b}) = \frac{\vec{a} \cdot \vec{b}}{||\vec{a}|| \cdot ||\vec{b}||} = \cos(\theta)$$

- Returns values from -1 to 1
- 1 = identical direction (very similar)
- 0 = orthogonal (unrelated)
- -1 = opposite direction (opposite meaning)

In [None]:
# Extract the learned embeddings
model.eval()
embeddings = model.get_embeddings().cpu().numpy()

print("Learned Word Embeddings:")
print("="*70)
print(f"Shape: {embeddings.shape}")
print(f"  â†’ {embeddings.shape[0]} words, each represented by {embeddings.shape[1]} numbers")
print("="*70)

# Display embedding for a sample word
sample_word = "cat"
if sample_word in word_to_idx:
    word_idx = word_to_idx[sample_word]
    word_embedding = embeddings[word_idx]
    
    print(f"\nEmbedding for '{sample_word}':")
    print(f"First 10 dimensions: {word_embedding[:10]}")
    print(f"\nThis {embedding_dim}-dimensional vector captures the meaning of '{sample_word}'!")

In [None]:
def cosine_similarity_matrix(embeddings):
    """
    Compute cosine similarity between all pairs of word embeddings
    
    Args:
        embeddings: Embedding matrix (vocab_size, embedding_dim)
    
    Returns:
        Similarity matrix (vocab_size, vocab_size)
    """
    # Normalize embeddings
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    normalized = embeddings / (norms + 1e-8)  # Add small value to avoid division by zero
    
    # Compute cosine similarity
    similarity = np.dot(normalized, normalized.T)
    
    return similarity

def find_most_similar(word, word_to_idx, idx_to_word, embeddings, top_k=5):
    """
    Find the most similar words to a given word
    
    Args:
        word: Query word
        word_to_idx: Word to index mapping
        idx_to_word: Index to word mapping
        embeddings: Embedding matrix
        top_k: Number of similar words to return
    
    Returns:
        List of (word, similarity_score) tuples
    """
    if word not in word_to_idx:
        return []
    
    # Get word index and embedding
    word_idx = word_to_idx[word]
    word_embedding = embeddings[word_idx]
    
    # Compute similarities with all words
    similarities = []
    for idx in range(len(embeddings)):
        other_embedding = embeddings[idx]
        
        # Cosine similarity
        similarity = np.dot(word_embedding, other_embedding) / (
            np.linalg.norm(word_embedding) * np.linalg.norm(other_embedding) + 1e-8
        )
        
        similarities.append((idx, similarity))
    
    # Sort by similarity (descending)
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    # Get top k (excluding the word itself)
    results = []
    for idx, sim in similarities[1:top_k+1]:  # Skip first (the word itself)
        results.append((idx_to_word[idx], sim))
    
    return results

# Test the similarity function
print("\nTesting Word Similarity:")
print("="*70)

In [None]:
# Find similar words for different test cases
test_words = ["cat", "dog", "king", "queen", "happy"]

for test_word in test_words:
    if test_word in word_to_idx:
        print(f"\nMost similar words to '{test_word}':")
        print("-" * 50)
        similar_words = find_most_similar(test_word, word_to_idx, idx_to_word, embeddings, top_k=5)
        
        for i, (word, similarity) in enumerate(similar_words, 1):
            print(f"  {i}. {word:<15s} (similarity: {similarity:.4f})")

print("\n" + "="*70)

**Interpreting the Results:**

If the model trained well, you should see:
- **"cat"** is similar to **"dog"** (both animals, both pets)
- **"king"** is similar to **"queen"** (both royalty)
- **"happy"** is similar to words in emotional contexts

The model learned these relationships **purely from seeing which words appear together** - we never told it that cats and dogs are animals!

**Note:** With our small corpus, the similarities might not be perfect. In practice:
- Use millions of words of text
- Train for longer
- Use larger embedding dimensions (100-300)
- Use techniques like negative sampling for efficiency

### **14. Visualizing Embeddings in 2D**

Our embeddings are 50-dimensional, which is impossible to visualize directly. Let's use **PCA (Principal Component Analysis)** to reduce them to 2D so we can see the relationships!

**What is PCA?**
- A dimensionality reduction technique
- Finds the 2 most important directions (principal components) in the high-dimensional space
- Projects the data onto these 2 dimensions
- Preserves as much variance (information) as possible

**What to look for:**
- Similar words should cluster together
- "cat" and "dog" should be near each other
- "king" and "queen" should be near each other
- Unrelated words should be far apart

In [None]:
# Reduce embeddings to 2D using PCA
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

print("PCA Dimensionality Reduction:")
print("="*70)
print(f"Original dimensions: {embeddings.shape[1]}D")
print(f"Reduced dimensions:  {embeddings_2d.shape[1]}D")
print(f"Variance explained:  {pca.explained_variance_ratio_.sum()*100:.2f}%")
print("="*70)

In [None]:
# Visualize all embeddings
plt.figure(figsize=(14, 10))

# Plot all points
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], alpha=0.5, s=50, color='steelblue')

# Annotate each point with its word
for idx, word in idx_to_word.items():
    x, y = embeddings_2d[idx]
    plt.annotate(word, (x, y), fontsize=9, alpha=0.8)

plt.xlabel('First Principal Component', fontsize=12)
plt.ylabel('Second Principal Component', fontsize=12)
plt.title('Word Embeddings Visualization (2D Projection via PCA)', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nðŸ’¡ Look for patterns:")
print("   - Are 'cat' and 'dog' close together?")
print("   - Are 'king' and 'queen' close together?")
print("   - Do related words form clusters?")

In [None]:
# Highlight specific word groups
plt.figure(figsize=(14, 10))

# Define word groups to highlight
animals = ['cat', 'dog', 'cats', 'dogs', 'animals']
royalty = ['king', 'queen', 'kingdom', 'empire', 'royalty']
emotions = ['happy', 'joyful']

# Plot with colors for different groups
for idx, word in idx_to_word.items():
    x, y = embeddings_2d[idx]
    
    if word in animals:
        color = 'green'
        marker = 'o'
        size = 150
    elif word in royalty:
        color = 'purple'
        marker = 's'
        size = 150
    elif word in emotions:
        color = 'orange'
        marker = '^'
        size = 150
    else:
        color = 'lightgray'
        marker = '.'
        size = 50
    
    plt.scatter(x, y, color=color, marker=marker, s=size, alpha=0.7, edgecolors='black', linewidth=0.5)
    
    # Annotate
    if word in animals + royalty + emotions:
        plt.annotate(word, (x, y), fontsize=11, fontweight='bold', 
                    bbox=dict(boxstyle='round,pad=0.3', facecolor=color, alpha=0.3))

# Legend
from matplotlib.lines import Line2D
legend_elements = [
    Line2D([0], [0], marker='o', color='w', markerfacecolor='green', markersize=10, label='Animals'),
    Line2D([0], [0], marker='s', color='w', markerfacecolor='purple', markersize=10, label='Royalty'),
    Line2D([0], [0], marker='^', color='w', markerfacecolor='orange', markersize=10, label='Emotions'),
]
plt.legend(handles=legend_elements, loc='best', fontsize=11)

plt.xlabel('First Principal Component', fontsize=12)
plt.ylabel('Second Principal Component', fontsize=12)
plt.title('Word Embeddings - Semantic Clusters', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### **15. Word Analogies: The Magic of Vector Arithmetic**

One of the most fascinating properties of word embeddings is that you can perform **mathematical operations on meaning**!

**Famous Example:**

$$\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$$

**The Logic:**
- $\vec{king} - \vec{man}$ = "royalty" concept (removes the "male" aspect)
- Add $\vec{woman}$ = "female royalty"
- Result should be close to $\vec{queen}$!

**How it works:**
1. Perform vector arithmetic on embeddings
2. Find the word whose embedding is closest to the result
3. This word is the "answer" to the analogy

Let's try this with our trained embeddings!

In [None]:
def word_analogy(word_a, word_b, word_c, word_to_idx, idx_to_word, embeddings, top_k=5):
    """
    Solve word analogy: word_a is to word_b as word_c is to ?
    Computes: embedding(word_b) - embedding(word_a) + embedding(word_c)
    
    Args:
        word_a, word_b, word_c: Words in the analogy
        word_to_idx: Word to index mapping
        idx_to_word: Index to word mapping
        embeddings: Embedding matrix
        top_k: Number of results to return
    
    Returns:
        List of (word, similarity) tuples
    """
    # Check if all words are in vocabulary
    if word_a not in word_to_idx or word_b not in word_to_idx or word_c not in word_to_idx:
        return []
    
    # Get embeddings
    vec_a = embeddings[word_to_idx[word_a]]
    vec_b = embeddings[word_to_idx[word_b]]
    vec_c = embeddings[word_to_idx[word_c]]
    
    # Compute analogy vector: b - a + c
    target_vector = vec_b - vec_a + vec_c
    
    # Find closest words
    similarities = []
    for idx in range(len(embeddings)):
        # Skip the input words
        if idx in [word_to_idx[word_a], word_to_idx[word_b], word_to_idx[word_c]]:
            continue
        
        other_embedding = embeddings[idx]
        
        # Cosine similarity
        similarity = np.dot(target_vector, other_embedding) / (
            np.linalg.norm(target_vector) * np.linalg.norm(other_embedding) + 1e-8
        )
        
        similarities.append((idx, similarity))
    
    # Sort by similarity
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    # Get top k
    results = [(idx_to_word[idx], sim) for idx, sim in similarities[:top_k]]
    
    return results

print("Word Analogy Testing:")
print("="*70)
print("Format: 'A is to B as C is to ?'")
print("="*70)

In [None]:
# Test analogies
analogies = [
    ("king", "queen", "man"),      # king:queen :: man:?
    ("cat", "cats", "dog"),        # cat:cats :: dog:?
    ("happy", "joyful", "cat"),    # happy:joyful :: cat:?
]

for word_a, word_b, word_c in analogies:
    print(f"\n'{word_a}' is to '{word_b}' as '{word_c}' is to:")
    print("-" * 50)
    
    results = word_analogy(word_a, word_b, word_c, word_to_idx, idx_to_word, embeddings, top_k=3)
    
    if results:
        for i, (word, similarity) in enumerate(results, 1):
            print(f"  {i}. {word:<15s} (similarity: {similarity:.4f})")
    else:
        print("  (Words not in vocabulary)")

print("\n" + "="*70)
print("\nðŸ’¡ Note: With our small corpus, analogies may not be perfect.")
print("   Larger datasets produce much better analogies!")

### **16. Conclusion and Key Takeaways**

Congratulations! You've successfully implemented Word2Vec from scratch and learned how machines can understand language!

**ðŸŽ¯ What We Accomplished:**

1. âœ… Understood the problem: How to represent words as numbers
2. âœ… Learned about one-hot encoding vs dense embeddings
3. âœ… Explored the distributional hypothesis ("a word is known by its context")
4. âœ… Preprocessed text: tokenization, vocabulary building
5. âœ… Generated training pairs using the Skip-Gram approach
6. âœ… Built and trained a neural network to learn word embeddings
7. âœ… Extracted learned embeddings and found similar words
8. âœ… Visualized embeddings in 2D to see semantic clusters
9. âœ… Performed word analogies using vector arithmetic

**ðŸ”‘ Key Concepts:**

1. **Word Embeddings:**
   - Dense vector representations of words
   - Capture semantic meaning automatically
   - Much better than sparse one-hot encodings

2. **Distributional Hypothesis:**
   - Words appearing in similar contexts have similar meanings
   - The foundation of Word2Vec and modern NLP

3. **Skip-Gram Model:**
   - Predicts context words from target words
   - Simple 2-layer architecture
   - The embedding layer weights ARE the word vectors

4. **Training Process:**
   - Transform text into (target, context) pairs
   - Train like any supervised learning problem
   - Similar words end up with similar vectors

5. **Vector Arithmetic:**
   - Can perform math on meanings: king - man + woman â‰ˆ queen
   - Embeddings capture semantic relationships

**ðŸ“Š From Simple to Powerful:**

| Aspect | One-Hot Encoding | Word Embeddings |
|--------|-----------------|------------------|
| **Dimensions** | Vocabulary size (e.g., 100,000) | Fixed small size (50-300) |
| **Sparsity** | 99.999% zeros | Dense (all values meaningful) |
| **Semantic Meaning** | None (all words equally different) | Rich (similar words have similar vectors) |
| **Generalization** | Poor | Excellent |
| **Vector Arithmetic** | Meaningless | Meaningful analogies! |

**ðŸš€ Real-World Applications:**

Word embeddings are used in:
- **Search Engines:** Understanding query intent
- **Recommendation Systems:** Finding similar products/content
- **Sentiment Analysis:** Understanding emotions in text
- **Machine Translation:** Google Translate, DeepL
- **Chatbots:** Understanding user messages
- **Question Answering:** Finding relevant answers
- **Document Classification:** Categorizing articles, emails

**ðŸ’¡ Improvements for Production:**

Our implementation is educational. For real applications:

1. **Use More Data:**
   - Millions/billions of words (Wikipedia, books, web pages)
   - Our 200 words â†’ Real systems use billions

2. **Negative Sampling:**
   - Instead of predicting over entire vocabulary (slow)
   - Sample a few "negative" examples
   - Much faster and more efficient

3. **Subword Information:**
   - Handle rare/unknown words better
   - FastText extends Word2Vec with character n-grams

4. **Pre-trained Embeddings:**
   - Use embeddings trained on huge corpora
   - GloVe, FastText, Word2Vec pre-trained models
   - Don't train from scratch for every task

5. **Modern Alternatives:**
   - **BERT, GPT, T5:** Contextual embeddings (different vectors for same word in different contexts)
   - **Sentence Transformers:** Embeddings for entire sentences
   - But Word2Vec is still widely used and very effective!

**ðŸŽ“ From Images to Language:**

Compare our journey:

| Day | Task | Input | Output | Key Learning |
|-----|------|-------|--------|---------------|
| **4** | Fashion MNIST | Images (28Ã—28 pixels) | 10 clothing classes | CNNs for vision |
| **5** | Word2Vec | Text (words) | Word vectors | Embeddings for NLP |

Both use neural networks but handle different data types!

**ðŸŒŸ You've now learned the foundation of Natural Language Processing!**

From here, you can explore:
- Recurrent Neural Networks (RNNs) for sequences
- Transformers (the architecture behind ChatGPT)
- BERT, GPT, and other large language models
- Sentiment analysis, machine translation, text generation

All of these build on the concept of word embeddings that you learned today!

---

**Keep exploring and building! ðŸš€**