# Module 04: PyTorch Embedding Layers

**Mastering nn.Embedding for Deep Learning NLP**

---

## 1. Objectives

- âœ… Understand nn.Embedding as a lookup table
- âœ… Initialize embeddings (random, pretrained)
- âœ… Load GloVe/FastText into PyTorch
- âœ… Handle padding correctly
- âœ… Freeze vs fine-tune embeddings

## 2. Prerequisites

- [Module 03: Word Embeddings](../03_word_embeddings/03_word_embeddings.ipynb)
- PyTorch basics (tensors, nn.Module)

## 3. Intuition & Motivation

### What is nn.Embedding?

Simply a **lookup table** that maps indices to vectors:

```
Index:  0 â†’ [0.2, -0.4, 0.7, ...]
        1 â†’ [0.3, -0.3, 0.6, ...]
        2 â†’ [0.8, 0.5, -0.2, ...]
```

### Dimensions

```
Input:  [batch_size, seq_len]        (indices)
Output: [batch_size, seq_len, embed_dim]  (vectors)
```

In [None]:
import torch
import torch.nn as nn
import numpy as np
from typing import Dict, List

print(f"PyTorch version: {torch.__version__}")

## 4. nn.Embedding Basics

In [None]:
# Create embedding layer
vocab_size = 1000
embedding_dim = 300

embedding = nn.Embedding(
    num_embeddings=vocab_size,  # Size of vocabulary
    embedding_dim=embedding_dim  # Dimension of each vector
)

print(f"Embedding weight shape: {embedding.weight.shape}")
print(f"Total parameters: {embedding.weight.numel():,}")

# Forward pass
input_indices = torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8]])  # (batch=2, seq=4)
output = embedding(input_indices)
print(f"\nInput shape: {input_indices.shape}")
print(f"Output shape: {output.shape}")

## 5. Initialization Strategies

In [None]:
# 1. Default (Normal distribution)
emb_default = nn.Embedding(1000, 300)
print(f"Default init - mean: {emb_default.weight.mean():.4f}, std: {emb_default.weight.std():.4f}")

# 2. Xavier/Glorot
emb_xavier = nn.Embedding(1000, 300)
nn.init.xavier_uniform_(emb_xavier.weight)
print(f"Xavier init - mean: {emb_xavier.weight.mean():.4f}, std: {emb_xavier.weight.std():.4f}")

# 3. Uniform in range
emb_uniform = nn.Embedding(1000, 300)
nn.init.uniform_(emb_uniform.weight, -0.1, 0.1)
print(f"Uniform init - mean: {emb_uniform.weight.mean():.4f}, std: {emb_uniform.weight.std():.4f}")

## 6. Loading Pretrained Embeddings (GloVe)

In [None]:
def load_glove(glove_path: str, vocab: Dict[str, int], embedding_dim: int) -> np.ndarray:
    """
    Load GloVe embeddings for vocabulary.
    
    Args:
        glove_path: Path to GloVe file (e.g., glove.6B.300d.txt)
        vocab: Dictionary mapping word -> index
        embedding_dim: Dimension of embeddings
    
    Returns:
        Embedding matrix of shape (vocab_size, embedding_dim)
    """
    # Initialize with random vectors
    embedding_matrix = np.random.randn(len(vocab), embedding_dim) * 0.01
    found = 0
    
    # Read GloVe file
    with open(glove_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            if word in vocab:
                idx = vocab[word]
                embedding_matrix[idx] = np.array(values[1:], dtype=np.float32)
                found += 1
    
    print(f"Found {found}/{len(vocab)} words in GloVe ({100*found/len(vocab):.1f}%)")
    return embedding_matrix

# Example usage (would need actual GloVe file)
# vocab = {'the': 0, 'cat': 1, 'sat': 2, 'on': 3, 'mat': 4}
# embeddings = load_glove('glove.6B.300d.txt', vocab, 300)

print("GloVe loading function ready!")

In [None]:
def create_embedding_layer(pretrained_weights: np.ndarray, freeze: bool = True) -> nn.Embedding:
    """
    Create embedding layer from pretrained weights.
    
    Args:
        pretrained_weights: NumPy array of shape (vocab_size, embed_dim)
        freeze: If True, embeddings won't be updated during training
    """
    vocab_size, embedding_dim = pretrained_weights.shape
    embedding = nn.Embedding(vocab_size, embedding_dim)
    
    # Load weights
    embedding.weight = nn.Parameter(
        torch.from_numpy(pretrained_weights).float(),
        requires_grad=not freeze
    )
    
    print(f"Created embedding: {vocab_size} x {embedding_dim}, frozen={freeze}")
    return embedding

# Demo with random weights
demo_weights = np.random.randn(1000, 300).astype(np.float32)
frozen_emb = create_embedding_layer(demo_weights, freeze=True)
trainable_emb = create_embedding_layer(demo_weights, freeze=False)

## 7. Padding and Masking

In [None]:
# Padding index: embedding for padding tokens should be zeros
PAD_IDX = 0

embedding = nn.Embedding(
    num_embeddings=1000,
    embedding_dim=300,
    padding_idx=PAD_IDX  # This index will always be zeros
)

# Check padding embedding
print(f"Padding embedding (should be zeros):")
print(f"  Sum: {embedding.weight[PAD_IDX].sum().item()}")
print(f"  Requires grad: {embedding.weight[PAD_IDX].requires_grad}")

# Example with padded sequence
padded_input = torch.tensor([
    [1, 2, 3, 0, 0],  # Sequence of length 3, padded to 5
    [4, 5, 0, 0, 0]   # Sequence of length 2, padded to 5
])

output = embedding(padded_input)
print(f"\nOutput shape: {output.shape}")
print(f"Padding positions are zero vectors: {output[0, 3].sum().item() == 0}")

## 8. Complete Embedding Module for NLP

In [None]:
class TextEmbedding(nn.Module):
    """
    Text embedding module with optional pretrained weights.
    """
    
    def __init__(
        self,
        vocab_size: int,
        embedding_dim: int,
        padding_idx: int = 0,
        pretrained_weights: np.ndarray = None,
        freeze: bool = False,
        dropout: float = 0.0
    ):
        super().__init__()
        
        self.embedding = nn.Embedding(
            vocab_size, embedding_dim, padding_idx=padding_idx
        )
        
        # Load pretrained if provided
        if pretrained_weights is not None:
            self.embedding.weight = nn.Parameter(
                torch.from_numpy(pretrained_weights).float(),
                requires_grad=not freeze
            )
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: (batch_size, seq_len) token indices
        Returns:
            (batch_size, seq_len, embedding_dim) embeddings
        """
        return self.dropout(self.embedding(x))

# Example
text_emb = TextEmbedding(vocab_size=5000, embedding_dim=300, dropout=0.1)
x = torch.randint(0, 5000, (32, 100))  # batch=32, seq_len=100
output = text_emb(x)
print(f"Input: {x.shape} â†’ Output: {output.shape}")

## 9. ðŸ”¥ Real-World Usage

### Best Practices

| Scenario | Recommendation |
|----------|----------------|
| Small data | Use pretrained, **freeze** |
| Medium data | Use pretrained, **fine-tune** |
| Large data | Random init or pretrained |
| Domain-specific | Fine-tune or train from scratch |

### Memory Optimization

```python
# Embeddings can be huge!
# 50K vocab Ã— 300d Ã— 4 bytes = 60 MB

# Solutions:
# 1. Reduce vocab size
# 2. Use smaller embedding dimension
# 3. Quantize for inference
```

### Modern Approach

> For transformers (BERT, GPT), use the model's built-in embeddings.
> No need to load separate pretrained embeddings.

## 10. Interview Questions

**Q1: What is nn.Embedding? Is it the same as a linear layer?**
<details><summary>Answer</summary>

nn.Embedding is a lookup table. It's equivalent to `nn.Linear(vocab_size, embed_dim)` with one-hot input, but much more efficient since we don't need to create one-hot vectors.
</details>

**Q2: Why use padding_idx?**
<details><summary>Answer</summary>

- Ensures padding tokens have zero embedding
- Prevents gradients from flowing to padding
- Important for correct sequence processing
</details>

**Q3: When to freeze vs fine-tune embeddings?**
<details><summary>Answer</summary>

- **Freeze**: Small dataset, prevent overfitting
- **Fine-tune**: Larger dataset, domain mismatch between pretrained and target
- Common: Freeze initially, unfreeze later
</details>

## 11. Summary

- **nn.Embedding**: Efficient lookup table for word vectors
- **Input/Output**: `[batch, seq]` â†’ `[batch, seq, dim]`
- **Pretrained**: Load GloVe/FastText for better performance
- **padding_idx**: Keep padding tokens as zeros
- **Freeze/Fine-tune**: Depends on data size

## 12. Exercises

1. Load actual GloVe embeddings and compute word similarities
2. Compare frozen vs fine-tuned on sentiment classification
3. Implement EmbeddingBag for multi-hot inputs
4. Visualize how embeddings change during training

## 13. References

- [PyTorch nn.Embedding Docs](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)
- [GloVe Pretrained](https://nlp.stanford.edu/projects/glove/)
- [FastText Pretrained](https://fasttext.cc/docs/en/english-vectors.html)

---
**Next:** [Module 05: RNN Fundamentals](../05_rnn_fundamentals/05_rnn_fundamentals.ipynb)