# Module 11: Language Modeling

**Predicting the Next Word**

---

## 1. Objectives

- âœ… Understand language modeling fundamentals
- âœ… Build LSTM language model
- âœ… Evaluate with perplexity
- âœ… Generate text with different sampling strategies

## 2. Prerequisites

- [Module 06: LSTM](../06_lstm/06_lstm.ipynb)

## 3. What is Language Modeling?

### Task Definition
Predict the probability of the next word given previous words:

$$P(w_t | w_1, w_2, ..., w_{t-1})$$

### Example
```
Input:  "The cat sat on the"
Output: P(mat) = 0.3, P(floor) = 0.2, P(dog) = 0.01, ...
```

### Why It Matters
- Foundation of GPT, ChatGPT, etc.
- Text generation
- Autocomplete
- Speech recognition

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from collections import Counter

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

Device: cpu


## 4. Data Preparation

In [2]:
# Sample corpus
corpus = """
the cat sat on the mat
the dog ran in the park
a cat and a dog are friends
the cat chased the mouse
the dog barked at the cat
""".strip().lower()

# Build vocabulary
words = corpus.split()
word_counts = Counter(words)
vocab = ['<PAD>', '<UNK>', '<EOS>'] + [w for w, _ in word_counts.most_common()]
word2idx = {w: i for i, w in enumerate(vocab)}
idx2word = {i: w for w, i in word2idx.items()}

print(f"Vocabulary size: {len(vocab)}")
print(f"Sample: {list(word2idx.items())[:10]}")

Vocabulary size: 20
Sample: [('<PAD>', 0), ('<UNK>', 1), ('<EOS>', 2), ('the', 3), ('cat', 4), ('dog', 5), ('a', 6), ('sat', 7), ('on', 8), ('mat', 9)]


In [3]:
# Create training sequences
def create_sequences(text, word2idx, seq_len=5):
    """Create (input, target) pairs for LM training."""
    tokens = [word2idx.get(w, 1) for w in text.split()]
    inputs, targets = [], []

    for i in range(len(tokens) - seq_len):
        inputs.append(tokens[i:i+seq_len])
        targets.append(tokens[i+1:i+seq_len+1])  # Shifted by 1

    return torch.tensor(inputs), torch.tensor(targets)

X, Y = create_sequences(corpus, word2idx, seq_len=4)
print(f"X shape: {X.shape}, Y shape: {Y.shape}")
print(f"\nExample:")
print(f"  Input:  {[idx2word[i.item()] for i in X[0]]}")
print(f"  Target: {[idx2word[i.item()] for i in Y[0]]}")

X shape: torch.Size([26, 4]), Y shape: torch.Size([26, 4])

Example:
  Input:  ['the', 'cat', 'sat', 'on']
  Target: ['cat', 'sat', 'on', 'the']


## 5. LSTM Language Model

In [4]:
class LSTMLanguageModel(nn.Module):
    """LSTM-based language model."""

    def __init__(self, vocab_size, embed_dim=128, hidden_dim=256, num_layers=2, dropout=0.3):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(
            embed_dim, hidden_dim, num_layers,
            batch_first=True, dropout=dropout if num_layers > 1 else 0
        )
        self.fc = nn.Linear(hidden_dim, vocab_size)
        self.dropout = nn.Dropout(dropout)

        # Tie weights (embedding and output)
        if embed_dim == hidden_dim:
            self.fc.weight = self.embedding.weight

    def forward(self, x, hidden=None):
        # x: (batch, seq)
        embedded = self.dropout(self.embedding(x))  # (batch, seq, embed)
        output, hidden = self.lstm(embedded, hidden)  # (batch, seq, hidden)
        logits = self.fc(self.dropout(output))  # (batch, seq, vocab)
        return logits, hidden

    def init_hidden(self, batch_size):
        """Initialize hidden state."""
        h = torch.zeros(self.num_layers, batch_size, self.hidden_dim).to(device)
        c = torch.zeros(self.num_layers, batch_size, self.hidden_dim).to(device)
        return (h, c)

model = LSTMLanguageModel(len(vocab), embed_dim=64, hidden_dim=128, num_layers=2).to(device)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

Parameters: 235,284


## 6. Training

In [5]:
# Training loop
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

X, Y = X.to(device), Y.to(device)

for epoch in range(100):
    model.train()
    hidden = model.init_hidden(X.size(0))

    optimizer.zero_grad()
    logits, _ = model(X, hidden)

    # Reshape for loss: (batch*seq, vocab) vs (batch*seq,)
    loss = criterion(logits.view(-1, len(vocab)), Y.view(-1))
    loss.backward()

    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

    if (epoch + 1) % 20 == 0:
        perplexity = torch.exp(loss).item()
        print(f"Epoch {epoch+1}: Loss={loss.item():.4f}, Perplexity={perplexity:.2f}")

Epoch 20: Loss=2.5967, Perplexity=13.42
Epoch 40: Loss=1.9175, Perplexity=6.80
Epoch 60: Loss=1.0862, Perplexity=2.96
Epoch 80: Loss=0.6107, Perplexity=1.84
Epoch 100: Loss=0.4108, Perplexity=1.51


## 7. Perplexity (Evaluation Metric)

$$\text{Perplexity} = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_{<i})\right)$$

- Lower is better
- Perplexity of 10 â‰ˆ "choosing from 10 equally likely words"
- GPT-2: ~20-30 on standard benchmarks

In [6]:
def compute_perplexity(model, X, Y):
    model.eval()
    with torch.no_grad():
        hidden = model.init_hidden(X.size(0))
        logits, _ = model(X, hidden)
        loss = F.cross_entropy(logits.view(-1, len(vocab)), Y.view(-1))
        return torch.exp(loss).item()

ppl = compute_perplexity(model, X, Y)
print(f"Perplexity: {ppl:.2f}")

Perplexity: 1.39


## 8. Text Generation

In [7]:
def generate(model, prompt, max_len=20, temperature=1.0, top_k=None):
    """
    Generate text from prompt.

    Args:
        temperature: Higher = more random, Lower = more deterministic
        top_k: Sample from top k tokens only
    """
    model.eval()
    tokens = [word2idx.get(w, 1) for w in prompt.lower().split()]

    hidden = model.init_hidden(1)

    for _ in range(max_len):
        x = torch.tensor([tokens[-4:]]).to(device)  # Last 4 tokens
        logits, hidden = model(x, hidden)

        # Get next token logits
        next_logits = logits[0, -1] / temperature

        # Top-k sampling
        if top_k:
            values, indices = torch.topk(next_logits, top_k)
            next_logits = torch.full_like(next_logits, float('-inf'))
            next_logits.scatter_(0, indices, values)

        # Sample
        probs = F.softmax(next_logits, dim=0)
        next_token = torch.multinomial(probs, 1).item()

        tokens.append(next_token)

        if next_token == word2idx.get('<EOS>', -1):
            break

    return ' '.join([idx2word[t] for t in tokens])

# Generate with different settings
print("Greedy (temp=0.1):")
print(f"  {generate(model, 'the cat', temperature=0.1)}\n")

print("Creative (temp=1.5):")
print(f"  {generate(model, 'the cat', temperature=1.5)}\n")

print("Top-k=3:")
print(f"  {generate(model, 'the cat', top_k=3)}")

Greedy (temp=0.1):
  the cat chased the mouse the dog barked at the cat chased the mouse the dog barked at the cat chased the

Creative (temp=1.5):
  the cat the dog in the a the park a cat and a dog are friends the cat chased the mouse the

Top-k=3:
  the cat chased the mouse the dog ran at the park a cat and a dog are friends the cat chased the


## 9. ðŸ”¥ Real-World Usage

### LSTM LM â†’ Transformers

| Era | Model | Perplexity |
|-----|-------|------------|
| 2017 | LSTM LM | ~50-100 |
| 2018 | Transformer LM | ~30-50 |
| 2019 | GPT-2 | ~20-30 |
| 2020+ | GPT-3/4 | Even lower |

### Key Concepts for LLMs
- Same objective: predict next token
- Same loss: cross-entropy
- Same sampling: temperature, top-k, top-p

## 10. Interview Questions

**Q1: What is perplexity?**
<details><summary>Answer</summary>

Exponentiated average cross-entropy loss. Lower is better. Perplexity of N means the model is "as confused as choosing from N equally likely options."
</details>

**Q2: How does temperature affect generation?**
<details><summary>Answer</summary>

- Temperature < 1: More deterministic, picks high-probability tokens
- Temperature = 1: Original distribution
- Temperature > 1: More random, flatter distribution
</details>

## 11. Summary

- **Language Model**: P(next word | previous words)
- **Training**: Cross-entropy loss, predict next token
- **Perplexity**: exp(loss), lower is better
- **Generation**: Temperature, top-k, top-p sampling
- **Foundation**: Same principles power GPT, ChatGPT

## 12. References

- [The Unreasonable Effectiveness of RNNs](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
- [GPT-2 Paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

---
**Next:** [Module 12: Sequence-to-Sequence](../12_seq2seq/12_seq2seq.ipynb)