# Day 19: Character-Level Models and Text Generation

**Goal:** Build character-level language models for text generation using sampling strategies.

**Time estimate:** 3-4 hours

**What you'll learn:**
- Character-level vs word-level modeling
- Autoregressive language modeling
- Temperature sampling and top-k sampling
- Beam search for generation
- Perplexity evaluation metric

---

## Part 1: Language Modeling Fundamentals

### What is a Language Model?

A language model assigns probabilities to sequences of tokens:

$$P(\text{"The cat sat"}) = P(\text{The}) \times P(\text{cat}|\text{The}) \times P(\text{sat}|\text{The cat})$$

**Autoregressive property:**
- Predict next token given all previous tokens
- $P(\text{token}_i | \text{token}_0, ..., \text{token}_{i-1})$

### Character-Level vs Word-Level

#### Character-Level Modeling
- **Vocabulary size:** ~100 (for English: 26 letters + punctuation + digits)
- **Advantages:**
  - Handles any word (no out-of-vocabulary)
  - Can learn spelling, morphology
  - Natural for code, structured text
- **Disadvantages:**
  - Longer sequences (more computation)
  - Harder to learn long-range dependencies
  - Slower generation (character by character)

#### Word-Level Modeling
- **Vocabulary size:** 10K-100K+ (or unlimited with subword tokenization)
- **Advantages:**
  - Shorter sequences
  - Captures semantic relationships
  - Faster to train and generate
- **Disadvantages:**
  - Out-of-vocabulary problem
  - Requires tokenization preprocessing

**Modern approach:** Use subword tokenization (BPE, WordPiece) - best of both worlds.

Today we focus on **character-level** for simplicity and full understanding.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import seaborn as sns
from collections import Counter
import urllib.request

# Set seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## Part 2: Character Vocabulary and Preprocessing

### Step 1: Create Character Vocabulary

In [None]:
# Download a small Shakespeare text file for training
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
filename = "shakespeare.txt"

try:
    urllib.request.urlretrieve(url, filename)
    print(f"✓ Downloaded {filename}")
except:
    # Fallback: Create a small sample if download fails
    with open(filename, 'w') as f:
        f.write("The cat sat on the mat.\n" * 100)
    print(f"✓ Created sample {filename}")

# Read the text
with open(filename, 'r', encoding='utf-8') as f:
    text = f.read()

print(f"Text length: {len(text):,} characters")
print(f"First 100 characters:\n{text[:100]}")

In [None]:
# Create character vocabulary
chars = sorted(list(set(text)))
vocab_size = len(chars)

# Create mappings
char_to_idx = {char: idx for idx, char in enumerate(chars)}
idx_to_char = {idx: char for char, idx in char_to_idx.items()}

print(f"Vocabulary size: {vocab_size}")
print(f"\nCharacters in vocabulary:")
print(''.join(chars))
print(f"\nExample mappings:")
for char in "The":
    print(f"  '{char}' → {char_to_idx[char]}")

### Step 2: Create Dataset

In [None]:
# Convert text to character indices
data = torch.tensor([char_to_idx[char] for char in text], dtype=torch.long)

print(f"Data shape: {data.shape}")
print(f"Data type: {data.dtype}")
print(f"Sample: {data[:20]}")
print(f"Decoded: {''.join([idx_to_char[idx.item()] for idx in data[:50]])}")

In [None]:
# Create dataset for language modeling
class CharacterDataset(torch.utils.data.Dataset):
    """
    Character-level language modeling dataset.
    For each sequence, the target is the next token.
    
    Example:
    Input:  "Hello" → [H, e, l, l, o]
    Target:           [e, l, l, o, space]
    """
    
    def __init__(self, data, seq_len):
        self.data = data
        self.seq_len = seq_len
    
    def __len__(self):
        # Number of sequences we can create
        return len(self.data) - self.seq_len
    
    def __getitem__(self, idx):
        # Get sequence of length seq_len
        x = self.data[idx:idx + self.seq_len]
        # Target is the next character
        y = self.data[idx + self.seq_len]
        return x, y

# Parameters
seq_len = 64  # Context length
batch_size = 32

# Create train/val split
train_size = int(0.9 * len(data))
train_data = data[:train_size]
val_data = data[train_size:]

train_dataset = CharacterDataset(train_data, seq_len)
val_dataset = CharacterDataset(val_data, seq_len)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

print(f"Train dataset size: {len(train_dataset)}")
print(f"Val dataset size: {len(val_dataset)}")
print(f"Sequence length: {seq_len}")
print(f"Batch size: {batch_size}")

# Example batch
x, y = next(iter(train_loader))
print(f"\nExample batch:")
print(f"  Input shape: {x.shape}")
print(f"  Target shape: {y.shape}")
print(f"  First sequence: {''.join([idx_to_char[idx.item()] for idx in x[0]])}")
print(f"  First target: '{idx_to_char[y[0].item()]}'")

## Part 3: Character-Level RNN Language Model

In [None]:
class CharLevelRNN(nn.Module):
    """
    Character-level RNN language model.
    
    Architecture:
    Input (batch_size, seq_len)
      ↓ Embedding
    (batch_size, seq_len, embed_dim)
      ↓ LSTM
    (batch_size, seq_len, hidden_dim)
      ↓ Linear
    (batch_size, seq_len, vocab_size)
    """
    
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers, dropout=0.1):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers,
                            batch_first=True, dropout=dropout if num_layers > 1 else 0)
        self.fc = nn.Linear(hidden_dim, vocab_size)
        
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
    
    def forward(self, x):
        """
        Args:
            x: Input token indices, shape (batch_size, seq_len)
        
        Returns:
            logits: Unnormalized probabilities, shape (batch_size, seq_len, vocab_size)
        """
        # Embedding
        x = self.embedding(x)  # (batch_size, seq_len, embed_dim)
        
        # LSTM
        lstm_out, _ = self.lstm(x)  # (batch_size, seq_len, hidden_dim)
        
        # Output layer
        logits = self.fc(lstm_out)  # (batch_size, seq_len, vocab_size)
        
        return logits

# Create model
embed_dim = 64
hidden_dim = 256
num_layers = 2

model = CharLevelRNN(vocab_size, embed_dim, hidden_dim, num_layers, dropout=0.2).to(device)

print(f"Model architecture:")
print(model)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"\nTotal parameters: {total_params:,}")

## Part 4: Training

### Loss Function: Cross-Entropy

We want to maximize the probability of the next character:

$$\mathcal{L} = -\sum_i \log P(\text{next}_i | \text{context}_i)$$

This is the **cross-entropy loss** between predicted distribution and true next character.

In [None]:
def train_epoch(model, train_loader, criterion, optimizer, device):
    """
    Train for one epoch.
    """
    model.train()
    total_loss = 0
    
    for x, y in train_loader:
        x = x.to(device)
        y = y.to(device)
        
        # Forward pass
        logits = model(x)  # (batch_size, seq_len, vocab_size)
        
        # Compute loss
        # Reshape for cross-entropy: (batch_size * seq_len, vocab_size)
        loss = criterion(logits.reshape(-1, model.vocab_size), y.reshape(-1))
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        
        # Gradient clipping (important for RNNs)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        
        optimizer.step()
        
        total_loss += loss.item()
    
    return total_loss / len(train_loader)


def evaluate(model, val_loader, criterion, device):
    """
    Evaluate on validation set.
    
    Returns:
        loss: Cross-entropy loss
        perplexity: exp(loss) - measure of model uncertainty
    """
    model.eval()
    total_loss = 0
    
    with torch.no_grad():
        for x, y in val_loader:
            x = x.to(device)
            y = y.to(device)
            
            logits = model(x)
            loss = criterion(logits.reshape(-1, model.vocab_size), y.reshape(-1))
            total_loss += loss.item()
    
    avg_loss = total_loss / len(val_loader)
    perplexity = np.exp(avg_loss)
    
    return avg_loss, perplexity


print("✓ Training functions defined")

In [None]:
# Training setup
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

num_epochs = 20
train_losses = []
val_losses = []
perplexities = []

print("Starting training...")
print("-" * 60)
print(f"{'Epoch':<8} {'Train Loss':<15} {'Val Loss':<15} {'Perplexity':<15}")
print("-" * 60)

for epoch in range(num_epochs):
    train_loss = train_epoch(model, train_loader, criterion, optimizer, device)
    val_loss, perplexity = evaluate(model, val_loader, criterion, device)
    
    train_losses.append(train_loss)
    val_losses.append(val_loss)
    perplexities.append(perplexity)
    
    if (epoch + 1) % 5 == 0 or epoch == 0:
        print(f"{epoch+1:<8} {train_loss:<15.4f} {val_loss:<15.4f} {perplexity:<15.4f}")

print("-" * 60)
print("✓ Training complete")

### Visualize Training Progress

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss
ax = axes[0]
ax.plot(train_losses, label='Train Loss', linewidth=2, marker='o')
ax.plot(val_losses, label='Val Loss', linewidth=2, marker='s')
ax.set_xlabel('Epoch', fontsize=12, fontweight='bold')
ax.set_ylabel('Cross-Entropy Loss', fontsize=12, fontweight='bold')
ax.set_title('Training Progress', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# Perplexity
ax = axes[1]
ax.plot(perplexities, label='Validation Perplexity', linewidth=2, marker='o', color='orange')
ax.set_xlabel('Epoch', fontsize=12, fontweight='bold')
ax.set_ylabel('Perplexity', fontsize=12, fontweight='bold')
ax.set_title('Language Model Perplexity', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Final validation perplexity: {perplexities[-1]:.2f}")
print(f"\nInterpretation:")
print(f"- Perplexity measures 'confusion' of the model")
print(f"- Lower is better (best possible: 1.0)")
print(f"- Uniform distribution over {vocab_size} chars: perplexity = {vocab_size:.1f}")

## Part 5: Text Generation Strategies

### Greedy Decoding (Baseline)

Always pick the most likely next token:
$$\text{next} = \arg\max_i P(\text{token}_i | \text{context})$$

**Problem:** Generates repetitive, boring text

### Temperature Sampling

Control diversity by modifying probability distribution:
$$P'(\text{token}_i) = \frac{\exp(\text{logit}_i / T)}{\sum_j \exp(\text{logit}_j / T)}$$

Where:
- $T = 1.0$: Original distribution (no change)
- $T > 1.0$: Softer distribution (more diversity, more errors)
- $T < 1.0$: Sharper distribution (less diversity, more confident)

### Top-K Sampling

Only sample from top-k most likely tokens, normalize:
1. Sort probabilities in descending order
2. Keep only top-k
3. Renormalize remaining probabilities
4. Sample from this distribution

### Nucleus (Top-P) Sampling

Keep smallest set of tokens with cumulative probability ≥ p:
1. Sort by probability descending
2. Keep tokens until cumulative sum ≥ p
3. Sample from this nucleus

In [None]:
def sample_next_token(logits, temperature=1.0, top_k=None, top_p=None):
    """
    Sample next token from logits using various strategies.
    
    Args:
        logits: Unnormalized scores, shape (vocab_size,)
        temperature: Sampling temperature (1.0 = no change)
        top_k: Keep only top-k tokens (None = use all)
        top_p: Keep tokens with cumulative prob >= top_p (None = use all)
    
    Returns:
        token_idx: Sampled token index
    """
    # Apply temperature
    logits = logits / temperature
    probabilities = torch.softmax(logits, dim=-1)
    
    # Apply top-k
    if top_k is not None:
        # Set all but top-k to very negative value
        topk_probs, topk_indices = torch.topk(probabilities, top_k)
        # Create mask and apply
        probabilities_masked = torch.zeros_like(probabilities)
        probabilities_masked[topk_indices] = topk_probs
        probabilities = probabilities_masked / probabilities_masked.sum()
    
    # Apply nucleus (top-p)
    if top_p is not None:
        sorted_probs, sorted_indices = torch.sort(probabilities, descending=True)
        cumsum_probs = torch.cumsum(sorted_probs, dim=-1)
        # Mask out tokens that exceed cumulative probability
        sorted_indices_to_remove = cumsum_probs > top_p
        sorted_indices_to_remove[0] = False  # Keep at least one token
        sorted_probs[sorted_indices_to_remove] = 0
        # Renormalize
        probabilities = torch.zeros_like(probabilities)
        probabilities[sorted_indices] = sorted_probs
        probabilities = probabilities / probabilities.sum()
    
    # Sample
    token_idx = torch.multinomial(probabilities, num_samples=1)
    return token_idx.item()


print("✓ Sampling functions defined")

### Generate Text

In [None]:
def generate_text(model, start_text, max_length, char_to_idx, idx_to_char, 
                  device, temperature=1.0, top_k=None, top_p=None):
    """
    Generate text starting from start_text.
    
    Args:
        model: Language model
        start_text: Starting text (string)
        max_length: Maximum length to generate
        char_to_idx, idx_to_char: Character mappings
        device: Device to run on
        temperature: Sampling temperature
        top_k: Top-k sampling parameter
        top_p: Nucleus sampling parameter
    
    Returns:
        generated_text: Generated text string
    """
    model.eval()
    
    # Initialize with start text
    current_text = start_text
    context = torch.tensor([char_to_idx[c] for c in start_text], dtype=torch.long).to(device)
    
    with torch.no_grad():
        for _ in range(max_length):
            # Keep only last seq_len tokens as context
            context_trimmed = context[-seq_len:].unsqueeze(0)  # Add batch dimension
            
            # Forward pass
            logits = model(context_trimmed)  # (1, ≤seq_len, vocab_size)
            
            # Get logits for next token (last position)
            next_logits = logits[0, -1, :]  # (vocab_size,)
            
            # Sample next token
            next_idx = sample_next_token(next_logits, temperature=temperature, 
                                        top_k=top_k, top_p=top_p)
            
            # Append to context and generated text
            context = torch.cat([context, torch.tensor([next_idx]).to(device)])
            current_text += idx_to_char[next_idx]
    
    return current_text


print("✓ Generation function defined")

### Generate with Different Strategies

In [None]:
# Generate with different temperature values
start_text = "The"
max_gen_length = 200

temperatures = [0.5, 1.0, 1.5]

print("Text Generation with Different Temperatures")
print("=" * 80)

for temp in temperatures:
    generated = generate_text(model, start_text, max_gen_length, 
                             char_to_idx, idx_to_char, device,
                             temperature=temp)
    print(f"\nTemperature = {temp} (Lower = more focused):")
    print("-" * 80)
    print(generated[:300])  # Print first 300 chars
    print()

In [None]:
# Generate with top-k sampling
print("\nText Generation with Top-K Sampling")
print("=" * 80)

for k in [5, 10, 20]:
    generated = generate_text(model, start_text, max_gen_length,
                             char_to_idx, idx_to_char, device,
                             temperature=1.0, top_k=k)
    print(f"\nTop-K = {k}:")
    print("-" * 80)
    print(generated[:300])
    print()

In [None]:
# Generate with nucleus (top-p) sampling
print("\nText Generation with Nucleus (Top-P) Sampling")
print("=" * 80)

for p in [0.5, 0.8, 0.95]:
    generated = generate_text(model, start_text, max_gen_length,
                             char_to_idx, idx_to_char, device,
                             temperature=1.0, top_p=p)
    print(f"\nTop-P = {p}:")
    print("-" * 80)
    print(generated[:300])
    print()

## Part 6: Analysis - Sampling Strategy Effects

### Compare Diversity and Quality

In [None]:
def analyze_generation_statistics(text, vocab_size):
    """
    Analyze diversity and repetition in generated text.
    """
    unique_chars = len(set(text))
    char_counts = Counter(text)
    entropy = -sum((count/len(text)) * np.log(count/len(text)) 
                   for count in char_counts.values())
    
    # Repetition: ratio of unique bigrams to total bigrams
    bigrams = [text[i:i+2] for i in range(len(text)-1)]
    unique_bigrams = len(set(bigrams))
    bigram_diversity = unique_bigrams / len(bigrams) if bigrams else 0
    
    return {
        'unique_chars': unique_chars,
        'entropy': entropy,
        'bigram_diversity': bigram_diversity,
    }


# Compare strategies
strategies = [
    ('Temperature=0.5', {'temperature': 0.5}),
    ('Temperature=1.0', {'temperature': 1.0}),
    ('Temperature=1.5', {'temperature': 1.5}),
    ('Top-K=10', {'top_k': 10}),
    ('Top-P=0.9', {'top_p': 0.9}),
]

results = []

for name, kwargs in strategies:
    # Generate multiple samples
    samples = [generate_text(model, start_text, 500, 
                            char_to_idx, idx_to_char, device, **kwargs)
              for _ in range(3)]
    combined = ''.join(samples)
    stats = analyze_generation_statistics(combined, vocab_size)
    results.append({'Strategy': name, **stats})

import pandas as pd
results_df = pd.DataFrame(results)
print("\nGeneration Statistics Comparison:")
print(results_df.to_string(index=False))
print("\nInterpretation:")
print("- Entropy: Higher = more diverse (useful for text)")
print("- Bigram diversity: Higher = less repetition")
print("- Temperature controls entropy-quality tradeoff")

## Part 7: Beam Search (Advanced Generation)

Instead of greedy or random sampling, maintain multiple hypotheses:

1. Keep top-k most likely sequences
2. At each step, expand all k sequences
3. Keep top-k sequences by cumulative probability
4. Return best sequence

**Trade-off:** More computation, but better quality

Common in machine translation, but less common for character-level models.

In [None]:
def beam_search(model, start_text, max_length, char_to_idx, idx_to_char,
               device, beam_width=5):
    """
    Generate text using beam search.
    
    Args:
        beam_width: Number of hypotheses to keep
    
    Returns:
        best_text: Highest probability sequence found
    """
    model.eval()
    
    # Initialize beams: list of (text, context, log_prob)
    beams = [(start_text, torch.tensor([char_to_idx[c] for c in start_text], 
                                       dtype=torch.long).to(device), 0.0)]
    
    with torch.no_grad():
        for _ in range(max_length):
            new_beams = []
            
            for text, context, log_prob in beams:
                # Get next token distribution
                context_trimmed = context[-seq_len:].unsqueeze(0)
                logits = model(context_trimmed)
                next_logits = logits[0, -1, :]
                log_probs = torch.log_softmax(next_logits, dim=-1)
                
                # Get top-k next tokens
                topk_log_probs, topk_indices = torch.topk(log_probs, beam_width)
                
                for k in range(beam_width):
                    next_idx = topk_indices[k].item()
                    next_log_prob = topk_log_probs[k].item()
                    new_context = torch.cat([context, torch.tensor([next_idx]).to(device)])
                    new_text = text + idx_to_char[next_idx]
                    new_log_prob = log_prob + next_log_prob
                    new_beams.append((new_text, new_context, new_log_prob))
            
            # Keep top beam_width sequences
            new_beams.sort(key=lambda x: x[2], reverse=True)
            beams = new_beams[:beam_width]
    
    return beams[0][0]


# Test beam search
print("Beam Search Results:")
print("=" * 80)
for beam_width in [1, 5, 10]:
    text = beam_search(model, start_text, 200, char_to_idx, idx_to_char, 
                      device, beam_width=beam_width)
    print(f"\nBeam Width = {beam_width}:")
    print("-" * 80)
    print(text[:300])
    print()

## Part 8: Practical Considerations

### Sequence Length Effects

- **Too short (e.g., 4):** Model forgets context quickly
- **Just right (e.g., 64):** Good balance
- **Too long (e.g., 512+):** Slower training, diminishing returns

### Character-Level Strengths
✅ Handles any text (code, special characters, Unicode)  
✅ No OOV problem  
✅ Can learn orthography  

### Character-Level Weaknesses
❌ Long sequences (more computation)  
❌ Harder to learn semantic relationships  
❌ Slower generation (character by character)  

### Modern Alternative: Subword Tokenization

Best of both worlds:
- Vocabulary: 5K-50K tokens (not 100 or 100K)
- Handles OOV automatically
- Shorter sequences
- Used in all modern LLMs (BPE, WordPiece, SentencePiece)

## Part 9: Key Takeaways

### What You Learned

✅ **Language modeling:** Predicting next token given context  
✅ **Character-level models:** When and why to use them  
✅ **RNN architecture:** Embedding + LSTM + Linear layer  
✅ **Perplexity metric:** Measuring model uncertainty  
✅ **Sampling strategies:** Temperature, top-k, nucleus sampling  
✅ **Beam search:** Finding better sequences  

### Important Concepts

**Temperature:**
- Controls diversity vs. quality
- T=0.5: Focused, repetitive
- T=1.0: Balanced
- T=1.5: Diverse, random

**Perplexity:**
- Measures "confusion" of model
- Lower is better
- Uniform over n: perplexity = n

**Trade-offs:**
- Greedy: Fast, deterministic, boring
- Sampling: Diverse, stochastic
- Beam search: Higher quality, slower

### Real-World Applications

- **Text generation:** Creative writing, dialogue, code
- **Code completion:** GitHub Copilot uses GPT-2 style models
- **Machine translation:** Transformers with beam search
- **Question answering:** RAG systems
- **Chat:** ChatGPT, Claude use similar decoding

### Coming Up (Day 20)

**Project 2:** Implement full character-level RNN text generation project with your own dataset!

## Part 10: Exercises

### Basic
1. **Experiment with context length:** Try seq_len = 32, 128, 256. How does it affect quality?
2. **Different datasets:** Train on different text (Shakespeare, Wikipedia, code). How do results change?
3. **Analyze attention:** Which characters does the model learn to group together?

### Intermediate
1. **Implement word-level model:** Change to word-level tokenization
2. **Conditional generation:** Generate in specific style (e.g., "Romeo and Juliet" style)
3. **Length control:** How to generate exactly N characters?

### Advanced
1. **Implement attention-based RNN:** Add attention to see what model focuses on
2. **Mixture of experts:** Have multiple models generate, blend predictions
3. **Fast sampling:** Implement sampling with top-p using bisection search
4. **Evaluate text quality:** Use BLEU, ROUGE, or BERTScore on generated text