# Module 16: GPT & Decoder Models

**Generative Pre-trained Transformers**

---

## 1. Objectives

- ‚úÖ Understand GPT architecture
- ‚úÖ Know causal language modeling
- ‚úÖ Implement generation strategies
- ‚úÖ Use HuggingFace GPT-2

## 2. Prerequisites

- [Module 15: BERT](../15_bert/15_bert.ipynb)

## 3. GPT vs BERT

| Aspect | BERT | GPT |
|--------|------|-----|
| Architecture | Encoder | Decoder |
| Direction | Bidirectional | Left-to-right (causal) |
| Pretraining | MLM + NSP | Causal LM |
| Best for | Understanding | Generation |

### GPT Architecture
```
  Input:  "The cat sat"
           ‚Üì   ‚Üì   ‚Üì
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ Masked Self-Attn ‚îÇ  ‚Üê Can only attend left
    ‚îÇ      + FFN       ‚îÇ
    ‚îÇ   (N layers)     ‚îÇ
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
           ‚Üì   ‚Üì   ‚Üì
  Output: Predict next token at each position
```

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import GPT2LMHeadModel, GPT2Tokenizer

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

## 4. Causal Language Modeling

### Training Objective
Predict next token given all previous tokens:

$$\mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_1, ..., x_{t-1})$$

### Causal Mask
```
Position can attend to:
     1  2  3  4
1  [ 1  0  0  0 ]  ‚Üê Position 1 sees only itself
2  [ 1  1  0  0 ]  ‚Üê Position 2 sees 1,2
3  [ 1  1  1  0 ]  ‚Üê Position 3 sees 1,2,3
4  [ 1  1  1  1 ]  ‚Üê Position 4 sees all
```

## 5. Using GPT-2 with HuggingFace

In [None]:
# Load GPT-2
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()

# Set pad token (GPT-2 doesn't have one by default)
tokenizer.pad_token = tokenizer.eos_token

print(f"GPT-2 Parameters: {sum(p.numel() for p in model.parameters())/1e6:.1f}M")
print(f"Vocabulary size: {tokenizer.vocab_size}")

In [None]:
# Simple generation
prompt = "The meaning of life is"
inputs = tokenizer(prompt, return_tensors='pt')

with torch.no_grad():
    outputs = model.generate(
        inputs['input_ids'],
        max_length=50,
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.95,
        pad_token_id=tokenizer.eos_token_id
    )

generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated)

## 6. Generation Strategies

In [None]:
def generate_with_strategy(prompt, strategy='greedy', **kwargs):
    """Generate text with different strategies."""
    inputs = tokenizer(prompt, return_tensors='pt')
    
    gen_kwargs = {
        'max_length': 50,
        'pad_token_id': tokenizer.eos_token_id
    }
    
    if strategy == 'greedy':
        gen_kwargs['do_sample'] = False
    elif strategy == 'beam':
        gen_kwargs['num_beams'] = kwargs.get('num_beams', 5)
        gen_kwargs['do_sample'] = False
    elif strategy == 'sample':
        gen_kwargs['do_sample'] = True
        gen_kwargs['temperature'] = kwargs.get('temperature', 1.0)
    elif strategy == 'top_k':
        gen_kwargs['do_sample'] = True
        gen_kwargs['top_k'] = kwargs.get('top_k', 50)
    elif strategy == 'top_p':
        gen_kwargs['do_sample'] = True
        gen_kwargs['top_p'] = kwargs.get('top_p', 0.95)
    
    with torch.no_grad():
        outputs = model.generate(inputs['input_ids'], **gen_kwargs)
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Compare strategies
prompt = "Once upon a time"
print("=" * 50)
for strategy in ['greedy', 'beam', 'top_k', 'top_p']:
    print(f"\n{strategy.upper()}:")
    print(generate_with_strategy(prompt, strategy))
    print("-" * 50)

## 7. Generation Strategies Explained

| Strategy | Description | When to Use |
|----------|-------------|-------------|
| **Greedy** | Pick highest prob token | Deterministic, boring |
| **Beam Search** | Keep top-k sequences | Better quality, diverse |
| **Sampling** | Sample from distribution | Creative, may be incoherent |
| **Top-k** | Sample from top-k tokens | Balance quality/diversity |
| **Top-p (nucleus)** | Sample until cumulative prob ‚â• p | Most commonly used |

In [None]:
# Temperature effect
prompt = "The future of AI is"

print("Temperature effect:")
for temp in [0.3, 0.7, 1.0, 1.5]:
    result = generate_with_strategy(prompt, 'sample', temperature=temp)
    print(f"\nTemp={temp}: {result}")

## 8. GPT Model Family

| Model | Params | Context | Year |
|-------|--------|---------|------|
| GPT-1 | 117M | 512 | 2018 |
| GPT-2 | 1.5B | 1024 | 2019 |
| GPT-3 | 175B | 2048 | 2020 |
| GPT-4 | ~1.7T? | 128K | 2023 |

### Available on HuggingFace
```python
'gpt2'        # 124M
'gpt2-medium' # 355M
'gpt2-large'  # 774M
'gpt2-xl'     # 1.5B
```

## 9. üî• Real-World Usage

### When to Use GPT-style Models

| Task | Use |
|------|-----|
| Text generation | ‚úÖ GPT |
| Chatbots | ‚úÖ GPT (instruction-tuned) |
| Code generation | ‚úÖ CodeGPT, Codex |
| Classification | ‚ùå Use BERT |

### 2024 Practice
- **API**: Use GPT-4 API for best results
- **Local**: LLaMA-2, Mistral, Phi-2
- **Fine-tuning**: LoRA for efficiency

## 10. Interview Questions

**Q1: What is the difference between GPT and BERT?**
<details><summary>Answer</summary>

- GPT: Decoder-only, left-to-right, trained with causal LM, for generation
- BERT: Encoder-only, bidirectional, trained with MLM, for understanding
</details>

**Q2: What is top-p (nucleus) sampling?**
<details><summary>Answer</summary>

Sample from smallest set of tokens whose cumulative probability ‚â• p. Adapts to token distribution‚Äîuses few tokens when confident, more when uncertain.
</details>

**Q3: Why is temperature used?**
<details><summary>Answer</summary>

Temperature scales logits before softmax. Low temp ‚Üí sharper distribution (confident), high temp ‚Üí flatter (random). Controls creativity vs consistency tradeoff.
</details>

## 11. Summary

- **GPT**: Decoder-only Transformer for generation
- **Causal LM**: Predict next token, can only see past
- **Sampling**: Temperature, top-k, top-p (nucleus)
- **In practice**: Use APIs or fine-tune smaller models

## 12. References

- [GPT-2 Paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
- [GPT-3 Paper](https://arxiv.org/abs/2005.14165)
- [HuggingFace Generation](https://huggingface.co/docs/transformers/generation_strategies)

---
**Next:** [Module 17: HuggingFace Ecosystem](../17_huggingface/17_huggingface.ipynb)