# Code Generation Model: Interactive Tutorial

This notebook walks you through building and using a code generation model, from **foundational concepts to advanced usage**.

**Learning Path**:
1. üî∞ Foundations: Understanding tokenization
2. üèóÔ∏è Architecture: Building the transformer
3. üéì Training: Two-stage training process
4. üöÄ Generation: Creating code from prompts
5. üéØ Advanced: Fine-tuning and optimization

**Duration**: 60-90 minutes

## Setup

First, let's set up our environment and import necessary libraries.

In [None]:
# Add src to path
import sys
from pathlib import Path

# Set project root (presentation folder is inside the project)
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# Imports from src modules
import torch
import torch.nn.functional as F
from src.tokenizer import BPETokenizer, Vocabulary
from src.model import CodeTransformer, CoderConfig
from src.training import CodeTrainer, create_dataloaders

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úì Imports successful")
print(f"‚úì PyTorch version: {torch.__version__}")
print(f"‚úì Device: {'MPS' if torch.backends.mps.is_available() else 'CPU'}")

---

# Part 1: Foundations üî∞

## 1.1 Understanding Tokenization

**Question**: How do we convert text into numbers that a neural network can process?

**Answer**: Tokenization!

![Tokenization Process](../docs/diagrams/tokenization-process.svg)

Let's explore different tokenization strategies.

In [None]:
# Example text
sample_text = "#!/bin/bash\nfor i in {1..10}; do\n    echo $i\ndone"

print("Original bash script:")
print(sample_text)
print("\n" + "="*50)

# Character-level tokenization
char_tokens = list(sample_text)
print(f"\nCharacter tokens ({len(char_tokens)} tokens):")
print(char_tokens[:20], "...")

# Word-level tokenization (naive)
word_tokens = sample_text.split()
print(f"\nWord tokens ({len(word_tokens)} tokens):")
print(word_tokens)

print("\nüìä Comparison:")
print(f"Characters: {len(char_tokens)} tokens, vocab size ~256")
print(f"Words: {len(word_tokens)} tokens, vocab size ~50,000+")
print(f"BPE: ~30 tokens (optimal!), vocab size ~8,000")

### Why BPE (Byte Pair Encoding)?

BPE finds the **sweet spot** between character and word tokenization:
- Common words/commands ‚Üí single token
- Rare words ‚Üí split into subwords
- No out-of-vocabulary issues!

Let's see BPE in action:

In [None]:
# Create and train a simple BPE tokenizer
tokenizer = BPETokenizer()

# Sample training data
training_texts = [
    "#!/bin/bash",
    "for i in {1..10}; do",
    "echo 'Hello World'",
    "if [ -f file.txt ]; then",
    "grep -r 'pattern' /path",
]

# Train tokenizer
tokenizer.target_vocab_size = 500
tokenizer.train(training_texts, verbose=False)

print(f"‚úì Tokenizer trained with vocab size: {len(tokenizer.vocab)}")
print(f"\nExample tokenization:")

test_text = "#!/bin/bash\necho 'test'"
tokens = tokenizer.encode(test_text)
decoded = tokenizer.decode(tokens)

print(f"Original: {repr(test_text)}")
print(f"Tokens: {tokens}")
print(f"Decoded: {repr(decoded)}")
print(f"Match: {test_text == decoded}")

### Vocabulary Analysis

Let's analyze what the tokenizer learned:

In [None]:
# Show some learned tokens
print("Sample vocabulary (first 30 tokens):")
for i, token in enumerate(list(tokenizer.vocab.token_to_id.keys())[:30]):
    token_id = tokenizer.vocab.token_to_id[token]
    print(f"{token_id:3d}: {repr(token):20s}", end="  ")
    if (i + 1) % 3 == 0:
        print()

# Visualize token frequency
print("\n\nüìä Vocabulary distribution:")
vocab_sizes = [len(token) for token in tokenizer.vocab.token_to_id.keys()]
plt.hist(vocab_sizes, bins=20, edgecolor='black')
plt.xlabel('Token Length (characters)')
plt.ylabel('Frequency')
plt.title('Distribution of Token Lengths')
plt.show()

---

# Part 2: Architecture üèóÔ∏è

## 2.1 The Transformer Model

Our model is a **GPT-style transformer** with:
- 6 layers
- 384 hidden dimensions
- 6 attention heads
- 48.7M parameters

![Transformer Architecture](../docs/diagrams/transformer-architecture.svg)

Let's build it step by step!

In [None]:
# Create model configuration
config = CoderConfig(
    vocab_size=len(tokenizer.vocab),
    n_layers=6,
    d_model=384,
    n_heads=6,
    d_ff=1536,
    max_seq_len=512,
)

print("Model Configuration:")
print(f"  Vocabulary size: {config.vocab_size:,}")
print(f"  Layers: {config.n_layers}")
print(f"  Hidden size: {config.d_model}")
print(f"  Attention heads: {config.n_heads}")
print(f"  Feed-forward size: {config.d_ff}")
print(f"  Max sequence length: {config.max_seq_len}")

# Create model
model = CodeTransformer(config)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\nüìä Model Statistics:")
print(f"  Total parameters: {total_params:,}")
print(f"  Trainable parameters: {trainable_params:,}")
print(f"  Model size: {total_params * 4 / (1024**2):.1f} MB (FP32)")

### 2.2 Parameter Breakdown

Where are all those parameters?

In [None]:
# Analyze parameter distribution
param_counts = {}
for name, param in model.named_parameters():
    component = name.split('.')[0]
    if component not in param_counts:
        param_counts[component] = 0
    param_counts[component] += param.numel()

# Visualize
components = list(param_counts.keys())
counts = list(param_counts.values())
percentages = [c / total_params * 100 for c in counts]

plt.figure(figsize=(10, 6))
plt.bar(components, percentages, color=['skyblue', 'lightcoral', 'lightgreen', 'gold'])
plt.xlabel('Component')
plt.ylabel('Percentage of Parameters')
plt.title('Parameter Distribution in Transformer')
plt.xticks(rotation=45, ha='right')

# Add value labels
for i, (comp, pct, count) in enumerate(zip(components, percentages, counts)):
    plt.text(i, pct + 1, f'{pct:.1f}%\n({count/1e6:.1f}M)', ha='center')

plt.tight_layout()
plt.show()

print("\nüìä Parameter Breakdown:")
for comp, count, pct in zip(components, counts, percentages):
    print(f"  {comp:20s}: {count/1e6:6.2f}M ({pct:5.1f}%)")

### 2.3 Understanding Self-Attention

The **key innovation** of transformers is self-attention. Let's visualize how it works!

![Attention Mechanism](../docs/diagrams/attention-mechanism.svg)

In [None]:
# Simple attention visualization
def visualize_attention(text, tokenizer):
    """Visualize attention pattern for a simple example."""
    tokens_ids = tokenizer.encode(text)
    tokens_text = [tokenizer.vocab.id_to_token.get(tid, '<UNK>') for tid in tokens_ids]
    
    # Create a simple attention matrix (causal)
    seq_len = len(tokens_ids)
    attention = np.tril(np.random.rand(seq_len, seq_len))
    
    # Normalize rows
    attention = attention / attention.sum(axis=1, keepdims=True)
    
    # Plot
    plt.figure(figsize=(10, 8))
    sns.heatmap(attention, 
                xticklabels=tokens_text,
                yticklabels=tokens_text,
                cmap='YlOrRd',
                annot=True,
                fmt='.2f',
                cbar_kws={'label': 'Attention Weight'})
    plt.xlabel('Key (what to attend to)')
    plt.ylabel('Query (current token)')
    plt.title('Causal Self-Attention Pattern')
    plt.tight_layout()
    plt.show()
    
    return attention

# Example
example_text = "for i in range"
print(f"Visualizing attention for: '{example_text}'\n")
attention_matrix = visualize_attention(example_text, tokenizer)

print("\nüí° Interpretation:")
print("  - Diagonal: Each token attends to itself")
print("  - Lower triangle: Can only attend to previous tokens (causal)")
print("  - Brighter = stronger attention")

---

# Part 3: Training üéì

## 3.1 Two-Stage Training Process

Modern code models use **two-stage training**:

![Two-Stage Training](../docs/diagrams/two-stage-training.svg)

```
Stage 1: Language Pretraining
  Data: Natural language (TinyStories)
  Goal: Learn grammar, vocabulary, reasoning
  Duration: 2-4 hours

Stage 2: Code Fine-Tuning  
  Data: Code (100+ bash scripts)
  Goal: Learn code syntax and patterns
  Duration: 30-60 minutes
```

Let's simulate a mini training run!

In [None]:
# Mini dataset for demonstration
demo_texts = [
    "The cat sat on the mat.",
    "A dog ran in the park.",
    "The sun shines brightly.",
    "Birds fly in the sky.",
    "Children play at school."
]

# Tokenize
demo_tokens = [tokenizer.encode(text) for text in demo_texts]

print("Demo Training Data:")
for text, tokens in zip(demo_texts, demo_tokens):
    print(f"  '{text}' ‚Üí {len(tokens)} tokens")

# Create tiny model for quick demo
tiny_config = CoderConfig(
    vocab_size=len(tokenizer.vocab),
    n_layers=2,  # Fewer layers
    d_model=128,  # Smaller
    n_heads=4,
    d_ff=512,
    max_seq_len=128,
)

tiny_model = CodeTransformer(tiny_config)
device = torch.device('cpu')  # Use CPU for demo
tiny_model = tiny_model.to(device)

print(f"\nTiny model: {sum(p.numel() for p in tiny_model.parameters()):,} parameters")

### 3.2 Training Loop (Simplified)

Let's run a few training steps to see the loss decrease!

![Training Loop](../docs/diagrams/training-loop.svg)

In [None]:
# Training setup
optimizer = torch.optim.AdamW(tiny_model.parameters(), lr=1e-3)
losses = []

# Prepare data
max_len = max(len(t) for t in demo_tokens)
padded_tokens = [t + [0] * (max_len - len(t)) for t in demo_tokens]
input_tensor = torch.tensor(padded_tokens, dtype=torch.long, device=device)

print("Training for 50 steps...\n")

# Training loop
tiny_model.train()
for step in range(50):
    # Forward pass
    logits, loss = tiny_model(input_tensor, targets=input_tensor)
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # Track loss
    losses.append(loss.item())
    
    if (step + 1) % 10 == 0:
        print(f"Step {step+1:2d}/50: loss = {loss.item():.4f}")

# Plot training curve
plt.figure(figsize=(10, 5))
plt.plot(losses, linewidth=2)
plt.xlabel('Training Step')
plt.ylabel('Loss')
plt.title('Training Progress (Demo)')
plt.grid(True, alpha=0.3)
plt.show()

print(f"\n‚úì Loss decreased from {losses[0]:.4f} to {losses[-1]:.4f}")
print("  Model is learning!")

---

# Part 4: Generation üöÄ

## 4.1 Loading a Trained Model

Now let's load a fully trained model and generate code!

![Generation Process](../docs/diagrams/generation-process.svg)

In [None]:
# Check if trained model exists
import os

# Model paths (relative to project root)
model_path = project_root / "models" / "code" / "code_model_final.pt"
tokenizer_path = project_root / "models" / "language" / "language_tokenizer.json"

if model_path.exists():
    print("‚úì Found trained model!")
    print(f"  Model: {model_path}")
    print(f"  Tokenizer: {tokenizer_path}")
    print("\nLoading...")
    
    # Load tokenizer
    production_tokenizer = BPETokenizer()
    production_tokenizer.load(str(tokenizer_path))
    
    # Load model
    checkpoint = torch.load(model_path, map_location='cpu')
    production_config = checkpoint.get('config', CoderConfig(vocab_size=len(production_tokenizer.vocab)))
    production_model = CodeTransformer(production_config)
    production_model.load_state_dict(checkpoint['model_state_dict'])
    production_model.eval()
    
    print(f"\n‚úì Model loaded ({sum(p.numel() for p in production_model.parameters()):,} parameters)")
    has_trained_model = True
else:
    print("‚ö† No trained model found.")
    print(f"  Expected: {model_path}")
    print("\n  To train a model, run:")
    print("    python scripts/train_language.py")
    print("    python scripts/train_code.py")
    print("\nUsing untrained model for demonstration...")
    production_model = tiny_model
    production_tokenizer = tokenizer
    has_trained_model = False

### 4.2 Code Generation Function

In [None]:
def generate_code(prompt, max_length=200, temperature=0.8, top_k=50):
    """Generate code from a prompt."""
    print(f"Prompt: {repr(prompt)}")
    print("=" * 60)
    
    # Encode prompt
    input_ids = production_tokenizer.encode(prompt)
    input_tensor = torch.tensor([input_ids], dtype=torch.long)
    
    # Generate
    production_model.eval()
    generated = input_tensor.clone()
    
    with torch.no_grad():
        for i in range(max_length):
            # Get predictions
            logits, _ = production_model(generated)
            next_logits = logits[0, -1, :] / temperature
            
            # Top-k sampling
            if top_k > 0:
                indices_to_remove = next_logits < torch.topk(next_logits, top_k)[0][..., -1, None]
                next_logits[indices_to_remove] = float('-inf')
            
            # Sample
            probs = F.softmax(next_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            
            # Append
            generated = torch.cat([generated, next_token.unsqueeze(0)], dim=1)
            
            # Stop at reasonable length for bash scripts
            if i > 50 and next_token.item() == production_tokenizer.vocab.token_to_id.get('\n', -1):
                break
    
    # Decode
    output = production_tokenizer.decode(generated[0].tolist())
    print(output)
    print("=" * 60)
    
    return output

### 4.3 Generate Your First Script!

Try different prompts and see what the model generates:

In [None]:
# Example 1: Backup script
prompt1 = "#!/bin/bash\n# Create a backup script for MySQL"
output1 = generate_code(prompt1, max_length=150, temperature=0.8)

In [None]:
# Example 2: System monitoring
prompt2 = "#!/bin/bash\n# Monitor system resources"
output2 = generate_code(prompt2, max_length=150, temperature=0.8)

In [None]:
# Example 3: Your custom prompt!
custom_prompt = "#!/bin/bash\n# "  # Add your description here
output3 = generate_code(custom_prompt, max_length=150, temperature=0.8)

### 4.4 Temperature Effects

Let's see how temperature affects generation:

In [None]:
prompt = "#!/bin/bash\necho "

print("Low Temperature (0.3) - Conservative:")
generate_code(prompt, max_length=50, temperature=0.3)

print("\nMedium Temperature (0.8) - Balanced:")
generate_code(prompt, max_length=50, temperature=0.8)

print("\nHigh Temperature (1.5) - Creative:")
generate_code(prompt, max_length=50, temperature=1.5)

print("\nüí° Observation:")
print("  Low temp ‚Üí More predictable, safer")
print("  High temp ‚Üí More varied, riskier")

---

# Part 5: Advanced Topics üéØ

## 5.1 Token Probability Analysis

Let's peek inside the model to see what it's thinking!

In [None]:
def analyze_predictions(prompt, top_k=10):
    """Show top-k next token predictions."""
    print(f"Analyzing: {repr(prompt)}\n")
    
    # Encode
    input_ids = production_tokenizer.encode(prompt)
    input_tensor = torch.tensor([input_ids], dtype=torch.long)
    
    # Get predictions
    with torch.no_grad():
        logits, _ = production_model(input_tensor)
        next_token_logits = logits[0, -1, :]
        probs = F.softmax(next_token_logits, dim=-1)
    
    # Get top-k
    top_probs, top_indices = torch.topk(probs, top_k)
    
    print(f"Top {top_k} most likely next tokens:\n")
    for prob, idx in zip(top_probs, top_indices):
        token = production_tokenizer.vocab.id_to_token.get(idx.item(), '<UNK>')
        print(f"  {prob.item()*100:5.2f}% ‚Üí {repr(token)}")

# Analyze what comes after "#!/bin/bash"
analyze_predictions("#!/bin/bash\n#", top_k=10)

## 5.2 Comparing Different Architectures

How does model size affect quality?

In [None]:
# Compare different model sizes
sizes = {
    'Tiny': {'n_layers': 4, 'd_model': 256, 'n_heads': 4, 'd_ff': 1024},
    'Small': {'n_layers': 6, 'd_model': 384, 'n_heads': 6, 'd_ff': 1536},
    'Medium': {'n_layers': 12, 'd_model': 768, 'n_heads': 12, 'd_ff': 3072},
}

size_comparison = []

for name, params in sizes.items():
    config = CoderConfig(
        vocab_size=8000,
        max_seq_len=512,
        **params
    )
    model = CodeTransformer(config)
    param_count = sum(p.numel() for p in model.parameters())
    
    size_comparison.append({
        'Size': name,
        'Parameters (M)': param_count / 1e6,
        'Layers': params['n_layers'],
        'Hidden': params['d_model'],
        'Heads': params['n_heads'],
    })

import pandas as pd
df = pd.DataFrame(size_comparison)
print(df.to_string(index=False))

# Visualize parameter scaling
plt.figure(figsize=(10, 5))
plt.bar(df['Size'], df['Parameters (M)'], color=['lightblue', 'skyblue', 'steelblue'])
plt.ylabel('Parameters (Millions)')
plt.title('Model Size Comparison')
plt.grid(axis='y', alpha=0.3)

for i, row in df.iterrows():
    plt.text(i, row['Parameters (M)'] + 5, f"{row['Parameters (M)']:.1f}M", 
             ha='center', fontweight='bold')

plt.show()

print("\nüí° Trade-offs:")
print("  Tiny: Fast, low memory, lower quality")
print("  Small: Balanced (recommended)")
print("  Medium: Best quality, slower, more memory")

## 5.3 Training Data Impact

Let's visualize our training data distribution:

In [None]:
import json

# Load bash scripts statistics
stats_path = project_root / "data" / "code" / "bash_scripts" / "stats.json"
if stats_path.exists():
    with open(stats_path, 'r') as f:
        stats = json.load(f)
    
    print("Training Data Statistics:")
    print(f"  Scripts: {stats['num_scripts']}")
    print(f"  Lines: {stats['total_lines']:,}")
    print(f"  Characters: {stats['total_chars']:,}")
    print(f"  Avg lines/script: {stats['avg_lines']:.1f}")
    print(f"  Avg chars/script: {stats['avg_chars']:.1f}")
    
    # Visualize categories
    categories = {
        'System Admin': 20,
        'DevOps/CI': 20,
        'Database': 15,
        'Networking': 15,
        'Monitoring': 15,
        'Deployment': 15,
    }
    
    plt.figure(figsize=(10, 6))
    plt.pie(categories.values(), labels=categories.keys(), autopct='%1.1f%%',
            startangle=90, colors=sns.color_palette('Set3'))
    plt.title('Training Data Distribution by Category')
    plt.axis('equal')
    plt.show()
else:
    print(f"Stats file not found: {stats_path}")
    print("Run: python scripts/generate_bash_dataset.py")

---

# Summary and Next Steps

## What We Learned

1. **‚úì Tokenization**: BPE balances vocab size and sequence length
2. **‚úì Architecture**: Transformers use self-attention for context
3. **‚úì Training**: Two-stage (language + code) is most efficient
4. **‚úì Generation**: Temperature controls creativity vs. correctness
5. **‚úì Scaling**: Bigger models = better quality but more resources

## Visual References

All diagrams used in this notebook are in `docs/diagrams/`:
- `tokenization-process.svg` - How tokenization works
- `transformer-architecture.svg` - Model structure
- `attention-mechanism.svg` - Self-attention explained
- `two-stage-training.svg` - Training pipeline
- `training-loop.svg` - Training process
- `generation-process.svg` - Code generation

## Try These Next

```python
# 1. Generate different script types
prompts = [
    "#!/bin/bash\n# Deployment script",
    "#!/bin/bash\n# Log analyzer",
    "#!/bin/bash\n# Network monitor",
]

# 2. Experiment with generation parameters
generate_code(prompt, temperature=0.5, top_k=20)

# 3. Fine-tune on your own bash scripts
# See: examples/fine_tuning.py
```

## Resources

- **Architecture Guide**: `docs/ARCHITECTURE.md`
- **Deployment Guide**: `docs/DEPLOYMENT.md`
- **Advanced Topics**: `docs/ADVANCED_TOPICS.md`
- **Visual Guide**: `docs/VISUAL_GUIDE.md`
- **Presentation Guide**: `presentation/PRESENTATION_GUIDE.md`

---

**üéâ Congratulations!** You now understand how modern code generation models work!