COLAB - https://colab.research.google.com/drive/1wFBi_V3qnOIT5e9CnqN3-IJSSltbIM2t?usp=sharing

# Token-Level NanoGPT Training - Terminal Commands

This notebook contains all the terminal commands needed to train a token-level NanoGPT model on Shakespeare data. Each cell can be executed to run the commands step by step.

## ðŸŽ¯ **Purpose**
- Fix the vocabulary mismatch issue (model vocab_size=65 vs data tokens>50,000)
- Train with proper tiktoken GPT-2 BPE tokenization (vocab_size â‰ˆ 50,257)
- Create a compatible model for evaluation

## ðŸ“‹ **What We'll Do**
1. Check dependencies and setup
2. Prepare token-level Shakespeare data
3. Create training configuration
4. Run training
5. Generate text samples
6. Test the trained model

---

## Step 1: Check Environment and Dependencies

In [None]:
# Check current working directory and available resources
import os
import sys
print(f"Current directory: {os.getcwd()}")
print(f"Python version: {sys.version}")

# Check if we have the required packages
try:
    import torch
    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"CUDA device: {torch.cuda.get_device_name()}")
except ImportError:
    print("PyTorch not found")

try:
    import tiktoken
    print(f"tiktoken available")
except ImportError:
    print("tiktoken not found")

try:
    import numpy as np
    print(f"NumPy version: {np.__version__}")
except ImportError:
    print("NumPy not found")

In [None]:
# Install required packages if missing
!pip install torch numpy transformers datasets tiktoken wandb tqdm

## Step 2: Navigate to nanoGPT Directory

In [None]:
# Navigate to nanoGPT directory
import os
os.chdir('nanoGPT')
print(f"Current directory: {os.getcwd()}")

# List contents to verify we're in the right place
print("\nDirectory contents:")
for item in sorted(os.listdir('.')):
    print(f"  {item}")

## Step 3: Prepare Token-Level Shakespeare Data

In [None]:
# Prepare Shakespeare data with tiktoken BPE tokenization
# This creates train.bin and val.bin with proper token-level encoding

print("Preparing Shakespeare data with tiktoken BPE tokenization...")
!cd data/shakespeare && python prepare.py

In [None]:
# Verify the data was created successfully
import os
import numpy as np

data_path = "data/shakespeare"
train_path = os.path.join(data_path, "train.bin")
val_path = os.path.join(data_path, "val.bin")

if os.path.exists(train_path) and os.path.exists(val_path):
    print("Data files created successfully!")
    
    # Load and inspect the data
    train_data = np.memmap(train_path, dtype=np.uint16, mode='r')
    val_data = np.memmap(val_path, dtype=np.uint16, mode='r')
    
    print(f"Training data: {len(train_data):,} tokens")
    print(f"Validation data: {len(val_data):,} tokens")
    print(f"Token range: {train_data.min()} to {train_data.max()}")
    print(f"Data type: {train_data.dtype}")
    
    # Verify this matches tiktoken vocab size
    import tiktoken
    enc = tiktoken.get_encoding("gpt2")
    print(f"Tiktoken vocab size: {enc.n_vocab}")
    
    if train_data.max() < enc.n_vocab:
        print("Token range is compatible with tiktoken vocabulary!")
    else:
        print("Token range exceeds tiktoken vocabulary!")
else:
    print("Data files not found. Check the prepare.py script.")

## Step 4: Create Training Configuration

In [None]:
# Create configuration file for token-level training
config_content = '''# Token-level Shakespeare training configuration
# This config is designed for GPT-2 BPE tokenization (vocab_size ~50257)

import torch

out_dir = 'out-shakespeare-token'
eval_interval = 500
eval_iters = 100
log_interval = 10
wandb_log = False  # Set to True if you want to use wandb

# Dataset
dataset = 'shakespeare'  # Uses tiktoken BPE tokenized data

# Model architecture - adjusted for token-level training
n_layer = 8          # Increased layers for token complexity
n_head = 8           # Increased attention heads
n_embd = 512         # Increased embedding dimension
dropout = 0.1        # Slightly lower dropout
bias = False         # No bias in linear layers (modern practice)

# Training hyperparameters
batch_size = 8                    # Smaller batch size due to larger vocab
gradient_accumulation_steps = 8   # Effective batch size = 8 * 8 = 64
max_iters = 3000                  # More iterations needed for convergence
learning_rate = 3e-4              # Standard learning rate
weight_decay = 1e-1               # L2 regularization
beta1 = 0.9
beta2 = 0.95
grad_clip = 1.0                   # Gradient clipping

# Learning rate schedule
decay_lr = True
warmup_iters = 100
lr_decay_iters = 3000  # Should be ~= max_iters
min_lr = 3e-5          # min_lr = learning_rate / 10

# Context length
block_size = 512      # Moderate context length for token-level

# System
device = 'cuda' if torch.cuda.is_available() else 'cpu'
dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16'
compile = True        # PyTorch 2.0 compile for speed

# Save checkpoint settings
always_save_checkpoint = False  # Only save when validation improves
'''

# Write the configuration file
config_path = "config/train_shakespeare_token.py"
with open(config_path, 'w') as f:
    f.write(config_content)

print(f"Created configuration file: {config_path}")
print("Configuration summary:")
print("  - Model: 8 layers, 8 heads, 512 embedding")
print("  - Batch size: 8 (effective: 64 with gradient accumulation)")
print("  - Max iterations: 3000")
print("  - Block size: 512 tokens")
print("  - Output directory: out-shakespeare-token")

## Step 5: Start Training

In [None]:
# Start training the token-level model
print("Starting token-level NanoGPT training...")
print("This may take 30-60 minutes on GPU, or several hours on CPU")
print("=" * 60)

# Run the training command
!python train.py config/train_shakespeare_token.py

### Alternative: Training with Custom Parameters

If you want to adjust training parameters or run on CPU/different hardware:

In [None]:
# OPTION A: For CPU training (slower but works on any machine)
# Uncomment the line below if you want to train on CPU
# !python train.py config/train_shakespeare_token.py --device=cpu --compile=False --batch_size=4

# OPTION B: For Apple Silicon Mac (M1/M2)
# Uncomment the line below if you're on Apple Silicon
# !python train.py config/train_shakespeare_token.py --device=mps --batch_size=4

# OPTION C: Quick training (fewer iterations, for testing)
# Uncomment the line below for a quick test run
# !python train.py config/train_shakespeare_token.py --max_iters=500 --eval_interval=100

print("TIP: Uncomment one of the alternative training commands above if needed")

## Step 6: Monitor Training Progress

In [None]:
# Check training progress and output directory
import os
import glob

output_dir = "out-shakespeare-token"
if os.path.exists(output_dir):
    print(f"Training output directory exists: {output_dir}")
    
    # List files in output directory
    files = os.listdir(output_dir)
    print(f"Files in output directory:")
    for file in sorted(files):
        file_path = os.path.join(output_dir, file)
        if os.path.isfile(file_path):
            size = os.path.getsize(file_path)
            print(f"  {file} ({size:,} bytes)")
    
    # Check if training log exists
    log_file = os.path.join(output_dir, "log.txt")
    if os.path.exists(log_file):
        print(f"\nLast few lines of training log:")
        with open(log_file, 'r') as f:
            lines = f.readlines()
            for line in lines[-10:]:  # Show last 10 lines
                print(f"  {line.strip()}")
else:
    print(f"ERROR: Training output directory not found: {output_dir}")
    print("Training may still be in progress or failed to start.")

## Step 7: Generate Text Samples

In [None]:
# Generate text samples from the trained model
print("Generating text samples from trained model...")
print("=" * 50)

!python sample.py --out_dir=out-shakespeare-token

In [None]:
# Generate custom text samples with specific prompts
print("Generating custom text samples...")
print("=" * 40)

# Sample with specific characters
!python sample.py --out_dir=out-shakespeare-token --start="HAMLET:" --num_samples=2 --max_new_tokens=150

In [None]:
# More creative prompts
print("\nMore creative generation examples...")

# Famous Shakespeare quote continuation
!python sample.py --out_dir=out-shakespeare-token --start="To be or not to be," --num_samples=1 --max_new_tokens=100 --temperature=0.8

print("\n" + "="*40)

# Different character
!python sample.py --out_dir=out-shakespeare-token --start="JULIET:" --num_samples=1 --max_new_tokens=100 --temperature=0.8

## Step 8: Test Model Compatibility

In [None]:
# Test that our trained model is compatible with the evaluation notebook
import pickle
import torch
import numpy as np

output_dir = "out-shakespeare-token"
model_path = f"{output_dir}/ckpt.pt"
config_path = f"{output_dir}/config.pkl"

if os.path.exists(model_path) and os.path.exists(config_path):
    print("Model files found!")
    
    # Load model configuration
    with open(config_path, 'rb') as f:
        config = pickle.load(f)
    
    print(f"Model Configuration:")
    print(f"  Vocabulary size: {config.vocab_size}")
    print(f"  Block size: {config.block_size}")
    print(f"  Number of layers: {config.n_layer}")
    print(f"  Number of heads: {config.n_head}")
    print(f"  Embedding dimension: {config.n_embd}")
    
    # Load model checkpoint
    checkpoint = torch.load(model_path, map_location='cpu')
    print(f"Checkpoint info:")
    print(f"  Training iteration: {checkpoint.get('iter_num', 'Unknown')}")
    print(f"  Best validation loss: {checkpoint.get('best_val_loss', 'Unknown')}")
    
    # Verify vocabulary compatibility
    data_path = "data/shakespeare/train.bin"
    if os.path.exists(data_path):
        train_data = np.memmap(data_path, dtype=np.uint16, mode='r')
        max_token = train_data.max()
        
        print(f"Data compatibility check:")
        print(f"  Data max token: {max_token}")
        print(f"  Model vocab size: {config.vocab_size}")
        
        if max_token < config.vocab_size:
            print("COMPATIBILITY SUCCESS! Data tokens fit within model vocabulary.")
            print("This model can be used with the evaluation notebook!")
        else:
            print("COMPATIBILITY ISSUE: Data tokens exceed model vocabulary.")
    
else:
    print("ERROR: Model files not found. Training may not have completed successfully.")
    if not os.path.exists(output_dir):
        print(f"ERROR: Output directory doesn't exist: {output_dir}")

## Step 9: Copy Model for Evaluation

In [None]:
# Copy the trained model to the checkpoints directory for use with evaluation notebook
import shutil

# Create destination directory
dest_dir = "../checkpoints/token_level_nanogpt"
os.makedirs(dest_dir, exist_ok=True)

# Copy model files
source_dir = "out-shakespeare-token"
if os.path.exists(source_dir):
    # Copy checkpoint
    if os.path.exists(f"{source_dir}/ckpt.pt"):
        shutil.copy2(f"{source_dir}/ckpt.pt", f"{dest_dir}/token_level_nanogpt.pt")
        print(f"Copied model checkpoint to {dest_dir}/token_level_nanogpt.pt")
    
    # Copy config
    if os.path.exists(f"{source_dir}/config.pkl"):
        shutil.copy2(f"{source_dir}/config.pkl", f"{dest_dir}/token_level_meta.pkl")
        print(f"Copied model config to {dest_dir}/token_level_meta.pkl")
    
    print(f"\nFiles in checkpoint directory:")
    for file in os.listdir(dest_dir):
        file_path = os.path.join(dest_dir, file)
        size = os.path.getsize(file_path)
        print(f"  {file} ({size:,} bytes)")
    
    print(f"\nTo use this model in the evaluation notebook, update the config:")
    print(f"  model_path: '../checkpoints/token_level_nanogpt/token_level_nanogpt.pt'")
    print(f"  meta_path: '../checkpoints/token_level_nanogpt/token_level_meta.pkl'")
    print(f"  data_dir: 'nanoGPT/data/shakespeare'")
    
else:
    print("ERROR: Source directory not found. Training may not have completed.")

## ðŸŽ‰ Training Complete!

### Summary of What We Accomplished:

1. âœ… **Prepared token-level data** using tiktoken GPT-2 BPE tokenization
2. âœ… **Created proper configuration** for token-level training
3. âœ… **Trained the model** with vocabulary size â‰ˆ 50,257 (compatible with data)
4. âœ… **Generated text samples** to verify model quality
5. âœ… **Verified compatibility** between model and data
6. âœ… **Copied model files** to checkpoint directory for evaluation

### Next Steps:

1. **Use the evaluation notebook** with the new model:
   - Update paths in the evaluation notebook config
   - Run evaluation to get proper metrics (no more infinite perplexity!)

2. **Compare results** between character-level and token-level models

3. **Experiment further**:
   - Try different hyperparameters
   - Train for longer
   - Fine-tune from pre-trained GPT-2

### Key Files Created:
- **Model**: `../checkpoints/token_level_nanogpt/token_level_nanogpt.pt`
- **Metadata**: `../checkpoints/token_level_nanogpt/token_level_meta.pkl`
- **Data**: `nanoGPT/data/shakespeare/train.bin` and `val.bin`
- **Config**: `nanoGPT/config/train_shakespeare_token.py`

The vocabulary mismatch issue is now **resolved**! ðŸš€