# LLM Training Pipeline

This notebook provides an interactive interface for training a 7B parameter language model through the complete pipeline:

1. **Data Preparation** - Download, clean, and tokenize training data
2. **Pretraining** - Train on large text corpora with curriculum learning
3. **SFT** - Supervised fine-tuning on instruction-response pairs
4. **DPO** - Direct preference optimization for alignment
5. **LoRA** - Optional domain-specific fine-tuning

## Prerequisites

- NVIDIA GPU (A100 80GB recommended, H100 for FP8 support)
- Training scripts installed in the same environment

---
## Setup & Configuration

In [None]:
import os
import sys

# Set the project root directory
PROJECT_ROOT = os.path.dirname(os.getcwd()) if 'notebooks' in os.getcwd() else os.getcwd()
os.chdir(PROJECT_ROOT)
sys.path.insert(0, os.path.join(PROJECT_ROOT, 'scripts'))

print(f"Project root: {PROJECT_ROOT}")

In [None]:
# Check GPU availability
import torch

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)
    print(f"GPU: {gpu_name}")
    print(f"Memory: {gpu_memory:.1f} GB")
    
    # Check for H100/FP8 support
    capability = torch.cuda.get_device_capability()
    if capability[0] >= 9:
        print("FP8 Support: Available (H100)")
    else:
        print(f"FP8 Support: Not available (compute capability {capability[0]}.{capability[1]})")
else:
    print("WARNING: No GPU detected! Training will be extremely slow.")

In [None]:
# Training configuration
# Modify these settings as needed

CONFIG = {
    # General settings
    'use_fp8': None,  # None = auto-detect, True = force FP8, False = force BF16
    'seed': 42,
    'enable_oom_recovery': True,
    
    # Pretraining
    'pretrain_max_steps': 100000,
    'pretrain_save_steps': 1000,
    'pretrain_eval_steps': 1000,
    
    # SFT
    'sft_max_steps': 5000,
    'sft_save_steps': 500,
    
    # DPO
    'dpo_max_steps': 2000,
    'dpo_save_steps': 200,
}

print("Configuration loaded:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

---
## Pre-flight Validation

Before starting training, validate that all prerequisites are in place.

In [None]:
# Run pre-flight checks for a specific stage
# Options: 'pretrain', 'sft', 'dpo', 'lora'

STAGE_TO_CHECK = 'pretrain'  # Change this to check different stages

!python scripts/preflight_check.py {STAGE_TO_CHECK}

In [None]:
# Run all pre-flight checks
!python scripts/preflight_check.py --all

---
## Stage 1: Data Preparation

Download, clean, and prepare training data. Skip this section if data is already prepared.

In [None]:
# Step 1.1: Download raw data
# This downloads data from configured sources (HuggingFace, etc.)

!python scripts/01_download_data.py

In [None]:
# Step 1.2: Clean and deduplicate data
# Removes duplicates, filters low-quality content

!python scripts/02_clean_deduplicate_optimized.py

In [None]:
# Step 1.3: Tokenize and pack sequences
# Creates packed sequences for efficient training

!python scripts/03_tokenize_and_pack.py

In [None]:
# Step 1.4: Initialize model
# Creates the initial 7B model checkpoint

!python scripts/04_init_model.py

In [None]:
# Verify data preparation
import os

paths_to_check = [
    ('Tokenizer', 'configs/tokenizer'),
    ('Initial model', 'checkpoints/init'),
    ('Training data', 'data/packed/train'),
    ('Validation data', 'data/packed/val'),
]

print("Data preparation status:")
print("=" * 50)
all_ready = True
for name, path in paths_to_check:
    exists = os.path.exists(path)
    status = "OK" if exists else "MISSING"
    print(f"  {name}: {status}")
    all_ready = all_ready and exists

print("=" * 50)
if all_ready:
    print("All data preparation complete! Ready for pretraining.")
else:
    print("Some data is missing. Run the preparation steps above.")

---
## Stage 2: Pretraining

Train the base model on large text corpora. This is the longest stage.

**Estimated time:** 25-50 hours depending on GPU (H100 FP8 fastest)

In [None]:
# Build pretraining command
pretrain_cmd = "python scripts/05_pretrain.py"

if CONFIG['use_fp8'] is True:
    pretrain_cmd += " --fp8"
elif CONFIG['use_fp8'] is False:
    pretrain_cmd += " --no-fp8"

pretrain_cmd += f" --max_steps {CONFIG['pretrain_max_steps']}"
pretrain_cmd += f" --save_steps {CONFIG['pretrain_save_steps']}"
pretrain_cmd += f" --eval_steps {CONFIG['pretrain_eval_steps']}"
pretrain_cmd += f" --seed {CONFIG['seed']}"

if CONFIG['enable_oom_recovery']:
    pretrain_cmd += " --enable-oom-recovery"

print("Pretraining command:")
print(pretrain_cmd)

In [None]:
# Start pretraining
# This will take a long time - monitor progress in the output

!{pretrain_cmd}

In [None]:
# Resume pretraining from checkpoint (if interrupted)
# Uncomment and modify the checkpoint path as needed

# CHECKPOINT_PATH = "checkpoints/pretrain/checkpoint-5000"
# !python scripts/05_pretrain.py --resume_from_checkpoint {CHECKPOINT_PATH}

---
## Stage 3: Supervised Fine-Tuning (SFT)

Fine-tune on instruction-response pairs to create a helpful assistant.

**Estimated time:** 2-5 hours

In [None]:
# Prepare SFT data (if not already done)
!python scripts/06_prepare_sft_data.py

In [None]:
# Verify pretrained checkpoint exists
import os

if os.path.exists('checkpoints/pretrain_final'):
    print("Pretrained checkpoint found. Ready for SFT.")
else:
    print("ERROR: Pretrained checkpoint not found!")
    print("Complete pretraining before starting SFT.")

In [None]:
# Build SFT command
sft_cmd = "python scripts/07_sft.py"

if CONFIG['use_fp8'] is True:
    sft_cmd += " --fp8"
elif CONFIG['use_fp8'] is False:
    sft_cmd += " --no-fp8"

sft_cmd += f" --max_steps {CONFIG['sft_max_steps']}"
sft_cmd += f" --save_steps {CONFIG['sft_save_steps']}"
sft_cmd += f" --seed {CONFIG['seed']}"

if CONFIG['enable_oom_recovery']:
    sft_cmd += " --enable-oom-recovery"

print("SFT command:")
print(sft_cmd)

In [None]:
# Start SFT training
!{sft_cmd}

---
## Stage 4: Direct Preference Optimization (DPO)

Align the model with human preferences using chosen/rejected response pairs.

**Estimated time:** 1-3 hours

In [None]:
# Prepare DPO data
!python scripts/08_prepare_dpo_data.py

In [None]:
# Verify SFT checkpoint exists
import os

if os.path.exists('checkpoints/sft_final'):
    print("SFT checkpoint found. Ready for DPO.")
else:
    print("ERROR: SFT checkpoint not found!")
    print("Complete SFT before starting DPO.")

In [None]:
# Build DPO command
dpo_cmd = "python scripts/09_dpo.py"

if CONFIG['use_fp8'] is True:
    dpo_cmd += " --fp8"
elif CONFIG['use_fp8'] is False:
    dpo_cmd += " --no-fp8"

dpo_cmd += f" --max_steps {CONFIG['dpo_max_steps']}"
dpo_cmd += f" --save_steps {CONFIG['dpo_save_steps']}"
dpo_cmd += f" --seed {CONFIG['seed']}"

if CONFIG['enable_oom_recovery']:
    dpo_cmd += " --enable-oom-recovery"

print("DPO command:")
print(dpo_cmd)

In [None]:
# Start DPO training
!{dpo_cmd}

---
## Stage 5: LoRA Fine-Tuning (Optional)

Domain-specific adaptation using LoRA for efficient fine-tuning.

In [None]:
# LoRA fine-tuning (optional)
# Uncomment to run LoRA training

# !python scripts/10_lora_finetune.py

---
## Evaluation

Evaluate the trained model on various benchmarks.

In [None]:
# Run full evaluation suite
CHECKPOINT_TO_EVAL = "checkpoints/dpo_final"  # Change as needed

!python scripts/11_evaluate.py {CHECKPOINT_TO_EVAL}

In [None]:
# Check promotion gates
# Verify model meets quality thresholds

STAGE_TO_CHECK = "dpo"  # Options: pretrain, sft, dpo

!python scripts/12_check_gates.py {STAGE_TO_CHECK}

---
## Monitoring & Utilities

In [None]:
# Monitor GPU utilization
!nvidia-smi

In [None]:
# List all checkpoints
!bash scripts/checkpoint_manager.sh list

In [None]:
# Show disk usage
!bash scripts/checkpoint_manager.sh disk-usage

In [None]:
# Cleanup old checkpoints (keep latest 3)
# Uncomment to run

# !bash scripts/checkpoint_manager.sh cleanup pretrain 3

---
## Model Inference

Test the trained model with interactive generation.

In [None]:
# Load the trained model for inference
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

MODEL_PATH = "checkpoints/dpo_final"  # Change to your checkpoint

print(f"Loading model from {MODEL_PATH}...")
tokenizer = AutoTokenizer.from_pretrained("configs/tokenizer")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
print("Model loaded!")

In [None]:
# Generate text
def generate(prompt, max_new_tokens=256, temperature=0.7):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Test generation
prompt = "Explain machine learning in simple terms:"
print(f"Prompt: {prompt}\n")
print(f"Response: {generate(prompt)}")

In [None]:
# Interactive generation cell
# Modify the prompt and run to test different inputs

PROMPT = "Write a Python function to calculate fibonacci numbers:"

print(f"Prompt: {PROMPT}\n")
print("=" * 50)
print(generate(PROMPT, max_new_tokens=512))

---
## Training Summary

After completing all stages, review the training summary.

In [None]:
# Generate training report
import os
import json

print("=" * 60)
print("TRAINING PIPELINE SUMMARY")
print("=" * 60)

stages = [
    ('Pretrain', 'checkpoints/pretrain_final'),
    ('SFT', 'checkpoints/sft_final'),
    ('DPO', 'checkpoints/dpo_final'),
    ('LoRA', 'checkpoints/lora_final'),
]

print("\nCheckpoint Status:")
for name, path in stages:
    if os.path.exists(path):
        # Get checkpoint size
        size = sum(os.path.getsize(os.path.join(path, f)) for f in os.listdir(path) if os.path.isfile(os.path.join(path, f)))
        size_gb = size / (1024**3)
        print(f"  {name}: COMPLETE ({size_gb:.2f} GB)")
    else:
        print(f"  {name}: Not completed")

print("\nEvaluation Results:")
eval_path = "evals/"
if os.path.exists(eval_path):
    for f in os.listdir(eval_path):
        if f.endswith('.json'):
            with open(os.path.join(eval_path, f)) as file:
                results = json.load(file)
                print(f"  {f}: {results}")
else:
    print("  No evaluation results found. Run evaluation first.")

print("\n" + "=" * 60)