# 🧠 GRPO CPU Demo - Quick Start Guide

This notebook demonstrates how to use the GRPO-based reinforcement fine-tuning system on CPU hardware.

## What is GRPO?

Group Relative Policy Optimization (GRPO) is an advanced reinforcement learning algorithm that:
- Works efficiently on CPU hardware
- Doesn't require a separate value function (unlike PPO)
- Uses relative rewards for better stability
- Shows improvements quickly (often within 20-30 steps)

## 🔧 Setup and Installation

First, let's check our environment and install dependencies:

In [None]:
# Check Python and PyTorch installation
import sys
print(f"Python version: {sys.version}")

try:
    import torch
    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"Running on: {'GPU' if torch.cuda.is_available() else 'CPU'}")
except ImportError:
    print("PyTorch not installed. Please run: pip install torch --index-url https://download.pytorch.org/whl/cpu")

In [None]:
# Install required packages if not already installed
import subprocess
import sys

def install_package(package):
    try:
        __import__(package)
        print(f"✅ {package} is already installed")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check key packages
packages = ["transformers", "datasets", "trl", "gradio", "accelerate"]
for package in packages:
    install_package(package)

## 🚀 Quick Demo: Training a Model with GRPO

Let's demonstrate the core functionality by training a small model on math problems:

In [None]:
# Import our GRPO trainer
import os
import sys

# Add src to path
sys.path.append(os.path.join(os.getcwd(), '..', 'src'))

from training.grpo_trainer import CPUGRPOTrainer, CPUGRPOConfig
from utils.grpo_utils import RewardFunctions, DatasetProcessor

print("✅ GRPO components imported successfully!")

In [None]:
# Configure the trainer for a quick demo
config = CPUGRPOConfig(
    model_name="distilgpt2",  # Small model for quick demo
    max_length=128,
    batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,
    num_train_epochs=1,
    output_dir="./demo_output"
)

print("📋 Configuration:")
print(f"  Model: {config.model_name}")
print(f"  Max length: {config.max_length}")
print(f"  Learning rate: {config.learning_rate}")
print(f"  Output directory: {config.output_dir}")

In [None]:
# Initialize the trainer
print("🔧 Initializing GRPO trainer...")
trainer = CPUGRPOTrainer(config)
print("✅ Trainer initialized!")

In [None]:
# Test the base model before training
test_prompt = "Solve this math problem: What is 2 + 2?"

print("🧪 Testing base model:")
print(f"Prompt: {test_prompt}")

base_response = trainer.generate_response(test_prompt, max_new_tokens=50)
print(f"Base model response: {base_response}")

In [None]:
# Create a small dataset for quick training
from datasets import Dataset

# Simple math problems for demonstration
math_prompts = [
    "Solve this: 1 + 1 = ?",
    "What is 3 + 2?",
    "Calculate: 5 - 3 = ?",
    "What is 2 × 4?",
    "Solve: 10 ÷ 2 = ?",
] * 10  # Repeat for more training data

dataset = Dataset.from_dict({"prompt": math_prompts})
print(f"📚 Created dataset with {len(dataset)} samples")

In [None]:
# Create a reward function for math problems
reward_fn = RewardFunctions.math_reasoning_reward

# Test the reward function
test_responses = [
    "The answer is 4",
    "I don't know",
    "Let me calculate: 2 + 2 = 4"
]

rewards = reward_fn(test_responses)
print("🎯 Reward function test:")
for response, reward in zip(test_responses, rewards):
    print(f"  '{response}' → Reward: {reward:.2f}")

## 🎓 Training the Model

**Note**: This is a simplified demo. For a full training run, use the web interface or command-line script.

Due to the complexity of setting up the full TRL training in a notebook, we'll demonstrate the components. For actual training, please use:

```bash
python app.py  # Web interface
# or
python train_grpo.py --samples 100 --dataset gsm8k
```

In [None]:
# Demonstrate the training setup (without actually training)
print("🎓 Training setup demonstration:")
print(f"  Dataset size: {len(dataset)}")
print(f"  Model: {config.model_name}")
print(f"  Task: Mathematical reasoning")
print(f"  Reward function: Math-specific rewards")
print("\n⚠️  For actual training, please use the web interface or command-line script")
print("   This ensures proper memory management and progress tracking.")

## 🌐 Web Interface Demo

The easiest way to use this system is through the web interface:

In [None]:
# Show how to launch the web interface
print("🌐 To launch the web interface:")
print("")
print("1. Open a terminal/command prompt")
print("2. Navigate to the project directory")
print("3. Run: python app.py")
print("4. Open http://localhost:7860 in your browser")
print("")
print("The web interface provides:")
print("  • 🎯 Training Setup - Configure and start training")
print("  • 🧪 Model Testing - Test and compare models")
print("  • 📚 Examples & Help - Documentation and examples")

## 🔧 Command-Line Usage

For programmatic usage, you can use the command-line script:

In [None]:
# Show command-line options
print("💻 Command-line training examples:")
print("")
print("# Basic training on GSM8K dataset:")
print("python train_grpo.py --dataset gsm8k --samples 200 --lr 1e-5")
print("")
print("# Custom training with specific parameters:")
print("python train_grpo.py --model distilgpt2 --task general --epochs 2")
print("")
print("# Get help with all options:")
print("python train_grpo.py --help")

## 📊 Understanding GRPO Results

When you run GRPO training, you'll see:

In [None]:
# Example of what training logs look like
example_logs = [
    {"step": 1, "loss": 2.45, "mean_reward": -0.2},
    {"step": 10, "loss": 1.89, "mean_reward": 0.1},
    {"step": 20, "loss": 1.23, "mean_reward": 0.4},
    {"step": 30, "loss": 0.87, "mean_reward": 0.7},
]

print("📈 Example training progress:")
print("Step | Loss  | Mean Reward | Notes")
print("-----|-------|-------------|------")
for log in example_logs:
    step = log["step"]
    loss = log["loss"]
    reward = log["mean_reward"]
    
    if step == 1:
        note = "Starting training"
    elif step == 20:
        note = "'Aha moment' - major improvement"
    elif step == 30:
        note = "Good convergence"
    else:
        note = "Gradual improvement"
    
    print(f"{step:4d} | {loss:.2f}  | {reward:+.1f}         | {note}")

## 🎯 Key Takeaways

1. **GRPO is CPU-friendly**: No GPU required for training
2. **Quick results**: Often shows improvement within 20-30 steps
3. **Math tasks work well**: Specialized reward functions for mathematical reasoning
4. **Easy to use**: Web interface for non-technical users
5. **Flexible**: Command-line interface for advanced users

## 🚀 Next Steps

1. **Try the web interface**: `python app.py`
2. **Experiment with different models**: Try `Qwen/Qwen2-0.5B-Instruct` or `distilgpt2`
3. **Create custom reward functions**: Modify `src/utils/grpo_utils.py`
4. **Scale up**: Use larger datasets and longer training

## 📚 Resources

- [TRL Documentation](https://huggingface.co/docs/trl)
- [GRPO Paper](https://docs.unsloth.ai/basics/reinforcement-learning-guide)
- [Gradio Documentation](https://gradio.app/docs)

---

**Happy training! 🎉**