# VERL Complete Training Notebook

This notebook provides a complete interface for training language models using **verl** (Volcano Engine Reinforcement Learning).

## Supported Algorithms
- **GRPO** (Group Relative Policy Optimization)
- **PPO** (Proximal Policy Optimization)
- **REINFORCE++**
- **RLOO** (REINFORCE Leave-One-Out)
- **ReMax** (Reward Maximization)

## Supported Backends
- **vLLM** - Mature, stable, PagedAttention
- **SGLang** - Fast, RadixAttention, better for multi-turn

## How to Use
1. Run Section 0 to install verl (takes ~5-10 minutes)
2. Run Section 1 to detect your hardware
3. Edit and run Section 1.5 to choose your backend (vLLM or SGLang)
4. Edit Section 2 to configure cluster (single GPU / multi GPU / multi node)
5. Edit Section 3 to set data paths and model
6. Run the algorithm section you want (4-8)
7. Monitor training in Section 9
8. Upload to HuggingFace in Section 11

**Note**: Only run the sections you need. Each algorithm section (4-8) is independent.

---
## Section 0: Installation

‚ö†Ô∏è **IMPORTANT**: This cell takes 5-10 minutes to run. Only run once.

Choose your inference backend:
- **vLLM**: More mature, stable, good documentation
- **SGLang**: Faster, better caching, good for multi-turn
- **Both**: Install both to switch easily

In [None]:
# CHOOSE YOUR INSTALLATION
# Uncomment ONE of the following:

# Option 1: Install with vLLM
# !pip install verl[vllm,gpu,math] jupyter ipywidgets matplotlib tensorboard -q

# Option 2: Install with SGLang  
# !pip install verl[sglang,gpu,math] jupyter ipywidgets matplotlib tensorboard -q

# Option 3: Install both (recommended - can switch easily)
!pip install verl[vllm,sglang,gpu,math] jupyter ipywidgets matplotlib tensorboard -q

print("‚úÖ Installation complete!")

---
## Section 1: Hardware Detection

Auto-detect GPUs, CUDA version, bf16 support, and memory.

In [None]:
import sys
sys.path.insert(0, '/home/user/verl/notebooks')

from notebook_utils import (
    detect_hardware,
    detect_available_backends,
    get_recommended_config,
    print_hardware_summary,
    print_backend_summary,
    check_verl_installation,
    check_dependencies,
)

# Check verl installation
is_installed, version_or_msg = check_verl_installation()
if is_installed:
    print(f"‚úÖ verl {version_or_msg} is installed\n")
else:
    print(f"‚ùå {version_or_msg}\n")
    raise ImportError("Please install verl first (see Section 0)")

# Detect hardware
HARDWARE_INFO = detect_hardware()
print_hardware_summary(HARDWARE_INFO)

# Detect backends
AVAILABLE_BACKENDS = detect_available_backends()
print_backend_summary(AVAILABLE_BACKENDS)

# Get recommended config based on hardware
RECOMMENDED_CONFIG = get_recommended_config(HARDWARE_INFO)

print("\n" + "="*70)
print("RECOMMENDED CONFIGURATION")
print("="*70)
for key, value in RECOMMENDED_CONFIG.items():
    print(f"{key:35s}: {value}")
print("="*70)

---
## Section 1.5: Backend Selection

Choose your inference backend: **vLLM** or **SGLang**

### Backend Comparison

| Feature | vLLM | SGLang |
|---------|------|--------|
| Maturity | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê More stable | ‚≠ê‚≠ê‚≠ê‚≠ê Newer |
| Speed | ‚≠ê‚≠ê‚≠ê‚≠ê Fast | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê Faster |
| Multi-turn | ‚≠ê‚≠ê‚≠ê Good | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê Excellent |
| Caching | PagedAttention | RadixAttention (better) |
| Model Support | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê Wide | ‚≠ê‚≠ê‚≠ê‚≠ê Growing |

**Quick Recommendation**:
- Use **vLLM** if you want stability and wide model support
- Use **SGLang** if you want maximum speed and better caching

In [None]:
from notebook_utils import get_backend_config

# ===================================================================
# CHOOSE YOUR BACKEND
# ===================================================================

# Uncomment ONE of the following:
BACKEND = 'sglang'  # Fast, better caching
# BACKEND = 'vllm'   # Stable, mature

# ===================================================================

# Validate backend is installed
if not AVAILABLE_BACKENDS.get(BACKEND, False):
    raise RuntimeError(
        f"‚ùå {BACKEND.upper()} is not installed!\n"
        f"Install it with: pip install verl[{BACKEND}]"
    )

# Get backend-specific configuration
BACKEND_CONFIG = get_backend_config(BACKEND, HARDWARE_INFO)

print(f"‚úÖ Using {BACKEND.upper()} for inference/rollout generation")
print("\nBackend Configuration:")
print("="*70)
for key, value in BACKEND_CONFIG.items():
    print(f"{key:50s}: {value}")
print("="*70)

---
## Section 2: Cluster Configuration

Configure your compute resources:
- **Single GPU**: For testing or small models
- **Single Node Multi-GPU**: Most common setup (8x A100, etc.)
- **Multi-Node Multi-GPU**: For very large models (70B+)

In [None]:
from notebook_utils import get_cluster_template

# ===================================================================
# CHOOSE YOUR CLUSTER MODE
# ===================================================================

# Uncomment ONE of the following:

# Option 1: Single GPU (for testing)
# CLUSTER_CONFIG = get_cluster_template('single_gpu', HARDWARE_INFO)

# Option 2: Single node with multiple GPUs (most common)
CLUSTER_CONFIG = get_cluster_template('single_node_multi_gpu', HARDWARE_INFO)

# Option 3: Multi-node with multiple GPUs each
# CLUSTER_CONFIG = get_cluster_template('multi_node_multi_gpu', HARDWARE_INFO)
# IMPORTANT: For multi-node, you MUST set the Ray head node address:
# CLUSTER_CONFIG['ray_kwargs.ray_init.address'] = '192.168.1.100:6379'  # Replace with your head node IP

# ===================================================================

print("Cluster Configuration:")
print("="*70)
for key, value in CLUSTER_CONFIG.items():
    print(f"{key:50s}: {value}")
print("="*70)

total_gpus = CLUSTER_CONFIG['trainer.n_gpus_per_node'] * CLUSTER_CONFIG['trainer.nnodes']
print(f"\nüìä Total GPUs to be used: {total_gpus}")

---
## Section 3: Data & Model Configuration

Set your data paths, model, and training hyperparameters.

**Edit the dictionaries below** with your actual paths and settings.

In [None]:
import os

# ===================================================================
# DATA CONFIGURATION - EDIT THESE PATHS
# ===================================================================

DATA_CONFIG = {
    'train_files': os.path.expanduser('~/data/gsm8k/train.parquet'),
    'val_files': os.path.expanduser('~/data/gsm8k/test.parquet'),
    'max_prompt_length': 512,
    'max_response_length': 1024,
}

# ===================================================================
# MODEL CONFIGURATION - EDIT THESE
# ===================================================================

MODEL_CONFIG = {
    'model_path': 'Qwen/Qwen3-8B',  # HuggingFace model or local path
    'output_dir': './checkpoints',   # Where to save checkpoints
}

# ===================================================================
# TRAINING HYPERPARAMETERS - EDIT AS NEEDED
# ===================================================================

TRAINING_CONFIG = {
    'learning_rate': 1e-6,
    'total_epochs': 15,
    'save_freq': 20,      # Save checkpoint every N steps
    'test_freq': 5,       # Run validation every N steps
    'project_name': 'verl_notebook_training',
    'experiment_name': 'gsm8k_qwen3_8b',
}

# ===================================================================
# LOGGING CONFIGURATION
# ===================================================================

LOGGING_CONFIG = {
    'logger': '["console","wandb"]',  # Options: console, wandb, tensorboard, mlflow
    'wandb_api_key': None,  # Set this or use 'wandb login' command
}

# Display configuration
print("Configuration Summary:")
print("="*70)
print(f"Model: {MODEL_CONFIG['model_path']}")
print(f"Training data: {DATA_CONFIG['train_files']}")
print(f"Validation data: {DATA_CONFIG['val_files']}")
print(f"Learning rate: {TRAINING_CONFIG['learning_rate']}")
print(f"Total epochs: {TRAINING_CONFIG['total_epochs']}")
print(f"Output directory: {MODEL_CONFIG['output_dir']}")
print("="*70)

---
# ALGORITHM SECTIONS

Run **ONLY ONE** of the following algorithm sections (4-8) based on your needs.

Each section is self-contained and will:
1. Create the configuration
2. Initialize Ray cluster
3. Start training
4. Monitor progress

---
## Section 4: GRPO (Group Relative Policy Optimization)

GRPO is a simplified on-policy algorithm that doesn't require a critic model.

**Best for**: Quick experimentation, lower memory usage

**Key parameters to edit**:
- `actor_rollout_ref.rollout.n`: Number of responses to sample per prompt (default: 5)
- `actor_rollout_ref.actor.use_kl_loss`: Whether to use KL divergence loss
- `actor_rollout_ref.actor.kl_loss_coef`: KL loss coefficient

In [None]:
from notebook_utils import create_config_dict
from omegaconf import OmegaConf

# ===================================================================
# GRPO CONFIGURATION
# ===================================================================

GRPO_CONFIG = create_config_dict(
    algorithm='grpo',
    model_path=MODEL_CONFIG['model_path'],
    train_files=DATA_CONFIG['train_files'],
    val_files=DATA_CONFIG['val_files'],
    backend_config=BACKEND_CONFIG,
    cluster_config=CLUSTER_CONFIG,
    recommended_config=RECOMMENDED_CONFIG,
    
    # GRPO-specific settings (edit as needed)
    **{
        'actor_rollout_ref.rollout.n': 5,  # Sample 5 responses per prompt
        'actor_rollout_ref.actor.use_kl_loss': True,
        'actor_rollout_ref.actor.kl_loss_coef': 0.001,
        'actor_rollout_ref.actor.kl_loss_type': 'low_var_kl',
        'actor_rollout_ref.actor.entropy_coeff': 0,
        'algorithm.use_kl_in_reward': False,
        'trainer.critic_warmup': 0,
        'trainer.project_name': TRAINING_CONFIG['project_name'],
        'trainer.experiment_name': f"{TRAINING_CONFIG['experiment_name']}_grpo",
        'trainer.total_epochs': TRAINING_CONFIG['total_epochs'],
        'trainer.save_freq': TRAINING_CONFIG['save_freq'],
        'trainer.test_freq': TRAINING_CONFIG['test_freq'],
    }
)

print("GRPO Configuration created successfully!")
print(f"Using {BACKEND.upper()} backend with {CLUSTER_CONFIG['trainer.n_gpus_per_node']} GPUs")

In [None]:
# ===================================================================
# START GRPO TRAINING
# ===================================================================

from verl.trainer.main_ppo import run_ppo

# Load base config and merge with GRPO config
base_config = OmegaConf.load('/home/user/verl/verl/trainer/config/ppo_trainer.yaml')
config = OmegaConf.merge(base_config, OmegaConf.create(GRPO_CONFIG))

print("üöÄ Starting GRPO training...")
print(f"   Model: {MODEL_CONFIG['model_path']}")
print(f"   Backend: {BACKEND.upper()}")
print(f"   Epochs: {TRAINING_CONFIG['total_epochs']}")
print(f"   Checkpoint dir: {MODEL_CONFIG['output_dir']}")
print("\nTraining will begin in 5 seconds...\n")

import time
time.sleep(5)

# Start training
run_ppo(config)

---
## Section 5: PPO (Proximal Policy Optimization)

PPO uses a critic model to estimate value functions.

**Best for**: More stable training, better sample efficiency

**Key parameters to edit**:
- `critic.optim.lr`: Critic learning rate
- `critic.model.path`: Critic model (usually same as actor)
- `reward_model.enable`: Whether to use a separate reward model
- `trainer.critic_warmup`: Number of warmup steps for critic

In [None]:
from notebook_utils import create_config_dict
from omegaconf import OmegaConf

# ===================================================================
# PPO CONFIGURATION
# ===================================================================

PPO_CONFIG = create_config_dict(
    algorithm='gae',  # GAE (Generalized Advantage Estimation) for PPO
    model_path=MODEL_CONFIG['model_path'],
    train_files=DATA_CONFIG['train_files'],
    val_files=DATA_CONFIG['val_files'],
    backend_config=BACKEND_CONFIG,
    cluster_config=CLUSTER_CONFIG,
    recommended_config=RECOMMENDED_CONFIG,
    
    # PPO-specific settings (edit as needed)
    **{
        # Critic configuration
        'critic.optim.lr': 1e-5,
        'critic.model.path': MODEL_CONFIG['model_path'],  # Same as actor
        'critic.model.use_remove_padding': True,
        'critic.model.enable_gradient_checkpointing': RECOMMENDED_CONFIG['enable_gradient_checkpointing'],
        'critic.ppo_micro_batch_size_per_gpu': RECOMMENDED_CONFIG['ppo_micro_batch_size_per_gpu'],
        'critic.model.fsdp_config.param_offload': RECOMMENDED_CONFIG['param_offload'],
        'critic.model.fsdp_config.optimizer_offload': RECOMMENDED_CONFIG['optimizer_offload'],
        
        # Optional: Reward model (set enable=True to use)
        'reward_model.enable': False,  # Set to True if you have a reward model
        # 'reward_model.model.path': 'path/to/reward/model',  # Uncomment if using reward model
        
        # Training
        'actor_rollout_ref.actor.use_kl_loss': False,
        'algorithm.use_kl_in_reward': False,
        'trainer.critic_warmup': 0,
        'trainer.project_name': TRAINING_CONFIG['project_name'],
        'trainer.experiment_name': f"{TRAINING_CONFIG['experiment_name']}_ppo",
        'trainer.total_epochs': TRAINING_CONFIG['total_epochs'],
        'trainer.save_freq': TRAINING_CONFIG['save_freq'],
        'trainer.test_freq': TRAINING_CONFIG['test_freq'],
    }
)

print("PPO Configuration created successfully!")
print(f"Using {BACKEND.upper()} backend with {CLUSTER_CONFIG['trainer.n_gpus_per_node']} GPUs")

In [None]:
# ===================================================================
# START PPO TRAINING
# ===================================================================

from verl.trainer.main_ppo import run_ppo

# Load base config and merge with PPO config
base_config = OmegaConf.load('/home/user/verl/verl/trainer/config/ppo_trainer.yaml')
config = OmegaConf.merge(base_config, OmegaConf.create(PPO_CONFIG))

print("üöÄ Starting PPO training...")
print(f"   Model: {MODEL_CONFIG['model_path']}")
print(f"   Backend: {BACKEND.upper()}")
print(f"   Epochs: {TRAINING_CONFIG['total_epochs']}")
print(f"   Checkpoint dir: {MODEL_CONFIG['output_dir']}")
print("\nTraining will begin in 5 seconds...\n")

import time
time.sleep(5)

# Start training
run_ppo(config)

---
## Section 6: REINFORCE++

REINFORCE++ is an improved version of the REINFORCE algorithm with variance reduction.

**Best for**: Simpler implementation, good baseline

**Key parameters to edit**:
- `actor_rollout_ref.rollout.n`: Number of samples per prompt
- `algorithm.advantage_normalization`: Whether to normalize advantages

In [None]:
from notebook_utils import create_config_dict
from omegaconf import OmegaConf

# ===================================================================
# REINFORCE++ CONFIGURATION
# ===================================================================

REINFORCE_PP_CONFIG = create_config_dict(
    algorithm='reinforce_plus_plus',
    model_path=MODEL_CONFIG['model_path'],
    train_files=DATA_CONFIG['train_files'],
    val_files=DATA_CONFIG['val_files'],
    backend_config=BACKEND_CONFIG,
    cluster_config=CLUSTER_CONFIG,
    recommended_config=RECOMMENDED_CONFIG,
    
    # REINFORCE++-specific settings
    **{
        'actor_rollout_ref.rollout.n': 8,  # Sample 8 responses per prompt
        'algorithm.advantage_normalization': True,
        'actor_rollout_ref.actor.use_kl_loss': True,
        'actor_rollout_ref.actor.kl_loss_coef': 0.001,
        'algorithm.use_kl_in_reward': False,
        'trainer.critic_warmup': 0,
        'trainer.project_name': TRAINING_CONFIG['project_name'],
        'trainer.experiment_name': f"{TRAINING_CONFIG['experiment_name']}_reinforce_pp",
        'trainer.total_epochs': TRAINING_CONFIG['total_epochs'],
        'trainer.save_freq': TRAINING_CONFIG['save_freq'],
        'trainer.test_freq': TRAINING_CONFIG['test_freq'],
    }
)

print("REINFORCE++ Configuration created successfully!")

In [None]:
# ===================================================================
# START REINFORCE++ TRAINING
# ===================================================================

from verl.trainer.main_ppo import run_ppo

base_config = OmegaConf.load('/home/user/verl/verl/trainer/config/ppo_trainer.yaml')
config = OmegaConf.merge(base_config, OmegaConf.create(REINFORCE_PP_CONFIG))

print("üöÄ Starting REINFORCE++ training...")
print(f"   Model: {MODEL_CONFIG['model_path']}")
print(f"   Backend: {BACKEND.upper()}")
print(f"   Epochs: {TRAINING_CONFIG['total_epochs']}")

import time
time.sleep(5)

run_ppo(config)

---
## Section 7: RLOO (REINFORCE Leave-One-Out)

RLOO uses leave-one-out baseline for variance reduction.

**Best for**: Low variance, good sample efficiency

**Key parameters to edit**:
- `actor_rollout_ref.rollout.n`: Number of samples (typically higher, e.g., 16)

In [None]:
from notebook_utils import create_config_dict
from omegaconf import OmegaConf

# ===================================================================
# RLOO CONFIGURATION
# ===================================================================

RLOO_CONFIG = create_config_dict(
    algorithm='rloo',
    model_path=MODEL_CONFIG['model_path'],
    train_files=DATA_CONFIG['train_files'],
    val_files=DATA_CONFIG['val_files'],
    backend_config=BACKEND_CONFIG,
    cluster_config=CLUSTER_CONFIG,
    recommended_config=RECOMMENDED_CONFIG,
    
    # RLOO-specific settings
    **{
        'actor_rollout_ref.rollout.n': 16,  # Higher sample count for better baseline
        'actor_rollout_ref.actor.use_kl_loss': True,
        'actor_rollout_ref.actor.kl_loss_coef': 0.001,
        'algorithm.use_kl_in_reward': False,
        'trainer.critic_warmup': 0,
        'trainer.project_name': TRAINING_CONFIG['project_name'],
        'trainer.experiment_name': f"{TRAINING_CONFIG['experiment_name']}_rloo",
        'trainer.total_epochs': TRAINING_CONFIG['total_epochs'],
        'trainer.save_freq': TRAINING_CONFIG['save_freq'],
        'trainer.test_freq': TRAINING_CONFIG['test_freq'],
    }
)

print("RLOO Configuration created successfully!")

In [None]:
# ===================================================================
# START RLOO TRAINING
# ===================================================================

from verl.trainer.main_ppo import run_ppo

base_config = OmegaConf.load('/home/user/verl/verl/trainer/config/ppo_trainer.yaml')
config = OmegaConf.merge(base_config, OmegaConf.create(RLOO_CONFIG))

print("üöÄ Starting RLOO training...")
print(f"   Model: {MODEL_CONFIG['model_path']}")
print(f"   Backend: {BACKEND.upper()}")
print(f"   Epochs: {TRAINING_CONFIG['total_epochs']}")

import time
time.sleep(5)

run_ppo(config)

---
## Section 8: ReMax (Reward Maximization)

ReMax focuses on direct reward maximization with sequence balancing.

**Best for**: Reward maximization tasks

**Key parameters to edit**:
- `algorithm.remax_alpha`: Temperature parameter for ReMax

In [None]:
from notebook_utils import create_config_dict
from omegaconf import OmegaConf

# ===================================================================
# REMAX CONFIGURATION
# ===================================================================

REMAX_CONFIG = create_config_dict(
    algorithm='remax',
    model_path=MODEL_CONFIG['model_path'],
    train_files=DATA_CONFIG['train_files'],
    val_files=DATA_CONFIG['val_files'],
    backend_config=BACKEND_CONFIG,
    cluster_config=CLUSTER_CONFIG,
    recommended_config=RECOMMENDED_CONFIG,
    
    # ReMax-specific settings
    **{
        'actor_rollout_ref.rollout.n': 8,
        'algorithm.remax_alpha': 0.01,  # Temperature parameter
        'actor_rollout_ref.actor.use_kl_loss': True,
        'actor_rollout_ref.actor.kl_loss_coef': 0.001,
        'trainer.critic_warmup': 0,
        'trainer.project_name': TRAINING_CONFIG['project_name'],
        'trainer.experiment_name': f"{TRAINING_CONFIG['experiment_name']}_remax",
        'trainer.total_epochs': TRAINING_CONFIG['total_epochs'],
        'trainer.save_freq': TRAINING_CONFIG['save_freq'],
        'trainer.test_freq': TRAINING_CONFIG['test_freq'],
    }
)

print("ReMax Configuration created successfully!")

In [None]:
# ===================================================================
# START REMAX TRAINING
# ===================================================================

from verl.trainer.main_ppo import run_ppo

base_config = OmegaConf.load('/home/user/verl/verl/trainer/config/ppo_trainer.yaml')
config = OmegaConf.merge(base_config, OmegaConf.create(REMAX_CONFIG))

print("üöÄ Starting ReMax training...")
print(f"   Model: {MODEL_CONFIG['model_path']}")
print(f"   Backend: {BACKEND.upper()}")
print(f"   Epochs: {TRAINING_CONFIG['total_epochs']}")

import time
time.sleep(5)

run_ppo(config)

---
## Section 9: Monitoring & Visualization

Monitor training progress and visualize metrics.

In [None]:
# Load tensorboard extension
%load_ext tensorboard

# Launch tensorboard (change logdir to your checkpoint directory)
# %tensorboard --logdir ./checkpoints

print("TensorBoard loaded. Uncomment the line above to launch.")

In [None]:
# Plot training curves
import matplotlib.pyplot as plt
import glob
import os

# Example: Plot rewards from checkpoint logs
# You can customize this based on your logging format

def plot_training_metrics(checkpoint_dir):
    """Plot training metrics from checkpoint directory"""
    # This is a placeholder - adapt based on your actual logging format
    print(f"Looking for metrics in: {checkpoint_dir}")
    
    # Example plot
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    axes[0, 0].set_title('Average Reward')
    axes[0, 0].set_xlabel('Step')
    axes[0, 0].set_ylabel('Reward')
    
    axes[0, 1].set_title('Policy Loss')
    axes[0, 1].set_xlabel('Step')
    axes[0, 1].set_ylabel('Loss')
    
    axes[1, 0].set_title('KL Divergence')
    axes[1, 0].set_xlabel('Step')
    axes[1, 0].set_ylabel('KL')
    
    axes[1, 1].set_title('Learning Rate')
    axes[1, 1].set_xlabel('Step')
    axes[1, 1].set_ylabel('LR')
    
    plt.tight_layout()
    plt.show()

# Uncomment to plot
# plot_training_metrics(MODEL_CONFIG['output_dir'])

---
## Section 10: Checkpoint Management

List, load, and inspect training checkpoints.

In [None]:
import os
import glob

# List all checkpoints
checkpoint_dir = MODEL_CONFIG['output_dir']

if os.path.exists(checkpoint_dir):
    checkpoints = sorted(glob.glob(os.path.join(checkpoint_dir, '*')))
    
    print("Available Checkpoints:")
    print("="*70)
    for i, ckpt in enumerate(checkpoints):
        size_mb = sum(os.path.getsize(os.path.join(ckpt, f)) 
                     for f in os.listdir(ckpt) if os.path.isfile(os.path.join(ckpt, f))) / (1024**2)
        print(f"{i+1}. {os.path.basename(ckpt):30s} ({size_mb:.1f} MB)")
    print("="*70)
else:
    print(f"No checkpoints found in {checkpoint_dir}")

In [None]:
# Load a checkpoint for inspection
from transformers import AutoModelForCausalLM, AutoTokenizer

# Edit this to the checkpoint you want to load
CHECKPOINT_TO_LOAD = os.path.join(checkpoint_dir, 'epoch_15')  # Example

if os.path.exists(CHECKPOINT_TO_LOAD):
    print(f"Loading checkpoint: {CHECKPOINT_TO_LOAD}")
    
    # Load model and tokenizer
    # model = AutoModelForCausalLM.from_pretrained(CHECKPOINT_TO_LOAD)
    # tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT_TO_LOAD)
    
    print("‚úÖ Checkpoint loaded successfully!")
    print("(Code commented out - uncomment to actually load)")
else:
    print(f"Checkpoint not found: {CHECKPOINT_TO_LOAD}")

---
## Section 11: Upload to HuggingFace

Upload your trained model to HuggingFace Hub for easy sharing and deployment.

In [None]:
# Login to HuggingFace
!huggingface-cli login

In [None]:
from huggingface_hub import HfApi, create_repo
import os

# ===================================================================
# HUGGINGFACE UPLOAD CONFIGURATION - EDIT THESE
# ===================================================================

HF_CONFIG = {
    'checkpoint_path': os.path.join(checkpoint_dir, 'epoch_15'),  # Your trained model
    'repo_id': 'your-username/qwen3-8b-gsm8k-grpo',  # EDIT THIS: your HF repo name
    'private': False,  # Set to True for private repo
    'commit_message': 'Upload trained model',
}

# ===================================================================
# UPLOAD TO HUGGINGFACE
# ===================================================================

def upload_to_huggingface(config):
    """Upload model to HuggingFace Hub"""
    
    if not os.path.exists(config['checkpoint_path']):
        raise FileNotFoundError(f"Checkpoint not found: {config['checkpoint_path']}")
    
    print(f"Uploading {config['checkpoint_path']} to {config['repo_id']}...")
    
    # Create repository
    api = HfApi()
    try:
        create_repo(
            repo_id=config['repo_id'],
            private=config['private'],
            exist_ok=True,
        )
        print(f"‚úÖ Repository created/verified: {config['repo_id']}")
    except Exception as e:
        print(f"Error creating repo: {e}")
        return
    
    # Upload folder
    try:
        api.upload_folder(
            folder_path=config['checkpoint_path'],
            repo_id=config['repo_id'],
            repo_type='model',
            commit_message=config['commit_message'],
        )
        print(f"\n‚úÖ Model uploaded successfully!")
        print(f"üîó View at: https://huggingface.co/{config['repo_id']}")
    except Exception as e:
        print(f"Error uploading: {e}")

# Uncomment to upload
# upload_to_huggingface(HF_CONFIG)

---
## Section 12: Cleanup

Clean up resources after training.

In [None]:
import ray
import torch
import gc

# Shutdown Ray cluster
if ray.is_initialized():
    ray.shutdown()
    print("‚úÖ Ray cluster shutdown")

# Clear GPU memory
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    gc.collect()
    print("‚úÖ GPU memory cleared")

print("\nüéâ Cleanup complete!")

---
## Additional Resources

- **verl Documentation**: https://verl.readthedocs.io/
- **GitHub**: https://github.com/volcengine/verl
- **Paper**: [HybridFlow](https://arxiv.org/abs/2409.19256)

## Getting Help

- Issues: https://github.com/volcengine/verl/issues
- Slack: https://join.slack.com/t/verl-project/...
- Twitter: @verl_project