# VERL Model Evaluation Notebook

This notebook helps you evaluate trained models from verl.

## Features
- Load trained checkpoints
- Run inference with vLLM or SGLang
- Benchmark on test sets
- Compare multiple checkpoints
- Generate sample outputs
- Compute metrics (accuracy, rewards, etc.)

## How to Use
1. Run installation cell
2. Choose your backend (vLLM or SGLang)
3. Load your checkpoint
4. Run evaluation sections

---
## Installation

In [None]:
# Install required packages
# Choose your backend:

# Option 1: vLLM
# !pip install transformers torch vllm pandas datasets -q

# Option 2: SGLang
# !pip install transformers torch sglang pandas datasets -q

# Option 3: Both
!pip install transformers torch vllm sglang pandas datasets -q

print("‚úÖ Dependencies installed!")

---
## Section 1: Backend Selection

Choose your inference backend.

In [None]:
import importlib

# ===================================================================
# CHOOSE YOUR BACKEND
# ===================================================================

# Uncomment ONE:
BACKEND = 'vllm'
# BACKEND = 'sglang'

# ===================================================================

# Validate
available_backends = {
    'vllm': importlib.util.find_spec('vllm') is not None,
    'sglang': importlib.util.find_spec('sglang') is not None,
}

if not available_backends[BACKEND]:
    raise RuntimeError(f"‚ùå {BACKEND} not installed! Run: pip install {BACKEND}")

print(f"‚úÖ Using {BACKEND.upper()} for inference")

---
## Section 2: Load Checkpoint

Load your trained model checkpoint.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os

# ===================================================================
# CHECKPOINT CONFIGURATION - EDIT THESE
# ===================================================================

CHECKPOINT_CONFIG = {
    'checkpoint_path': './checkpoints/epoch_15',  # Edit this
    # OR use HuggingFace model:
    # 'checkpoint_path': 'your-username/model-name',
}

# ===================================================================

print(f"Loading checkpoint from: {CHECKPOINT_CONFIG['checkpoint_path']}")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    CHECKPOINT_CONFIG['checkpoint_path'],
    trust_remote_code=True
)

print("‚úÖ Tokenizer loaded")

# For vLLM/SGLang, we don't load the model here
# They handle loading internally for optimal inference
print(f"Model will be loaded by {BACKEND.upper()} backend during inference")

---
## Section 3: Initialize Inference Backend

Set up vLLM or SGLang for fast inference.

In [None]:
# ===================================================================
# INFERENCE CONFIGURATION - EDIT AS NEEDED
# ===================================================================

INFERENCE_CONFIG = {
    'tensor_parallel_size': 1,  # Set based on your GPU count
    'gpu_memory_utilization': 0.8,
    'max_model_len': 2048,
    'temperature': 0.7,
    'top_p': 0.9,
    'max_tokens': 1024,
}

# ===================================================================

if BACKEND == 'vllm':
    from vllm import LLM, SamplingParams
    
    # Initialize vLLM
    llm = LLM(
        model=CHECKPOINT_CONFIG['checkpoint_path'],
        tensor_parallel_size=INFERENCE_CONFIG['tensor_parallel_size'],
        gpu_memory_utilization=INFERENCE_CONFIG['gpu_memory_utilization'],
        max_model_len=INFERENCE_CONFIG['max_model_len'],
        trust_remote_code=True,
    )
    
    # Sampling params
    sampling_params = SamplingParams(
        temperature=INFERENCE_CONFIG['temperature'],
        top_p=INFERENCE_CONFIG['top_p'],
        max_tokens=INFERENCE_CONFIG['max_tokens'],
    )
    
    print("‚úÖ vLLM initialized")

elif BACKEND == 'sglang':
    import sglang as sgl
    
    # Initialize SGLang runtime
    runtime = sgl.Runtime(
        model_path=CHECKPOINT_CONFIG['checkpoint_path'],
        tp_size=INFERENCE_CONFIG['tensor_parallel_size'],
        mem_fraction_static=INFERENCE_CONFIG['gpu_memory_utilization'],
        trust_remote_code=True,
    )
    
    sgl.set_default_backend(runtime)
    
    print("‚úÖ SGLang initialized")

---
## Section 4: Generate Sample Outputs

Generate responses for sample prompts.

In [None]:
# ===================================================================
# SAMPLE PROMPTS - EDIT THESE
# ===================================================================

SAMPLE_PROMPTS = [
    "What is 25 * 37?",
    "Solve: If x + 5 = 12, what is x?",
    "A store has 45 apples and sells 17. How many are left?",
]

# ===================================================================

def generate_responses(prompts):
    """Generate responses using selected backend"""
    
    if BACKEND == 'vllm':
        outputs = llm.generate(prompts, sampling_params)
        results = []
        for output in outputs:
            results.append({
                'prompt': output.prompt,
                'response': output.outputs[0].text,
            })
        return results
    
    elif BACKEND == 'sglang':
        results = []
        for prompt in prompts:
            @sgl.function
            def gen(s, prompt):
                s += prompt
                s += sgl.gen(
                    "response",
                    max_tokens=INFERENCE_CONFIG['max_tokens'],
                    temperature=INFERENCE_CONFIG['temperature'],
                )
            
            state = gen.run(prompt=prompt)
            results.append({
                'prompt': prompt,
                'response': state['response'],
            })
        return results

# Generate
print("Generating responses...")
results = generate_responses(SAMPLE_PROMPTS)

# Display
print("\n" + "="*70)
print("SAMPLE GENERATIONS")
print("="*70)
for i, result in enumerate(results, 1):
    print(f"\n[{i}] Prompt: {result['prompt']}")
    print(f"Response: {result['response']}")
    print("-"*70)

---
## Section 5: Benchmark on Test Set

Evaluate your model on a full test dataset.

In [None]:
import pandas as pd
from tqdm import tqdm

# ===================================================================
# TEST SET CONFIGURATION - EDIT THESE
# ===================================================================

TEST_CONFIG = {
    'test_file': os.path.expanduser('~/data/gsm8k/test.parquet'),
    'num_samples': 100,  # Limit for faster testing (set to None for all)
    'batch_size': 32,     # For vLLM batch inference
}

# ===================================================================

# Load test data
print(f"Loading test data from: {TEST_CONFIG['test_file']}")
test_df = pd.read_parquet(TEST_CONFIG['test_file'])

if TEST_CONFIG['num_samples']:
    test_df = test_df.head(TEST_CONFIG['num_samples'])

print(f"Test samples: {len(test_df)}")

# Extract prompts
prompts = test_df['prompt'].tolist()

# Generate responses
print("\nGenerating responses for test set...")

if BACKEND == 'vllm':
    # vLLM can handle batch inference efficiently
    all_outputs = []
    for i in tqdm(range(0, len(prompts), TEST_CONFIG['batch_size'])):
        batch = prompts[i:i + TEST_CONFIG['batch_size']]
        outputs = llm.generate(batch, sampling_params)
        all_outputs.extend([out.outputs[0].text for out in outputs])
    
elif BACKEND == 'sglang':
    # SGLang sequential generation
    all_outputs = []
    for prompt in tqdm(prompts):
        @sgl.function
        def gen(s, prompt):
            s += prompt
            s += sgl.gen("response", max_tokens=INFERENCE_CONFIG['max_tokens'])
        
        state = gen.run(prompt=prompt)
        all_outputs.append(state['response'])

# Add to dataframe
test_df['model_output'] = all_outputs

print(f"‚úÖ Generated {len(all_outputs)} responses")

---
## Section 6: Compute Metrics

Evaluate model performance with metrics.

In [None]:
# Example: Accuracy for math problems
# This is dataset-specific - adjust based on your task

def extract_answer(text):
    """Extract numerical answer from text"""
    # Simple extraction - customize based on your format
    import re
    
    # Look for numbers at the end
    numbers = re.findall(r'-?\d+\.?\d*', text)
    if numbers:
        return numbers[-1]
    return None

def compute_accuracy(df):
    """Compute accuracy for GSM8K-style datasets"""
    correct = 0
    total = 0
    
    for _, row in df.iterrows():
        if 'extra_info' in row and 'answer' in row['extra_info']:
            ground_truth = row['extra_info']['answer']
            predicted = row['model_output']
            
            # Extract numerical answers
            gt_num = extract_answer(str(ground_truth))
            pred_num = extract_answer(str(predicted))
            
            if gt_num and pred_num and gt_num == pred_num:
                correct += 1
            total += 1
    
    accuracy = correct / total if total > 0 else 0
    return accuracy, correct, total

# Compute metrics
if 'extra_info' in test_df.columns:
    accuracy, correct, total = compute_accuracy(test_df)
    
    print("="*70)
    print("EVALUATION RESULTS")
    print("="*70)
    print(f"Accuracy: {accuracy:.2%} ({correct}/{total})")
    print("="*70)
else:
    print("‚ö†Ô∏è  No ground truth available for accuracy computation")

# Show sample correct/incorrect predictions
print("\nSample predictions:")
print(test_df[['prompt', 'model_output']].head(5))

---
## Section 7: Compare Multiple Checkpoints

Compare performance across different training checkpoints.

In [None]:
# ===================================================================
# CHECKPOINT COMPARISON - EDIT THESE
# ===================================================================

CHECKPOINTS_TO_COMPARE = [
    './checkpoints/epoch_5',
    './checkpoints/epoch_10',
    './checkpoints/epoch_15',
]

# ===================================================================

# This is a template - you would need to reload models and re-run inference
# for each checkpoint, which can be time-consuming

comparison_results = []

for ckpt_path in CHECKPOINTS_TO_COMPARE:
    if not os.path.exists(ckpt_path):
        print(f"‚ö†Ô∏è  Skipping {ckpt_path} (not found)")
        continue
    
    print(f"\nEvaluating checkpoint: {ckpt_path}")
    
    # Here you would:
    # 1. Reload the model from ckpt_path
    # 2. Re-run inference
    # 3. Compute metrics
    # 4. Store results
    
    # Placeholder
    comparison_results.append({
        'checkpoint': os.path.basename(ckpt_path),
        'accuracy': 0.0,  # Replace with actual metric
    })

# Display comparison
if comparison_results:
    comparison_df = pd.DataFrame(comparison_results)
    print("\n" + "="*70)
    print("CHECKPOINT COMPARISON")
    print("="*70)
    print(comparison_df)
    print("="*70)
else:
    print("No checkpoints to compare")

---
## Section 8: Save Results

Save evaluation results for later analysis.

In [None]:
# ===================================================================
# SAVE CONFIGURATION - EDIT OUTPUT PATH
# ===================================================================

SAVE_CONFIG = {
    'output_dir': './evaluation_results',
    'experiment_name': 'gsm8k_epoch15',
}

# ===================================================================

os.makedirs(SAVE_CONFIG['output_dir'], exist_ok=True)

# Save results
output_path = os.path.join(
    SAVE_CONFIG['output_dir'],
    f"{SAVE_CONFIG['experiment_name']}_results.parquet"
)

test_df.to_parquet(output_path, index=False)
print(f"‚úÖ Results saved to: {output_path}")

# Save metrics summary
if 'accuracy' in locals():
    metrics_path = os.path.join(
        SAVE_CONFIG['output_dir'],
        f"{SAVE_CONFIG['experiment_name']}_metrics.txt"
    )
    
    with open(metrics_path, 'w') as f:
        f.write(f"Checkpoint: {CHECKPOINT_CONFIG['checkpoint_path']}\n")
        f.write(f"Test samples: {len(test_df)}\n")
        f.write(f"Accuracy: {accuracy:.2%}\n")
        f.write(f"Correct: {correct}/{total}\n")
    
    print(f"‚úÖ Metrics saved to: {metrics_path}")

---
## Section 9: Cleanup

Clean up resources.

In [None]:
import gc
import torch

# Shutdown backend
if BACKEND == 'sglang':
    runtime.shutdown()
    print("‚úÖ SGLang runtime shutdown")

# Clear GPU memory
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    gc.collect()
    print("‚úÖ GPU memory cleared")

print("\nüéâ Cleanup complete!")

---
## Summary

You've evaluated your trained model!

**What you did**:
- Loaded a checkpoint
- Generated sample outputs
- Benchmarked on test set
- Computed accuracy metrics
- Saved results

**Next steps**:
- Fine-tune hyperparameters based on results
- Compare with baseline models
- Upload best checkpoint to HuggingFace (see notebook 1, Section 11)