# Checkpoint Selection with LAB SFT

This notebook demonstrates how checkpoint selection can be done when running the LAB training methodology through training hub. 

Checkpoint selection is an optional step at the end of an LLM training run, often used to maximize how much you get out of your training run.
When customizing a model using the LAB methodology, the resulting checkpoints may not differ greatly on their domain-knowledge, but may vary in their performance on general domains.

This notebook shows how you can use a complex benchmark such as OpenLLM Leaderboard v2 to select the most optimal checkpoint out of your training run.

> Note: Although this notebook showcases the LAB technique, it is focused on checkpoint evaluation.
> 
> For a comprehensive tutorial on the LAB multiphase training technique, see `lab_multiphase_training_tutorial.ipynb`

## Installing modules for evaluation


In addition to the dependencies from training-hub, we'll need to install the required dependencies.

For this notebook, we'll be using [Open LLM Leaderboard v2](https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about) as our heuristic benchmark; although any other benchmark may be swapped in.

We'll be using leaderboard through the `instructlab-eval` package, however you can use any other eval provider such as [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) or [OpenCompass](https://github.com/open-compass/opencompass).

To get started, please install via one of the following options:


### Option 1: Install with standard pip (most vanilla environments)

In [None]:
# For conventional python environments (vanilla, venv, conda, etc.)
!pip install instructlab-eval && pip install instructlab-eval[cuda] --no-build-isolation && pip install instructlab-eval[leaderboard] --no-build-isolation

### Option 2: Install with UV

In [None]:
# If using UV, run with this command:
!uv pip install instructlab-eval && uv pip install instructlab-eval[cuda] --no-build-isolation && uv pip install instructlab-eval[leaderboard] --no-build-isolation

## Important: Run this before anything else

In [None]:
# IMPORTANT: Fix for CUDA multiprocessing issue - must be run first!
import multiprocessing
multiprocessing.set_start_method('spawn', force=True)
print("‚úÖ Set multiprocessing start method to 'spawn' for CUDA compatibility")

# Ensure CUDA is available for vLLM
import os
import torch

# Set CUDA device visibility if not already set
if 'CUDA_VISIBLE_DEVICES' not in os.environ:
    # Make all GPUs visible
    os.environ['CUDA_VISIBLE_DEVICES'] = ','.join(str(i) for i in range(torch.cuda.device_count()))
    print(f"‚úÖ Set CUDA_VISIBLE_DEVICES to: {os.environ['CUDA_VISIBLE_DEVICES']}")

# Verify CUDA availability
if torch.cuda.is_available():
    print(f"‚úÖ CUDA is available with {torch.cuda.device_count()} GPU(s)")
else:
    print("‚ö†Ô∏è  WARNING: CUDA is not available! vLLM may fall back to CPU backend.")

## Configure your hardware setup

You may configure your system here as-needed.

In [None]:
# =============================================================================
# HARDWARE AND TRAINING CONFIGURATION
# =============================================================================

# Training hyperparameters
max_tokens_per_gpu = 25_000  # Memory limit per GPU (reduce if hitting OOM errors)
max_seq_len = 20_000         # Maximum sequence length

# LAB-specific configurations
phase07_effective_batch_size = 128   # Smaller batch size for knowledge dataset
phase10_effective_batch_size = 3840  # Larger batch size for skills + replay dataset
learning_rate = 2e-5                 # Learning rate for both phases
num_epochs = 7                       # Number of epochs per phase
warmup_steps = 0                     # Number of warmup steps

# Single-node distributed training setup
nproc_per_node = 8   # Number of GPUs per node (adjust based on your hardware)
nnodes = 1           # Number of nodes (single-node setup)
node_rank = 0        # This node's rank (always 0 for single-node)
rdzv_id = 47         # Rendezvous ID for distributed training
rdzv_endpoint = "127.0.0.1:12345"  # Local endpoint for single-node

# Calculate total resources
total_gpus = nproc_per_node * nnodes

print("üñ•Ô∏è  Hardware Configuration:")
print(f"  GPUs per node: {nproc_per_node}")
print(f"  Total GPUs: {total_gpus}")
print(f"  Max tokens per GPU: {max_tokens_per_gpu:,}")
print(f"  Max sequence length: {max_seq_len:,}")
print()
print("üìä LAB Training Configuration:")
print(f"  Phase07 batch size: {phase07_effective_batch_size} (knowledge tuning)")
print(f"  Phase10 batch size: {phase10_effective_batch_size} (skills + replay)")
print(f"  Learning rate: {learning_rate}")
print(f"  Epochs per phase: {num_epochs}")
print()
print("üí° Note: If you encounter OOM (Out of Memory) errors, reduce max_tokens_per_gpu")
print("   For fewer GPUs, adjust nproc_per_node to match your available hardware")


### Setup utility functions

In [None]:
# Standard library imports for logging management
import logging
import time
from datetime import datetime
from contextlib import redirect_stdout, redirect_stderr
from io import StringIO

# Configure logging to prevent notebook crashes from excessive output
# while still showing essential progress and error information

def setup_training_logging():
    """Set up logging configuration optimized for notebook environments."""
    # Reduce logging level for common noisy loggers
    logging.getLogger("transformers").setLevel(logging.WARNING)
    logging.getLogger("torch").setLevel(logging.WARNING)
    logging.getLogger("accelerate").setLevel(logging.WARNING)
    
    # Set up a custom logger that shows progress without overwhelming the notebook
    root_logger = logging.getLogger()
    root_logger.setLevel(logging.INFO)
    
    print("‚úÖ Logging configured for notebook environment")

def run_training_with_managed_output(training_func, description="Training"):
    """
    Run training with balanced output showing progress without overwhelming the notebook.
    Shows essential progress, errors, and key milestones while filtering excessive logs.
    """
    print(f"üöÄ Starting {description}...")
    print("üìù Showing essential progress and key training milestones")
    print("‚è≥ This may take a while. Training progress will appear below:")
    print("-" * 60)
    
    start_time = time.time()
    
    try:
        # Run training with minimal output redirection to allow subprocess logs
        # but reduce verbosity of the most chatty components
        import os
        
        # Set environment variables to reduce some verbose output
        old_env = {}
        env_settings = {
            'TRANSFORMERS_VERBOSITY': 'warning',
            'TOKENIZERS_PARALLELISM': 'false',  # Reduces tokenizer warnings
        }
        
        for key, value in env_settings.items():
            old_env[key] = os.environ.get(key)
            os.environ[key] = value
        
        try:
            result = training_func()
        finally:
            # Restore environment
            for key, old_value in old_env.items():
                if old_value is None:
                    os.environ.pop(key, None)
                else:
                    os.environ[key] = old_value
        
        end_time = time.time()
        duration = end_time - start_time
        
        print("-" * 60)
        print(f"‚úÖ {description} completed successfully!")
        print(f"‚è±Ô∏è  Duration: {duration/3600:.2f} hours")
        
        return result
        
    except Exception as e:
        end_time = time.time()
        duration = end_time - start_time
        
        print("-" * 60)
        print(f"‚ùå {description} failed after {duration/60:.1f} minutes")
        print(f"Error: {e}")
        print("\nüí° The error occurred in the distributed training subprocess.")
        print("   Check the training logs above for more context about the failure.")
        print("   Common issues include: data path problems, memory issues, or model loading errors.")
        
        raise

def find_latest_checkpoint(ckpt_output_dir: str) -> str | None:
    """Find the most recently created checkpoint in the hf_format directory.
    
    Args:
        ckpt_output_dir: Base checkpoint output directory
        
    Returns:
        str or None: Path to the latest checkpoint directory, or None if no checkpoints found
    """
    import os
    import glob
    
    hf_format_dir = os.path.join(ckpt_output_dir, "hf_format")
    if not os.path.exists(hf_format_dir):
        return None
    
    # List all items in hf_format directory and pick the most recent
    items = os.listdir(hf_format_dir)
    if not items:
        return None
    
    # Get full paths and find the most recently created item
    item_paths = [os.path.join(hf_format_dir, item) for item in items]
    latest_checkpoint = max(item_paths, key=os.path.getctime)
    
    return latest_checkpoint

# Set up logging
setup_training_logging()


## Train your model with LAB (or any other method)

The following section showcases running the LAB method.

In [None]:
from training_hub import sft
import os

# Model and data paths - Update these to your actual paths
base_model_path = "/path/to/your/model"  # e.g., granite-3.1-8b-starter-v2.1
phase07_knowledge_data = "/path/to/your/knowledge/data.jsonl"  # Knowledge data for Phase07
phase10_skills_replay_data = "/path/to/your/skills_plus_replay/data.jsonl"  # Skills + replay data for Phase10

# Configure your output directories
ckpt_output_base_dir = "/path/to/checkpoint-save-dir"
data_output_dir = "/dev/shm"  # A good default on most systems, but you can change as-needed


# Phase-specific output directories
phase07_ckpt_output_dir = os.path.join(ckpt_output_base_dir, "_phase07")
phase10_ckpt_output_dir = os.path.join(ckpt_output_base_dir, "_phase10")

print(f"üìÇ Model and data configuration:")
print(f"   Base model: {base_model_path}")
print(f"   Phase07 knowledge data: {phase07_knowledge_data}")
print(f"   Phase10 skills + replay data: {phase10_skills_replay_data}")
print(f"   Phase07 output: {phase07_ckpt_output_dir}")
print(f"   Phase10 output: {phase10_ckpt_output_dir}")


## Run LAB Multiphase Training

In this example, we provide a template for training a model using the LAB SFT approach, since it's a particularly useful to have checkpoint selection here.

However; this technique applies to any training regime, so feel free to swap this out for your own techniques.

In [None]:
# Phase 1: Knowledge Tuning (Phase07)
print("üìö Phase 1: Knowledge Tuning (Phase07)")
print("=" * 60)

def phase07_training():
    """Execute Phase07 training with all parameters."""
    sft(
        # Required parameters
        model_path=base_model_path,
        data_path=phase07_knowledge_data,
        ckpt_output_dir=phase07_ckpt_output_dir,
        
        # Core training parameters
        num_epochs=num_epochs,
        effective_batch_size=phase07_effective_batch_size,
        learning_rate=learning_rate,
        max_seq_len=max_seq_len,
        max_tokens_per_gpu=max_tokens_per_gpu,
        
        # Data and checkpointing parameters
        data_output_dir=data_output_dir,
        warmup_steps=warmup_steps,
        save_samples=0,
        checkpoint_at_epoch=True,
        accelerate_full_state_at_epoch=False,  # Save space for intermediate phase
        
        # Distributed training parameters
        nproc_per_node=nproc_per_node,
        nnodes=nnodes,
        node_rank=node_rank,
        rdzv_id=rdzv_id,
        rdzv_endpoint=rdzv_endpoint,
    )
    

# Execute Phase07 training
try:
    run_training_with_managed_output(phase07_training, "Phase07 (Knowledge Tuning)")
    print("üéØ Phase07 training completed successfully!")
except Exception as e:
    print(f"üí• Phase07 training failed: {e}")
    print("üîç Check the error details above for troubleshooting")
    raise

In [None]:
print("üé§ Phase 2: Skills + Knowledge Tuning (Phase10)")
print("=" * 60)

def phase10_training():
    """Execute Phase10 training with all parameters."""

    # Next, find and load the last checkpoint from the knowledge training phase
    latest_phase07_checkpoint = find_latest_checkpoint(phase07_ckpt_output_dir)
    print(f"üîç Found latest checkpoint: {latest_phase07_checkpoint}")
    print(f"üèã Loading checkpoint into Phase10 training...")

    sft(
        # Required parameters
        model_path=latest_phase07_checkpoint,
        data_path=phase10_skills_replay_data,
        ckpt_output_dir=phase10_ckpt_output_dir,
        
        # Core training parameters
        checkpoint_at_epoch=True,
        num_epochs=num_epochs,
        effective_batch_size=phase10_effective_batch_size,
        max_seq_len=max_seq_len,
        max_tokens_per_gpu=max_tokens_per_gpu,
        learning_rate=learning_rate,

        # Data and checkpointing parameters
        data_output_dir=data_output_dir,
        warmup_steps=warmup_steps,
        save_samples=0,
        accelerate_full_state_at_epoch=False,

        # Distributed training parameters
        nproc_per_node=nproc_per_node,
        nnodes=nnodes,
        node_rank=node_rank,
        rdzv_id=rdzv_id,
        rdzv_endpoint=rdzv_endpoint,
    )


try:
    run_training_with_managed_output(phase10_training, "Phase10 (Skills + Knowledge Tuning)")
    print("üéØ Phase10 training completed successfully!")
except Exception as e:
    print(f"üí• Phase10 training failed: {e}")
    print("üîç Check the error details above for troubleshooting")
    raise

## Evaluating checkpoints

### Why Evaluation Matters for Custom Instruction-Tuned Models

When fine-tuning instruction-following models with the LAB methodology, we mix the base model's original instruction data with our custom domain-specific data. This data mixing creates an important challenge: while we want the model to excel at our specific tasks, we also need to ensure it maintains its general-purpose capabilities as an intelligent chatbot.

The training process can create subtle trade-offs:
- Early checkpoints might retain more general capabilities but have less domain expertise
- Later checkpoints might excel at your specific tasks but potentially degrade on general tasks
- Different checkpoints may balance these capabilities differently

This is why evaluating checkpoints across diverse, complex domains is crucial - we need to find the checkpoint that best balances domain-specific performance with general intelligence.

### OpenLLM Leaderboard v2: An Ideal Evaluation Tool

The OpenLLM Leaderboard v2 is particularly well-suited for checkpoint selection because:

1. **Comprehensive Coverage**: It tests multiple facets of intelligence including reasoning (GPQA, MuSR), math (MATH), instruction following (IFEval), and general knowledge (MMLU-Pro)

2. **Challenging Tasks**: These benchmarks are designed to be difficult, helping distinguish between checkpoints that might perform similarly on easier tasks

3. **Real-World Relevance**: The tasks mirror the diverse queries a chatbot encounters in production, from technical questions to creative problem-solving

4. **Standardized Metrics**: Provides consistent, comparable scores across checkpoints, making selection objective rather than subjective

By evaluating all checkpoints from your training run on this benchmark, you can identify which checkpoint best preserves the model's versatility while incorporating your custom knowledge - ensuring you deploy the most capable version of your fine-tuned model.


### Running Checkpoint Selection with Leaderboard

In this notebook, we leverage the `instructlab-eval` library which wraps around lm-eval-harness to provide a seamless API in Python.

Although it's possible to run this same script directly through via lm-eval-harness, it comes with the following problems:

1. **Complex output**: The leaderboard scores returned from `lm-eval` are complex and require post-processing to obtain a score that's easy to read.
2. **Difficult to optimize**: Without the correct configuration, leaderboard can take an hour and a half to run on an **8xH100** machine. But obtaining the correct configuration requires an engineering effort to work properly. 

We solve this challenge in `instructlab-eval` by packaging it up so it's

1. Easy to call from a script like this üìû
2. Configures it for speed (üê¢ 90 minutes --> üèéÔ∏è 15 minutes) üî•
3. Provides the scores in a simple, readable format üìú --> üìÑ


### Evaluation script

In [None]:
# Import the evaluator
print("üì¶ Importing LeaderboardV2Evaluator...")
from instructlab.eval.leaderboard import LeaderboardV2Evaluator
print("‚úÖ Import complete")

# Configuration for evaluation
eval_num_gpus = 8  # Number of GPUs to use for evaluation (adjust based on your hardware)
eval_config = {"batch_size": "auto", "max_batch_size": 64}  # Optimized evaluation config


### Step 1: Checkpoint Discovery

Now we'll find all Phase10 checkpoints that need to be evaluated. Phase10 represents the final training phase with skills and comprehensive replay, so these are the checkpoints we want to evaluate for deployment.


In [None]:
# Find Phase10 checkpoints to evaluate
print("üîç Finding Phase10 checkpoints to evaluate...")

phase10_hf_dir = os.path.join(phase10_ckpt_output_dir, "hf_format")
checkpoints_to_evaluate = []

if os.path.exists(phase10_hf_dir):
    # List all checkpoint directories
    phase10_checkpoints = [
        os.path.join(phase10_hf_dir, d) 
        for d in os.listdir(phase10_hf_dir) 
        if os.path.isdir(os.path.join(phase10_hf_dir, d))
    ]
    
    # Sort by name to ensure consistent ordering
    phase10_checkpoints.sort()
    
    checkpoints_to_evaluate = phase10_checkpoints
    print(f"‚úÖ Found {len(checkpoints_to_evaluate)} Phase10 checkpoint(s):")
    for ckpt in checkpoints_to_evaluate:
        print(f"   - {os.path.basename(ckpt)}")
else:
    print(f"‚ùå No Phase10 checkpoints found at {phase10_hf_dir}")
    print("   Please ensure Phase10 training completed successfully.")

print(f"\nüìä Total checkpoints to evaluate: {len(checkpoints_to_evaluate)}")


### Step 2: Initialize Evaluation

Before we start the evaluation loop, let's set up our results storage and verify we have checkpoints to evaluate.


In [None]:
# Initialize evaluation tracking
if not checkpoints_to_evaluate:
    print("‚ùå No checkpoints found to evaluate. Exiting evaluation process.")
else:
    print("‚úÖ Ready to evaluate checkpoints")
    print(f"‚è±Ô∏è  Total estimated time: ~{len(checkpoints_to_evaluate) * 15}-{len(checkpoints_to_evaluate) * 20} minutes")

checkpoint_results = []

### Step 3: Evaluate Each Checkpoint

Now we'll evaluate each checkpoint using the OpenLLM Leaderboard v2. This will test each checkpoint across multiple dimensions including reasoning, math, instruction following, and general knowledge.


In [None]:
checkpoint_results = []

In [None]:
# Evaluation loop

for i, checkpoint_path in enumerate(checkpoints_to_evaluate, 1):
    checkpoint_name = os.path.basename(checkpoint_path)
    print(f"\n{'='*70}")
    print(f"üìç Evaluating checkpoint {i}/{len(checkpoints_to_evaluate)}: {checkpoint_name}")
    print(f"{'='*70}")
    
    try:
        # Create evaluator for this checkpoint
        print("  üîß Creating evaluator...")
        evaluator = LeaderboardV2Evaluator(
            model_path=checkpoint_path,
            num_gpus=eval_num_gpus,
            eval_config=eval_config
        )
        
        # Run evaluation
        print("  üöÄ Running leaderboard evaluation...")
        start_time = time.time()
        
        result = evaluator.run()
        
        eval_time = time.time() - start_time
        print(f"  ‚úÖ Evaluation complete in {eval_time/60:.1f} minutes")
        
        # Store results
        checkpoint_results.append({
            "checkpoint_path": checkpoint_path,
            "checkpoint_name": checkpoint_name,
            "results": result,
            "overall_score": result["overall_score"],
            "eval_time_minutes": eval_time/60
        })
        
        # Print scores for this checkpoint
        print(f"\n  üìä Scores for {checkpoint_name}:")
        print(f"    Overall: {result['overall_score'] * 100:.2f}%")
        if "leaderboard_ifeval" in result:
            print(f"    IFEval: {result['leaderboard_ifeval']['score'] * 100:.2f}%")
        if "leaderboard_mmlu_pro" in result:
            print(f"    MMLU-Pro: {result['leaderboard_mmlu_pro']['score'] * 100:.2f}%")
        if "leaderboard_math_hard" in result:
            print(f"    MATH-Hard: {result['leaderboard_math_hard']['score'] * 100:.2f}%")
        if "leaderboard_gpqa" in result:
            print(f"    GPQA: {result['leaderboard_gpqa']['score'] * 100:.2f}%")
        if "leaderboard_musr" in result:
            print(f"    MUSR: {result['leaderboard_musr']['score'] * 100:.2f}%")
        if "leaderboard_bbh" in result:
            print(f"    BBH: {result['leaderboard_bbh']['score'] * 100:.2f}%")
        
    except Exception as e:
        print(f"  ‚ùå Error evaluating checkpoint: {e}")
        checkpoint_results.append({
            "checkpoint_path": checkpoint_path,
            "checkpoint_name": checkpoint_name,
            "results": None,
            "overall_score": -1,
            "error": str(e)
        })

print(f"\n‚úÖ Evaluation loop completed. Evaluated {len(checkpoint_results)} checkpoints.")


### Step 4: Analyze Results

Let's sort and analyze all the evaluation results to understand how each checkpoint performed.


In [None]:
# Sort results by overall score
if checkpoint_results:
    print("üìä Analyzing evaluation results...\n")
    
    # Sort by overall score (descending)
    sorted_results = sorted(
        checkpoint_results, 
        key=lambda x: x["overall_score"], 
        reverse=True
    )
    
    # Display all results ranked
    print("üèÜ CHECKPOINT RANKINGS")
    print("=" * 70)
    print(f"{'Rank':<6} {'Checkpoint':<30} {'Overall Score':<15} {'Status'}")
    print("-" * 70)
    
    for i, result in enumerate(sorted_results, 1):
        score_str = f"{result['overall_score'] * 100:.2f}%" if result['overall_score'] >= 0 else "ERROR"
        status = "ü•á" if i == 1 else "ü•à" if i == 2 else "ü•â" if i == 3 else "‚úì"
        print(f"{i:<6} {result['checkpoint_name']:<30} {score_str:<15} {status}")
    
    # Show evaluation time statistics
    successful_evals = [r for r in checkpoint_results if r["overall_score"] >= 0]
    if successful_evals:
        avg_time = sum(r.get("eval_time_minutes", 0) for r in successful_evals) / len(successful_evals)
        print(f"\n‚è±Ô∏è  Average evaluation time: {avg_time:.1f} minutes per checkpoint")
else:
    print("‚ùå No results to analyze.")


### Step 5: Identify Best Checkpoint

Finally, let's identify and highlight the best-performing checkpoint based on the overall leaderboard score.


In [None]:
# Identify and display the best checkpoint
if checkpoint_results and sorted_results and sorted_results[0]["overall_score"] >= 0:
    best_checkpoint = sorted_results[0]
    
    print("\n" + "=" * 70)
    print("‚ú® BEST CHECKPOINT FOUND ‚ú®")
    print("=" * 70)
    print(f"\nüìç Checkpoint: {best_checkpoint['checkpoint_name']}")
    print(f"üìÇ Full Path: {best_checkpoint['checkpoint_path']}")
    print(f"üèÜ Overall Score: {best_checkpoint['overall_score'] * 100:.2f}%")
    
    if best_checkpoint['results']:
        print("\nüìä Detailed Performance Breakdown:")
        print("-" * 40)
        
        # Define the order we want to display tasks
        task_order = [
            ("leaderboard_ifeval", "IFEval (Instruction Following)"),
            ("leaderboard_mmlu_pro", "MMLU-Pro (General Knowledge)"),
            ("leaderboard_math_hard", "MATH-Hard (Mathematics)"),
            ("leaderboard_gpqa", "GPQA (Graduate-level QA)"),
            ("leaderboard_musr", "MUSR (Multi-step Reasoning)"),
            ("leaderboard_bbh", "BBH (Big-Bench Hard)")
        ]
        
        for task_key, task_name in task_order:
            if task_key in best_checkpoint['results'] and isinstance(best_checkpoint['results'][task_key], dict):
                score = best_checkpoint['results'][task_key].get('score', 0)
                print(f"  {task_name:<35} {score * 100:>6.2f}%")
    
    print("\nüí° This checkpoint achieved the best balance across all evaluation tasks.")
    print("   It is recommended for deployment unless domain-specific requirements suggest otherwise.")
else:
    print("\n‚ùå No valid checkpoint could be identified as best.")
    print("   Please check the evaluation errors above.")


## Summary and Next Steps

### What We've Accomplished

In this notebook, we've demonstrated a complete LAB training and checkpoint selection workflow:

1. **LAB Training**: Executed two-phase training with knowledge tuning (Phase07) followed by skills + replay training (Phase10)
2. **Checkpoint Collection**: Gathered all checkpoints from both training phases
3. **Comprehensive Evaluation**: Used OpenLLM Leaderboard v2 to evaluate each checkpoint across multiple intelligence dimensions
4. **Optimal Selection**: Identified the best-performing checkpoint based on overall scores

### Key Insights

- **Trade-offs Matter**: Different checkpoints excel at different tasks - the best checkpoint balances general intelligence with your domain expertise
- **Evaluation is Essential**: Without proper evaluation, you might deploy a suboptimal checkpoint that underperforms on important capabilities
- **Leaderboard v2 Efficiency**: With proper configuration, evaluation takes only ~15 minutes per checkpoint instead of 90+ minutes

### Next Steps

1. **Deploy Your Best Model**: Use the identified best checkpoint for inference or further fine-tuning
2. **Domain-Specific Testing**: Additionally test the best checkpoint on your specific use cases
3. **Consider Task-Specific Selection**: If certain tasks are more important for your application, you might weight those scores higher in selection
4. **Monitor in Production**: Continue evaluating model performance on real-world tasks after deployment

### Alternative Evaluation Options

While this notebook uses OpenLLM Leaderboard v2, you can adapt the same workflow with other evaluation frameworks:
- **lm-evaluation-harness**: For custom task suites
- **OpenCompass**: For comprehensive Chinese language evaluation
- **Your Custom Benchmarks**: For domain-specific evaluation needs

The key is to maintain a consistent evaluation protocol across all checkpoints to ensure fair comparison.
