# Model Evaluation with lm_eval

This notebook demonstrates how to evaluate and compare quantized vs unquantized models using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) library.

We'll run MMLU (Massive Multitask Language Understanding) benchmarks against both models and compare their performance.

**Note:** Multiple users may be running these tests simultaneously, so we use rate limiting to avoid overloading the model endpoints.

---
## 0. Setup

Configure model endpoints.

In [None]:
import os

# Model endpoints - configure these for your environment
os.environ["UNQUANTIZED_URL"] = "http://llama-32-predictor.ai501.svc.cluster.local:8080"
os.environ["UNQUANTIZED_MODEL"] = "llama32"

os.environ["QUANTIZED_URL"] = "http://llama-32-fp8-predictor.ai501.svc.cluster.local:8080"
os.environ["QUANTIZED_MODEL"] = "RedHatAI/Llama-3.2-3B-Instruct-FP8"

print(f"Unquantized endpoint: {os.environ['UNQUANTIZED_URL']}")
print(f"Unquantized model: {os.environ['UNQUANTIZED_MODEL']}")
print(f"Quantized endpoint: {os.environ['QUANTIZED_URL']}")
print(f"Quantized model: {os.environ['QUANTIZED_MODEL']}")

---
## 1. Understanding lm_eval

The `lm-evaluation-harness` is a standard framework for evaluating language models. It supports:

- **200+ benchmarks** including MMLU, HellaSwag, ARC, TruthfulQA, etc.
- **Multiple model backends** including local models, OpenAI API, and vLLM endpoints
- **Standardized evaluation** for reproducible comparisons

For remote models served via OpenAI-compatible APIs (like vLLM), we use the `local-completions` model type.

---
## 2. Evaluate Unquantized Model

First, let's run MMLU evaluation on the unquantized (full precision) model.

We use a single MMLU task with limited samples for a quick workshop demo. For production evaluations, remove the `limit` parameter and add more tasks.

In [None]:
import lm_eval
import lm_eval.models.openai_completions
from lm_eval.models.openai_completions import LocalCompletionsAPI

# Fix for missing tqdm import in lm_eval's openai_completions module
from tqdm import tqdm
lm_eval.models.openai_completions.tqdm = tqdm

# Also inject into the module's global namespace
import sys
sys.modules['lm_eval.models.openai_completions'].__dict__['tqdm'] = tqdm

# Configure the unquantized model
unquantized_model = LocalCompletionsAPI(
    model=os.environ["UNQUANTIZED_MODEL"],
    base_url=f"{os.environ['UNQUANTIZED_URL']}/v1/completions",
    num_concurrent=1,  # Limit concurrency to avoid overloading
    tokenizer_backend="huggingface",
    tokenizer="RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16",
)

print("Unquantized model configured.")

In [None]:
# Run MMLU evaluation on unquantized model
# Using limited samples for a quick workshop demo
unquantized_results = lm_eval.simple_evaluate(
    model=unquantized_model,
    tasks=["mmlu_abstract_algebra"],  # Single task for speed
    num_fewshot=0,                     # No few-shot examples (faster)
    batch_size=1,
    limit=50,                          # Only evaluate 10 samples
)

print("Unquantized model evaluation complete.")

In [None]:
# Display unquantized results
print("Unquantized Model Results")
print("=" * 50)
for task, metrics in unquantized_results["results"].items():
    acc = metrics.get("acc,none", metrics.get("acc", "N/A"))
    if isinstance(acc, float):
        print(f"{task}: {acc:.4f}")
    else:
        print(f"{task}: {acc}")

---
## 3. Evaluate Quantized Model

Now let's run the same evaluation on the FP8 quantized model.

In [None]:
# Configure the quantized model
quantized_model = LocalCompletionsAPI(
    model=os.environ["QUANTIZED_MODEL"],
    base_url=f"{os.environ['QUANTIZED_URL']}/v1/completions",
    num_concurrent=1,  # Limit concurrency to avoid overloading
    tokenizer_backend="huggingface",
    tokenizer="RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16",
)

print("Quantized model configured.")

In [None]:
# Run MMLU evaluation on quantized model
# Using same limited samples for comparison
quantized_results = lm_eval.simple_evaluate(
    model=quantized_model,
    tasks=["mmlu_abstract_algebra"],  # Single task for speed
    num_fewshot=0,                     # No few-shot examples (faster)
    batch_size=1,
    limit=50,                          # Only evaluate 10 samples
)

print("Quantized model evaluation complete.")

In [None]:
# Display quantized results
print("Quantized Model Results")
print("=" * 50)
for task, metrics in quantized_results["results"].items():
    acc = metrics.get("acc,none", metrics.get("acc", "N/A"))
    if isinstance(acc, float):
        print(f"{task}: {acc:.4f}")
    else:
        print(f"{task}: {acc}")

---
## 4. Compare Results

Let's compare the performance of both models side-by-side.

In [None]:
# Compare results
print("Model Comparison: Unquantized vs Quantized (FP8)")
print("=" * 70)
print(f"{'Task':<30} {'Unquantized':>15} {'Quantized':>15} {'Diff':>10}")
print("-" * 70)

total_unquant = 0
total_quant = 0
num_tasks = 0

for task in unquantized_results["results"].keys():
    unquant_acc = unquantized_results["results"][task].get("acc,none", 0)
    quant_acc = quantized_results["results"][task].get("acc,none", 0)
    
    if isinstance(unquant_acc, float) and isinstance(quant_acc, float):
        diff = quant_acc - unquant_acc
        print(f"{task:<30} {unquant_acc:>15.4f} {quant_acc:>15.4f} {diff:>+10.4f}")
        total_unquant += unquant_acc
        total_quant += quant_acc
        num_tasks += 1

if num_tasks > 0:
    avg_unquant = total_unquant / num_tasks
    avg_quant = total_quant / num_tasks
    avg_diff = avg_quant - avg_unquant
    print("-" * 70)
    print(f"{'Average':<30} {avg_unquant:>15.4f} {avg_quant:>15.4f} {avg_diff:>+10.4f}")
    print()
    print(f"Quantization impact: {avg_diff*100:+.2f}% accuracy change")

---
## Summary

You've learned:

- How to use `lm_eval` to evaluate models served via OpenAI-compatible APIs
- How to run MMLU benchmarks on remote model endpoints
- How to compare quantized vs unquantized model performance
- Best practices for rate limiting when multiple users share endpoints

**Key Takeaways:**
- FP8 quantization typically results in minimal accuracy loss (often <1%)
- The trade-off is significant memory and inference speed improvements
- Always benchmark on tasks relevant to your use case