## Evaluate Accuracy of the Compressed Model
After compression, it’s important to ensure the model maintains its generative capabilities. This step evaluates the compressed model on standard benchmarks.

**Goal**: Verify that compression does not degrade model quality.

**Key Actions**:

- We will create a function called **evaluate** that uses simple_evaluate from LM Eval to test the compressed model.

- Benchmark on multiple datasets:

    - MMLU: General knowledge across subjects.

    - IFeval: Instruction-following tasks.

    - ARC: Logical and scientific reasoning.
    
    - HellaSwag: Commonsense completion.

- Collect metrics like accuracy, accuracy_norm, and task-specific scores.

- Save results as JSON for later comparison.

**Outcome**:

- Quantitative metrics for the compressed model.

- Confidence that the model is ready for system-level performance benchmarking.

In [1]:
from lm_eval.utils import make_table
import torch
import json
from typing import Union
from utils import evaluate, extract_task_metrics, save_pickle, load_pickle

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

## Define evaluation benchmarking datasets
The following benchmark datasets can be used for evaluating on multiple tasks:
- MMLU: General knowledge across 57 subjects
- IFeval: Instruction-following capability
- ARC: Logical & scientific reasoning
- HellaSwag: Commonsense completion


In [3]:
# define tasks you want to evaluate the model on
tasks = [
    "mmlu",
    "arc_easy",
    "hellaswag",
    "ifeval"
]

### Evaluating the Compressed Model with `simple_evaluate`

`simple_evaluate` is the **main entry point** in LM Evaluation Harness to evaluate a model across one or multiple benchmark datasets. It handles:

1. Wrapping your model (or creating an LM object) to provide a **standardized interface**.
2. Preparing inputs and optionally applying **few-shot examples** or **chat/instruction templates**.
3. Running the model on benchmark tasks and collecting outputs.
4. Computing **evaluation metrics** (accuracy, accuracy_norm, etc.) for each task.
5. Returning a **results dictionary** that includes task-level metrics and model configuration info.

We have wrapped **simple_evaluate** in a helper function **evaluate** which can be found in [utils.py](utils.py).

**Key concepts:**

- **LM object**:  
  LM Evaluation Harness wraps all models (Hugging Face, custom, or preloaded) in an `LM` object. This object provides a consistent interface (`loglikelihood`, `generate`, etc.) regardless of model backend.

- **model_args**:  
  Optional dictionary or string containing model-specific arguments (e.g., temperature, top-k, top-p). Ignored if passing a pre-wrapped LM object.

- **apply_chat_template**:  
  If your model is chat-based or instruction-following, this parameter allows you to prepend a prompt template to match the model's training format.  
  
**Parameters used here:**
- `model`: Path or name of the model to evaluate (can be a string or an LM object).
- `model_args`: Optional dictionary to provide model-specific arguments (e.g., batch size, device).
- `tasks`: List of task names or objects to evaluate.
- `num_fewshot`: Number of examples in the few-shot context (set to 0 for zero-shot).
- `batch_size`: Number of samples to process per batch.
- `device`: Device to run the model on (e.g., "cuda" or "cpu").
- `apply_chat_template`: Whether to wrap inputs in a chat-style template; useful for chat or instruction-tuned models.
- `verbosity`: Set logging level; use `"DEBUG"` to inspect inputs/outputs for debugging. Default is None.
- `log_samples`: Whether to log per-sample outputs for inspection.


**NOTE**: Running the evaluation on the entire list of tasks can take long. So for testing, you can use a single task instead.



In [4]:
# setting directories
compressed_model_path = "./compressed_model"
compressed_results_dir = "results/compressed_accuracy"
base_model_path = "./base_model"
base_results_dir = "results/base_accuracy"

In [5]:
# evaluate the compressed model and save results in pkl format
# comp_acc = evaluate(compressed_model_path, tasks, limit=None, batch_size=16, apply_chat_template=True, verbosity=None)
# save_pickle(compressed_results_dir, comp_acc)

Compressing model: 224it [00:00, 1212.52it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [01:08<00:00, 34.19s/it]
Overwriting default num_fewshot of hellaswag from None to 0
Overwriting default num_fewshot of arc_easy from None to 0
Overwriting default num_fewshot of mmlu_abstract_algebra from None to 0
Overwriting default num_fewshot of mmlu_anatomy from None to 0
Overwriting default num_fewshot of mmlu_astronomy from None to 0
Overwriting default num_fewshot of mmlu_college_biology from None to 0
Overwriting default num_fewshot of mmlu_college_chemistry from None to 0
Overwriting default num_fewshot of mmlu_college_computer_science from None to 0
Overwriting default num_fewshot of mmlu_college_mathematics from None to 0
Overwriting default num_fewshot of mmlu_college_physics from None to 0
Overwriting default num_fewshot of mmlu_computer_security from None to 0
Overwriting default num_fewshot of mmlu_conceptual_physics from None to 0
Overwriting default num_fewshot of mmlu_elect

In [None]:
# evaluate the base model and save results in pkl format
base_acc = evaluate(base_model_path, tasks, limit=None, batch_size=16, apply_chat_template=True, verbosity=None)
save_pickle(base_results_dir, base_acc)

Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00,  1.48it/s]


In [7]:
base_results = load_pickle(base_results_dir)
comp_results = load_pickle(compressed_model_path)

{'results': {'arc_easy': {'alias': 'arc_easy',
   'acc,none': 0.5067340067340067,
   'acc_stderr,none': 0.010258852980991717,
   'acc_norm,none': 0.4393939393939394,
   'acc_norm_stderr,none': 0.010184134315437384},
  'hellaswag': {'alias': 'hellaswag',
   'acc,none': 0.44373630750846443,
   'acc_stderr,none': 0.004958089432669621,
   'acc_norm,none': 0.5502887870942044,
   'acc_norm_stderr,none': 0.0049644793245523094},
  'ifeval': {'alias': 'ifeval',
   'prompt_level_strict_acc,none': 0.1367837338262477,
   'prompt_level_strict_acc_stderr,none': 0.014787002800682885,
   'inst_level_strict_acc,none': 0.23501199040767387,
   'inst_level_strict_acc_stderr,none': 'N/A',
   'prompt_level_loose_acc,none': 0.16081330868761554,
   'prompt_level_loose_acc_stderr,none': 0.015808599888607115,
   'inst_level_loose_acc,none': 0.2577937649880096,
   'inst_level_loose_acc_stderr,none': 'N/A'},
  'mmlu': {'acc,none': 0.2486113089303518,
   'acc_stderr,none': 0.0036413387703834654,
   'alias': 'mmlu'

In [None]:
# print the results
print(make_table(base_acc))

In [None]:
print(make_table(comp_acc))