## Evaluate Accuracy of the Base Model
After compression, it’s important to ensure the model maintains its generative capabilities. This step evaluates the compressed model on standard benchmarks.

**Goal**: Verify that compression does not degrade model quality.

**Key Actions**:

- We will create a function called **evaluate** that uses simple_evaluate from LM Eval to test the compressed model.

- Benchmark on multiple datasets:

    - MMLU: General knowledge across subjects.

    - IFeval: Instruction-following tasks.

    - ARC: Logical and scientific reasoning.
    
    - HellaSwag: Commonsense completion.

- Collect metrics like accuracy, accuracy_norm, and task-specific scores.

- Save results as JSON for later comparison.

**Outcome**:

- Quantitative metrics for the compressed model.

- Confidence that the model is good enough in terms of accuracy.

In [None]:
import torch
from lm_eval.utils import make_table
from utils import evaluate, load_pickle, save_pickle

!export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

  from .autonotebook import tqdm as notebook_tqdm
2025-12-10 13:22:55,324	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


INFO 12-10 13:22:55 [__init__.py:216] Automatically detected platform cuda.


In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

## Define evaluation benchmarking datasets
The following benchmark datasets can be used for evaluating on multiple tasks:
- MMLU: General knowledge across 57 subjects
- IFeval: Instruction-following capability
- ARC: Logical & scientific reasoning
- HellaSwag: Commonsense completion


In [None]:
# define tasks you want to evaluate the model on
tasks = ["mmlu", "arc_easy", "hellaswag", "ifeval"]

### Evaluating the Base Model with `simple_evaluate`

`simple_evaluate` is the **main entry point** in LM Evaluation Harness to evaluate a model across one or multiple benchmark datasets. It handles:

1. Wrapping your model (or creating an LM object) to provide a **standardized interface**.
2. Preparing inputs and optionally applying **few-shot examples** or **chat/instruction templates**.
3. Running the model on benchmark tasks and collecting outputs.
4. Computing **evaluation metrics** (accuracy, accuracy_norm, etc.) for each task.
5. Returning a **results dictionary** that includes task-level metrics and model configuration info.

We have wrapped **simple_evaluate** in a helper function **evaluate** which can be found in [utils.py](utils.py).

**Key concepts:**

- **LM object**:  
  LM Evaluation Harness wraps all models (Hugging Face, custom, or preloaded) in an `LM` object. This object provides a consistent interface (`loglikelihood`, `generate`, etc.) regardless of model backend.

- **model_args**:  
  Optional dictionary or string containing model-specific arguments (e.g., temperature, top-k, top-p). Ignored if passing a pre-wrapped LM object.

- **apply_chat_template**:  
  If your model is chat-based or instruction-following, this parameter allows you to prepend a prompt template to match the model's training format.  
  
**Parameters used here:**
- `model`: Path or name of the model to evaluate (can be a string or an LM object).
- `model_args`: Optional dictionary to provide model-specific arguments (e.g., batch size, device).
- `tasks`: List of task names or objects to evaluate.
- `num_fewshot`: Number of examples in the few-shot context (set to 0 for zero-shot).
- `batch_size`: Number of samples to process per batch.
- `device`: Device to run the model on (e.g., "cuda" or "cpu").
- `apply_chat_template`: Whether to wrap inputs in a chat-style template; useful for chat or instruction-tuned models.
- `verbosity`: Set logging level; use `"DEBUG"` to inspect inputs/outputs for debugging. Default is None.
- `log_samples`: Whether to log per-sample outputs for inspection.


**NOTE**: 
1. Running the evaluation on the entire list of tasks can take long. So for testing, you can use a single task instead.

2. The results will be stored as a **results.pkl** files in the directories defined by **compressed_results_dir** and **base_results_dir** paths.


In [None]:
# setting directories
base_model_path = "./base_model"
base_results_dir = "results/base_accuracy"

In [None]:
# evaluate the base model and save results in pkl format
base_acc = evaluate(
    base_model_path,
    tasks,
    limit=None,
    batch_size="auto",
    apply_chat_template=True,
    verbosity=None,
)
save_pickle(base_results_dir, base_acc)

In [None]:
base_results = load_pickle(base_results_dir)

In [None]:
# print results for the base model
print(make_table(base_results))

|                 Tasks                 |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|---------------------------------------|------:|------|-----:|-----------------------|---|-----:|---|------|
|arc_easy                               |      1|none  |     0|acc                    |↑  |0.8127|±  |0.0080|
|                                       |       |none  |     0|acc_norm               |↑  |0.7588|±  |0.0088|
|hellaswag                              |      1|none  |     0|acc                    |↑  |0.5742|±  |0.0049|
|                                       |       |none  |     0|acc_norm               |↑  |0.7254|±  |0.0045|
|ifeval                                 |      4|none  |     0|inst_level_loose_acc   |↑  |0.8513|±  |   N/A|
|                                       |       |none  |     0|inst_level_strict_acc  |↑  |0.8189|±  |   N/A|
|                                       |       |none  |     0|prompt_level_loose_acc |↑  |0.7874|±  |0.0176|
|         