## Evaluate Accuracy of the Compressed Model
After compression, it’s important to ensure the model maintains its generative capabilities. This step evaluates the compressed model on standard benchmarks.

**Goal**: Verify that compression does not degrade model quality.

**Key Actions**:

- We will create a function called **evaluate** that uses simple_evaluate from LM Eval to test the compressed model.

- Benchmark on multiple datasets:

    - MMLU: General knowledge across subjects.

    - IFeval: Instruction-following tasks.

    - ARC: Logical and scientific reasoning.
    
    - HellaSwag: Commonsense completion.

- Collect metrics like accuracy, accuracy_norm, and task-specific scores.

- Save results as JSON for later comparison.

**Outcome**:

- Quantitative metrics for the compressed model.

- Confidence that the model is ready for system-level performance benchmarking.

In [7]:
from lm_eval.utils import make_table
import torch
import json
from typing import Union
from utils import evaluate, save_pickle, load_pickle
!export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [8]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

## Define evaluation benchmarking datasets
The following benchmark datasets can be used for evaluating on multiple tasks:
- MMLU: General knowledge across 57 subjects
- IFeval: Instruction-following capability
- ARC: Logical & scientific reasoning
- HellaSwag: Commonsense completion


In [9]:
# define tasks you want to evaluate the model on
tasks = [
    "mmlu",
    "arc_easy",
    "hellaswag",
    "ifeval"
]

### Evaluating the Compressed Model with `simple_evaluate`

`simple_evaluate` is the **main entry point** in LM Evaluation Harness to evaluate a model across one or multiple benchmark datasets. It handles:

1. Wrapping your model (or creating an LM object) to provide a **standardized interface**.
2. Preparing inputs and optionally applying **few-shot examples** or **chat/instruction templates**.
3. Running the model on benchmark tasks and collecting outputs.
4. Computing **evaluation metrics** (accuracy, accuracy_norm, etc.) for each task.
5. Returning a **results dictionary** that includes task-level metrics and model configuration info.

We have wrapped **simple_evaluate** in a helper function **evaluate** which can be found in [utils.py](utils.py).

**Key concepts:**

- **LM object**:  
  LM Evaluation Harness wraps all models (Hugging Face, custom, or preloaded) in an `LM` object. This object provides a consistent interface (`loglikelihood`, `generate`, etc.) regardless of model backend.

- **model_args**:  
  Optional dictionary or string containing model-specific arguments (e.g., temperature, top-k, top-p). Ignored if passing a pre-wrapped LM object.

- **apply_chat_template**:  
  If your model is chat-based or instruction-following, this parameter allows you to prepend a prompt template to match the model's training format.  
  
**Parameters used here:**
- `model`: Path or name of the model to evaluate (can be a string or an LM object).
- `model_args`: Optional dictionary to provide model-specific arguments (e.g., batch size, device).
- `tasks`: List of task names or objects to evaluate.
- `num_fewshot`: Number of examples in the few-shot context (set to 0 for zero-shot).
- `batch_size`: Number of samples to process per batch.
- `device`: Device to run the model on (e.g., "cuda" or "cpu").
- `apply_chat_template`: Whether to wrap inputs in a chat-style template; useful for chat or instruction-tuned models.
- `verbosity`: Set logging level; use `"DEBUG"` to inspect inputs/outputs for debugging. Default is None.
- `log_samples`: Whether to log per-sample outputs for inspection.


**NOTE**: Running the evaluation on the entire list of tasks can take long. So for testing, you can use a single task instead.



In [10]:
# setting directories
compressed_model_path = "./compressed_model"
compressed_results_dir = "results/compressed_accuracy"
base_model_path = "./base_model"
base_results_dir = "results/base_accuracy"

In [5]:
# evaluate the compressed model and save results in pkl format
comp_acc = evaluate(compressed_model_path, tasks, limit=None, batch_size=16, apply_chat_template=True, verbosity=None)
save_pickle(compressed_results_dir, comp_acc)

Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.40it/s]
Overwriting default num_fewshot of hellaswag from None to 0
Overwriting default num_fewshot of arc_easy from None to 0
Overwriting default num_fewshot of mmlu_abstract_algebra from None to 0
Overwriting default num_fewshot of mmlu_anatomy from None to 0
Overwriting default num_fewshot of mmlu_astronomy from None to 0
Overwriting default num_fewshot of mmlu_college_biology from None to 0
Overwriting default num_fewshot of mmlu_college_chemistry from None to 0
Overwriting default num_fewshot of mmlu_college_computer_science from None to 0
Overwriting default num_fewshot of mmlu_college_mathematics from None to 0
Overwriting default num_fewshot of mmlu_college_physics from None to 0
Overwriting default num_fewshot of mmlu_computer_security from None to 0
Overwriting default num_fewshot of mmlu_conceptual_physics from None to 0
Overwriting default num_fewshot of mmlu_electrical_engineering from None to 0
Overwriting d

In [None]:
# evaluate the base model and save results in pkl format
base_acc = evaluate(base_model_path, tasks, limit=None, batch_size="auto", apply_chat_template=True, verbosity=None)
save_pickle(base_results_dir, base_acc)

Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00,  1.82it/s]
Overwriting default num_fewshot of hellaswag from None to 0
Overwriting default num_fewshot of arc_easy from None to 0
Overwriting default num_fewshot of mmlu_abstract_algebra from None to 0
Overwriting default num_fewshot of mmlu_anatomy from None to 0
Overwriting default num_fewshot of mmlu_astronomy from None to 0
Overwriting default num_fewshot of mmlu_college_biology from None to 0
Overwriting default num_fewshot of mmlu_college_chemistry from None to 0
Overwriting default num_fewshot of mmlu_college_computer_science from None to 0
Overwriting default num_fewshot of mmlu_college_mathematics from None to 0
Overwriting default num_fewshot of mmlu_college_physics from None to 0
Overwriting default num_fewshot of mmlu_computer_security from None to 0
Overwriting default num_fewshot of mmlu_conceptual_physics from None to 0
Overwriting default num_fewshot of mmlu_electrical_engineering from None to 0
Overwriting d

Passed argument batch_size = auto. Detecting largest batch size


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Determined Largest batch size: 1


Running generate_until requests:   0%|          | 1/541 [00:22<3:20:11, 22.24s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Running generate_until requests:   0%|          | 2/541 [00:25<1:41:14, 11.27s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Running generate_until requests:   1%|          | 3/541 [00:29<1:11:45,  8.00s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Running generate_until requests:   1%|          | 4/541 [00:55<2:12:20, 14.79s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Running generate_until requests:   1%|          | 5/541 [01:16<2:32:19, 17.05s/it]The following gene

In [11]:
base_results = load_pickle(base_results_dir)
comp_results = load_pickle(compressed_results_dir)

In [18]:
# print the results
print(make_table(base_results))

|                 Tasks                  |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|----------------------------------------|------:|------|-----:|-----------------------|---|-----:|---|------|
|arc_easy                                |      1|none  |     0|acc                    |↑  |0.8127|±  |0.0080|
|                                        |       |none  |     0|acc_norm               |↑  |0.7588|±  |0.0088|
|hellaswag                               |      1|none  |     0|acc                    |↑  |0.5742|±  |0.0049|
|                                        |       |none  |     0|acc_norm               |↑  |0.7254|±  |0.0045|
|ifeval                                  |      4|none  |     0|inst_level_loose_acc   |↑  |0.8513|±  |   N/A|
|                                        |       |none  |     0|inst_level_strict_acc  |↑  |0.8189|±  |   N/A|
|                                        |       |none  |     0|prompt_level_loose_acc |↑  |0.7874|±  |0.0176|
|

In [19]:
print(make_table(comp_results))

|                 Tasks                  |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|----------------------------------------|------:|------|-----:|-----------------------|---|-----:|---|------|
|arc_easy                                |      1|none  |     0|acc                    |↑  |0.8106|±  |0.0080|
|                                        |       |none  |     0|acc_norm               |↑  |0.7572|±  |0.0088|
|hellaswag                               |      1|none  |     0|acc                    |↑  |0.5750|±  |0.0049|
|                                        |       |none  |     0|acc_norm               |↑  |0.7249|±  |0.0045|
|ifeval                                  |      4|none  |     0|inst_level_loose_acc   |↑  |0.8621|±  |   N/A|
|                                        |       |none  |     0|inst_level_strict_acc  |↑  |0.8309|±  |   N/A|
|                                        |       |none  |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|

## Observation
Comparing the accuracies of the base and compressed models shows that the compressed model performs very similarly to the base model across most tasks. While there are small variations in some task-level metrics, the overall accuracy drop is minimal, demonstrating that compression (e.g., quantization to 8-bit) maintains the model’s capabilities effectively.

This indicates that the compressed model is suitable for deployment scenarios where reduced memory footprint and faster inference are required, without significantly sacrificing performance.