## Evaluate Accuracy of the Compressed Model
After compression, this step evaluates the compressed model on standard benchmarks to determine how compression affects its accuracy and generative quality relative to the base model.

**Goal**: Establish the performance and accuracy of the compressed model and compare it later against the baseline to understand the impact of compression.

**Key Actions**:

- We will create a function called **evaluate** that uses `simple_evaluate` from LM Eval to test the compressed model.

- Benchmark on multiple datasets:

    - MMLU: General knowledge across subjects.

    - IFeval: Instruction-following tasks.

    - ARC: Logical and scientific reasoning.
    
    - HellaSwag: Commonsense completion.

- Collect metrics like accuracy, accuracy_norm, and task-specific scores.

- Save results as JSON for later comparison.

**Outcome**:

- Quantitative metrics for the compressed model.

- Confidence that the model is good enough in terms of accuracy.

In [None]:
import torch
from lm_eval.utils import make_table
from utils import evaluate, load_pickle, save_pickle

!export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

  from .autonotebook import tqdm as notebook_tqdm
2025-12-10 13:22:55,324	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


INFO 12-10 13:22:55 [__init__.py:216] Automatically detected platform cuda.


In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

## Define evaluation benchmarking datasets
The following benchmark datasets can be used for evaluating on multiple tasks:
- MMLU: General knowledge across 57 subjects
- IFeval: Instruction-following capability
- ARC: Logical & scientific reasoning
- HellaSwag: Commonsense completion


In [None]:
# define tasks you want to evaluate the model on
tasks = ["mmlu", "arc_easy", "hellaswag", "ifeval"]

### Evaluating the Compressed Model

**NOTE**: 
1. Running the evaluation on the entire list of tasks can take long. So for testing, you can use a single task instead.

2. The results will be stored as a **results.pkl** files in the directories defined by **compressed_results_dir**.

In [None]:
# setting directories
compressed_model_path = "Llama_3.1_8B_Instruct_int8_dynamic"
compressed_results_dir = "results/compressed_accuracy"

In [None]:
# evaluate the compressed model and save results in pkl format
comp_acc = evaluate(
    compressed_model_path,
    tasks,
    limit=None,
    batch_size=16,
    apply_chat_template=True,
    verbosity=None,
)
save_pickle(compressed_results_dir, comp_acc)

In [None]:
comp_results = load_pickle(compressed_results_dir)

In [None]:
# print results for the compressed model
print(make_table(comp_results))

|                 Tasks                 |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|---------------------------------------|------:|------|-----:|-----------------------|---|-----:|---|------|
|arc_easy                               |      1|none  |     0|acc                    |↑  |0.8114|±  |0.0080|
|                                       |       |none  |     0|acc_norm               |↑  |0.7584|±  |0.0088|
|hellaswag                              |      1|none  |     0|acc                    |↑  |0.5756|±  |0.0049|
|                                       |       |none  |     0|acc_norm               |↑  |0.7261|±  |0.0045|
|ifeval                                 |      4|none  |     0|inst_level_loose_acc   |↑  |0.8609|±  |   N/A|
|                                       |       |none  |     0|inst_level_strict_acc  |↑  |0.8225|±  |   N/A|
|                                       |       |none  |     0|prompt_level_loose_acc |↑  |0.8004|±  |0.0172|
|         