## Evaluate Accuracy of the Compressed Model
After compression, this step evaluates the compressed model on standard benchmarks to determine how compression affects its accuracy and generative quality relative to the base model.

**Goal**: Establish the performance and accuracy of the compressed model and compare it later against the baseline to understand the impact of compression.

**Key Actions**:

- We will create a function called **evaluate** that uses `simple_evaluate` from LM Eval to test the compressed model.

- Benchmark on multiple datasets:

    - MMLU: General knowledge across subjects.

    - IFeval: Instruction-following tasks.

    - ARC: Logical and scientific reasoning.
    
    - HellaSwag: Commonsense completion.

- Collect metrics like accuracy, accuracy_norm, and task-specific scores.

- Save results as JSON for later comparison.

**Outcome**:

- Quantitative metrics for the compressed model.

- Confidence that the model is good enough in terms of accuracy.

In [3]:
import torch
from lm_eval.utils import make_table
from utils import evaluate, load_pickle, save_pickle

!export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

To make sure you have enough GPU memory to run this notebook, run the following command in terminal:

`nvidia-smi`

The output will look something like this:

```text
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:CA:00.0 Off |                    0 |
| N/A   44C    P0             91W /  350W |   15753MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            8049      C   /opt/app-root/bin/python3             15744MiB |
+-----------------------------------------------------------------------------------------+

Note the PID and run the following command:
```
`kill -9 <pid>`


Replace <pid> with the actual PID for example `8049` in this case. So the command will become `kill -9 8094`


In [4]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

## Define evaluation benchmarking datasets
We evaluate the compressed model on the same benchmarks as the base model to make the results comparable.
The following benchmark datasets can be used for evaluating on multiple tasks:
- MMLU: General knowledge across 57 subjects
- IFeval: Instruction-following capability
- ARC: Logical & scientific reasoning
- HellaSwag: Commonsense completion


In [5]:
# define tasks you want to evaluate the model on
tasks = ["mmlu", "arc_easy", "hellaswag", "ifeval"]

### Evaluating the Compressed Model

**NOTE**: 
1. Running the evaluation on the entire list of tasks can take long. So for testing, you can use a single task instead.

2. The results will be stored as a **results.pkl** files in the directories defined by **compressed_results_dir**.

In [6]:
# setting directories
compressed_model_path = "../Llama_3.1_8B_Instruct_int8_dynamic"
compressed_results_dir = "results/compressed_accuracy"

In [7]:
# evaluate the compressed model and save results in pkl format
comp_acc = evaluate(
    compressed_model_path,
    tasks,
    limit=None,
    batch_size=16,
    apply_chat_template=True,
    verbosity=None,
)
save_pickle(compressed_results_dir, comp_acc)

Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.56it/s]
'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 7833f6e1-4675-4100-b0d2-9bf05d5490ff)')' thrown while requesting HEAD https://huggingface.co/datasets/cais/mmlu/resolve/c30699e8356da336a370243923dbaf21066bb9fe/.huggingface.yaml
Retrying in 1s [Retry 1/5].
Overwriting default num_fewshot of hellaswag from None to 0
Overwriting default num_fewshot of arc_easy from None to 0
Overwriting default num_fewshot of mmlu_abstract_algebra from None to 0
Overwriting default num_fewshot of mmlu_anatomy from None to 0
Overwriting default num_fewshot of mmlu_astronomy from None to 0
Overwriting default num_fewshot of mmlu_college_biology from None to 0
Overwriting default num_fewshot of mmlu_college_chemistry from None to 0
Overwriting default num_fewshot of mmlu_college_computer_science from None to 0
Overwriting default num_fewshot of mmlu_college_math

In [8]:
comp_results = load_pickle(compressed_results_dir)

In [9]:
# print results for the compressed model
print(make_table(comp_results))

|                 Tasks                 |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|---------------------------------------|------:|------|-----:|-----------------------|---|-----:|---|------|
|arc_easy                               |      1|none  |     0|acc                    |↑  |0.8106|±  |0.0080|
|                                       |       |none  |     0|acc_norm               |↑  |0.7555|±  |0.0088|
|hellaswag                              |      1|none  |     0|acc                    |↑  |0.5734|±  |0.0049|
|                                       |       |none  |     0|acc_norm               |↑  |0.7277|±  |0.0044|
|ifeval                                 |      4|none  |     0|inst_level_loose_acc   |↑  |0.8549|±  |   N/A|
|                                       |       |none  |     0|inst_level_strict_acc  |↑  |0.8237|±  |   N/A|
|                                       |       |none  |     0|prompt_level_loose_acc |↑  |0.7893|±  |0.0175|
|         