## Evaluate Accuracy of the Base Model
After compression, it is important to ensure that the model maintains its generative capabilities. In this step, the base model is evaluated on standard benchmarks to establish a performance baseline, which is later used to compare against the compressed model.

**Goal**: Evaluate the base (uncompressed) model on standard benchmarks to establish a performance baseline for later comparison with the compressed model.

**Key Actions**:

- We will create a function called **evaluate** that uses the `simple_evaluate` from LM Eval to test the compressed model.

- Benchmark on multiple datasets:

    - MMLU: General knowledge across subjects.

    - IFeval: Instruction-following tasks.

    - ARC: Logical and scientific reasoning.
    
    - HellaSwag: Commonsense completion.

- Collect metrics like accuracy, accuracy_norm, and task-specific scores.

- Save results as JSON.

**Outcome**:

- Quantitative metrics for the base model.

- A baseline to compare the compressed model's accuracy agaisnt.

In [1]:
import torch
from lm_eval.utils import make_table
from utils import evaluate, load_pickle, save_pickle

!export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

  from .autonotebook import tqdm as notebook_tqdm


To make sure you have enough GPU memory to run this notebook, run the following command in terminal:

`nvidia-smi`

The output will look something like this:

```text
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:CA:00.0 Off |                    0 |
| N/A   44C    P0             91W /  350W |   15753MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            8049      C   /opt/app-root/bin/python3             15744MiB |
+-----------------------------------------------------------------------------------------+

Note the PID and run the following command:
```
`kill -9 <pid>`


Replace <pid> with the actual PID for example `8049` in this case. So the command will become `kill -9 8094`


In [3]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

### Define evaluation benchmarking datasets
The following benchmark datasets can be used for evaluating on multiple tasks:
- MMLU: General knowledge across 57 subjects
- IFeval: Instruction-following capability
- ARC: Logical & scientific reasoning
- HellaSwag: Commonsense completion


In [4]:
# define tasks you want to evaluate the model on
tasks = ["mmlu", "arc_easy", "hellaswag", "ifeval"]

### Evaluating the Base Model

**NOTE**: 
1. Running the evaluation on the entire list of tasks can take long. So for testing, you can use a single task instead.

2. The results will be stored as a **results.pkl** files in the directories defined by **base_results_dir**.


In [3]:
# setting directories
base_model_path = "../base_model"
base_results_dir = "results/base_accuracy"

In [6]:
# evaluate the base model and save results in pkl format
base_acc = evaluate(
    base_model_path,
    tasks,
    limit=None,
    batch_size="auto",
    apply_chat_template=True,
    verbosity=None,
)
save_pickle(base_results_dir, base_acc)

Loading checkpoint shards: 100%|██████████| 4/4 [00:04<00:00,  1.06s/it]
Generating test split: 100%|██████████| 171/171 [00:00<00:00, 27728.52 examples/s]
Generating validation split: 100%|██████████| 19/19 [00:00<00:00, 10093.96 examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 3088.59 examples/s]
Generating test split: 100%|██████████| 1534/1534 [00:00<00:00, 178500.83 examples/s]
Generating validation split: 100%|██████████| 170/170 [00:00<00:00, 65200.41 examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 2924.49 examples/s]
Generating test split: 100%|██████████| 324/324 [00:00<00:00, 125538.52 examples/s]
Generating validation split: 100%|██████████| 35/35 [00:00<00:00, 19389.86 examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 2935.95 examples/s]
Generating test split: 100%|██████████| 311/311 [00:00<00:00, 97915.37 examples/s]
Generating validation split: 100%|██████████| 34/34 [00:00<00:00, 18670.64 examples/s]
Generating

Downloaded punkt_tab on rank 0


Generating train split: 100%|██████████| 541/541 [00:00<00:00, 71072.09 examples/s]
Overwriting default num_fewshot of hellaswag from None to 0
Overwriting default num_fewshot of arc_easy from None to 0
Overwriting default num_fewshot of mmlu_abstract_algebra from None to 0
Overwriting default num_fewshot of mmlu_anatomy from None to 0
Overwriting default num_fewshot of mmlu_astronomy from None to 0
Overwriting default num_fewshot of mmlu_college_biology from None to 0
Overwriting default num_fewshot of mmlu_college_chemistry from None to 0
Overwriting default num_fewshot of mmlu_college_computer_science from None to 0
Overwriting default num_fewshot of mmlu_college_mathematics from None to 0
Overwriting default num_fewshot of mmlu_college_physics from None to 0
Overwriting default num_fewshot of mmlu_computer_security from None to 0
Overwriting default num_fewshot of mmlu_conceptual_physics from None to 0
Overwriting default num_fewshot of mmlu_electrical_engineering from None to 0
Ov

Passed argument batch_size = auto. Detecting largest batch size


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Determined Largest batch size: 1


Running generate_until requests:   0%|          | 1/541 [00:17<2:34:13, 17.14s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Running generate_until requests:   0%|          | 2/541 [00:20<1:22:42,  9.21s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Running generate_until requests:   1%|          | 3/541 [00:24<1:01:48,  6.89s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Running generate_until requests:   1%|          | 4/541 [00:50<2:06:47, 14.17s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Running generate_until requests:   1%|          | 5/541 [01:11<2:29:08, 16.69s/it]The following gene

Passed argument batch_size = auto:1. Detecting largest batch size
Determined largest batch size: 1


Running loglikelihood requests: 100%|██████████| 105837/105837 [1:19:04<00:00, 22.31it/s]


In [4]:
base_results = load_pickle(base_results_dir)

In [5]:
# print results for the base model
print(make_table(base_results))

|                 Tasks                 |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|---------------------------------------|------:|------|-----:|-----------------------|---|-----:|---|------|
|arc_easy                               |      1|none  |     0|acc                    |↑  |0.8136|±  |0.0080|
|                                       |       |none  |     0|acc_norm               |↑  |0.7588|±  |0.0088|
|hellaswag                              |      1|none  |     0|acc                    |↑  |0.5741|±  |0.0049|
|                                       |       |none  |     0|acc_norm               |↑  |0.7251|±  |0.0045|
|ifeval                                 |      4|none  |     0|inst_level_loose_acc   |↑  |0.8513|±  |   N/A|
|                                       |       |none  |     0|inst_level_strict_acc  |↑  |0.8189|±  |   N/A|
|                                       |       |none  |     0|prompt_level_loose_acc |↑  |0.7874|±  |0.0176|
|         