## Evaluate Accuracy of the Compressed Model
Now that the base model has been compressed and an accuracy baseline has been established, this step evaluates the **compressed model** on standard benchmarks. The goal is to quantify how model compression impacts accuracy and generative quality relative to the base model.

**Goal**: Measure the accuracy of the compressed model and enable a direct comparison with the base model to assess the impact of compression.

**Key Actions**:

- We will create a function called **evaluate** that uses `simple_evaluate` from LM Eval to test the compressed model.

- Benchmark on multiple datasets:

    - MMLU: General knowledge across subjects.

    - IFeval: Instruction-following tasks.

    - ARC: Logical and scientific reasoning.
    
    - HellaSwag: Commonsense completion.

- Collect metrics like accuracy, accuracy_norm, and task-specific scores.

- Save results as JSON for later comparison.

**Outcome**:

- Quantitative metrics for the compressed model.

- Confidence that the model is good enough in terms of accuracy.

More details on evaluating LLMs is provided in [Accuracy_Evaluation.md](../docs/Accuracy_Evaluation.md)

### Install Dependencies

In [None]:
# uncomment the following lines to install dependencies if dependencies were not installed in 01_Base_Accuracy_Benchmarking/Base.ipynb
# !pip install .

In [None]:
import torch
from lm_eval.utils import make_table
from utils import evaluate, load_pickle, save_pickle

!export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

To make sure you have enough GPU memory to run this notebook, run the following command in terminal:

`nvidia-smi`

The output will look something like this:

```text
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:CA:00.0 Off |                    0 |
| N/A   44C    P0             91W /  350W |   15753MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            8049      C   /opt/app-root/bin/python3             15744MiB |
+-----------------------------------------------------------------------------------------+

If there are any processes already running and using significant amount of GPU memory, note the PID and run the following command:

```
`kill -9 <pid>`


Replace <pid> with the actual PID for example `8049` in this case. So the command will become `kill -9 8094`


In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

## Define evaluation benchmarking datasets
We evaluate the compressed model on the same benchmarks as the base model to make the results comparable.
The following benchmark datasets can be used for evaluating on multiple tasks:
- MMLU: General knowledge across 57 subjects
- IFeval: Instruction-following capability
- ARC: Logical & scientific reasoning
- HellaSwag: Commonsense completion


In [None]:
# define tasks you want to evaluate the model on
tasks = ["mmlu", "arc_easy", "hellaswag", "ifeval"]

### Evaluating the Compressed Model

**NOTE**: 
1. Running the evaluation on the entire list of tasks can take long. So for testing, you can use a single task instead.

2. The results will be stored as a **results.pkl** file in the directories defined by **compressed_results_dir**.

In [None]:
# setting directories
compressed_model_path = "../Llama_3.1_8B_Instruct_int8_dynamic"
compressed_results_dir = "../results/compressed_accuracy"

In [None]:
# evaluate the compressed model and save results in pkl format
comp_acc = evaluate(
    compressed_model_path,
    tasks,
    limit=None,
    batch_size=16,
    apply_chat_template=True,
    verbosity=None,
)
save_pickle(compressed_results_dir, comp_acc)

In [None]:
comp_results = load_pickle(compressed_results_dir)

In [None]:
# print results for the compressed model
print(make_table(comp_results))

Accuracy results for the compressed model:

```text
|                 Tasks                 |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|---------------------------------------|------:|------|-----:|-----------------------|---|-----:|---|------|
|arc_easy                               |      1|none  |     0|acc                    |↑  |0.8106|±  |0.0080|
|                                       |       |none  |     0|acc_norm               |↑  |0.7555|±  |0.0088|
|hellaswag                              |      1|none  |     0|acc                    |↑  |0.5734|±  |0.0049|
|                                       |       |none  |     0|acc_norm               |↑  |0.7277|±  |0.0044|
|ifeval                                 |      4|none  |     0|inst_level_loose_acc   |↑  |0.8549|±  |   N/A|
|                                       |       |none  |     0|inst_level_strict_acc  |↑  |0.8237|±  |   N/A|
|                                       |       |none  |     0|prompt_level_loose_acc |↑  |0.7893|±  |0.0175|
|                                       |       |none  |     0|prompt_level_strict_acc|↑  |0.7449|±  |0.0188|
|mmlu                                   |      2|none  |      |acc                    |↑  |0.6311|±  |0.0038|
| - humanities                          |      2|none  |      |acc                    |↑  |0.5911|±  |0.0068|
|  - formal_logic                       |      1|none  |     0|acc                    |↑  |0.4921|±  |0.0447|
|  - high_school_european_history       |      1|none  |     0|acc                    |↑  |0.7697|±  |0.0329|
|  - high_school_us_history             |      1|none  |     0|acc                    |↑  |0.7990|±  |0.0281|
|  - high_school_world_history          |      1|none  |     0|acc                    |↑  |0.8186|±  |0.0251|
|  - international_law                  |      1|none  |     0|acc                    |↑  |0.7686|±  |0.0385|
|  - jurisprudence                      |      1|none  |     0|acc                    |↑  |0.7500|±  |0.0419|
|  - logical_fallacies                  |      1|none  |     0|acc                    |↑  |0.7669|±  |0.0332|
|  - moral_disputes                     |      1|none  |     0|acc                    |↑  |0.6792|±  |0.0251|
|  - moral_scenarios                    |      1|none  |     0|acc                    |↑  |0.4369|±  |0.0166|
|  - philosophy                         |      1|none  |     0|acc                    |↑  |0.6913|±  |0.0262|
|  - prehistory                         |      1|none  |     0|acc                    |↑  |0.7191|±  |0.0250|
|  - professional_law                   |      1|none  |     0|acc                    |↑  |0.4687|±  |0.0127|
|  - world_religions                    |      1|none  |     0|acc                    |↑  |0.8363|±  |0.0284|
| - other                               |      2|none  |      |acc                    |↑  |0.7132|±  |0.0079|
|  - business_ethics                    |      1|none  |     0|acc                    |↑  |0.6500|±  |0.0479|
|  - clinical_knowledge                 |      1|none  |     0|acc                    |↑  |0.7019|±  |0.0282|
|  - college_medicine                   |      1|none  |     0|acc                    |↑  |0.6474|±  |0.0364|
|  - global_facts                       |      1|none  |     0|acc                    |↑  |0.4100|±  |0.0494|
|  - human_aging                        |      1|none  |     0|acc                    |↑  |0.6861|±  |0.0311|
|  - management                         |      1|none  |     0|acc                    |↑  |0.7864|±  |0.0406|
|  - marketing                          |      1|none  |     0|acc                    |↑  |0.8462|±  |0.0236|
|  - medical_genetics                   |      1|none  |     0|acc                    |↑  |0.7700|±  |0.0423|
|  - miscellaneous                      |      1|none  |     0|acc                    |↑  |0.8059|±  |0.0141|
|  - nutrition                          |      1|none  |     0|acc                    |↑  |0.7614|±  |0.0244|
|  - professional_accounting            |      1|none  |     0|acc                    |↑  |0.4965|±  |0.0298|
|  - professional_medicine              |      1|none  |     0|acc                    |↑  |0.7721|±  |0.0255|
|  - virology                           |      1|none  |     0|acc                    |↑  |0.5361|±  |0.0388|
| - social sciences                     |      2|none  |      |acc                    |↑  |0.7394|±  |0.0077|
|  - econometrics                       |      1|none  |     0|acc                    |↑  |0.4474|±  |0.0468|
|  - high_school_geography              |      1|none  |     0|acc                    |↑  |0.7778|±  |0.0296|
|  - high_school_government_and_politics|      1|none  |     0|acc                    |↑  |0.8187|±  |0.0278|
|  - high_school_macroeconomics         |      1|none  |     0|acc                    |↑  |0.6487|±  |0.0242|
|  - high_school_microeconomics         |      1|none  |     0|acc                    |↑  |0.7437|±  |0.0284|
|  - high_school_psychology             |      1|none  |     0|acc                    |↑  |0.8606|±  |0.0149|
|  - human_sexuality                    |      1|none  |     0|acc                    |↑  |0.7634|±  |0.0373|
|  - professional_psychology            |      1|none  |     0|acc                    |↑  |0.6814|±  |0.0189|
|  - public_relations                   |      1|none  |     0|acc                    |↑  |0.6636|±  |0.0453|
|  - security_studies                   |      1|none  |     0|acc                    |↑  |0.6857|±  |0.0297|
|  - sociology                          |      1|none  |     0|acc                    |↑  |0.8408|±  |0.0259|
|  - us_foreign_policy                  |      1|none  |     0|acc                    |↑  |0.8600|±  |0.0349|
| - stem                                |      2|none  |      |acc                    |↑  |0.5043|±  |0.0084|
|  - abstract_algebra                   |      1|none  |     0|acc                    |↑  |0.2500|±  |0.0435|
|  - anatomy                            |      1|none  |     0|acc                    |↑  |0.6444|±  |0.0414|
|  - astronomy                          |      1|none  |     0|acc                    |↑  |0.6842|±  |0.0378|
|  - college_biology                    |      1|none  |     0|acc                    |↑  |0.7431|±  |0.0365|
|  - college_chemistry                  |      1|none  |     0|acc                    |↑  |0.4500|±  |0.0500|
|  - college_computer_science           |      1|none  |     0|acc                    |↑  |0.4200|±  |0.0496|
|  - college_mathematics                |      1|none  |     0|acc                    |↑  |0.2700|±  |0.0446|
|  - college_physics                    |      1|none  |     0|acc                    |↑  |0.3824|±  |0.0484|
|  - computer_security                  |      1|none  |     0|acc                    |↑  |0.7300|±  |0.0446|
|  - conceptual_physics                 |      1|none  |     0|acc                    |↑  |0.6000|±  |0.0320|
|  - electrical_engineering             |      1|none  |     0|acc                    |↑  |0.6069|±  |0.0407|
|  - elementary_mathematics             |      1|none  |     0|acc                    |↑  |0.4048|±  |0.0253|
|  - high_school_biology                |      1|none  |     0|acc                    |↑  |0.7774|±  |0.0237|
|  - high_school_chemistry              |      1|none  |     0|acc                    |↑  |0.4729|±  |0.0351|
|  - high_school_computer_science       |      1|none  |     0|acc                    |↑  |0.5800|±  |0.0496|
|  - high_school_mathematics            |      1|none  |     0|acc                    |↑  |0.2519|±  |0.0265|
|  - high_school_physics                |      1|none  |     0|acc                    |↑  |0.3444|±  |0.0388|
|  - high_school_statistics             |      1|none  |     0|acc                    |↑  |0.4213|±  |0.0337|
|  - machine_learning                   |      1|none  |     0|acc                    |↑  |0.4732|±  |0.0474|

    
    