## Evaluate the accuracy of the base model

The `Llama-3.1-8B-Instruct` model is the **base model** for the `model-serve-flow` example. Before you compress the model, use this notebook to evaluate the base model on standard benchmarks to establish an accuracy baseline. In Module 6 of this example, you compare this baseline evaluation against the accuracy of the compressed version of the model.

**Goal**: Evaluate the base model on standard benchmarks to establish a baseline that you can later compare against the accuracy of the compressed model.

**Key actions**:

- Test the base model by using the **evaluate** function provided in the `utils.py` file. The  **evaluate** function wraps the LMEval tool's `simple_evaluate` function.

- Benchmark on multiple datasets:

    - MMLU: General knowledge across subjects.

    - IFeval: Instruction-following tasks.

    - ARC: Logical and scientific reasoning.
    
    - HellaSwag: Commonsense completion.

- Collect metrics such as accuracy, accuracy_norm, and task-specific scores.

- Save results in JSON format.

**Outcome**:

- Quantitative metrics for the base model.

- A baseline to use for comparison against the compressed model's accuracy.

For details on evaluating LLMs, see [Evaluate the Accuracy of the Base and Compressed Models](../docs/Accuracy_Evaluation.md).

### Install dependencies

In [None]:
!pip install -qqU .

In [None]:
import os

import torch
from lm_eval.utils import make_table
from transformers import AutoModelForCausalLM, AutoTokenizer
from utils import evaluate, load_pickle, model_size_gb, save_pickle

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

### Check GPU memory

To make sure that you have enough GPU memory to run this notebook:

1. In a terminal window, run the following command:

   `nvidia-smi`

2. Verify that the output is similar to the following example:

   ```text
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
    +-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  NVIDIA L40S                    On  |   00000000:CA:00.0 Off |                    0 |
    | N/A   44C    P0             91W /  350W |   15753MiB /  46068MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+

    +-----------------------------------------------------------------------------------------+
    | Processes:                                                                              |
    |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
    |        ID   ID                                                               Usage      |
    |=========================================================================================|
    |    0   N/A  N/A            8049      C   /opt/app-root/bin/python3             15744MiB |
    +-----------------------------------------------------------------------------------------+



3. If there are processes that are using GPU memory that this notebook requires to run, stop the processes:
 
   a. Note the PID number for each process.

   b. Run the `kill -9 <pid>` command for each process to stop it.

      For example, if the PID number is  `8049`, run the following command:

        ```
        `kill -9 8049`



### Load the base model

For this example, the `RedHatAI/Llama-3.1-8B-Instruct` model is the base model, as defined in the `model_name` variable.
To load the model in the data type specified in its configuration, load the model with **from_pretrained** using the **AutoModelForCausalLM** class. Specify the data type by setting the **torch_dtype** parameter to **auto**. Otherwise, PyTorch loads the weights in **full precision (fp32)**.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

In [None]:
# set up variables
model_name = "RedHatAI/Llama-3.1-8B-Instruct"
base_model_path = f"../base_model/{model_name.replace('/', '-')}"

In [None]:
# loading model and tokenizer from huggingfaceabs
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
model.config.dtype = "bfloat16"
# saving model and tokenizer
model.save_pretrained(base_model_path)
tokenizer.save_pretrained(base_model_path)

print("Base model saved at:", base_model_path)

In [None]:
# check model size
# !du -sh {base_model_path}
model_size = model_size_gb(model)
print(f"The size of the base model is: {model_size:.4f}GB")

Note that the base model (`Llama-3.1-8B-Instruct`) is approximately 15GB in size.

### Define evaluation benchmarking datasets

Use the following benchmark datasets to evaluate multiple tasks:
- MMLU: General knowledge across 57 subjects
- IFeval: Instruction-following capability
- ARC: Logical and scientific reasoning
- HellaSwag: Commonsense completion


In [None]:
# define tasks you want to evaluate the model on
tasks = ["mmlu", "arc_easy", "hellaswag", "ifeval"]

### Evaluate the base model

**NOTE**: 
- Running the evaluation on the entire list of tasks can take a long time (5 hours or more depending on resources). For the purpose of testing, run the evaluation on a single task instead.

- The results are stored as a **results.pkl** file in the directory defined by **base_results_dir**.


In [None]:
# setting directories
base_results_dir = "../results/base_accuracy"

**NOTE:** If the following warning appears when you run the next cell, you can safely ignore it:

`The tokenizer you are loading from '../base_model' with an incorrect regex pattern... This will lead to incorrect tokenization.`

In [None]:
# evaluate the base model and save results in pkl format
base_acc = evaluate(
    base_model_path,
    tasks,
    limit=None,
    batch_size="auto",
    apply_chat_template=True,
    verbosity=None,
)
save_pickle(base_results_dir, base_acc)

In [None]:
base_results = load_pickle(base_results_dir)

In [None]:
# print results for the base model
print(make_table(base_results))

**Example Accuracy results for the base model:**


```text
|                 Tasks                 |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|---------------------------------------|------:|------|-----:|-----------------------|---|-----:|---|------|
|arc_easy                               |      1|none  |     0|acc                    |↑  |0.8136|±  |0.0080|
|                                       |       |none  |     0|acc_norm               |↑  |0.7588|±  |0.0088|
|hellaswag                              |      1|none  |     0|acc                    |↑  |0.5741|±  |0.0049|
|                                       |       |none  |     0|acc_norm               |↑  |0.7251|±  |0.0045|
|ifeval                                 |      4|none  |     0|inst_level_loose_acc   |↑  |0.8513|±  |   N/A|
|                                       |       |none  |     0|inst_level_strict_acc  |↑  |0.8189|±  |   N/A|
|                                       |       |none  |     0|prompt_level_loose_acc |↑  |0.7874|±  |0.0176|
|                                       |       |none  |     0|prompt_level_strict_acc|↑  |0.7449|±  |0.0188|
|mmlu                                   |      2|none  |      |acc                    |↑  |0.6322|±  |0.0038|
| - humanities                          |      2|none  |      |acc                    |↑  |0.5864|±  |0.0068|
|  - formal_logic                       |      1|none  |     0|acc                    |↑  |0.4921|±  |0.0447|
|  - high_school_european_history       |      1|none  |     0|acc                    |↑  |0.7455|±  |0.0340|
|  - high_school_us_history             |      1|none  |     0|acc                    |↑  |0.7892|±  |0.0286|
|  - high_school_world_history          |      1|none  |     0|acc                    |↑  |0.8186|±  |0.0251|
|  - international_law                  |      1|none  |     0|acc                    |↑  |0.7686|±  |0.0385|
|  - jurisprudence                      |      1|none  |     0|acc                    |↑  |0.7315|±  |0.0428|
|  - logical_fallacies                  |      1|none  |     0|acc                    |↑  |0.7730|±  |0.0329|
|  - moral_disputes                     |      1|none  |     0|acc                    |↑  |0.6792|±  |0.0251|
|  - moral_scenarios                    |      1|none  |     0|acc                    |↑  |0.4179|±  |0.0165|
|  - philosophy                         |      1|none  |     0|acc                    |↑  |0.6977|±  |0.0261|
|  - prehistory                         |      1|none  |     0|acc                    |↑  |0.7130|±  |0.0252|
|  - professional_law                   |      1|none  |     0|acc                    |↑  |0.4687|±  |0.0127|
|  - world_religions                    |      1|none  |     0|acc                    |↑  |0.8480|±  |0.0275|
| - other                               |      2|none  |      |acc                    |↑  |0.7184|±  |0.0078|
|  - business_ethics                    |      1|none  |     0|acc                    |↑  |0.6500|±  |0.0479|
|  - clinical_knowledge                 |      1|none  |     0|acc                    |↑  |0.7094|±  |0.0279|
|  - college_medicine                   |      1|none  |     0|acc                    |↑  |0.6416|±  |0.0366|
|  - global_facts                       |      1|none  |     0|acc                    |↑  |0.4200|±  |0.0496|
|  - human_aging                        |      1|none  |     0|acc                    |↑  |0.6771|±  |0.0314|
|  - management                         |      1|none  |     0|acc                    |↑  |0.8058|±  |0.0392|
|  - marketing                          |      1|none  |     0|acc                    |↑  |0.8547|±  |0.0231|
|  - medical_genetics                   |      1|none  |     0|acc                    |↑  |0.8100|±  |0.0394|
|  - miscellaneous                      |      1|none  |     0|acc                    |↑  |0.8084|±  |0.0141|
|  - nutrition                          |      1|none  |     0|acc                    |↑  |0.7549|±  |0.0246|
|  - professional_accounting            |      1|none  |     0|acc                    |↑  |0.5177|±  |0.0298|
|  - professional_medicine              |      1|none  |     0|acc                    |↑  |0.7610|±  |0.0259|
|  - virology                           |      1|none  |     0|acc                    |↑  |0.5663|±  |0.0386|
| - social sciences                     |      2|none  |      |acc                    |↑  |0.7442|±  |0.0077|
|  - econometrics                       |      1|none  |     0|acc                    |↑  |0.4386|±  |0.0467|
|  - high_school_geography              |      1|none  |     0|acc                    |↑  |0.7929|±  |0.0289|
|  - high_school_government_and_politics|      1|none  |     0|acc                    |↑  |0.8497|±  |0.0258|
|  - high_school_macroeconomics         |      1|none  |     0|acc                    |↑  |0.6564|±  |0.0241|
|  - high_school_microeconomics         |      1|none  |     0|acc                    |↑  |0.7479|±  |0.0282|
|  - high_school_psychology             |      1|none  |     0|acc                    |↑  |0.8642|±  |0.0147|
|  - human_sexuality                    |      1|none  |     0|acc                    |↑  |0.7634|±  |0.0373|
|  - professional_psychology            |      1|none  |     0|acc                    |↑  |0.6797|±  |0.0189|
|  - public_relations                   |      1|none  |     0|acc                    |↑  |0.6909|±  |0.0443|
|  - security_studies                   |      1|none  |     0|acc                    |↑  |0.6898|±  |0.0296|
|  - sociology                          |      1|none  |     0|acc                    |↑  |0.8308|±  |0.0265|
|  - us_foreign_policy                  |      1|none  |     0|acc                    |↑  |0.8600|±  |0.0349|
| - stem                                |      2|none  |      |acc                    |↑  |0.5062|±  |0.0084|
|  - abstract_algebra                   |      1|none  |     0|acc                    |↑  |0.2700|±  |0.0446|
|  - anatomy                            |      1|none  |     0|acc                    |↑  |0.6222|±  |0.0419|
|  - astronomy                          |      1|none  |     0|acc                    |↑  |0.6974|±  |0.0374|
|  - college_biology                    |      1|none  |     0|acc                    |↑  |0.7708|±  |0.0351|
|  - college_chemistry                  |      1|none  |     0|acc                    |↑  |0.4700|±  |0.0502|
|  - college_computer_science           |      1|none  |     0|acc                    |↑  |0.4100|±  |0.0494|
|  - college_mathematics                |      1|none  |     0|acc                    |↑  |0.2800|±  |0.0451|
|  - college_physics                    |      1|none  |     0|acc                    |↑  |0.3922|±  |0.0486|
|  - computer_security                  |      1|none  |     0|acc                    |↑  |0.7200|±  |0.0451|
|  - conceptual_physics                 |      1|none  |     0|acc                    |↑  |0.5915|±  |0.0321|
|  - electrical_engineering             |      1|none  |     0|acc                    |↑  |0.5724|±  |0.0412|
|  - elementary_mathematics             |      1|none  |     0|acc                    |↑  |0.3942|±  |0.0252|
|  - high_school_biology                |      1|none  |     0|acc                    |↑  |0.7581|±  |0.0244|
|  - high_school_chemistry              |      1|none  |     0|acc                    |↑  |0.5123|±  |0.0352|
|  - high_school_computer_science       |      1|none  |     0|acc                    |↑  |0.6100|±  |0.0490|
|  - high_school_mathematics            |      1|none  |     0|acc                    |↑  |0.2481|±  |0.0263|
|  - high_school_physics                |      1|none  |     0|acc                    |↑  |0.3709|±  |0.0394|
|  - high_school_statistics             |      1|none  |     0|acc                    |↑  |0.4213|±  |0.0337|
|  - machine_learning                   |      1|none  |     0|acc                    |↑  |0.4911|±  |0.0475|