# Lesson 6. Model evaluation

## Introduction
* Evaluating language models (LLMs) is a critical step in determining their effectiveness in different tasks.
This notebook introduces tools for LLM evaluation, such as **LM Evaluation Harness**, and explains how to test models using **TruthfulQA MC2** and the **Hugging Face Leaderboard**.

* The model comparison tool  described in the video can be found at this link: https://console.upstage.ai/ (note that you need to create a free account to try it out.)

Information about the harness can be found at this [github repo](https://github.com/EleutherAI/lm-evaluation-harness):

There are 4 steps in model evaluation:
*  **Look at Loss**  : Training loss should decrease with epochs.
*  **Check Model Ops during  training**: create model checkpoints periodiically
*  **Compare with other models** : using online tools
*  **Using LLM model as judge to compare models**

#### Workflow
In this lesson, we focus on evaluating pretrained language models 

### Key Objectives:
1.  **Model Initialization**
   - Load pretrained models using Hugging Face's `AutoModelForCausalLM` or `LlamaForCausalLM`.  
2.  **Evaluation Workflow**
   - Perform log-likelihood evaluations on datasets to assess model performance.
   - Use metrics such as accuracy (`acc`) and log-likelihood scores to evaluate the model's ability to generate or predict text.

3.  **Practical Debugging Tips**
   - Use `nvidia-smi` to monitor GPU utilization during evaluation.
   - Optimize batch size and precision (`torch_dtype=torch.bfloat16`) for faster inference.

**Next Steps**:
- Fine-tune models on specific datasets to improve task-specific performance.
- Experiment with different precision settings (FP16, BF16) and batch sizes to optimize throughput during evaluation.



In [None]:
import os
import sys
import io


In [None]:
import os
import subprocess
from typing import Tuple

def setup_environment(curr_proj_folder: str = "pretraining-llms", google_drive_base_folder: str = "Colab Notebooks",\
                      run_remote: bool= True, use_gpu: bool = True) -> Tuple[str, bool]:
    """
    Sets up the environment for running code, handling local and remote execution.

    Args:
        curr_proj_folder (str, optional): Folder name of the current project. Defaults to "pretraining-llms".
        google_drive_base_folder (str, optional): Folder name of the Google drive base folder. Defaults to ""Colab Notebooks".
        use_gpu (bool, optional): Whether to use GPU if available. Defaults to True.

    Returns:
        Tuple[str,bool]: (computed path_to_scripts,mount_success status)
    """
    # Initialize mount status for Colab
    mount_success = False
    # Remote run code
    if run_remote:
      from google.colab import drive
      # Mount Google Drive
      drive.mount('/content/drive')
      # Check if the mount was successful
      if os.path.ismount('/content/drive'):
        print("Google Drive mounted successfully!")
        mount_success = True
      else:
        print("Drive mount failed.")
      # By Default, this is complete mount path
      mount_path = '/content/drive/MyDrive'

      # complete path to current files
      path_to_scripts = os.path.join(mount_path, google_drive_base_folder,curr_proj_folder)
      # Create the directory if it doesn't exist
      if not os.path.exists(path_to_scripts):
        os.makedirs(path_to_scripts)
        # change to the path
      os.chdir(path_to_scripts)
      print(f'Running code in path {os.getcwd()}')
    # Local Run
    else:
      path_to_scripts  = os.getcwd()
      # folder name provided as argument should match the one existing
      assert os.path.basename(path_to_scripts ) == curr_proj_folder, \
          f"Folder Name Mismatch: {os.path.basename(path_to_scripts )} != {curr_proj_folder}"
      print(f'Running code in path {path_to_scripts }')
    # check GPU usage
    if use_gpu:
      try:
        gpu_info = subprocess.check_output("nvidia-smi", shell=True).decode('utf-8')
        print("******GPU is available and will be used:**********")
        print(gpu_info)
      except subprocess.SubprocessError:
        print("GPU check failed (nvidia-smi not found or no GPU available). Falling back to CPU.")
        use_gpu = False  # Force CPU usage if GPU check fails
    else:
        print("******use_gpu is set to False. Using CPU******")
    return  path_to_scripts,mount_success

Always set following parameters as needed before each run

In [None]:
# Project-specific configuration parameters
# Specifies the current project folder name
curr_proj_folder = "pretraining-llms"
# Base folder name in Google Drive where notebooks are stored
google_drive_base_folder = "Colab Notebooks"
# Flag to determine whether to use GPU for computations
use_gpu = True
# Flag to indicate remote execution environment
run_remote = False
# Flag to control model loading from a specific folder or through URL
load_model_from_folder = True

if run_remote:
  run_local = False
  run_local_usingColab = False
else:
  run_local = False
  run_local_usingColab = not run_local

# call method to setup environment
path_to_scripts,mount_success = setup_environment(curr_proj_folder = curr_proj_folder, \
                                   google_drive_base_folder =  google_drive_base_folder,\
                                    run_remote = run_remote, use_gpu = use_gpu)

In [None]:
from IPython.display import display
from PIL import Image
image_path = os.path.join(path_to_scripts,"images","lesson6_benchmarks.jpg")
img = Image.open(image_path)
display(img)

In [None]:
model_name = "TinySolar-248m-4k"
# Force forward slashes for Hugging Face compatibility
upstage_path = os.path.join("upstage",model_name).replace('\\', '/') 
# download model if not present  locally
model_path_or_name = os.path.join(path_to_scripts ,"models",model_name) if load_model_from_folder else upstage_path
print(model_path_or_name)




## Evaluating TinySolar-248m-4k on TruthfulQA MC2

### About TruthfulQA MC2
TruthfulQA is a benchmark that tests a model's ability to generate truthful responses.
- **Multiple-choice format**: The model selects the most truthful answer.
- **Why important?**: Many LLMs generate misinformation, and this benchmark helps assess their reliability.
- **Source**: [TruthfulQA Paper](https://arxiv.org/abs/2109.07958) and you can checkout the code for implementing the tasks at this [github repo](https://github.com/sylinrl/TruthfulQA).


The code below runs only the TruthfulQA MC2 task using the LM Evaluation Harness:

In [None]:
model_name = "TinySolar-248m-4k"
# Force forward slashes for Hugging Face compatibility
upstage_path = os.path.join("upstage",model_name).replace('\\', '/') 
# download model if not present  locally
model_path_or_name = os.path.join(path_to_scripts ,"models",model_name) if load_model_from_folder else upstage_path


In [None]:
import subprocess
import os

# Ensure system uses UTF-8 encoding
os.environ["PYTHONIOENCODING"] = "utf-8"

with open("output.txt", "w", encoding="utf-8") as f:
    subprocess.run([
        "lm_eval", "--model", "hf",
        "--model_args", f"pretrained={model_path_or_name}",
        "--tasks", "truthfulqa_mc2",
        "--device", "auto",
        "--limit", "5"
    ], stdout=f, stderr=subprocess.STDOUT, text=True, encoding="utf-8")

#### Result snapshot is shown below


In [None]:
image_path = os.path.join(path_to_scripts,"images","lesson6_eval_result.jpg")
img = Image.open(image_path)
display(img)

### Evaluation for the Hugging Face Leaderboard
You can use the code below to test your own model against the evaluations required for the [Hugging Face leaderboard](https://huggingface.co/open-llm-leaderboard).



The Hugging Face Open LLM Leaderboard ranks models based on standard benchmarks, such as:
- **ARC-Challenge**: Tests grade-school science questions.
- **HellaSwag**: Evaluates commonsense reasoning.
- **MMLU**: Measures model understanding across subjects.
- **TruthfulQA**: Assesses truthfulness.
- **Winogrande**: Tests pronoun resolution.
- **GSM8K**: Evaluates grade-school math.

If you decide to run this evaluation on your own model, don't change the few-shot numbers below - they are set by the rules of the leaderboard.

### Evaluation Function
The function below automates model evaluation across these tasks.

In [None]:
def h6_open_llm_leaderboard(model_name):
    """
    Runs a predefined set of evaluations for a given model.

    Parameters:
    - model_name (str): Path to the model to be evaluated.
    """
    task_and_shot = [
        ('arc_challenge', 25),
        ('hellaswag', 10),
        ('mmlu', 5),
        ('truthfulqa_mc2', 0),
        ('winogrande', 5),
        ('gsm8k', 5)
    ]

    for task, fewshot in task_and_shot:
        eval_cmd = f"""
        lm_eval --model hf \
            --model_args pretrained={model_name} \
            --tasks {task} \
            --device cpu \
            --num_fewshot {fewshot}
        """
        os.system(eval_cmd)

# Example usage:
h6_open_llm_leaderboard(model_name="YOUR_MODEL")



In [None]:
if run_remote  :
  from google.colab import drive
  drive.flush_and_unmount()
  print('All changes made in this colab session should now be visible in Drive.')