# Evaluation


## Overview
Evaluation is a crucial step in model training. It allows you to measure the performance and generalization ability of our models. In this notebook, you use appropriate metrics and validation datasets to systematically evaluate both the base model and the fine-tuned model.

- **Base Model Evaluation:**  
    Assess the initial performance of the base model to establish a benchmark. This benchmark helps you understand how the model performs before any task-specific adaptation.

- **Finetuned Model Evaluation:**  
    After training the model on your specific dataset, evaluate its performance again. By comparing these results with the base model results, you can see the improvements that fine-tuning provides.

To ensure transparency and to help guide further improvements, this notebook, uses visualizations and quantitative metrics to provide a comprehensive analysis of model performance. 

## Prerequisites

- You completed the [Model Training](../05_Model_Training/Model_Training.ipynb) notebook.


## Install dependencies

In [None]:
!pip install -qqU .

## Set up paths and directories

In [None]:
import os
from pathlib import Path

from dotenv import load_dotenv

load_dotenv()

WORKSPACE = Path.cwd().parent  # Path to the workspace directory


MODEL_NAME = os.getenv("STUDENT_MODEL_NAME", "RedHatAI/Llama-3.1-8B-Instruct")

OUTPUT_DIR = WORKSPACE / "output"

BASE_MODEL_PATH = OUTPUT_DIR / "base_model" / MODEL_NAME.replace("/", "__")

FINE_TUNED_MODEL_PATH = OUTPUT_DIR / "fine_tuned_model" / MODEL_NAME.replace("/", "__")

if not BASE_MODEL_PATH.exists():
    raise FileNotFoundError("ðŸš¨ Base model directory doesn't exist.")

if not FINE_TUNED_MODEL_PATH.exists():
    raise FileNotFoundError("ðŸš¨ Finetuned model directory doesn't exist.")


print(f"Model Name : {MODEL_NAME}")
print(f"Base model path : {BASE_MODEL_PATH}")
print(f"Finetuned model path : {FINE_TUNED_MODEL_PATH}")

## Load the base model and the trained model

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

print("Loading base model and tokenizer...")
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_PATH, dtype=torch.float16, device_map="cuda:0"
)
base_model_tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)
print("âœ… Base model and tokenizer loaded successfully.")


print("Loading fine-tuned model and tokenizer...")
fine_tuned_model = AutoModelForCausalLM.from_pretrained(
    FINE_TUNED_MODEL_PATH, dtype=torch.float16, device_map="cuda:0"
)
fine_tuned_model_tokenizer = AutoTokenizer.from_pretrained(FINE_TUNED_MODEL_PATH)
print("âœ… Fine-tuned model and tokenizer loaded successfully.")

## LLM sampling parameters

Define parameter values for LLM sampling:

In [None]:
################################################################################
# ðŸŽ¯ Sampling/Generation Parameters                                            #
################################################################################
MAX_NEW_TOKENS = 256
DO_SAMPLE = True
TEMPERATURE = 0.7  # Meta's recommended temperature for Llama
TOP_P = 0.9  # Standard top_p for Llama models

print(f"MAX_NEW_TOKENS: {MAX_NEW_TOKENS}")
print(f"DO_SAMPLE: {DO_SAMPLE}")
print(f"TEMPERATURE: {TEMPERATURE}")
print(f"TOP_P: {TOP_P}")
print("âœ… LLM sampling parameters defined")
print()
print("ðŸ“Š Using Meta's recommended Llama sampling settings:")
print("  â€¢ Temperature 0.6 for balanced creativity/consistency")
print("  â€¢ Top-p 0.9 for good token diversity")
print("  â€¢ Stop on both EOS and <|eot_id|> tokens")

## Define utility functions

In [None]:
from rich import print as pprint


def prompt_runner(model, tokenizer, prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=MAX_NEW_TOKENS,
            temperature=TEMPERATURE,
            do_sample=DO_SAMPLE,
            top_p=TOP_P,
        )

    response = tokenizer.decode(
        outputs[0][inputs.input_ids.shape[1] :], skip_special_tokens=True
    )

    return response


def run_experimentation(prompt):
    # Run prompt in base model
    base_model_response = prompt_runner(
        model=base_model, tokenizer=base_model_tokenizer, prompt=prompt
    )

    # Run prompt in fine_tuned_model
    fine_tuned_model_response = prompt_runner(
        model=fine_tuned_model, tokenizer=fine_tuned_model_tokenizer, prompt=prompt
    )

    # Print the response of the model

    pprint(f"""
    [bold]EXPERIMENTATION DETAILS[/bold]:
        MODEL NAME     : {MODEL_NAME}
        MAX NEW TOKENS : {MAX_NEW_TOKENS}
        DO SAMPLE      : {DO_SAMPLE}
        TEMPERATURE    : {TEMPERATURE}
        TOP P          : {TOP_P}


    [bold]PROMPT ðŸ’¬[/bold]:

        [green]{prompt}[/green]

    [bold]BASE MODEL RESPONSE ðŸ¤–[/bold]:

        {base_model_response}

    [bold]FINE TUNED MODEL RESPONSE ðŸ¤–[/bold]:

        {fine_tuned_model_response}
    """)

## Test the model

Use the following question to test the model's knowledge of Bank of Montreal (BMO) data:

Question: `what is the meaning of verifying the identity of a person or an entity`

In [None]:
prompt = """what is the meaning of verifying the identity of a person or an entity"""


run_experimentation(prompt)

Congratulations! You have completed the Knowledge Tuning example.