## Evaluation

Evaluation is a crucial step in model training, allowing us to measure the performance and generalization ability of our models. In this notebook, we will systematically evaluate both the base model and the finetuned model using appropriate metrics and validation datasets.

- **Base Model Evaluation:**  
    We will assess the initial performance of the base model to establish a benchmark. This helps us understand how well the model performs before any task-specific adaptation.

- **Finetuned Model Evaluation:**  
    After training the model on our specific dataset, we will evaluate its performance again. Comparing these results with the base model will highlight the improvements gained through finetuning.

Throughout this notebook, we will use visualizations and quantitative metrics to provide a comprehensive analysis of model performance. This approach ensures transparency and helps guide further improvements.

## Setup Paths and Directories

In [None]:
from pathlib import Path
from rich import print


WORKSPACE = Path.cwd().parent  # Path to the workspace directory

OUTPUT_DIR = WORKSPACE / "output" / "step_04"

OUTPUT_DIR.mkdir(
    parents=True, exist_ok=True
)  # Create output directory if it doesn't exist


MODEL_NAME = "meta-llama/Llama-3.2-1B-Instruct"

TRAINING_OUTPUT_PATH = WORKSPACE / "output" / "step_04"

BASE_MODEL_PATH = TRAINING_OUTPUT_PATH / "base_model" / MODEL_NAME.split("/")[-1]

FINE_TUNED_MODEL_PATH = TRAINING_OUTPUT_PATH / "fine_tuned_model" / MODEL_NAME.split("/")[-1]

## Load the Models

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_PATH,
    dtype=torch.float16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
)
base_model_tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)



fine_tuned_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_PATH,
    dtype=torch.float16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
)
fine_tuned_model_tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)

## LLM Sampling Parameters

We will use the below when testing our model

In [3]:
################################################################################
# 🎯 Sampling/Generation Parameters                                            #
################################################################################
MAX_NEW_TOKENS = 256
DO_SAMPLE = True
TEMPERATURE = 0.7  # Meta's recommended temperature for Llama
TOP_P = 0.9        # Standard top_p for Llama models

print(f"MAX_NEW_TOKENS: {MAX_NEW_TOKENS}")
print(f"DO_SAMPLE: {DO_SAMPLE}")
print(f"TEMPERATURE: {TEMPERATURE}")
print(f"TOP_P: {TOP_P}")
print("✅ LLM sampling parameters defined")
print()
print("📊 Using Meta's recommended Llama sampling settings:")
print("  • Temperature 0.6 for balanced creativity/consistency")
print("  • Top-p 0.9 for good token diversity")
print("  • Stop on both EOS and <|eot_id|> tokens")

MAX_NEW_TOKENS: 256
DO_SAMPLE: True
TEMPERATURE: 0.7
TOP_P: 0.9
✅ LLM sampling parameters defined

📊 Using Meta's recommended Llama sampling settings:
  • Temperature 0.6 for balanced creativity/consistency
  • Top-p 0.9 for good token diversity
  • Stop on both EOS and <|eot_id|> tokens


## Utility Functions

In [None]:
def prompt_runner(model,tokenizer, prompt):
    inputs = tokenizer(prompt,return_tensors='pt').to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=MAX_NEW_TOKENS,
            temperature=TEMPERATURE,
            do_sample=DO_SAMPLE,
            top_p=TOP_P,
        )

    response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:],skip_special_tokens=True)

    return response


def run_experimentation(prompt):
    # Run prompt in base model
    base_model_response = prompt_runner(
        model=base_model,
        tokenizer=base_model_tokenizer,
        prompt=prompt)


    # Run prompt in fine_tuned_model
    fine_tuned_model_response = prompt_runner(
        model=fine_tuned_model,
        tokenizer=fine_tuned_model_tokenizer,
        prompt=prompt)


    # Print the response of the model

    print(f"""
    [bold]EXPERIMENTATION DETAILS[/bold]:
        MODEL NAME     : {MODEL_NAME}
        MAX NEW TOKENS : {MAX_NEW_TOKENS}
        DO SAMPLE      : {DO_SAMPLE}
        TEMPERATURE    : {TEMPERATURE}
        TOP P          : {TOP_P}      


    [bold]PROMPT 💬[/bold]:

        [green]{prompt}[/green]

    [bold]BASE MODEL RESPONSE 🤖[/bold]:

        {base_model_response}

    [bold]FINE TUNED MODEL RESPONSE 🤖[/bold]:

        {fine_tuned_model_response}
    """)

## Test 1

We will test the knowledge of the model on BMO data. 

Question: `what is the meaning of verifying the identity of a person or an entity`

In [None]:
prompt = """what is the meaning of verifying the identity of a person or an entity"""


run_experimentation(prompt)