<a href="https://colab.research.google.com/github/veydantkatyal/Llama-LoRA-FineTuning/blob/main/Evaluate_Fine_Tuned_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluate Fine Tuned Model

This notebook demonstrates how to evaluate the fine tuned model traned in [this notebook](https://drive.google.com/file/d/1Dnj1pbYL2k7ckZ3Xa1ktpOq42jcKB5Bt/view?usp=sharing)

## What we'll cover:
1. Setting up the required libraries
2. Loading the original and fine tuned models
3. Processing our test dataset
4. Evaluate the model using ROUGE and BertScore

Note: This notebook assumes you have access to a GPU. We'll be using techniques to minimize memory usage while maintaining performance.

## Setup

First, we'll install the necessary libraries:
- `bitsandbytes`: For model quantization (reducing model size)
- `transformers`: Hugging Face's library for working with language models
- `peft`: For efficient fine-tuning using LoRA
- `accelerate`: For optimized model training
- `datasets`: For handling our training data
- `bert_score`: For computing BERTScore
- `rouge_score`: For computing ROUGE Score

In [None]:
!pip install -q -U bitsandbytes transformers peft accelerate datasets evaluate rouge_score bert_score

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.0/76.0 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m411.0/411.0 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m13.8 MB/s[0m eta [36m0:0

## Importing Required Libraries

We import the necessary modules for our task. Each has a specific purpose:
- `datasets`: To load and process our training data
- `AutoModelForCausalLM`: To load our pre-trained language model
- `BitsAndBytesConfig`: For model quantization
- `TrainingArguments`: To configure training parameters
- `SFTTrainer`: For supervised fine-tuning

In [None]:
import evaluate
import numpy as np
import pandas as pd
import torch

from datasets import load_dataset, load_from_disk
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    set_seed
)
from functools import partial
from peft import PeftModel, PeftConfig

We'll set CUDA (GPU) as our default device. This ensures our model training will use GPU acceleration instead of CPU, making it much faster.

In [None]:
# Check for CUDA first, then fall back to CPU
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("Using device:", device)

    device_index = torch.cuda.current_device()
    device_name = torch.cuda.get_device_name(device_index)
    total_mem = torch.cuda.get_device_properties(device_index).total_memory / 1e9  # bytes to GB
    allocated_mem = torch.cuda.memory_allocated(device_index) / 1e9
    reserved_mem = torch.cuda.memory_reserved(device_index) / 1e9

    print(f"CUDA device name: {device_name}")
    print(f"Total memory: {total_mem:.2f} GB")
    print(f"Memory allocated: {allocated_mem:.2f} GB")
    print(f"Memory reserved: {reserved_mem:.2f} GB")
else:
    device = torch.device("cpu")
    print("Using device:", device)
    print("No GPU acceleration available")

Using device: cuda
CUDA device name: Tesla T4
Total memory: 15.83 GB
Memory allocated: 0.00 GB
Memory reserved: 0.00 GB


## Loading the Dataset

We're using the "dialogsum-test" dataset, which contains conversations and their summaries. This dataset will help us train our model to generate concise summaries of dialogues.

In [None]:
huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)

README.md:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

train.csv:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.csv:   0%|          | 0.00/442k [00:00<?, ?B/s]

test.csv:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

## Model Quantization Configuration

Here we set up quantization parameters to reduce the model's memory footprint. This configuration is valid for loading the base model, since the adapter already know the configuration of the base model, we wouldn't need to set it up.
We're using 4-bit quantization, which significantly reduces memory usage while maintaining most of the model's performance.

Key concepts:
- Quantization: Converting model weights to lower precision (4-bit instead of 16/32-bit)
- `compute_dtype`: The data type used for computations

In [None]:
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

## Loading the Base Model and PEFT model

We're loading LLama 3.2 1B model, a relatively compact but powerful language model. We're applying our quantization configuration to make it memory-efficient.

In [None]:
model_name='meta-llama/Llama-3.2-1B-Instruct'
device_map = {"": 0}

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_name,
                                                      device_map=device_map,
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)



config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

## Setting Up the Tokenizer

The tokenizer converts text into numbers that the model can process. We configure it with specific settings:
- `padding_side="left"`: Adds padding tokens at the start of sequences
- `add_eos_token` and `add_bos_token`: Adds special tokens to mark the beginning and end of sequences

In [None]:
MAX_LENGTH = model.config.max_position_embeddings
tokenizer = AutoTokenizer.from_pretrained(
    model_name, trust_remote_code=True, padding=True, padding_side="left",
    add_eos_token=False, add_bos_token=False, use_fast=False
)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

## Preparing Data Processing Functions

We define several helper functions to prepare our data:
- `generate_response`: Build promt wuth dialogue and summary
- `get_response`: Use the generate response functio to get the output. It expects to receive a `suffix` so we have both responses in the same dataset
- `process_dataset`: Combines all processing steps and prepares the final dataset.

These functions ensure our data is in the right format for training.

In [None]:
PROMPT_TEMPLATE = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are an expert on summarizing conversations considering a particular topic.
The user request will contain the topic and the conversation
Answer with the summary only. Do not explain your answer
<|eot_id|>

<|start_header_id|>user<|end_header_id|>
Topic: {0}
Conversation: {1}

<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{2}
"""

def generate_response(
    model, topic, conversation, summary='',
    max_length=MAX_LENGTH, prompt_template=PROMPT_TEMPLATE,
    seed=42, tokenizer=tokenizer
):
    set_seed(seed)
    prompt = prompt_template.format(topic, conversation, summary)
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        return_attention_mask=True,
        padding=True
    ).to(device)

    outputs = model.generate(
        **inputs,
        max_length=max_length,
        pad_token_id=tokenizer.eos_token_id
    )


    # Decode full output and prompt
    full_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    prompt_text = tokenizer.decode(inputs['input_ids'][0], skip_special_tokens=True)

    # Get only the response part
    response_only = full_text[len(prompt_text):].strip()

    return response_only

In [None]:
def get_response(sample, model, tokenizer, suffix=''):
    dialogue = sample['dialogue']
    topic = sample['topic']
    text = generate_response(model, topic, dialogue, tokenizer=tokenizer)
    sample[f'response{suffix}'] = text
    return sample


def process_dataset(dataset, tokenizer, model, suffix=''):
    proc_fn = partial(
        get_response, model=model, tokenizer=tokenizer, suffix=suffix
    )

    dataset = dataset.map(
        proc_fn,
        batched=False,
    )

    return dataset

Just to speed up this process, we'll process 200 samples out of the 1500 samples in the test dataset

Inference in the base model, before applying trained weights

In [None]:
eval_dataset = dataset['test'].shuffle(seed=42)
eval_dataset = eval_dataset.select(range(200))
eval_dataset = process_dataset(eval_dataset, tokenizer, model, suffix='_original')

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

## Loading fine tuned model

Here, change the path to the folder were you saved the fine tuned model in the [first notebook](https://drive.google.com/file/d/1Dnj1pbYL2k7ckZ3Xa1ktpOq42jcKB5Bt/view?usp=sharing). First we load the adapter config, then attach the adapter to the current base model.

In [None]:
# Load the adapter's config
adapter_path = './peft-dialogue-summary-training-1742810044/best_model'
peft_config = PeftConfig.from_pretrained(adapter_path)

# Load the model with the adapter attached
model = PeftModel.from_pretrained(model, adapter_path)

In [None]:
eval_dataset = process_dataset(eval_dataset, tokenizer, model, suffix='_peft')

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

It's good to save the evaluation dataset in case you need to skip this processing above for the next time

In [None]:
eval_dataset.save_to_disk('eval_results')

Saving the dataset (0/1 shards):   0%|          | 0/200 [00:00<?, ? examples/s]

## Evaluate

In [None]:
def calculate_metrics(predictions, references):
    # Initialize metrics from HF evaluate
    rouge = evaluate.load('rouge')  # Changed from 'rouge' to 'rouge_score'
    bert_scorer = evaluate.load('bertscore')

    # Calculate ROUGE scores
    rouge_results = rouge.compute(predictions=predictions, references=references)

    # Calculate BERTScore
    bert_results = bert_scorer.compute(
        predictions=predictions,
        references=references,
        lang="en",
        model_type="bert-base-uncased"
    )

    # Combine metrics
    metrics = {
        'ROUGE-1': rouge_results['rouge1'],
        'ROUGE-2': rouge_results['rouge2'],
        'ROUGE-L': rouge_results['rougeL'],
        'BERTScore-P': np.mean(bert_results['precision']),
        'BERTScore-R': np.mean(bert_results['recall']),
        'BERTScore-F1': np.mean(bert_results['f1'])
    }

    return metrics


In [None]:
# Run the evaluation
print("Evaluating Original Model:")
original_metrics = calculate_metrics(
    eval_dataset['response_original'],
    eval_dataset['summary']
)

print("\nEvaluating Fine-tuned Model:")
peft_metrics = calculate_metrics(
    eval_dataset['response_peft'],
    eval_dataset['summary']
)

results = pd.DataFrame([original_metrics, peft_metrics]).T
results.columns = ['Base model', 'Fine-tuned']
results

Evaluating Original Model:

Evaluating Fine-tuned Model:


Unnamed: 0,Base model,Fine-tuned
ROUGE-1,0.21274,0.442587
ROUGE-2,0.048835,0.177176
ROUGE-L,0.157541,0.356341
BERTScore-P,0.460584,0.710478
BERTScore-R,0.544465,0.726904
BERTScore-F1,0.495974,0.716942
