# Evaluating the Performance of Fine-Tuned Llama 3.2 3B Instruct Model

## Evaluation Metrics

The evaluation metrics that will be used to evaluate the text summarization performance of the Large Language Models (LLMs) are:
1. METEOR (Metric for Evaluation of Translation with Explicit Ordering)
2. ROUGE-N (Recall-Oriented Understudy for Gisting Evaluation)
3. BERTScore
4. BLEU (BiLingual Evaluation Understudy)
5. G-Eval
6. FactCC

In [None]:
import os
import torch
import pandas as pd

from datasets import load_from_disk
from transformers.utils import is_flash_attn_2_available
from unsloth import FastLanguageModel

import evaluate

os.environ["HUGGINGFACEHUB_API_TOKEN"] = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

device = "cuda"
torch.cuda.empty_cache()

dataset = load_from_disk("../datasets/xsum_dataset.hf")
dataset

  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 30000
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 3750
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 3750
    })
})

In [None]:
def load_peft_model():
    
    if (is_flash_attn_2_available() and (torch.cuda.get_device_capability(0)[0] >= 8)):
        attn_implementation = "flash_attention_2"
    else:
        attn_implementation = "sdpa"
    
    print(f"[INFO] Using attention implementation: {attn_implementation}")

    model_id = "woshityj/llama_3.2_3B_Instruct_bnb_finetuned"
    print(f"[INFO] Using model_id: {model_id}")

    peft_model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = model_id,
        max_seq_length = 8192,
        dtype = None,
        load_in_4bit = True,
        token = "XXXXXXXXXXXXXXXXXXXXXXX"
    )

    peft_model.to(device)

    return peft_model, tokenizer

peft_model, tokenizer = load_peft_model()

[INFO] Using attention implementation: sdpa
[INFO] Using model_id: woshityj/llama_3.2_3B_Instruct_bnb_finetuned
==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: NVIDIA GeForce RTX 4070. Max memory: 11.994 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


  self.register_buffer("cos_cached", emb.cos().to(dtype=dtype, device=device, non_blocking=True), persistent=False)
Unsloth 2024.12.4 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [3]:
dataset = dataset['test']

articles = dataset['document'][0:50]
human_summaries = dataset['summary'][0:50]
generated_summaries = []

for idx, article in enumerate(articles):
    prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Summarize the following text.

### Input:
{article}

### Response:
"""
    input_ids = tokenizer(prompt, return_tensors='pt')
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    input_ids = input_ids.to(device)
    human_baseline_text_output = human_summaries[idx]
    FastLanguageModel.for_inference(peft_model)
    peft_model_output = peft_model.generate(**input_ids, max_new_tokens = 8192, temperature = 0.1)
    prompt_length = input_ids['input_ids'].shape[1]
    peft_model_text_output = tokenizer.decode(peft_model_output[0][prompt_length:], skip_special_tokens = True)
    generated_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(articles, human_summaries, generated_summaries))

In [4]:
df = pd.DataFrame(zipped_summaries, columns = ['Article', 'Human Summary', 'Generated Summary'])
df

Unnamed: 0,Article,Human Summary,Generated Summary
0,"The Newport man faces other charges, including...",A 22-year-old man has been charged with causin...,A 22-year-old man has been charged with causin...
1,"Staff at RSPCA Gonsal Farm animal centre, in D...",A Shropshire charity has designated October 'B...,A charity is using Halloween to promote black ...
2,"According to CNN, the former FBI director and ...",Now that Hurricane Junior has blown through Wa...,The Trump Tower meeting between Donald Trump J...
3,Former leader Nick Paget-Brown resigned on 30 ...,The new leader of Kensington and Chelsea Counc...,The new leader of Kensington and Chelsea Counc...
4,That makes it more serious than a technical co...,"The index of the UK's biggest 100 companies, t...",The FTSE 100 index has fallen by 4.67% in a da...
5,Gorse fires have been big news this week and t...,"A ""river of filth"", a spate of gorse fires, an...",The papers are full of stories about gorse fir...
6,"David Davies, Ian Lucas, Albert Owen and Gerai...",Four Welsh MPs are standing for election as ch...,Four MPs are vying to become the new chairs of...
7,Three auctioneers at Hotel Drouot also receive...,A French court has jailed 35 porters at the co...,A French auction house has been ordered to pay...
8,The Financial Conduct Authority (FCA) said tha...,"Investors must be quoted an ""all-in fee"" to ma...",The UK's financial regulator has announced a s...
9,Yonhap news agency quoted a South Korean offic...,North Korean leader Kim Jong-il is paying his ...,North Korean leader Kim Jong-il has left Pyong...


In [5]:
df.to_pickle("../generated_results/llama_3_2_3B_finetuned_results.pkl")

In [6]:
df = pd.read_pickle("../generated_results/llama_3_2_3B_finetuned_results.pkl")
df

Unnamed: 0,Article,Human Summary,Generated Summary
0,"The Newport man faces other charges, including...",A 22-year-old man has been charged with causin...,A 22-year-old man has been charged with causin...
1,"Staff at RSPCA Gonsal Farm animal centre, in D...",A Shropshire charity has designated October 'B...,A charity is using Halloween to promote black ...
2,"According to CNN, the former FBI director and ...",Now that Hurricane Junior has blown through Wa...,The Trump Tower meeting between Donald Trump J...
3,Former leader Nick Paget-Brown resigned on 30 ...,The new leader of Kensington and Chelsea Counc...,The new leader of Kensington and Chelsea Counc...
4,That makes it more serious than a technical co...,"The index of the UK's biggest 100 companies, t...",The FTSE 100 index has fallen by 4.67% in a da...
5,Gorse fires have been big news this week and t...,"A ""river of filth"", a spate of gorse fires, an...",The papers are full of stories about gorse fir...
6,"David Davies, Ian Lucas, Albert Owen and Gerai...",Four Welsh MPs are standing for election as ch...,Four MPs are vying to become the new chairs of...
7,Three auctioneers at Hotel Drouot also receive...,A French court has jailed 35 porters at the co...,A French auction house has been ordered to pay...
8,The Financial Conduct Authority (FCA) said tha...,"Investors must be quoted an ""all-in fee"" to ma...",The UK's financial regulator has announced a s...
9,Yonhap news agency quoted a South Korean offic...,North Korean leader Kim Jong-il is paying his ...,North Korean leader Kim Jong-il has left Pyong...


### METEOR (Metric for Evaluation of Translation with Explicit Ordering)

In [8]:
meteor = evaluate.load("meteor")

peft_model_meteor_results = meteor.compute(
    predictions = generated_summaries,
    references = human_summaries[0:len(generated_summaries)]
)

print(peft_model_meteor_results)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


{'meteor': 0.32367399706897154}


### ROUGE-N (Recall-Oriented Understudy for Gisting Evaluation)

In [9]:
rouge = evaluate.load("rouge")

peft_model_rouge_results = rouge.compute(
    predictions = generated_summaries,
    references = human_summaries[0:len(generated_summaries)],
    use_aggregator = True,
    use_stemmer = True
)

print(peft_model_rouge_results)

{'rouge1': 0.3909643217290165, 'rouge2': 0.16907414751162342, 'rougeL': 0.317886334143511, 'rougeLsum': 0.31621610917736653}


### BERTScore

In [10]:
from statistics import mean

bert_score = evaluate.load("bertscore")

peft_model_bert_score_results = bert_score.compute(
    predictions = df['Generated Summary'],
    references = df['Human Summary'][0:len(df['Generated Summary'])],
    lang = "en"
)

print(mean(peft_model_bert_score_results['precision']))

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.9111144268512725


### BLEU (BiLingual Evaluation Understudy)

In [11]:
bleu_score = evaluate.load("bleu")

peft_model_bleu_score_results = bleu_score.compute(
    predictions = generated_summaries,
    references = human_summaries[0:len(generated_summaries)]
)

print(peft_model_bleu_score_results)

{'bleu': 0.11057672775232655, 'precisions': [0.4189723320158103, 0.15696465696465697, 0.08991228070175439, 0.060324825986078884], 'brevity_penalty': 0.8046150583253528, 'length_ratio': 0.8214285714285714, 'translation_length': 1012, 'reference_length': 1232}


### Average Inference Time

In [12]:
import time

inference_times = []

for idx, article in enumerate(articles):
    input_ids = tokenizer(prompt, return_tensors='pt')
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    input_ids.to(device)

    FastLanguageModel.for_inference(peft_model)
    inference_start_time = time.time()
    model_output = peft_model.generate(**input_ids, max_new_tokens = 8192, temperature = 0.1)
    prompt_length = input_ids['input_ids'].shape[1]
    model_text_output = tokenizer.decode(model_output[0][prompt_length:], skip_special_tokens = True)
    inference_end_time = time.time()
    inference_time = inference_end_time - inference_start_time
    inference_times.append(inference_time)

mean_inference_time = mean(inference_times)
print(f"Average Inference Time: {mean_inference_time}")

Average Inference Time: 0.4586831521987915


### G-Eval

In [2]:
import torch
import pandas as pd

from transformers import AutoTokenizer
from unsloth import FastLanguageModel


def load_geval_model() -> tuple[FastLanguageModel, AutoTokenizer]:

    model_name = "unsloth/Qwen2.5-14B"

    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = model_name,
        max_seq_length = 8192,
        dtype = None,
        load_in_4bit = True
    )

    return model, tokenizer

geval_model, geval_tokenizer = load_geval_model()
df = pd.read_pickle("../generated_results/llama_3_2_3B_finetuned_results.pkl")

  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.12.4: Fast Qwen2 patching. Transformers:4.46.3.
   \\   /|    GPU: NVIDIA GeForce RTX 4070. Max memory: 11.994 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


  self.register_buffer("cos_cached", emb.cos().to(dtype=dtype, device=device, non_blocking=True), persistent=False)
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.11s/it]


In [None]:
COHERENCE_PROMPT = """You will be given one summary written for a news article.

Your task is to rate the summary on one metric.

Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.

Evaluation Criteria:

Coherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic."

Evaluation Steps:

1. Read the news article carefully and identify the main topic and key points.
2. Read the summary and compare it to the news article. Check if the summary covers the main topic and key points of the news article, and if it presents them in a clear and logical order.
3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.


Source Text:

{Document}

Summary:

{Summary}


Evaluation Form (scores ONLY):
Only provide the score for the Coherence metric. The score should be an integer between 1 and 5.
- Coherence:
"""

coherence_scores = []

for index, row in df.head(5).iterrows():
    article = row['Article']
    generated_summary = row['Human Summary']

    prompt = COHERENCE_PROMPT.format(Document = article, Summary = generated_summary)
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    input_ids = geval_tokenizer(prompt, return_tensors = "pt")
    input_ids = input_ids.to(device)

    FastLanguageModel.for_inference(geval_model)
    model_output = geval_model.generate(**input_ids, max_new_tokens = 512, temperature = 0.1)
    prompt_length = input_ids['input_ids'].shape[1]
    model_text_output = geval_tokenizer.decode(model_output[0][prompt_length:], skip_special_tokens = True)
    coherence_scores.append(model_text_output)
    print(model_text_output)

print(coherence_scores)

Coherence: 4
Coherence: 4
Coherence: 4

The summary effectively captures the main points of the article, discussing the potential connections between the Trump campaign and Russia, as well as the questions raised by the meeting between Trump Jr. and the Russian lawyer. However, it could be more coherent by providing a clearer structure and connecting the different points more smoothly.


In [None]:
# Generated Coherence Scores:
2, 3, 2, 4, 3, 2, 2, 3, 4, 3 = 2.8

# Human Summary Coherence Scores:
3, 4, 2, 2, 2, 4, 5, 5, 4, 3 = 3.4

In [3]:
print(f"Human Summary: {df['Human Summary'][0]}")
print(f"Generated Summary: {df['Generated Summary'][0]}")

Human Summary: A 22-year-old man has been charged with causing death by dangerous driving after a woman died in a crash in Newport on Friday.
Generated Summary: A 22-year-old man has been charged with causing death by dangerous driving after a car crashed into a tree in Newport.
