# Evaluating the Performance of Llama 3.2 3B Instruct Model

## Evaluation Metrics

The evaluation metrics that will be used to evaluate the text summarization performance of the Large Language Models (LLMs) are:
1. METEOR (Metric for Evaluation of Translation with Explicit Ordering)
2. ROUGE-N (Recall-Oriented Understudy for Gisting Evaluation)
3. BERTScore
4. BLEU (BiLingual Evaluation Understudy)
5. G-Eval
6. FactCC
7. Model's Inference Time

In [None]:
import os
import torch
import pandas as pd

from datasets import load_from_disk
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available
from unsloth import FastLanguageModel

import evaluate

os.environ["HUGGINGFACEHUB_API_TOKEN"] = "XXXXXXXXXXXXXXXXXXX"

device = "cuda"
torch.cuda.empty_cache()

dataset = load_from_disk("../datasets/xsum_dataset.hf")
dataset

  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 30000
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 3750
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 3750
    })
})

In [None]:
def load_llm_model():

    if (is_flash_attn_2_available() and (torch.cuda.get_device_capability(0)[0] >= 8)):
        attn_implementation = "flash_attention_2"
    else:
        attn_implementation = "sdpa"
    
    print(f"[INFO] Using attention implementation: {attn_implementation}")

    model_id = "meta-llama/Llama-3.2-3B-Instruct"
    print(f"[INFO] Using model_id: {model_id}")

    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path = model_id, token = "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")
    llm_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path = model_id,
                                                    torch_dtype = torch.float16,
                                                    low_cpu_mem_usage = False,
                                                    attn_implementation = attn_implementation,
                                                    token = "XXXXXXXXXXXXXXXXXXXXX")
    
    llm_model.to(device)

    return llm_model, tokenizer

model, tokenizer = load_llm_model()

[INFO] Using attention implementation: sdpa
[INFO] Using model_id: meta-llama/Llama-3.2-3B-Instruct


Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.86s/it]


In [3]:
dataset = dataset['test']

articles = dataset['document'][0:50]
human_summaries = dataset['summary'][0:50]
generated_summaries = []

for idx, article in enumerate(articles):
    prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Summarize the following text.

### Input:
{article}

### Response:
"""
    input_ids = tokenizer(prompt, return_tensors='pt')
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    input_ids = input_ids.to(device)
    human_baseline_text_output = human_summaries[idx]
    model_output = model.generate(**input_ids, max_new_tokens = 8192, temperature = 0.1)
    prompt_length = input_ids['input_ids'].shape[1]
    model_text_output = tokenizer.decode(model_output[0][prompt_length:], skip_special_tokens = True)
    generated_summaries.append(model_text_output)

zipped_summaries = list(zip(articles, human_summaries, generated_summaries))

In [4]:
df = pd.DataFrame(zipped_summaries, columns = ['Article', 'Human Summaries', 'Generated Summaries'])
df

Unnamed: 0,Article,Human Summaries,Generated Summaries
0,"The Newport man faces other charges, including...",A 22-year-old man has been charged with causin...,"A 19-year-old woman, Xana Doyle, died after a ..."
1,"Staff at RSPCA Gonsal Farm animal centre, in D...",A Shropshire charity has designated October 'B...,The RSPCA Gonsal Farm animal centre in Dorring...
2,"According to CNN, the former FBI director and ...",Now that Hurricane Junior has blown through Wa...,The text discusses the recent email revelation...
3,Former leader Nick Paget-Brown resigned on 30 ...,The new leader of Kensington and Chelsea Counc...,The former leader of Kensington and Chelsea Co...
4,That makes it more serious than a technical co...,"The index of the UK's biggest 100 companies, t...",The article discusses the current market turbu...
5,Gorse fires have been big news this week and t...,"A ""river of filth"", a spate of gorse fires, an...",Here is a summary of the provided text:\n\nThi...
6,"David Davies, Ian Lucas, Albert Owen and Gerai...",Four Welsh MPs are standing for election as ch...,The current chair of the Welsh Affairs Committ...
7,Three auctioneers at Hotel Drouot also receive...,A French court has jailed 35 porters at the co...,The Hotel Drouot auction house in France has b...
8,The Financial Conduct Authority (FCA) said tha...,"Investors must be quoted an ""all-in fee"" to ma...",The Financial Conduct Authority (FCA) has anno...
9,Yonhap news agency quoted a South Korean offic...,North Korean leader Kim Jong-il is paying his ...,Here is a summary of the text:\n\nNorth Korean...


In [5]:
df.to_pickle("../generated_results/llama3_2_3B_results.pkl")

In [6]:
df = pd.read_pickle("../generated_results/llama3_2_3B_results.pkl")
df

Unnamed: 0,Article,Human Summaries,Generated Summaries
0,"The Newport man faces other charges, including...",A 22-year-old man has been charged with causin...,"A 19-year-old woman, Xana Doyle, died after a ..."
1,"Staff at RSPCA Gonsal Farm animal centre, in D...",A Shropshire charity has designated October 'B...,The RSPCA Gonsal Farm animal centre in Dorring...
2,"According to CNN, the former FBI director and ...",Now that Hurricane Junior has blown through Wa...,The text discusses the recent email revelation...
3,Former leader Nick Paget-Brown resigned on 30 ...,The new leader of Kensington and Chelsea Counc...,The former leader of Kensington and Chelsea Co...
4,That makes it more serious than a technical co...,"The index of the UK's biggest 100 companies, t...",The article discusses the current market turbu...
5,Gorse fires have been big news this week and t...,"A ""river of filth"", a spate of gorse fires, an...",Here is a summary of the provided text:\n\nThi...
6,"David Davies, Ian Lucas, Albert Owen and Gerai...",Four Welsh MPs are standing for election as ch...,The current chair of the Welsh Affairs Committ...
7,Three auctioneers at Hotel Drouot also receive...,A French court has jailed 35 porters at the co...,The Hotel Drouot auction house in France has b...
8,The Financial Conduct Authority (FCA) said tha...,"Investors must be quoted an ""all-in fee"" to ma...",The Financial Conduct Authority (FCA) has anno...
9,Yonhap news agency quoted a South Korean offic...,North Korean leader Kim Jong-il is paying his ...,Here is a summary of the text:\n\nNorth Korean...


### METEOR (Metric for Evaluation of Translation with Explicit Ordering)

In [7]:
meteor = evaluate.load("meteor")

peft_model_meteor_results = meteor.compute(
    predictions = generated_summaries,
    references = human_summaries[0:len(generated_summaries)]
)

print(peft_model_meteor_results)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


{'meteor': 0.22290188320547813}


### ROUGE-N (Recall-Oriented Understudy for Gisting Evaluation)

In [8]:
rouge = evaluate.load("rouge")

peft_model_rouge_results = rouge.compute(
    predictions = generated_summaries,
    references = human_summaries[0:len(generated_summaries)],
    use_aggregator = True,
    use_stemmer = True
)

print(peft_model_rouge_results)

{'rouge1': 0.15678534302759964, 'rouge2': 0.039206241558554636, 'rougeL': 0.10514268782664304, 'rougeLsum': 0.10586519867590312}


### BERTScore

In [9]:
from statistics import mean

bert_score = evaluate.load("bertscore")

peft_model_bert_score_results = bert_score.compute(
    predictions = df['Generated Summaries'],
    references = df['Human Summaries'][0:len(df['Generated Summaries'])],
    lang = "en"
)

print(mean(peft_model_bert_score_results['precision']))

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.8285276222229004


### BLEU (BiLingual Evaluation Understudy)

In [10]:
bleu_score = evaluate.load("bleu")

peft_model_bleu_score_results = bleu_score.compute(
    predictions = generated_summaries,
    references = human_summaries[0:len(generated_summaries)]
)

print(peft_model_bleu_score_results)

{'bleu': 0.012196992181735578, 'precisions': [0.08237840590781767, 0.018195797027165558, 0.00631931906112974, 0.002336448598130841], 'brevity_penalty': 1.0, 'length_ratio': 6.375, 'translation_length': 7854, 'reference_length': 1232}


### Average Inference Time

In [12]:
import time

inference_times = []

for idx, article in enumerate(articles):
    input_ids = tokenizer(prompt, return_tensors='pt')
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    input_ids.to(device)

    inference_start_time = time.time()
    model_output = model.generate(**input_ids, max_new_tokens = 8192, temperature = 0.1)
    prompt_length = input_ids['input_ids'].shape[1]
    model_text_output = tokenizer.decode(model_output[0][prompt_length:], skip_special_tokens = True)
    inference_end_time = time.time()
    inference_time = inference_end_time - inference_start_time
    inference_times.append(inference_time)

mean_inference_time = mean(inference_times)
print(f"Average Inference Time: {mean_inference_time}")

Average Inference Time: 2.0084987831115724


### G-Eval

In [2]:
import pandas as pd

df = pd.read_pickle("../generated_results/llama3_2_3B_results.pkl")
df['Article'][0]

'The Newport man faces other charges, including theft of a vehicle and driving with excess alcohol.Xana Doyle, 19, from Newport, died after a silver Toyota Avensis ended up on its roof on Usk Way just after 07:00 GMT. A 15-year-old girl was also hurt.The man is due appear before Newport magistrates on Monday.A 21-year-old man, who was also arrested in connection with the incident, has been released with no further action, said Gwent Police.'

In [8]:
COHERENCE_PROMPT = """You will be given one summary written for a news article.

Your task is to rate the summary on one metric.

Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.

Evaluation Criteria:

Coherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic."

Evaluation Steps:

1. Read the news article carefully and identify the main topic and key points.
2. Read the summary and compare it to the news article. Check if the summary covers the main topic and key points of the news article, and if it presents them in a clear and logical order.
3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.


Source Text:

{Document}

Summary:

{Summary}


Evaluation Form (scores ONLY):
- Coherence:
"""

prompt = COHERENCE_PROMPT.format(Document = df['Article'][4], Summary = df['Generated Summaries'][4])
print(prompt)

You will be given one summary written for a news article.

Your task is to rate the summary on one metric.

Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.

Evaluation Criteria:

Coherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic."

Evaluation Steps:

1. Read the news article carefully and identify the main topic and key points.
2. Read the summary and compare it to the news article. Check if the summary covers the main topic and key points of the news article, and if it presents them in a clear and logical order.
3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is 

In [None]:
# Generated Coherence Scores:
4, 5, 4, 5, 5

In [10]:
print(f"Human Summary: {df['Human Summaries'][0]}")
print(f"Generated Summary: {df['Generated Summaries'][0]}")

Human Summary: A 22-year-old man has been charged with causing death by dangerous driving after a woman died in a crash in Newport on Friday.
Generated Summary: A 19-year-old woman, Xana Doyle, died after a car accident in Newport, where a silver Toyota Avensis crashed on its roof. A 15-year-old girl was also injured in the incident. A 21-year-old man, who was involved in the incident, was arrested but later released with no further action. The 21-year-old man, along with a 19-year-old man from Newport, faces other charges including theft of a vehicle and driving with excess alcohol. He is due to appear before a Newport magistrates court on Monday.
