<a href="https://colab.research.google.com/github/pgurazada/understanding-self-attention/blob/main/understanding_self_attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project1 - Financial Product Complaint Classification and Summarization

Rahul Kher

# Setup

## Installing Libraries

In [1]:
!pip install -q transformers>=4.41.0 accelerate==0.26.1 bitsandbytes huggingface_hub>=0.24.0

## Importing Relevant Libraries

In [2]:
# Importing relevant libraries
# Basic python libraries
import math
import torch
import re
import orjson

# using transformers library to get the LLM from Hugging Face
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig
)

# Libraries to get keys, data from Google Colab and Login to Hugging Face
from google.colab import userdata
from google.colab import files
from huggingface_hub import login

# Data science kit
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Loading LLM

In [3]:
login(token = userdata.get("hf_token")) # Logging in the Hugging Face (SECRET stored in Colab)

In [4]:
model_name = "mistralai/Mistral-7B-Instruct-v0.2" # Using the Mistral-7B-Instruct-v0.2

In [5]:
# Quantising the model to reduce size based on RAM space available
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype='float16',
    bnb_4bit_use_double_quant=False
)

In [6]:
# Loading the model from Hugging Face
mistral7b_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [7]:
# Loading the Tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    max_length=6,use_fast= False,
    padding_side="left"
)

tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

# Loading Dataset

In [8]:
# Code snippet to upload the given dataset in colab during runtime
uploaded = files.upload()

for filename in uploaded.keys():
    print(f"File uploaded: {filename}")

# Reading the CSV into pandas DataFrame
df = pd.read_csv(list(uploaded.keys())[0])
df.head()

Saving Complaints_classification.csv to Complaints_classification.csv
File uploaded: Complaints_classification.csv


Unnamed: 0,product,narrative,summary
0,credit_card,purchase order day shipping amount receive pro...,The customer made a purchase order with an agr...
1,credit_card,forwarded message date tue subject please inve...,The sender of the email believes they have bee...
2,retail_banking,forwarded message cc sent friday pdt subject f...,The sender of the email alleges that Wells Far...
3,credit_reporting,payment history missing credit report speciali...,The credit report from Specialized Loan Servic...
4,credit_reporting,payment history missing credit report made mis...,The text concerns a person who found an unauth...


In [9]:
# Splitting the dataset into train and test - Test Dataset to be used to review model performance
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)
train_df.head()

Unnamed: 0,product,narrative,summary
0,credit_reporting,block except otherwise provided section consum...,This text refers to the rules and regulations ...
1,credit_reporting,bankruptcy account may concern received copy c...,The text is about a person from Florida expres...
2,credit_reporting,xxxxxxxx credit card year disputed charge vend...,The input talks about a dispute over a charge ...
3,credit_reporting,blockexcept otherwise provided section consume...,The text describes the regulations for blockin...
4,credit_reporting,blockexcept otherwise provided section consume...,The text describes regulations on consumer rep...


In [10]:
test_df.head()

Unnamed: 0,product,narrative,summary
0,credit_reporting,except otherwise provided section consumer rep...,The text discusses rules and provisions pertai...
1,mortgages_and_loans,entity servicing loan behalf hereinafter calle...,"The input is about a lender company, which ser..."
2,credit_reporting,except otherwise provided section consumer rep...,The input text seems to discuss a law provisio...
3,credit_reporting,true identity theft victim identity theft info...,The text is a plea from an identity theft vict...
4,credit_reporting,continue refresh outdated account old account ...,The text stresses on updating an outdated acco...


# LLM Inference

In [11]:
# Deriving the Labels from the Dataset - in List form
LABELS = train_df["product"].unique().tolist()
LABELS

['credit_reporting',
 'credit_card',
 'debt_collection',
 'mortgages_and_loans',
 'retail_banking']

In [12]:
# Function to generate completions from the LLM
def generate_completion(user_text: str, df: pd.DataFrame, zero_shot_flag: bool=True) -> str:

    messages = [
        {"role":"system","content": SYSTEM_PROMPT_ZERO_SHOT if zero_shot_flag else SYSTEM_PROMPT_FEW_SHOT},
        {"role":"user","content": zero_shot(user_text) if zero_shot_flag else few_shot_prompt_from_df(df, user_text)}
    ]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt").to(mistral7b_model.device)
    stop_token_ids = [tokenizer.eos_token_id]
    with torch.no_grad():
        out_tokens = mistral7b_model.generate(
            **inputs,
            max_new_tokens=64,
            do_sample=False,
            eos_token_id=stop_token_ids[0]
        )
    return tokenizer.decode(out_tokens[0], skip_special_tokens=True).strip()

# Extracting the final Category from the LLM output
_json_pat = re.compile(r"\{\"category\":\s?\"(credit_reporting|credit_card|debt_collection|mortgages_and_loans|retail_banking)\"}", re.DOTALL)
def extract_category(raw: str) -> str:
    # _json_pat = re.compile(r"\{\"category\":\s.*}", re.DOTALL)
    m = _json_pat.search(raw)
    if not m:
        return "insufficient_text"  # fallback
    try:
        obj = orjson.loads(m.group(0))
        cat = str(obj.get("category","")).strip()
        if cat in LABELS:
            return cat
    except Exception:
        pass
    return "insufficient_text"

# Final classification function to bring it all together
# def classify(text: str, df: pd.DataFrame, zero_shot_flag: bool=True) -> str:
#     return extract_category(generate_completion(text, df, zero_shot_flag))

In [13]:
def evaluate(df: pd.DataFrame, truth_col: str, pred_col: str):
    """
    Evaluate classification performance with Accuracy, Precision, Recall, F1.

    Args:
        df: Pandas DataFrame containing both columns.
        truth_col: Column name for ground truth labels.
        pred_col: Column name for predicted labels.

    Returns:
        f1 Score
    """
    y_true = df[truth_col].astype(str).str.strip()
    y_pred = df[pred_col].astype(str).str.strip()

    acc = accuracy_score(y_true, y_pred)
    _, _, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted', zero_division=0)

    return f1

## Zero Shot Prompting

In [14]:
SYSTEM_PROMPT_ZERO_SHOT = """
You are a banking complaint triage assistant.
Output ONLY the category label exactly as instructed.
Do not repeat the user prompt.
Do not add any extra words, explanations, or formatting.
If unsure, output "insufficient_data".
"""

In [15]:
# Function to create a zero shot prompt
def zero_shot(text: str) -> str:
    return (
        'Return ONLY this JSON:\n'
        '{"category":"<one_of: '+ " | ".join(LABELS) + ' >"}\n\n'
        "Text:\n" + text.strip()
    )

In [16]:
# Code for Zero Shot inference and evaluation

eval_df = test_df.copy()
for i in range(len(eval_df)):
  eval_df.loc[i, 'mistral_response'] = generate_completion(eval_df.loc[i, "narrative"], df=pd.DataFrame())
  eval_df.loc[i, 'mistral_response_cleaned'] = extract_category(eval_df.loc[i, 'mistral_response'])
eval_df.head()

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

Unnamed: 0,product,narrative,summary,mistral_response,mistral_response_cleaned
0,credit_reporting,except otherwise provided section consumer rep...,The text discusses rules and provisions pertai...,[INST] \nYou are a banking complaint triage as...,credit_reporting
1,mortgages_and_loans,entity servicing loan behalf hereinafter calle...,"The input is about a lender company, which ser...",[INST] \nYou are a banking complaint triage as...,mortgages_and_loans
2,credit_reporting,except otherwise provided section consumer rep...,The input text seems to discuss a law provisio...,[INST] \nYou are a banking complaint triage as...,credit_reporting
3,credit_reporting,true identity theft victim identity theft info...,The text is a plea from an identity theft vict...,[INST] \nYou are a banking complaint triage as...,insufficient_text
4,credit_reporting,continue refresh outdated account old account ...,The text stresses on updating an outdated acco...,[INST] \nYou are a banking complaint triage as...,debt_collection


In [17]:
print("Mistral Raw: ", evaluate(eval_df, "product", "mistral_response"))
print("Mistral Cleaned: ", evaluate(eval_df, "product", "mistral_response_cleaned"))

Mistral Raw:  0.0
Mistral Cleaned:  0.8185093167701862


The initial mistral response (mistral raw) still contains the prompt in the final response, hence the f1 score evaluation when compared with the actual category does not match. Thus the score evaluates to zero

However, when we extract the actual category from the raw response. This gives us the final category which can be matched with the ground truth. Thus the f1 score evaluates to 82%

## Few Shot Prompting

In [18]:
SYSTEM_PROMPT_FEW_SHOT = """
You are a banking complaint triage assistant.
- Follow instructions exactly and be concise.
- Output ONLY the requested JSON/text. No extra words.
- If unsure, say "insufficient_data".
- Do not reveal chain-of-thought; give final results only.
- Use the provided examples in the user prompt as guidance for structure, style, and decision-making.
- Apply the reasoning patterns from the examples to new cases.
"""

In [19]:
# Function to sample random examples from the dataset - used to create a few shot prompt
def _sample_examples_from_df(df, per_label: int = 10) -> list[tuple[str, str]]:
    rng = np.random.default_rng(42)
    out = []
    for lab in LABELS:
        sub = df[df["product"] == lab]
        if sub.empty:
            continue
        pick = sub.sample(min(per_label, len(sub)), random_state=42)
        for t in pick["narrative"].tolist():
            out.append((t, lab))
    return out

# Cleaing the text for better assimilation by the LLM
def _clean_line(s: str, max_len: int = 220) -> str:
    s = " ".join(str(s).split())   # collapse whitespace/newlines
    s = s.replace('"', "'")        # avoid breaking quotes
    return (s[:max_len] + "…") if len(s) > max_len else s

# Function to design the few shot prompt
def few_shot_prompt_from_df(df, text: str, per_label: int = 2,
                            labels=LABELS, default: str = "insufficient_text") -> str:
    """
    Build a few-shot classification prompt (USER message body) from your dataframe.
    Uses sample_examples_from_df(df, per_label) to pull examples.
    """
    examples = _sample_examples_from_df(df, per_label=per_label)
    ex_lines = [f'- Text: "{_clean_line(t)}" -> {lab}' for t, lab in examples if lab in labels]

    return (
        "You must classify the complaint into exactly ONE of:\n"
        f"{labels}\n\n"
        'Return ONLY:\n{"category":"<one_label>"}\n\n'
        "Rules:\n"
        f'- If unsure, return exactly: {{"category":"{default}"}}\n'
        "- No extra words.\n\n"
        "Examples:\n" + ("\n".join(ex_lines) if ex_lines else "(no examples found)") + "\n\n"
        "Classify this complaint and return ONLY the JSON:\n\n"
        "Text:\n" + text.strip()
    )

In [20]:
# Code for Few Shot inference and evaluation
eval_df = test_df.copy()
for i in range(len(eval_df)):
  eval_df.loc[i, 'mistral_response'] = generate_completion(eval_df.loc[i, "narrative"], df=eval_df, zero_shot_flag=False)
  eval_df.loc[i, 'mistral_response_cleaned'] = extract_category(eval_df.loc[i, 'mistral_response'])
eval_df.head()

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

Unnamed: 0,product,narrative,summary,mistral_response,mistral_response_cleaned
0,credit_reporting,except otherwise provided section consumer rep...,The text discusses rules and provisions pertai...,[INST] \nYou are a banking complaint triage as...,credit_reporting
1,mortgages_and_loans,entity servicing loan behalf hereinafter calle...,"The input is about a lender company, which ser...",[INST] \nYou are a banking complaint triage as...,mortgages_and_loans
2,credit_reporting,except otherwise provided section consumer rep...,The input text seems to discuss a law provisio...,[INST] \nYou are a banking complaint triage as...,credit_reporting
3,credit_reporting,true identity theft victim identity theft info...,The text is a plea from an identity theft vict...,[INST] \nYou are a banking complaint triage as...,credit_reporting
4,credit_reporting,continue refresh outdated account old account ...,The text stresses on updating an outdated acco...,[INST] \nYou are a banking complaint triage as...,credit_reporting


In [21]:
print("Mistral Raw: ", evaluate(eval_df, "product", "mistral_response"))
print("Mistral Cleaned: ", evaluate(eval_df, "product", "mistral_response_cleaned"))

Mistral Raw:  0.0
Mistral Cleaned:  0.8943743961352658


Few-shot prompting generally gives better outputs than zero-shot prompting because it shows the model concrete examples of the desired task before it generates an answer.

In zero-shot mode, the model relies only on the instruction and its general training knowledge. This can lead to inconsistent formatting, misinterpretation of intent, or inclusion of irrelevant details, especially in structured tasks like classification.

Few-shot prompting, on the other hand, provides specific demonstrations of inputs paired with the correct outputs. This acts like in context training:

The model learns exactly how we want the response structured.

It understands the tone, level of detail and style expected.

It sees clear boundaries of what counts as relevant vs. irrelevant information.

As a result, few-shot prompts reduce ambiguity, constrain the model’s output space, and lead to more accurate, consistent, and cleaner responses, which is why it is observed better performance in evaluation metrics.

# Q2. Text Summarization

In [22]:
SYSTEM_PROMPT_SUMMARY = """You are a text summarisation assistant.
- Your task is to produce a concise and accurate summary of the user-provided text.
- Retain all important facts, figures, and key points.
- Remove unnecessary details, examples, and repetition.
- Preserve the original meaning without adding your own opinions.
- Use clear and simple language.
- Output ONLY the summary text, without any extra commentary, formatting, or explanation.
"""

In [23]:
def summarisation_user_prompt(text: str) -> str:
  return f"Summarise the following text:\n\n{text}\n\nSummary:"

In [24]:
# Function to generate completions from the LLM
def generate_completion(user_text: str) -> str:

    messages = [
        {"role":"system","content": SYSTEM_PROMPT_SUMMARY},
        {"role":"user","content": summarisation_user_prompt(user_text)}
    ]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt").to(mistral7b_model.device)
    stop_token_ids = [tokenizer.eos_token_id]
    with torch.no_grad():
        out_tokens = mistral7b_model.generate(
            **inputs,
            max_new_tokens=64,
            do_sample=False,
            eos_token_id=stop_token_ids[0]
        )
    return tokenizer.decode(out_tokens[0], skip_special_tokens=True).strip()

# Extracting the final Category from the LLM output
def extract_summary(raw: str) -> str:
    # Find "Summary:" and grab everything after it
    match = re.search(r"Summary:\s*(.*)", raw, re.DOTALL | re.IGNORECASE)
    if match:
        summary_text = match.group(1)
        # Remove [/INST] or any special tokens
        summary_text = re.sub(r"\[/?INST\]", "", summary_text)
        return summary_text.strip()
    return raw.strip()

# Final classification function to bring it all together
def summarise(text: str) -> str:
    return extract_summary(generate_completion(text))

In [25]:
# Code for Zero Shot inference and evaluation
eval_df = test_df.copy()
for i in range(len(eval_df)):
  eval_df.loc[i, 'mistral_response'] = summarise(eval_df.loc[i, 'narrative'])
eval_df.head()

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

Unnamed: 0,product,narrative,summary,mistral_response
0,credit_reporting,except otherwise provided section consumer rep...,The text discusses rules and provisions pertai...,A consumer reporting agency must block reporti...
1,mortgages_and_loans,entity servicing loan behalf hereinafter calle...,"The input is about a lender company, which ser...",The text refers to a lender company that provi...
2,credit_reporting,except otherwise provided section consumer rep...,The input text seems to discuss a law provisio...,A consumer reporting agency must block reporti...
3,credit_reporting,true identity theft victim identity theft info...,The text is a plea from an identity theft vict...,A victim of identity theft discovered inaccura...
4,credit_reporting,continue refresh outdated account old account ...,The text stresses on updating an outdated acco...,An outdated account may violate the Federal De...


In [27]:
!pip install bert-score
from bert_score import score

# Extract the two columns as lists
pred_summaries = eval_df["mistral_response"].astype(str).tolist()
ref_summaries  = eval_df["summary"].astype(str).tolist()

# Compute BERTScore
P, R, F1 = score(pred_summaries, ref_summaries, lang="en", verbose=True)

# Add results back to DataFrame
eval_df["BERTScore_P"]  = P.tolist()
eval_df["BERTScore_R"]  = R.tolist()
eval_df["BERTScore_F1"] = F1.tolist()

Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert-score
Successfully installed bert-score-0.3.13


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/3 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/2 [00:00<?, ?it/s]

done in 3.68 seconds, 27.14 sentences/sec


In [28]:
eval_df.head()

Unnamed: 0,product,narrative,summary,mistral_response,BERTScore_P,BERTScore_R,BERTScore_F1
0,credit_reporting,except otherwise provided section consumer rep...,The text discusses rules and provisions pertai...,A consumer reporting agency must block reporti...,0.91421,0.861569,0.887109
1,mortgages_and_loans,entity servicing loan behalf hereinafter calle...,"The input is about a lender company, which ser...",The text refers to a lender company that provi...,0.919408,0.904904,0.912098
2,credit_reporting,except otherwise provided section consumer rep...,The input text seems to discuss a law provisio...,A consumer reporting agency must block reporti...,0.903887,0.857739,0.880208
3,credit_reporting,true identity theft victim identity theft info...,The text is a plea from an identity theft vict...,A victim of identity theft discovered inaccura...,0.908332,0.892196,0.900192
4,credit_reporting,continue refresh outdated account old account ...,The text stresses on updating an outdated acco...,An outdated account may violate the Federal De...,0.824069,0.879819,0.851032


In [29]:
# Show averages
print("\nAverage Precision:", P.mean().item())
print("Average Recall:", R.mean().item())
print("Average F1:", F1.mean().item())


Average Precision: 0.9029146432876587
Average Recall: 0.8734950423240662
Average F1: 0.8878607749938965


After cleaning the Mistral outputs to remove prompt text and special tokens, the model achieved strong summarisation performance in zero-shot mode, with an average BERTScore F1 of \~0.888, precision of \~0.903, and recall of \~0.873. The high precision shows that the summaries are semantically close to the references, while the slightly lower recall suggests some details are occasionally omitted. Cleaning proved essential, as raw outputs with noise scored near zero, highlighting the importance of preprocessing before evaluation.
