### Model Finetuning: Summarize CNN news articles

Generation of the summary of an article with the pre-trained Google's LLM [FLAN-T5](https://huggingface.co/google/flan-t5-base) from HuggingFace finetuned with PEFT/LoRA.

The articles come from the [CNN dataset](https://huggingface.co/datasets/cnn_dailymail), which contains ~1M articles from the CNN DailyMail. They come with the corresponding manually labeled summaries.

Key points:
- A pre-trained model can be **finetuned** to better perform on a specific task.
- **LoRA** (Low-Ranking Adaptaton) is a powerful way to do this.
- The model's performance can be assessed with the **ROUGE** metric, after visually checking the results.

In [None]:
# ! pip install datasets transformers evaluate accelerate peft

In [1]:
# libraries
import torch
import time
import pandas as pd
import numpy as np
from datasets import load_dataset                                                                           # huggingface datasets
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer # necessary huggingface classes
import evaluate                                                                                             # huggingface evaluate






#### Dataset

In [2]:
dataset = load_dataset("cnn_dailymail", "3.0.0", ignore_verifications=True) # load dataset



Downloading readme:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/257M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/257M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/259M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [3]:
# some examples
np.random.seed(30)
example_index = np.random.randint(0,dataset['test'].shape[0])
print('-'.join('' for x in range(100)))
print(f"Article n. {example_index}")
print('-'.join('' for x in range(100)))
print('INPUT ARTICLE:')
print(dataset['test'][example_index]['article'])
print('-'.join('' for x in range(100)))
print()
print('BASELINE HUMAN SUMMARY:')
print(dataset['test'][example_index]['highlights'])
print('-'.join('' for x in range(100)))
print()

---------------------------------------------------------------------------------------------------
Article n. 5925
---------------------------------------------------------------------------------------------------
INPUT ARTICLE:
You’ve found the house of your dreams – but you want to make sure the costly nightmare of dry rot isn’t lurking under the carpets and skirting boards. The solution? Call in the dogs! The appropriately named Mark Doggett has trained his two animals to sniff out the destructive fungus in old houses where it can hide in places a person would miss. Mr Doggett gave up a ten-year career in construction after hitting on the idea to set up a business using the animals’ sense of smell, which is said to be up to a million times better than that of humans. Skilled: Meg and Jess, pictured with Mark Doggett, were trained for six months to sniff out dry rot . On the case: Four-year-old Border collie Meg gets down to work sniffing out the destructive fungus . When they find

#### Model

Load the [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and tokenizer and check what the tokenizer does.

In [4]:
model_name = 'google/flan-t5-base'
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)   # instantiate the model
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)                             # instantiate the tokenizer

In [5]:
# tokenizer encoding/decoding example
sentence = "Breaking News: Christmas holidays are over! @^^^**#$%"
sentence_encoded = tokenizer(sentence, return_tensors='pt')                       # return tokenization as pytorch tensor
sentence_decoded = tokenizer.decode(sentence_encoded["input_ids"][0], skip_special_tokens=True) # skipping special tokens

print('INPUT SENTENCE:')
print(f"{sentence}")
print('\nENCODED SENTENCE:')
print(f"Input IDs: {sentence_encoded['input_ids']}")
print(f"Attention Mask: {sentence_encoded['attention_mask']}")
print('\nDECODED SENTENCE:')
print(sentence_decoded)

INPUT SENTENCE:
Breaking News: Christmas holidays are over! @^^^**#$%

ENCODED SENTENCE:
Input IDs: tensor([[11429,    53,  3529,    10,  1619,  6799,    33,   147,    55,  3320,
             2, 19844,  4663,  3229,  1454,     1]])
Attention Mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

DECODED SENTENCE:
Breaking News: Christmas holidays are over! @**#$%


Let's check the number of parameters and how many of them are trainable.

In [6]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params, all_model_params = 0, 0
    for _, param in model.named_parameters():           # model parameters
        all_model_params += param.numel()
        if param.requires_grad:                         # model parameters with gradients (i.e., trainable)
            trainable_model_params += param.numel()
    print(f"N. of model parameters: {all_model_params}")
    print(f"N. of trainable model parameters: {trainable_model_params} ({trainable_model_params / all_model_params:.2%} of total)")
print_number_of_trainable_model_parameters(original_model)

N. of model parameters: 247577856
N. of trainable model parameters: 247577856 (100.00% of total)


#### Summarizing an Article without Prompt Engineering (zero-shot inference)

Let's generate a summary that is as long as the human-made one. **Note** that this is an arbitrary choice and that any length can be chosen.

In [7]:
article = dataset['test'][example_index]['article']
summary = dataset['test'][example_index]['highlights']
max_new_tokens = len(tokenizer(summary)['input_ids'])

In [8]:
inputs = tokenizer(article, return_tensors='pt')
output = tokenizer.decode(original_model.generate(inputs["input_ids"], max_new_tokens=max_new_tokens)[0], skip_special_tokens=True)

print('-'.join('' for x in range(100)))
print(f"Article n. {example_index}")
print('-'.join('' for x in range(100)))
print('INPUT ARTICLE:')
print(dataset['test'][example_index]['article'])
print('-'.join('' for x in range(100)))
print()
print(f'BASELINE HUMAN SUMMARY (n. of tokens = {max_new_tokens}):')
print(dataset['test'][example_index]['highlights'])
print('-'.join('' for x in range(100)))
print()
print(f'MODEL GENERATION - WITHOUT PROMPT ENGINEERING:')
print(output)
print()

Token indices sequence length is longer than the specified maximum sequence length for this model (775 > 512). Running this sequence through the model will result in indexing errors


---------------------------------------------------------------------------------------------------
Article n. 5925
---------------------------------------------------------------------------------------------------
INPUT ARTICLE:
You’ve found the house of your dreams – but you want to make sure the costly nightmare of dry rot isn’t lurking under the carpets and skirting boards. The solution? Call in the dogs! The appropriately named Mark Doggett has trained his two animals to sniff out the destructive fungus in old houses where it can hide in places a person would miss. Mr Doggett gave up a ten-year career in construction after hitting on the idea to set up a business using the animals’ sense of smell, which is said to be up to a million times better than that of humans. Skilled: Meg and Jess, pictured with Mark Doggett, were trained for six months to sniff out dry rot . On the case: Four-year-old Border collie Meg gets down to work sniffing out the destructive fungus . When they find

#### Dataset Preprocessing

We will do two things:
1. Rewrite the prompt-response pairs in a way that helps the model better understand the task to perform (using a pre-defined template).
2. Tokenize the prompt-response pairs with the tokenizer specific to our model.

Prompt templates for our model, FLAN-T5, are available [here](https://github.com/google-research/FLAN/blob/main/flan/v2/templates.py).
The shape of the prompt-response pair will be the following.\
Training prompt (article):
  ```
  Article:
  {article}
  What was going on?
  ```

Training response (summary):
  ```
  {summary}
  ```

In [9]:
# engineer + tokenize input dataset (train/val/test)
def tokenize_function(example):
    engineered_prompt = ['Article: \n\n' + article + '\n\n What was going on?' for article in example["article"]]
    example['input_ids'] = tokenizer(engineered_prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["highlights"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    return example

tokenized_datasets = dataset.map(tokenize_function, batched=True)  # it handles all data across all splits (train/val/test) in batches
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'article', 'highlights',])

Map:   0%|          | 0/287113 [00:00<?, ? examples/s]

Map:   0%|          | 0/13368 [00:00<?, ? examples/s]

Map:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [10]:
print(f"Shapes of the original datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

print(f"Shapes of the reduced datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the original datasets:
Training: (287113, 2)
Validation: (13368, 2)
Test: (11490, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 11490
    })
})


Filter:   0%|          | 0/287113 [00:00<?, ? examples/s]

Filter:   0%|          | 0/13368 [00:00<?, ? examples/s]

Filter:   0%|          | 0/11490 [00:00<?, ? examples/s]

Shapes of the reduced datasets:
Training: (2872, 2)
Validation: (134, 2)
Test: (115, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 2872
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 134
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 115
    })
})


#### Model *full* fine-tuning [NOT done here]
Performing a full fine-tuning is computationally expensive, that's why we won't do it here. Even on the reduced dataset, the training would take few hours on a GPU. We will use another technique (see below) to finetune FLAN-T5 and make it capable to adapt to our task (summarization) on our dataset (CNN news).

For sake of completeness, the following cell shows how to train the model (using the HuggingFace [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) class) and how to load a checkpoint from an already trained model.

```
# full model fine-tuning

! pip install accelerate
! pip install accelerate>=0.20.1

import time
import accelerate
training_args = TrainingArguments(output_dir=output_dir, learning_rate=1e-3, num_train_epochs=100, max_steps=1)
trainer = Trainer(model=original_model, args=training_args, train_dataset=tokenized_datasets['train'], eval_dataset=tokenized_datasets['validation'])

trainer.train()
trainer.save_model('path_to_file')



# load model from checkpoint

instruct_model = AutoModelForSeq2SeqLM.from_pretrained("path_to_folder_with_model_checkpoint/", torch_dtype=torch.bfloat16)
```

#### Perform Parameter Efficient Fine-Tuning (PEFT)

When we don't have enough resources (computational power, GPUs, time, ...), we can still finetune a model to specialize in a task or add extra information from a new dataset. This is called **Parameter Efficient Fine-Tuning (PEFT)**, a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results.

PEFT is a generic term that includes **Low-Rank Adaptation ([LoRA](https://en.wikipedia.org/wiki/Fine-tuning_(deep_learning)#Low-rank_adaptation))** and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

#### Setup the PEFT/LoRA model for Fine-Tuning

We can set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. When using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. The main hyper-parameter is the rank, `r`, which defines the rank/dimension of the adapter to be trained.

In [11]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

In [12]:
lora_config

LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type=<TaskType.SEQ_2_SEQ_LM: 'SEQ_2_SEQ_LM'>, inference_mode=False, r=32, target_modules={'v', 'q'}, lora_alpha=32, lora_dropout=0.05, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={})

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [13]:
peft_model = get_peft_model(original_model, lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

N. of model parameters: 251116800
N. of trainable model parameters: 3538944 (1.41% of total)
None


#### Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [14]:
# set PEFT config
output_dir = f'./peft-article-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1
)

# define PEFT Trainer
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

In [15]:
# train the PEFT adapter
peft_trainer.train()

peft_model_path="./peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

Step,Training Loss
1,43.0


('./peft-dialogue-summary-checkpoint-local\\tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local\\special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local\\tokenizer.json')

At this point, we can add the PEFT adapter to the original FLAN-T5 model. 

BY setting `is_trainable=False`, we use only the PEFT model to do inference. If you were preparing the model for further training, you would set `is_trainable=True`.

In [16]:
from peft import PeftModel, PeftConfig

# original model
peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

# peft trained model
peft_model = PeftModel.from_pretrained(peft_model_base, peft_model_path, torch_dtype=torch.bfloat16, is_trainable=False)

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [17]:
print(print_number_of_trainable_model_parameters(peft_model))

N. of model parameters: 251116800
N. of trainable model parameters: 0 (0.00% of total)
None


#### Evaluate the Model Qualitatively (Human Evaluation)

As with many GenAI applications, a *qualitative* approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below, we can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [27]:
article = dataset['test'][example_index]['article']
baseline_human_summary = dataset['test'][example_index]['highlights']

prompt = f"""
    Article:
    {article}
    What was going on?
    
    Summary:  
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

# instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
# instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print('-'.join('' for x in range(100)))
print(f'BASELINE HUMAN SUMMARY:\n{dataset["test"][example_index]["highlights"]}')
print('-'.join('' for x in range(100)))
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print('-'.join('' for x in range(100)))
# print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
# print('-'.join('' for x in range(100)))
print(f'PEFT MODEL: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Mark Doggett has trained his two dogs to sniff out the destructive fungus .
His Border collie and English springer spaniel had six months training .
When they find dry rot they stop, stare at it and point with their nose .
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
Mark Doggett trained his dogs to sniff out dry rot. The businessman, 30, from Wolverhampton, has now trained his dogs to sniff out dry rot.
---------------------------------------------------------------------------------------------------
PEFT MODEL: Mark Doggett has trained his dogs to sniff out dry rot. The businessman, 30, from Wolverhampton, has been successful. He plans to train his dogs to hunt bed bugs for hotels and hospitals.


#### Evaluate the Model Quantitatively (with ROUGE Metric)

After visually checking the model's performance, we can use a more quantitative approach to evaluate the model's inference capabilities: the [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) metric helps quantify the validity of the summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human.

In [30]:
articles = dataset['test'][0:10]['article']
human_baseline_summaries = dataset['test'][0:10]['highlights']

original_model_summaries = []
# instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(articles):
    prompt = f"""
        Article:
        {article}
        What was going on?

        Summary:  
    """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    # instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    # instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    # instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(
    human_baseline_summaries, 
    original_model_summaries, 
    # instruct_model_summaries, 
    peft_model_summaries))

df = pd.DataFrame(zipped_summaries, 
                  columns = [
                      'human_baseline_summaries', 
                      'original_model_summaries', 
                      #'instruct_model_summaries', 
                      'peft_model_summaries']
                 )
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,peft_model_summaries
0,Membership gives the ICC jurisdiction over all...,Mark Doggett has trained his dogs to sniff out...,Mark Doggett has trained his dogs to sniff out...
1,"Theia, a bully breed mix, was apparently hit b...",Mark Doggett trained dogs to sniff out the des...,Mark Doggett has trained his dogs to sniff out...
2,Mohammad Javad Zarif has spent more time with ...,Dogs trained to sniff out dry rot in old house...,Mark Doggett has trained his dogs to sniff out...
3,17 Americans were exposed to the Ebola virus w...,"Mark Doggett, 30, from Wolverhampton, has trai...",Mark Doggett has trained his dogs to sniff out...
4,Student is no longer on Duke University campus...,Mark Doggett has trained his dogs to sniff out...,Mark Doggett has trained his dogs to sniff out...
5,College-bound basketball star asks girl with D...,Mark Doggett has trained two dogs to sniff out...,Mark Doggett has trained his dogs to sniff out...
6,Amnesty's annual death penalty report catalogs...,Mark Doggett and his pair of sniffer dogs have...,Mark Doggett has trained his dogs to sniff out...
7,Andrew Getty's death appears to be from natura...,Mark Doggett set up business to sniff out dry ...,Mark Doggett has trained his dogs to sniff out...
8,"Once a super typhoon, Maysak is now a tropical...",Mark Doggett has trained his two dogs to sniff...,Mark Doggett has trained his dogs to sniff out...
9,"Bob Barker returned to host ""The Price Is Righ...",Mark Doggett has trained his two dogs to sniff...,Mark Doggett has trained his dogs to sniff out...


Compute ROUGE scores for this subset of the data.

In [32]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

'''
instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)
'''

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
# print('INSTRUCT MODEL:')
# print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.09650013544619476, 'rouge2': 0.0, 'rougeL': 0.08541426507882091, 'rougeLsum': 0.09502221702385631}
PEFT MODEL:
{'rouge1': 0.08739697192076012, 'rouge2': 0.0, 'rougeL': 0.06560769520311946, 'rougeLsum': 0.07404988886987354}


Notice, that PEFT model results are only slightly inferior, while the training process was much easier!

In [34]:
human_baseline_summaries = df['human_baseline_summaries'].values
original_model_summaries = df['original_model_summaries'].values
# instruct_model_summaries = df['instruct_model_summaries'].values
peft_model_summaries     = df['peft_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

'''
instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)
'''

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
# print('INSTRUCT MODEL:')
# print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.09650013544619476, 'rouge2': 0.0, 'rougeL': 0.08541426507882091, 'rougeLsum': 0.09502221702385631}
PEFT MODEL:
{'rouge1': 0.08739697192076012, 'rouge2': 0.0, 'rougeL': 0.06560769520311946, 'rougeLsum': 0.07404988886987354}


Calculating the improvement of PEFT over the original model, we can see that:

In [36]:
print("Percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Percentage improvement of PEFT MODEL over ORIGINAL MODEL
rouge1: -0.91%
rouge2: 0.00%
rougeL: -1.98%
rougeLsum: -2.10%


Here you see a small percentage decrease in the ROUGE metrics vs. full fine-tuned. However, the training requires much less computing and memory resources (often just a single GPU).

In conclusion, we can see that it's possible to finetune an LLM with ad-hoc technique, like LoRA, that allow to make a general-purpose model learn how to perform a specific task and/or on a specific dataset, without having to fully retrain the full model.

Also, we saw that, while a qualitative evaluation of the model's behaviour remains the first step to take, more quantitative approaches, like using the ROUGE score, allow us to measure what the model is doing.

### Acknowledgements

Thanks to DeepLearning.AI for the courses that inspired this notebook.