# Fine-Tuning LLMs

In this exercise, you will fine-tune the [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model for enhanced dialogue summarization. You will first explore a full fine-tuning approach and evaluate the results with ROUGE metrics. Then you will perform Parameter-Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

In [None]:
''' gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info) '''

In [None]:
'''from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')'''

## 1. Set up Dependencies and Load Dataset and LLM

In [None]:
!pip install datasets evaluate rouge_score peft -q

In [None]:
import torch
import time
import evaluate
import pandas as pd
import numpy as np

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer
from datasets import load_dataset
from peft import LoraConfig, TaskType, get_peft_model

In [None]:
dataset = load_dataset('knkarthick/dialogsum')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Load the pre-trained [Flan-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer from HuggingFace. Notice that you will be using the [small version](https://huggingface.co/google/flan-t5-base) of Flan-T5. Setting `torch_dtype=torch.bfloat16` specifies the data type to be used by this model, which can reduce GPU memory usage since `bfloat16` uses half as much memory per number compared to `float32`, the default precision for most models.

In [None]:
model_name = 'google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

## 2. Test the Model with Zero-Shot Inferencing

Test the model with zero-shot inference.

In [None]:
index = 42
dash_line = '-' * 100

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"Summarize the following conversation.\n{dialogue}\nSummary:\n"
inputs = tokenizer(prompt, return_tensors='pt')
output = original_model.generate(inputs['input_ids'], max_new_tokens=50)[0]
original_model_summary = tokenizer.decode(output, skip_special_tokens=True)

print(dash_line)
print(f'INPUT PROMPT:\n{dialogue}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{original_model_summary}\n')

----------------------------------------------------------------------------------------------------
INPUT PROMPT:
#Person1#: I don't know how to adjust my life. Would you give me a piece of advice?
#Person2#: You look a bit pale, don't you?
#Person1#: Yes, I can't sleep well every night.
#Person2#: You should get plenty of sleep.
#Person1#: I drink a lot of wine.
#Person2#: If I were you, I wouldn't drink too much.
#Person1#: I often feel so tired.
#Person2#: You better do some exercise every morning.
#Person1#: I sometimes find the shadow of death in front of me.
#Person2#: Why do you worry about your future? You're very young, and you'll make great contribution to the world. I hope you take my advice.
----------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# wants to adjust #Person1#'s life and #Person2# suggests #Person1# be positive and stay healthy.
-------------------------------------------------------

You can see that the model struggles to summarize the dialogue compared to the baseline summary, and simply repeats the first sentence from the dialogue.

## 3. Perform Full Fine-Tuning

### 3.1 Preprocess the Dataset

You need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following conversation.`, and to the start of the summary with `Summary:` as follows:

Training prompt (dialogue):
```
Summarize the following conversation.
Alice: This is her part of the conversation.
Bob: This is his part of the conversation.    
Summary:
```

Training response (summary):
```
Both Alice and Bob participated in the conversation.
```

**Exercise**: Write a function to tokenize a batch of examples from the dialogue dataset. The function should concatentate the dialogues with the predefined prompt, tokenize them along with their summaries, and define the tokenized summaries as the labels.

In [None]:
def tokenize(examples):
    # Create prompts by formatting the dialogue from the examples dataset into the desired input format
    prompts = [f"Briefly summarize the conversation.\n{conversation}\nSummary:" for conversation in examples["dialogue"]]

    # Extract the summary labels corresponding to each dialogue
    summaries = [summary_text for summary_text in examples["summary"]]

    # Tokenize the prompts using the tokenizer
    tokenized_prompts = tokenizer(prompts, padding="max_length", truncation=True, return_tensors="pt", max_length=512)

    # Tokenize the summaries in a similar way, with a smaller max_length since summaries are shorter
    tokenized_summaries = tokenizer(summaries, padding="max_length", truncation=True, return_tensors="pt", max_length=128)

    # Extract tokenized inputs for the prompts
    prompt_input_ids = tokenized_prompts.input_ids  # Encoded IDs for the tokenized text
    prompt_attention_mask = tokenized_prompts.attention_mask  # Attention mask for padded positions

    # Extract tokenized inputs for the summaries (these serve as labels for the model)
    summary_input_ids = tokenized_summaries.input_ids  # Encoded IDs for the tokenized summaries

    # Prepare the final dictionary to return
    model_inputs = {
        'input_ids': prompt_input_ids,
        'attention_mask': prompt_attention_mask,
        'labels': summary_input_ids
    }

    # Return the processed inputs for model training or evaluation
    return model_inputs


In [None]:
tokenized_dataset = dataset.map(tokenize, batched=True)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

### 3.2 Fine-Tune the Model

**Exercise**: Utilize the Hugging Face Trainer API for training the model on the preprocessed dataset. Define the training arguments, a data collator, and create a `Seq2SeqTrainer` instance. Train the model for one epoch.

In [None]:
# Define training arguments for the Seq2Seq model
training_args = Seq2SeqTrainingArguments(
    output_dir="./output",  # Directory to save all outputs, such as checkpoints and logs
    save_total_limit=2,  # Limit the number of checkpoints to save (to avoid using excessive disk space)
    learning_rate=5e-5,  # Set the learning rate for the optimizer
    per_device_train_batch_size=4,  # Batch size for training on each GPU/device
    per_device_eval_batch_size=4,  # Batch size for evaluation on each GPU/device
    gradient_accumulation_steps=4,  # Steps to accumulate gradients before updating parameters (effective batch size)
    optim="adamw_torch",  # Optimizer to use for training; 'adamw_torch' is an efficient AdamW implementation
    evaluation_strategy="epoch",  # Evaluate the model at the end of each epoch
    num_train_epochs=1  # Number of epochs to train the model
)


In [None]:
# Define data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=original_model)

# Define the trainer for sequence-to-sequence training
trainer = Seq2SeqTrainer(
    model=original_model,  # The model to be trained (in this case, the original model)
    args=training_args,  # Training arguments (e.g., batch size, learning rate, number of epochs, etc.)
    data_collator=data_collator,  # Function to collate data into batches (handles padding and truncation)
    train_dataset=tokenized_dataset["train"],  # The training dataset, preprocessed and tokenized
    eval_dataset=tokenized_dataset["validation"],  # The evaluation/validation dataset, tokenized
    tokenizer=tokenizer,  # Tokenizer used for preprocessing input and decoding outputs
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Training a fully fine-tuned version of the model should take about 10 minutes on a Google Colab GPU machine.

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss
0,No log,23.023438


TrainOutput(global_step=59, training_loss=26.831302966101696, metrics={'train_runtime': 2414.0846, 'train_samples_per_second': 5.161, 'train_steps_per_second': 0.024, 'total_flos': 8403342229241856.0, 'train_loss': 26.831302966101696, 'epoch': 0.98})

Save the model to a local folder:

In [None]:
model_path = "/content/drive/MyDrive/Fall'24/LLM/Assignment 5/flan-t5-base_dialogsum"

original_model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

('/content/drive/MyDrive/Colab Notebooks/flan-t5-base-dialogsum-checkpoint/tokenizer_config.json',
 '/content/drive/MyDrive/Colab Notebooks/flan-t5-base-dialogsum-checkpoint/special_tokens_map.json',
 '/content/drive/MyDrive/Colab Notebooks/flan-t5-base-dialogsum-checkpoint/spiece.model',
 '/content/drive/MyDrive/Colab Notebooks/flan-t5-base-dialogsum-checkpoint/added_tokens.json',
 '/content/drive/MyDrive/Colab Notebooks/flan-t5-base-dialogsum-checkpoint/tokenizer.json')

Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

In [None]:
# Load the fine-tuned instruction model for sequence-to-sequence tasks
instruct_model = AutoModelForSeq2SeqLM.from_pretrained(
    "/content/drive/MyDrive/Fall'24/LLM/Assignment 5/flan-t5-base_dialogsum",  # Path to the fine-tuned model
    torch_dtype=torch.bfloat32  # Use bfloat32 for reduced memory usage and faster computation on supported hardware
)


Reload the original Flan-T5-base model:

In [None]:
original_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base', torch_dtype=torch.bfloat16)

### 3.3 Evaluate the Model Qualitatively (Human Evaluation)

**Exercise**: Make inferences for the same example as in Section 2, using the original model and the fully fine-tuned model.

In [None]:
index = 42
dash_line = '-' * 100

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"Summarize the following conversation.\n{dialogue}\nSummary:\n"
inputs = tokenizer(prompt, return_tensors='pt')

# Generate summary using the original model
original_model_output = original_model.generate(inputs['input_ids'], max_new_tokens=50)[0]
original_model_summary = tokenizer.decode(output, skip_special_tokens=True)

# Generate summary using the fine-tuned model
instruct_output = instruct_model.generate(inputs['input_ids'], max_new_tokens=50)[0]
instruct_summary = tokenizer.decode(instruct_output, skip_special_tokens=True)

print(dash_line)
print(f'INPUT PROMPT:\n{dialogue}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}')
print(dash_line)
print(f'ORIGINAL MODEL GENERATION - ZERO SHOT:\n{original_model_summary}\n')
print(dash_line)
print(f'INSTRUCT MODEL GENERATION - ZERO SHOT:\n{instruct_summary}\n')

----------------------------------------------------------------------------------------------------
INPUT PROMPT:
#Person1#: I don't know how to adjust my life. Would you give me a piece of advice?
#Person2#: You look a bit pale, don't you?
#Person1#: Yes, I can't sleep well every night.
#Person2#: You should get plenty of sleep.
#Person1#: I drink a lot of wine.
#Person2#: If I were you, I wouldn't drink too much.
#Person1#: I often feel so tired.
#Person2#: You better do some exercise every morning.
#Person1#: I sometimes find the shadow of death in front of me.
#Person2#: Why do you worry about your future? You're very young, and you'll make great contribution to the world. I hope you take my advice.
----------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# wants to adjust #Person1#'s life and #Person2# suggests #Person1# be positive and stay healthy.
-------------------------------------------------------

The fine-tuned model is able to create a much better summary of the dialogue compared to the original model.

### 3.4 Evaluate the Model Quantitatively (with ROUGE Metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [None]:
rouge = evaluate.load('rouge')

**Exercise**: Generate the outputs for a sample of the test set with the fine-tuned model (use only the first 10 dialogues and summaries to save time).

In [None]:
instruct_model_summaries = []  # Summaries generated by the fine-tuned model
original_model_summaries = []  # Summaries generated by the original model
human_baseline_summaries = []  # Human-written reference summaries

def preprocess_and_generate(model, tokenizer, dialogue, device, max_length=512):
    """
    Preprocess dialogue, generate a summary using the model, and decode the output.
    """
    # Preprocess the dialogue
    tokenized_input = tokenizer(dialogue, return_tensors="pt", max_length=max_length, truncation=True).to(device)
    # Generate the summary
    generated_output = model.generate(**tokenized_input)
    # Decode and return the summary
    return tokenizer.decode(generated_output[0], skip_special_tokens=True)


# Loop through the first 10 entries in the 'test' dataset
for i in range(10):
    # Extract the dialogue and the reference summary
    test_dialogue = dataset['test'][i]['dialogue']
    reference_summary = dataset['test'][i]['summary']

    # Generate summaries using the instruct_model and original_model
    instruct_summary = preprocess_and_generate(instruct_model, tokenizer, test_dialogue, device)
    original_summary = preprocess_and_generate(original_model, tokenizer, test_dialogue, device)

    # Append the summaries and reference to the respective lists
    instruct_model_summaries.append(instruct_summary)  # Fine-tuned model summary
    original_model_summaries.append(original_summary)  # Original model summary
    human_baseline_summaries.append(reference_summary)  # Human-written summary



Evaluate the models computing ROUGE metrics:

In [None]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)]
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)]
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.22868575868575866, 'rouge2': 0.08206617894882928, 'rougeL': 0.2006298146298146, 'rougeLsum': 0.20396599696599696}
INSTRUCT MODEL:
{'rouge1': 0.2494131054131054, 'rouge2': 0.09044485418029699, 'rougeL': 0.21345033893309756, 'rougeLsum': 0.21594925827684447}


The results show substantial improvement in all ROUGE metrics:

In [None]:
print("Absolute percentage improvement of the instruct model over the original model:")

for key in instruct_model_results:
    improvement = instruct_model_results[key] - original_model_results[key]
    print(f'{key}: {improvement*100:.2f}%')

Absolute percentage improvement of the instruct model over the original model:
rouge1: 2.01%
rouge2: 0.89%
rougeL: 1.30%
rougeLsum: 1.29%


## 4. Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** instead of "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning, with comparable evaluation results as you will see soon.

One of the most popular PEFT methods is **Low-Rank Adaptation (LoRA)**, which  introduces low-rank matrices to adapt the LLM with minimal additional parameters. In most cases, when someone says PEFT, they typically mean LoRA.  After fine-tuning for a specific task with LoRA, the result is that the original LLM remains unchanged and a newly-trained "LoRA adapter" emerges. This LoRA adapter is much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

At inference time, the LoRA adapter is reunited and combined with its original LLM to serve the inference request. The benefit is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

### 4.1 Setup the LoRA model for Fine-Tuning

You first need to define the configuration of the LoRA model. Have a look at the configuration below. The key configuration element to adjust is the rank (`r`) of the adapter, which influences its capacity and complexity. Experiment with various ranks, such as 8, 16, or 32, and see how they affect the results.

In [None]:
lora_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    r=32,
    lora_alpha=32,
    lora_dropout=0.1
)

Add LoRA adapter layers/parameters to the original LLM to be trained:

In [None]:
peft_model = get_peft_model(original_model, lora_config)

The number of trainable model parameters in the LoRA model is:

In [None]:
peft_model.print_trainable_parameters()

trainable params: 3,538,944 || all params: 251,116,800 || trainable%: 1.4092820552029972


### 4.2 Train the LoRA Adapter

**Exercise**: Define training arguments and create a `Seq2SeqTrainer` instance for the LoRA model. Use a higher learning rate than full fine-tuning (e.g., `1e-3`).

In [None]:
# Define training arguments with GPU device
peft_training_args = Seq2SeqTrainingArguments(
    output_dir="./output",
    save_total_limit=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    evaluation_strategy="epoch",
    num_train_epochs=1,
    optim="adamw_torch",
    learning_rate=1e-3
)


In [None]:
# Define data collator
#peft_data_collator = DataCollatorForSeq2Seq(tokenizer, model=original_model)
# Define trainer
peft_trainer = Seq2SeqTrainer(
    model=original_model,
    args=peft_training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    )

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Train the PEFT adapter. Training should take about 6 minutes on a Google Colab GPU machine.

In [None]:
peft_trainer.train()

Epoch,Training Loss,Validation Loss
0,No log,3.183594


TrainOutput(global_step=59, training_loss=8.86670749470339, metrics={'train_runtime': 1970.8249, 'train_samples_per_second': 6.322, 'train_steps_per_second': 0.03, 'total_flos': 8536758945841152.0, 'train_loss': 8.86670749470339, 'epoch': 0.98})

Save the model to a local folder:

In [None]:
peft_model.save_pretrained("/content/drive/MyDrive/Fall'24/LLM/Assignment 5/flan-t5-base_dialogsumlora")

Load the PEFT model:

In [None]:
peft_model = AutoModelForSeq2SeqLM.from_pretrained("/content/drive/MyDrive/Fall'24/LLM/Assignment 5/flan-t5-base_dialogsumlora")
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base')

Reload the original Flan-T5-base model:

In [None]:
original_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base', torch_dtype=torch.bfloat32)

### 4.3 Evaluate the Model Qualitatively (Human Evaluation)

**Exercise**: Make inferences for the same example as in Sections 2 and 3, using the original model, the fully fine-tuned model and the PEFT model.

In [None]:
index = 42
dash_line = '-' * 100

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"Summarize the following conversation.\n{dialogue}\nSummary:\n"
inputs = tokenizer(prompt, return_tensors='pt')

# Generate summary using the original model
original_model_output = original_model.generate(inputs['input_ids'], max_new_tokens=50)[0]
original_model_summary = tokenizer.decode(output, skip_special_tokens=True)

# Generate summary using the fine-tuned model
peft_output = peft_model.generate(inputs['input_ids'], max_new_tokens=50)[0]
peft_summary = tokenizer.decode(peft_output, skip_special_tokens=True)

print(dash_line)
print(f'INPUT PROMPT:\n{dialogue}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}')
print(dash_line)
print(f'ORIGINAL MODEL GENERATION - ZERO SHOT:\n{original_model_summary}\n')
print(dash_line)
print(f'PEFT MODEL GENERATION - ZERO SHOT:\n{peft_summary}\n')


----------------------------------------------------------------------------------------------------
INPUT PROMPT:
#Person1#: I don't know how to adjust my life. Would you give me a piece of advice?
#Person2#: You look a bit pale, don't you?
#Person1#: Yes, I can't sleep well every night.
#Person2#: You should get plenty of sleep.
#Person1#: I drink a lot of wine.
#Person2#: If I were you, I wouldn't drink too much.
#Person1#: I often feel so tired.
#Person2#: You better do some exercise every morning.
#Person1#: I sometimes find the shadow of death in front of me.
#Person2#: Why do you worry about your future? You're very young, and you'll make great contribution to the world. I hope you take my advice.
----------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# wants to adjust #Person1#'s life and #Person2# suggests #Person1# be positive and stay healthy.
-------------------------------------------------------

### 4.4 Evaluate the Model Quantitatively (with ROUGE Metric)

**Exercise**: Generate the outputs for a sample of the test set with the PEFT model (use only the first 10 dialogues and summaries to save time).

In [None]:
peft_model_summaries = []  # Summaries generated by the PEFT model
original_model_summaries = []  # Summaries generated by the original model
human_baseline_summaries = []  # Human-written reference summaries

def preprocess_and_generate_summary(model, tokenizer, dialogue, device, max_length=512):

    # Preprocess the dialogue
    tokenized_input = tokenizer(dialogue, return_tensors="pt", max_length=max_length, truncation=True).to(device)
    # Generate the summary
    generated_output = model.generate(**tokenized_input)
    # Decode and return the summary
    return tokenizer.decode(generated_output[0], skip_special_tokens=True)

# Loop through the first 10 entries in the 'test' dataset
for i in range(10):
    # Extract dialogue and reference summary
    test_dialogue = dataset['test'][i]['dialogue']
    reference_summary = dataset['test'][i]['summary']

    # Generate summaries using the PEFT model and the original model
    peft_summary = preprocess_and_generate_summary(peft_model, tokenizer, test_dialogue, device)
    original_summary = preprocess_and_generate_summary(original_model, tokenizer, test_dialogue, device)

    # Append the generated summaries and reference summary to respective lists
    peft_model_summaries.append(peft_summary)  # PEFT model summary
    original_model_summaries.append(original_summary)  # Original model summary
    human_baseline_summaries.append(reference_summary)  # Human-written summary


Compute ROUGE score for this subset of the data.

In [None]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.22868575868575866, 'rouge2': 0.08206617894882928, 'rougeL': 0.2006298146298146, 'rougeLsum': 0.20396599696599696}
INSTRUCT MODEL:
{'rouge1': 0.2494131054131054, 'rouge2': 0.09044485418029699, 'rougeL': 0.21345033893309756, 'rougeLsum': 0.21594925827684447}
PEFT MODEL:
{'rouge1': 0.2257931936299649, 'rouge2': 0.027705627705627706, 'rougeL': 0.16692994751073992, 'rougeLsum': 0.16711118615039638}


Notice, that PEFT model results are not too bad, while the training process was much easier!

Calculate the improvement of PEFT over the original model:

In [None]:
print("Absolute percentage improvement of the PEFT model over the original model:")

for key in peft_model_results:
    improvement = peft_model_results[key] - original_model_results[key]
    print(f'{key}: {improvement*100:.2f}%')

Absolute percentage improvement of the PEFT model over the original model:
rouge1: -0.29%
rouge2: -5.44%
rougeL: -3.37%
rougeLsum: -3.69%


Now calculate the improvement of PEFT over a full fine-tuned model:

#### Note : The code given does not use the abs() function so the percentage improvent is negative and it is not the absolute value

In [None]:
print("Absolute percentage improvement of the PEFT model over the instruct model:")

for key in peft_model_results:
    improvement = peft_model_results[key] - instruct_model_results[key]
    print(f'{key}: {improvement*100:.2f}%')

Absolute percentage improvement of the PEFT model over the instruct model:
rouge1: -2.33%
rouge2: -6.24%
rougeL: -4.69%
rougeLsum: -4.99%


You can see a small percentage decrease in the ROUGE metrics vs. full fine-tuned. However, the training requires much less computing and memory resources.