# Fine-Tuning LLMs

In this exercise, you will fine-tune the [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model for enhanced dialogue summarization. You will first explore a full fine-tuning approach and evaluate the results with ROUGE metrics. Then you will perform Parameter-Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

## 1. Set up Dependencies and Load Dataset and LLM

In [None]:
!pip install datasets evaluate rouge_score peft -q

In [None]:
import torch
import time
import evaluate
import pandas as pd
import numpy as np

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer
from datasets import load_dataset

In [None]:
dataset = load_dataset('knkarthick/dialogsum')

Load the pre-trained [Flan-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer from HuggingFace. Notice that you will be using the [small version](https://huggingface.co/google/flan-t5-base) of Flan-T5. Setting `torch_dtype=torch.bfloat16` specifies the data type to be used by this model, which can reduce GPU memory usage since `bfloat16` uses half as much memory per number compared to `float32`, the default precision for most models. 

In [None]:
model_name = 'google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

## 2. Test the Model with Zero-Shot Inferencing

Test the model with zero-shot inference.

In [None]:
index = 42
dash_line = '-' * 100

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"Summarize the following conversation.\n{dialogue}\nSummary:\n"
inputs = tokenizer(prompt, return_tensors='pt')
output = original_model.generate(inputs['input_ids'], max_new_tokens=50)[0]
original_model_summary = tokenizer.decode(output, skip_special_tokens=True)

print(dash_line)
print(f'INPUT PROMPT:\n{dialogue}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{original_model_summary}\n')

You can see that the model struggles to summarize the dialogue compared to the baseline summary, and simply repeats the first sentence from the dialogue. 

## 3. Perform Full Fine-Tuning

### 3.1 Preprocess the Dataset

You need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following conversation.`, and to the start of the summary with `Summary:` as follows:

Training prompt (dialogue):
```
Summarize the following conversation.
Alice: This is her part of the conversation.
Bob: This is his part of the conversation.    
Summary:
```

Training response (summary):
```
Both Alice and Bob participated in the conversation.
```

**Exercise**: Write a function to tokenize a batch of examples from the dialogue dataset. The function should concatentate the dialogues with the predefined prompt, tokenize them along with their summaries, and define the tokenized summaries as the labels.

In [None]:
def tokenize(examples):
    ### WRITE YOUR CODE HERE
    
    ## Create the input prompts by concatenating dialogue with the predefined instruction
    inputs = []
    for dialogue in examples['dialogue']:
        prompt = f"Summarize the following conversation.\n{dialogue}\nSummary:"
        inputs.append(prompt)

    # Tokenize the input prompts

    #Truncate sequences that are too long
    #Pad shorter sequences to match the longest in the batch
    #return_tensors=None: Return as lists rather than tensors (required for dataset mapping)
    model_inputs = tokenizer(inputs, truncation=True, padding=True, return_tensors=None)

    # Tokenize the target summaries (labels)
    targets = tokenizer(examples['summary'], truncation=True, padding=True, return_tensors=None)
    
    # Set the labels to the tokenized summaries

    # labels are the target outputs that want the model to learn to generate. They represent the "correct answers" during training.
    model_inputs['labels'] = targets['input_ids']
    

    return model_inputs

In [None]:
## Apply preprocessing to entire dataset
tokenized_dataset = dataset.map(tokenize, batched=True)

### 3.2 Fine-Tune the Model

**Exercise**: Utilize the Hugging Face Trainer API for training the model on the preprocessed dataset. Define the training arguments, a data collator, and create a `Seq2SeqTrainer` instance. Train the model for one epoch.

In [None]:
### WRITE YOUR CODE HERE

training_args = Seq2SeqTrainingArguments(
    output_dir='./flan-t5-base-dialogsum',          # Directory to save model checkpoints
    num_train_epochs=1,                             # Train for 1 epoch as specified
    per_device_train_batch_size=8,                  # Batch size per device
    per_device_eval_batch_size=8,                   # Evaluation batch size
    warmup_steps=500,                               # Number of warmup steps
    weight_decay=0.01,                              # regularization technique that adds a penalty term to the loss function to prevent overfitting by penalty for large weights
    logging_dir='./logs',                           # Directory for storing logs
    logging_steps=100,                              # Log every 100 steps
    eval_strategy="epoch",                    # Evaluate at the end of each epoch
    save_strategy="epoch",                          # Save checkpoint at the end of each epoch
    predict_with_generate=True,                     # Use generate for evaluation
    bf16=True,                                      # Use BFloat16 mixed precision training (compatible with bfloat16 model)
    fp16=False,                                     # Disable FP16 since we're using BF16
    push_to_hub=False,                              # Don't push to hub
    report_to=[],
)

## Create data collator

# Handles padding and batching for sequence-to-sequence tasks, Creates attention masks to ignore padding tokens
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=original_model,
    padding=True,
    return_tensors='pt'
)

# create trainer
trainer = Seq2SeqTrainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
)




Training a fully fine-tuned version of the model should take about 10 minutes on a Google Colab GPU machine.

In [None]:
trainer.train()

Save the model to a local folder:

In [None]:
model_path = './flan-t5-base-dialogsum-checkpoint'

original_model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

In [None]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained('./flan-t5-base-dialogsum-checkpoint', 
                                                       torch_dtype=torch.bfloat16)

Reload the original Flan-T5-base model:

In [None]:
original_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base', torch_dtype=torch.bfloat16)

### 3.3 Evaluate the Model Qualitatively (Human Evaluation)

**Exercise**: Make inferences for the same example as in Section 2, using the original model and the fully fine-tuned model.

In [None]:
### WRITE YOUR CODE HERE
index = 42
dash_line = '-' * 100

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

# Create the prompt
prompt = f"Summarize the following conversation.\n{dialogue}\nSummary:\n"

#Generate summary with original model
inputs = tokenizer(prompt, return_tensors='pt')
output = original_model.generate(inputs['input_ids'], max_new_tokens=50)[0]
original_model_summary = tokenizer.decode(output, skip_special_tokens=True)

#Generate summary with fine-tuned (instruct) model
output = instruct_model.generate(inputs['input_ids'], max_new_tokens=50)[0]
instruct_model_summary = tokenizer.decode(output, skip_special_tokens=True)

# Display results for comparison
print(dash_line)
print(f'INPUT DIALOGUE:\n{dialogue}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}')
print(dash_line)
print(f'ORIGINAL MODEL SUMMARY:\n{original_model_summary}')
print(dash_line)
print(f'FINE-TUNED MODEL SUMMARY:\n{instruct_model_summary}')



The fine-tuned model is able to create a much better summary of the dialogue compared to the original model.

### 3.4 Evaluate the Model Quantitatively (with ROUGE Metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [None]:
rouge = evaluate.load('rouge')

**Exercise**: Generate the outputs for a sample of the test set with the fine-tuned model (use only the first 10 dialogues and summaries to save time).

In [None]:
### WRITE YOUR CODE HERE
# Generate summaries for the first 10 test examples
dialogues = dataset['test']['dialogue'][:10]
human_baseline_summaries = dataset['test']['summary'][:10]

# Generate summaries with original model
original_model_summaries = []
for dialogue in dialogues:
    prompt = f"Summarize the following conversation.\n{dialogue}\nSummary:\n"
    inputs = tokenizer(prompt, return_tensors='pt') # Tokenizer converts text to tokens:
    output = original_model.generate(inputs['input_ids'],max_new_tokens=50)[0] #Model generates from token IDs, the generate() method needs the actual token IDs as input
    summary = tokenizer.decode(output, skip_special_tokens=True) #tokenizer is a tool for converting between text and token
    original_model_summaries.append(summary)

# Generate summaries with fine-tuned (instruct) model
instruct_model_summaries = []
for dialogue in dialogues:
    prompt = f"Summarize the following conversation.\n{dialogue}\nSummary:\n"
    inputs = tokenizer(prompt, return_tensors='pt')
    output = instruct_model.generate(inputs['input_ids'], max_new_tokens=50)[0]
    summary = tokenizer.decode(output, skip_special_tokens=True)
    instruct_model_summaries.append(summary)





Evaluate the models computing ROUGE metrics:

In [None]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)]
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)]
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

The results show substantial improvement in all ROUGE metrics:

In [None]:
print("Absolute percentage improvement of the instruct model over the original model:")

for key in instruct_model_results:
    improvement = instruct_model_results[key] - original_model_results[key]
    print(f'{key}: {improvement*100:.2f}%')

## 4. Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** instead of "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning, with comparable evaluation results as you will see soon.

One of the most popular PEFT methods is **Low-Rank Adaptation (LoRA)**, which  introduces low-rank matrices to adapt the LLM with minimal additional parameters. In most cases, when someone says PEFT, they typically mean LoRA.  After fine-tuning for a specific task with LoRA, the result is that the original LLM remains unchanged and a newly-trained "LoRA adapter" emerges. This LoRA adapter is much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

At inference time, the LoRA adapter is reunited and combined with its original LLM to serve the inference request. The benefit is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

### 4.1 Setup the LoRA model for Fine-Tuning

You first need to define the configuration of the LoRA model. Have a look at the configuration below. The key configuration element to adjust is the rank (`r`) of the adapter, which influences its capacity and complexity. Experiment with various ranks, such as 8, 16, or 32, and see how they affect the results.

In [None]:
from peft import LoraConfig, TaskType, get_peft_model

lora_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    r=32,
    lora_alpha=32,
    lora_dropout=0.1 #applied to the LoRA layers for regularization to prevent overfitting
)

# LoRA decomposes the weight updates into two smaller matrices (A and B) that 
# when multiplied together approximate the full weight update matrix. 
# dapter Layers: Instead of modifying the original model weights (W), LoRA adds trainable low-rank matrices that compute: W + BA, where:
#W = original frozen weights
#B and A = small trainable matrices
#The rank r determines the size of these matrices
# Higher rank = more parameters = more capacity but also more memory
#lora_alpha=32: The scaling factor that controls how much the LoRA adaptation affects the original model:
#Formula: scaling = lora_alpha / r
#With your values: scaling = 32/32 = 1.0
#This means the LoRA updates have equal weight to the original model

Add LoRA adapter layers/parameters to the original LLM to be trained:

In [None]:
peft_model = get_peft_model(original_model, lora_config)

The number of trainable model parameters in the LoRA model is:

In [None]:
peft_model.print_trainable_parameters()

### 4.2 Train the LoRA Adapter

**Exercise**: Define training arguments and create a `Seq2SeqTrainer` instance for the LoRA model. Use a higher learning rate than full fine-tuning (e.g., `1e-3`).

In [None]:
### WRITE YOUR CODE HERE
# Define training arguments for PEFT with higher learning rate
peft_training_args= Seq2SeqTrainingArguments(
    output_dir='./flan-t5-base-dialogsum-lora',     # Directory to save model checkpoints
    num_train_epochs=1,                             # Train for 1 epoch
    per_device_train_batch_size=8,                  # Batch size per device
    per_device_eval_batch_size=8,                   # Evaluation batch size
    warmup_steps=500,                               # Number of warmup steps
    weight_decay=0.01,                              # Weight decay for regularization
    logging_dir='./logs-peft',                      # Directory for storing logs
    logging_steps=100,                              # Log every 100 steps
    learning_rate=1e-3,                             # Higher learning rate for PEFT (1e-3 vs default ~5e-5)
    eval_strategy="epoch",                          # Evaluate at the end of each epoch
    save_strategy="epoch",                          # Save checkpoint at the end of each epoch
    predict_with_generate=True,                     # Use generate for evaluation
    bf16=True,                                      # Use BFloat16 mixed precision training
    fp16=False,                                     # Disable FP16 since we're using BF16
    push_to_hub=False,                              # Don't push to hub
    report_to=[],  
)

# Create data collator for PEFT
peft_data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=peft_model,
    padding=True,
    return_tensors='pt'
)

# create PEFT trainer
peft_trainer = Seq2SeqTrainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
    tokenizer=tokenizer,
    data_collator=peft_data_collator,
)


Train the PEFT adapter. Training should take about 6 minutes on a Google Colab GPU machine.

In [None]:
peft_trainer.train()

Save the model to a local folder:

In [None]:
peft_model.save_pretrained('./flan-t5-base-dialogsum-lora')

Load the PEFT model:

In [None]:
from peft import AutoPeftModelForSeq2SeqLM
from transformers import AutoTokenizer

peft_model = AutoModelForSeq2SeqLM.from_pretrained('./flan-t5-base-dialogsum-lora')
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base')

Reload the original Flan-T5-base model:

In [None]:
original_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base', torch_dtype=torch.bfloat16)

### 4.3 Evaluate the Model Qualitatively (Human Evaluation)

**Exercise**: Make inferences for the same example as in Sections 2 and 3, using the original model, the fully fine-tuned model and the PEFT model.

In [None]:
### WRITE YOUR CODE HERE
index = 42
dash_line = '-' * 100

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']


# create the prompt
prompt = f"Summarize the following conversation.\n{dialogue}\nSummary:\n"

# Generate summary with original model
inputs = tokenizer(prompt, return_tensors='pt')
output = original_model.generate(inputs['input_ids'], max_new_tokens=50)[0]
original_model_summary = tokenizer.decode(output, skip_special_tokens=True)

# Generate summary with fine-tuned (instruct) model
output = instruct_model.generate(inputs['input_ids'], max_new_tokens=50)[0]
instruct_model_summary = tokenizer.decode(output, skip_special_tokens=True)

# generate summary with PEFT model
output = peft_model.generate(inputs['input_ids'], max_new_tokens=50)[0]
peft_model_summary = tokenizer.decode(output, skip_special_tokens=True)

print(f'INPUT DIALOGUE:\n{dialogue}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}')
print(dash_line)
print(f'ORIGINAL MODEL SUMMARY:\n{original_model_summary}')
print(dash_line)
print(f'FINE-TUNED MODEL SUMMARY:\n{instruct_model_summary}')
print(dash_line)
print(f'PEFT MODEL SUMMARY:\n{peft_model_summary}')


### 4.4 Evaluate the Model Quantitatively (with ROUGE Metric)

**Exercise**: Generate the outputs for a sample of the test set with the PEFT model (use only the first 10 dialogues and summaries to save time).

In [None]:
# Generate summaries for the first 10 test examples with PEFT model
dialogues = dataset['test']['dialogue'][:10]
human_baseline_summaries = dataset['test']['summary'][:10]

# Generate summaries with PEFT model
peft_model_summaries = []
for dialogue in dialogues:
    prompt = f"Summarize the following conversation.\n{dialogue}\nSummary:\n"
    inputs = tokenizer(prompt, return_tensors='pt')
    output = peft_model.generate(inputs['input_ids'], max_new_tokens=50)[0]
    summary = tokenizer.decode(output, skip_special_tokens=True)
    peft_model_summaries.append(summary)

print(f"Generated {len(peft_model_summaries)} summaries with PEFT model")



Compute ROUGE score for this subset of the data.

In [None]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

Notice, that PEFT model results are not too bad, while the training process was much easier!

Calculate the improvement of PEFT over the original model:

In [None]:
print("Absolute percentage improvement of the PEFT model over the original model:")

for key in peft_model_results:
    improvement = peft_model_results[key] - original_model_results[key]
    print(f'{key}: {improvement*100:.2f}%')

Now calculate the improvement of PEFT over a full fine-tuned model:

In [None]:
print("Absolute percentage improvement of the PEFT model over the instruct model:")

for key in peft_model_results:
    improvement = peft_model_results[key] - instruct_model_results[key]
    print(f'{key}: {improvement*100:.2f}%')

You can see a small percentage decrease in the ROUGE metrics vs. full fine-tuned. However, the training requires much less computing and memory resources.