# Fine-Tune a Generative AI Model for Dialogue Summarization

In this notebook, you will fine-tune an existing LLM from Hugging Face for enhaced dialogue summarization. You will use the FLAN-T5 model, which provides a high quality instruction tuned model and can summarize text out of the box. To improve the inferences, you will explore a full fine-tuning apporach and evaluate the results with ROUGE metrics. Then you will perform Parameter Efficient Fine-Tuning(PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-power performance metrics.

## 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM

### 1.1 - Set up Kernel and Required Dependecies

Colab

In [1]:
# %pip install --upgrade pip
# %pip install \
#     torch==1.13.1 \
#     torchdata==0.5.1
# %pip install \
#     transformers==4.27.2 \
#     datasets==2.11.0 \
#     evaluate==0.4.0 \
#     rouge_score==0.1.2 \
#     loralib==0.1.1 \
#     peft==0.3.0


evaluate and rouge_score for ROUGE score

loralib and peft for Parameter effient fine tuning

In [2]:
%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0

Collecting transformers==4.27.2
  Downloading transformers-4.27.2-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m68.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets==2.11.0
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m54.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate==0.4.0
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge_score==0.1.2
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting loralib==0.1.1
  Downloading loralib-0.1.1-py3-none-any.whl (8.8 kB)
Collecting peft==0.3.0
  Downloading peft-0.3.0-py3-none-any.whl (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.8/56.8 kB[0m 

In [46]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

device = torch.device("cuda")

# AutoModelForSeq2SeqLM for the access to FLAN-T5 model

### Load Dataset and LLM

You are going to continue experimenting with the DialogSum Hugging Face dataset. It contains 10,000+ dialogues with the corresponding manually labeled summmaries and topics.

In [47]:
huggingface_dataset_name = "knkarthick/dialogsum"

dataset = load_dataset(huggingface_dataset_name)

dataset



  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
})

Load the pre-trained FLAN-T5 model and its tokenizer directly from HuggingFace. Notive that you will be using the small version of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

In [48]:
model_name = 'google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
# original_model is used to compare with all different fine-tuning strategies
tokenizer = AutoTokenizer.from_pretrained(model_name)
original_model = original_model.to(device)

It is possible to pull out the number of the model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it.

In [49]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    print(f"trainable model parameters: {trainable_model_params} \nall model parameter: {all_model_params} \npercentage of trainable model parameters: {trainable_model_params/all_model_params * 100}%")

print_number_of_trainable_model_parameters(original_model)


trainable model parameters: 247577856 
all model parameter: 247577856 
percentage of trainable model parameters: 100.0%


### Test the Model with Zero Shot Inferencing


Test the model with the zero shot inferencing. You can see that the model summarize the dialogue compared to the baseline summary, but it does poll out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [51]:
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"].to(device),
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')



---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

-------------------------------------------------------------------

## 2 - Perform Full Fine-Tuning

### 2.1 - Preprocess the Dialog-Summary Dataset

You need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following conversation` and to the start of the summary with `Summary` as follows:

Training prompt (dialogue):
```
Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.

Summary:
```

Training response (summary):
```
Both Chris and Antje participated in the conversation.
```

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).

Tokenize the wrap our dataset in a prompt:


In [52]:
def tokenize_function(examples):
    start_prompt = "Summarize the following conversation.\n\n"
    end_prompt = "\n\nSummary:"
    prompt = [start_prompt + dialogue + end_prompt for dialogue in examples['dialogue']]
    examples['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors='pt').input_ids.to(device)
    examples['labels'] = tokenizer(examples['summary'], padding="max_length", truncation=True, return_tensors='pt').input_ids.to(device)

    return examples

# The dataset actually contains 3 diff splits: train, validation and test
# The tokenize_function code is handling all data across all splits in batches
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary'])

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

To save some time in lab, you will subsample the dataset.

In [53]:
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Check the shapes of all three parts of the dataset:

In [54]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)


Shapes of the datasets:
Training: (125, 2)
Validation: (5, 2)
Test: (15, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 125
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 15
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 5
    })
})


The output dataset is ready for fine-tuning.

### 2.2 - Fine-Tune the model with the Preproccessed dataset

Now utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model, Other training parameters are found experimentally and there is no need to go into details about those at the moment.

In [55]:
output_dir = f'./dialogue-summary-training-checkpoint07051122'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=10
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)


In [56]:
trainer.train()



Step,Training Loss
1,48.5
2,48.75
3,50.75
4,48.0
5,50.5
6,49.25
7,50.25
8,49.5
9,50.25
10,48.75


TrainOutput(global_step=10, training_loss=49.45, metrics={'train_runtime': 1.9459, 'train_samples_per_second': 41.112, 'train_steps_per_second': 5.139, 'total_flos': 54780588195840.0, 'train_loss': 49.45, 'epoch': 0.62})

In [57]:
trainer.save_model(output_dir=output_dir)

Training a fully fine-tuned version of the model would take a few hours on a GPU.

Create an instance of the `AudoModelForSeq2Seq` class for the instruct model.

In [59]:
%ls -sh ./dialogue-summary-training-checkpoint07051122/pytorch_model.bin

In [60]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained("./dialogue-summary-training-checkpoint07051122", torch_dtype=torch.bfloat16)
instruct_model = instruct_model.to(device)

In [44]:
# device = torch.device("cuda")
# instruct_model = instruct_model.to(device)
# original_model = original_model.to(device)


### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)

As with many GenAI applications, a quanlitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below (the same one we started this notebook with), you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [62]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device)  # input_ids要加上.to(device)

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_test_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_test_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}\n')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_test_output}\n')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_test_output}\n')




---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.

---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person2#: I'm not sure what exactly I'd need to upgrade my computer to make it more powerful and more usable. #Person1#: I'd like to add a CD-ROM drive.

---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person1#: I'm thinking of upgrading my computer.



### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

The ROUGE metric helps quantify the validaity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we accomplished by fine-tuning.

In [63]:
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Generate the outputs for the sample of the test dataset (only 10 dialogues and summaries to save time), and save the results.

In [64]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
    Summarize the following conversation.

    {dialogue}

    Summary: """

    input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device)

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
    original_model_test_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_test_output)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
    instruct_model_test_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_test_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))

df = pd.DataFrame(zipped_summaries, columns=['human_baseline', 'original_model', 'instruct_model'])
df


Unnamed: 0,human_baseline,original_model,instruct_model
0,Ms. Dawson helps #Person1# to write a memo to ...,This memo is to go out as an intra-office memo...,#Person1#: I need to take a dictation for you.
1,In order to prevent employees from wasting tim...,"Ms. Dawson, please let me dict your opinion.",#Person1#: I need to take a dictation for you.
2,Ms. Dawson takes a dictation for #Person1# abo...,#Person1: This should go out as an intra-offic...,#Person1#: I need to take a dictation for you.
3,#Person2# arrives late because of traffic jam....,#Person1: I'm sorry I'm late. I'm sorry I'm la...,The traffic jam at the Carrefour intersection ...
4,#Person2# decides to follow #Person1#'s sugges...,People are talking about the traffic jam in th...,The traffic jam at the Carrefour intersection ...
5,#Person2# complains to #Person1# about the tra...,"The traffic in this city can be bad, but it's ...",The traffic jam at the Carrefour intersection ...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are divorced.,Masha and Hero are getting divorced.
7,#Person1# tells Kate that Masha and Hero are g...,"#Person1: Well, I heard that Masha and Hero ar...",Masha and Hero are getting divorced.
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
9,#Person1# and Brian are at the birthday party ...,#Person1: Thank you for coming to our birthday...,"#Person1#: Happy birthday, Brian. #Person2#: T..."


Evaluate the models computing ROUGE metrics. Notice the improvement in the results!

In [65]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer = True
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer = True
)

print("ORIGINAL MODEL RESULTS")
print(original_model_results)
print("INSTRUCT MODEL RESULTS")
print(instruct_model_results)


ORIGINAL MODEL RESULTS
{'rouge1': 0.206502812002812, 'rouge2': 0.0782550527066656, 'rougeL': 0.1827264587264587, 'rougeLsum': 0.1862993302993303}
INSTRUCT MODEL RESULTS
{'rouge1': 0.2510566239316239, 'rouge2': 0.11535720375106562, 'rougeL': 0.229375, 'rougeLsum': 0.23316773504273502}


Above is just for 10 examples. If evaluated with all dataset, the results would be as follows:

ORIGINAL MODEL:
{'rouge1': 0.233, 'rouge2': 0.0760, 'rougeL': 0.201, 'rougeLsum': 0.201}

INSTRUCT MODEL:
{'rouge1': 0.421, 'rouge2':0.180, 'rougeL': 0.338, 'rougeLsum': 0.338}

Absolute percentage improvement:
rouge1: 18.82%
rouge2: 10.43%
rougeL: 13.70%
rougeLsum: 13.69%

## 3 - Perform Parameter Efficient Fine-Tuning(PEFT)

Now let's perform Parameter Efficient Fine-Tuning (PEFT) fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon.

PEFT is a generic term that includes Low-Rank Adaptaion (LoRA) and prompt tuning (which is NOT THE SAME as prompt engineering). In most cases, when someone says PEFT, the typically mean LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained "LoRA adapter" emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).

That said, at the inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request. The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the uderlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rand (r) hyper-parameter, which defineds the rank/dimension of the adapter to be trained.

In [66]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias='none',
    task_type=TaskType.SEQ_2_SEQ_LM  # FLAN_T5
)

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [68]:
peft_model = get_peft_model(original_model, lora_config)
peft_model.to(device)
print_number_of_trainable_model_parameters(peft_model)

trainable model parameters: 3538944 
all model parameter: 251116800 
percentage of trainable model parameters: 1.4092820552029972%


### 3.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [73]:
output_dir = f'./peft-dialogue-summary-training-peft-07051137'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3,  # higher learning rate than full fine-tuning
    num_train_epochs=5,
    logging_steps=1,
    max_steps=10
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets['train']
)


In [74]:
peft_trainer.train()

peft_model_path="./peft-dialogue-summary-chechpoint-local-07051137"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

Step,Training Loss
1,46.5
2,41.75
3,37.25
4,33.25
5,30.0
6,28.125
7,26.125
8,26.0
9,23.5
10,23.25


('./peft-dialogue-summary-chechpoint-local-07051137/tokenizer_config.json',
 './peft-dialogue-summary-chechpoint-local-07051137/special_tokens_map.json',
 './peft-dialogue-summary-chechpoint-local-07051137/tokenizer.json')

Prepare this model by adding an adapter to the original FLAN-T5 model. You are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model. If you were preparing the model for further training, you would set `is_trainable=True`

In [75]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
peft_model_base = peft_model_base.to(device)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base,
                                        './peft-dialogue-summary-chechpoint-local-07051137',
                                        torch_dtype=torch.bfloat16,
                                        is_trainable=False)

The number of trainable parameters will be 0 due to `is_trainable=False`

In [76]:
print_number_of_trainable_model_parameters(peft_model)

trainable model parameters: 0 
all model parameter: 251116800 
percentage of trainable model parameters: 0.0%


### 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

Make the inference for the same example as in sections 1.3 and 2.3, with the original model, fully fine-tuned and PEFT model.

In [78]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device)

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_test_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_test_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_test_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}\n')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_test_output}\n')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_test_output}\n')
print(dash_line)
print(f'PEFT MODEL:\n{peft_model_test_output}\n')



---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.

---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
You'd probably want to upgrade your computer.

---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person1#: I'm thinking of upgrading my computer.

---------------------------------------------------------------------------------------------------
PEFT MODEL:
Upgrade your computer.



### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

Perform inference for the sample of the test dataset (only 10 dialogues and summaries to save time).

In [79]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
    Summarize the following conversation.

    {dialogue}

    Summary: """

    input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device)

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
    original_model_test_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_test_output)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
    instruct_model_test_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_test_output)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
    peft_model_test_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)
    peft_model_summaries.append(peft_model_test_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns=['human_baseline', 'original_model', 'instruct_model', 'peft_model'])
df


Unnamed: 0,human_baseline,original_model,instruct_model,peft_model
0,Ms. Dawson helps #Person1# to write a memo to ...,This memo should go out as an intra-office mem...,#Person1#: I need to take a dictation for you.,This memo is to be distributed to all employee...
1,In order to prevent employees from wasting tim...,The memo is being distributed as an intra-offi...,#Person1#: I need to take a dictation for you.,This memo is to be distributed to all employee...
2,Ms. Dawson takes a dictation for #Person1# abo...,The memo is a new intra-office memorandum to e...,#Person1#: I need to take a dictation for you.,This memo is to be distributed to all employee...
3,#Person2# arrives late because of traffic jam....,I'm going to be late for the meeting.,The traffic jam at the Carrefour intersection ...,Take public transport to work.
4,#Person2# decides to follow #Person1#'s sugges...,I'm finally home.,The traffic jam at the Carrefour intersection ...,Take public transport to work.
5,#Person2# complains to #Person1# about the tra...,I'm stuck in traffic.,The traffic jam at the Carrefour intersection ...,Take public transport to work.
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are having a separation for 2 m...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
9,#Person1# and Brian are at the birthday party ...,"You look great, you look great.","#Person1#: Happy birthday, Brian. #Person2#: T...",Brian's birthday is coming up.


Compute ROUGE score for this subset of the data.

In [80]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer = True
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer = True
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer = True
)

print("ORIGINAL MODEL RESULTS")
print(original_model_results)
print("INSTRUCT MODEL RESULTS")
print(instruct_model_results)
print("PEFT MODEL RESULTS")
print(peft_model_results)


ORIGINAL MODEL RESULTS
{'rouge1': 0.20142281924466598, 'rouge2': 0.07487218367784852, 'rougeL': 0.1667557040775345, 'rougeLsum': 0.17030134157160476}
INSTRUCT MODEL RESULTS
{'rouge1': 0.2510566239316239, 'rouge2': 0.11535720375106562, 'rougeL': 0.229375, 'rougeLsum': 0.23316773504273502}
PEFT MODEL RESULTS
{'rouge1': 0.29970979020979016, 'rouge2': 0.14344664031620552, 'rougeL': 0.24626456876456876, 'rougeLsum': 0.24932465682465677}


Above is just for 10 examples. If evaluated with all dataset, the results would be as follows:

ORIGINAL MODEL:
{'rouge1': 0.233, 'rouge2': 0.0760, 'rougeL': 0.201, 'rougeLsum': 0.201}

INSTRUCT MODEL:
{'rouge1': 0.421, 'rouge2':0.180, 'rougeL': 0.338, 'rougeLsum': 0.338}

PEFT MODEL:
{'rouge1': 0.408, 'rouge2':0.163, 'rougeL': 0.325, 'rougeLsum': 0.324}


Absolute percentage improvement of PEFT model voer HUMAN BASELINE:
rouge1: 17.47%
rouge2: 8.73%
rougeL: 12.36%
rougeLsum: 12.34%

Calculate the percentage improvement of PEFT over a full fine-tuned model.

In [81]:
improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))

for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

rouge1: 4.87%
rouge2: 2.81%
rougeL: 1.69%
rougeLsum: 1.62%
