# Fine-Tune a Generative AI Model for Dialogue Summarization

In this notebook, we will fine-tune an existing LLM from Hugging Face for enhanced dialogue summarization.

We will use the [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model, which provides a high quality instruction tuned model and can summarize text out of the box. To improve the inferences, we will explore a full fine-tuning approach and evaluate the results with ROUGE metrics. 

Then we will perform Parameter Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

## 1 - Set up Kernel 

```
pip3 install --upgrade pip

pip3 install --disable-pip-version-check \
    torch==2.0.0 \
    torchdata==0.6.0

pip3 install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    rouge_score==0.1.2 \
    evaluate==0.4.0 \
    loralib==0.1.1 \
    peft==0.3.0
```

## 2 - Setup Dataset, LLM, and Tokenizer

In [1]:
from datasets import load_dataset

dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(dataset_name)

from enum import Enum
class Dataset_Splits(Enum):
    TRAIN = 'train'
    VALIDATION = 'validation'
    TEST = 'test'


class Dataset_Columns(Enum):
    DIALOGUE = 'dialogue'
    SUMMARY = 'summary'

dataset

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [2]:
from transformers import AutoModelForSeq2SeqLM

model_name = "google/flan-t5-base"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

### 2.1 Trainable Model Parameters

See how many of model parameters are trainable out of the total.

In [4]:
def print_total_number_of_model_parameters(model):
    print(f'{model.num_parameters():,}')

def print_number_of_trainable_model_parameters(model):
    print(f'{model.num_parameters(only_trainable=True):,}')

print_total_number_of_model_parameters(model)
print_number_of_trainable_model_parameters(model)

247,577,856
247,577,856


### 2.2 Test the model with zero shot inference

We can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [27]:
index = 200

dialogue = dataset[Dataset_Splits.TEST.value][index][Dataset_Columns.DIALOGUE.value]
human_baseline_summary = dataset[Dataset_Splits.TEST.value][index][Dataset_Columns.SUMMARY.value]

prompt = f"""
Summarize the following conversation.
{dialogue}

Summary:
"""

inputs = tokenizer.encode(prompt, return_tensors='pt')
model_completion = model.generate(
    inputs,
    max_new_tokens=200
)
output = tokenizer.decode(
    model_completion[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')


---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.
#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

--------------------------------------------------------------------

## 3 - Perform Full Fine-Tuning

### 3.1 Preprocess the Dialog-Summary Dataset

We need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following conversation` and to the start of the summary with `Summary` as follows:

Training prompt (dialogue):

Summarize the following conversation.

        Chris: This is his part of the conversation.
        Antje: This is her part of the conversation.

Summary: 
Training response (summary):

        Both Chris and Antje participated in the conversation.

Then preprocess the prompt-response dataset into tokens and pull out their input_ids (1 per token).

In [6]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary:'

    prompt = [start_prompt + dialogue + end_prompt for dialogue in example[Dataset_Columns.DIALOGUE.value]]

    example['input_ids'] = tokenizer(prompt, padding='max_length', truncation=True, return_tensors='pt').input_ids
    example['labels'] = tokenizer(example[Dataset_Columns.SUMMARY.value], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

print(f'The dataset used to be like this:\n{dataset}')
tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(f'\nWe tokenize the datasets & add `input_ids` and `labels` columns:\n{tokenized_datasets}')
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])
print(f'\nWe remove the original columns from the `tokenized_datasets`:\n{tokenized_datasets}')

The dataset used to be like this:
DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

We tokenize the datasets & add `input_ids` and `labels` columns:
DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'labels'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'labels'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'labels'],
        num_rows: 1500
    })
})

We remove the original columns from the `tokenized_datasets`:
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labe



We can subsample the dataset to save some time, but I wont:

In [7]:
# tokenized_datasets = tokenized_datasets.filter(lambda filter, index: index % 100 == 0, with_indices=True)
# print(tokenized_datasets)

Now, check the shape of the 3 parts of the dataset:

In [8]:
print(f'Shapes of the datasets:')
print(f'Training: {tokenized_datasets[Dataset_Splits.TRAIN.value].shape}')
print(f'Validation: {tokenized_datasets[Dataset_Splits.VALIDATION.value].shape}')
print(f'Testing: {tokenized_datasets[Dataset_Splits.TEST.value].shape}')

Shapes of the datasets:
Training: (12460, 2)
Validation: (500, 2)
Testing: (1500, 2)


The output dataset is now ready for fine-tuning.

### 3.2 Fine-Tune the Model with the Preprocessed Dataset

Now utilize the built-in Hugging Face Trainer class (documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). 

Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.


In [9]:
import time
from transformers import TrainingArguments, Trainer

instruct_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)

trainer = Trainer(
    model=instruct_model,
    args=training_args,
    train_dataset=tokenized_datasets[Dataset_Splits.TRAIN.value],
    eval_dataset=tokenized_datasets[Dataset_Splits.VALIDATION.value]
)

trainer.train()

100%|██████████| 1/1 [00:44<00:00, 44.74s/it]

{'loss': 49.0106, 'learning_rate': 0.0, 'epoch': 0.0}
{'train_runtime': 44.7354, 'train_samples_per_second': 0.179, 'train_steps_per_second': 0.022, 'train_loss': 49.01055908203125, 'epoch': 0.0}





TrainOutput(global_step=1, training_loss=49.01055908203125, metrics={'train_runtime': 44.7354, 'train_samples_per_second': 0.179, 'train_steps_per_second': 0.022, 'train_loss': 49.01055908203125, 'epoch': 0.0})

Training a fully fine-tuned version of the model would take a few hours on a GPU. 

To save time, download a checkpoint of the fully fine-tuned model to use in the rest of this notebook. This fully fine-tuned model will also be referred to as the **instruct model** in this lab.

I didn't buy the course so I don't have access to the checkpoint of the fully-fine tuned model, so I'll just use my own.

In [10]:
# !aws s3 cp --recursive s3://dlai-generative-ai/models/flan-dialogue-summary-checkpoint/ ./flan-dialogue-summary-checkpoint/

# !ls -alh ./flan-dialogue-summary-checkpoint/

In [11]:
print(instruct_model)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo):

In [12]:
import torch

# instruct_model = AutoModelForSeq2SeqLM.from_pretrained("./flan-dialogue-summary-checkpoint", torch_dtype=torch.bfloat16)

### 3.3 Evaluate the Model Qualitatively (Human Evaluation)

As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point.

In the example below (the same one we started this notebook with), we can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [13]:
from transformers import GenerationConfig

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

generation_config=GenerationConfig(max_new_tokens=200, num_beams=1)

original_model_outputs = model.generate(input_ids=input_ids, generation_config=generation_config)
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=generation_config)
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1#: I'm thinking of upgrading my computer.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person1#: I'm thinking about upgrading my computer.


### 3.4 Evaluate the Model Quantitatively (with ROUGE metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. 

It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [14]:
import evaluate

rouge = evaluate.load('rouge')

Generate the outputs for the sample of the test dataset (only 10 dialogues and summaries to save time), and save the results.

In [15]:
import pandas as pd

dialogues = dataset[Dataset_Splits.TEST.value][0:10][Dataset_Columns.DIALOGUE.value]
human_baseline_summaries = dataset[Dataset_Splits.TEST.value][0:10][Dataset_Columns.SUMMARY.value]

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.
{dialogue}

Summary: """
    
    input_ids = tokenizer(prompt, return_tensors='pt').input_ids

    original_model_outputs = model.generate(input_ids=input_ids, generation_config=generation_config)
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=generation_config)
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,#Person1#: I need to take a dictation for you.,Employees will be required to sign a memo cont...
1,In order to prevent employees from wasting tim...,#Person1#: I need to take a dictation for you.,#Person1: This is a memo to all employees. #Pe...
2,Ms. Dawson takes a dictation for #Person1# abo...,#Person1#: I need to take a dictation for you.,The memo is a memo that should be distributed ...
3,#Person2# arrives late because of traffic jam....,The traffic jam at the Carrefour intersection ...,The conversation starts with a person talking ...
4,#Person2# decides to follow #Person1#'s sugges...,The traffic jam at the Carrefour intersection ...,Taking public transport to work would be bette...
5,#Person2# complains to #Person1# about the tra...,The traffic jam at the Carrefour intersection ...,The person who got stuck in traffic got stuck ...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are getting divorced.,#Person1: Masha and Hero are getting divorced....
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are getting divorced.,#Person1#: Masha and Hero are getting divorced.
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
9,#Person1# and Brian are at the birthday party ...,"#Person1#: Happy birthday, Brian. #Person2#: I...",Brian's birthday is today.


Evaluate the models computing ROUGE metrics. Notice the improvement in the results!

In [16]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.24089921652421653, 'rouge2': 0.11769053708439897, 'rougeL': 0.22001958689458687, 'rougeLsum': 0.22134175465057818}
INSTRUCT MODEL:
{'rouge1': 0.28717773337739955, 'rouge2': 0.10829179728317662, 'rougeL': 0.2302144124698507, 'rougeLsum': 0.23255054012271414}


The file `data/dialogue-summary-training-results.csv` contains a pre-populated list of all model results which you can use to evaluate on a larger section of data. 

Let's do that for each of the models:

In [17]:
import numpy as np

results = pd.read_csv('data/dialogue-summary-training-results.csv')

human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)


print("Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE")
improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

ORIGINAL MODEL:
{'rouge1': 0.2334158581572823, 'rouge2': 0.07603964187010573, 'rougeL': 0.20145520923859048, 'rougeLsum': 0.20145899339006135}
INSTRUCT MODEL:
{'rouge1': 0.42161291557556113, 'rouge2': 0.18035380596301792, 'rougeL': 0.3384439349963909, 'rougeLsum': 0.33835653595561666}
Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE
rouge1: 18.82%
rouge2: 10.43%
rougeL: 13.70%
rougeLsum: 13.69%


## 4 - Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform Parameter Efficient Fine-Tuning (PEFT) fine-tuning as opposed to "full fine-tuning" as we did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as we will see soon. PEFT is a generic term that includes Low-Rank Adaptation (LoRA) and prompt tuning (which is NOT THE SAME as prompt engineering!). 

In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request. The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

### 4.1 - Setup the PEFT/LoRA model for Fine-Tuning

We need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. 

Using PEFT/LoRA, we are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [18]:
from peft import LoraConfig, TaskType

lora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=["q", "v"], # query and value
    lora_dropout=0.05,
    bias='none',
    task_type=TaskType.SEQ_2_SEQ_LM
)

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [19]:
from peft import get_peft_model

peft_model = get_peft_model(
    model,
    lora_config
)

print_total_number_of_model_parameters(peft_model)
print_number_of_trainable_model_parameters(peft_model)

251,116,800
3,538,944


### 4-2 Train PEFT Adapter

Define training arguments and create `Trainer` instance

In [20]:
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # higher learning rate than full fine-tuning
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets[Dataset_Splits.TRAIN.value]
)



# Now everything is ready to train the PEFT adapter and save the model.

peft_trainer.train()

peft_model_path='./peft-dialogue-summary-checkpoint-local'
peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)


100%|██████████| 1/1 [00:37<00:00, 37.23s/it]


{'loss': 49.4193, 'learning_rate': 0.0, 'epoch': 0.0}
{'train_runtime': 37.2336, 'train_samples_per_second': 0.215, 'train_steps_per_second': 0.027, 'train_loss': 49.419254302978516, 'epoch': 0.0}


('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')

That training was performed on a subset of data. 
To load a fully trained PEFT model, read a checkpoint of a PEFT model from S3, which again I won't have access to.

In [21]:
# !aws s3 cp --recursive s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/ ./peft-dialogue-summary-checkpoint-from-s3/ 
# !ls -al ./peft-dialogue-summary-checkpoint-from-s3/adapter_model.bin


Prepare this model by adding an adapter to the original FLAN-T5 model. 

We are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model. If we were preparing the model for further training, we would set `is_trainable=True`.

In [25]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base', torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base')

peft_model = PeftModel.from_pretrained(peft_model_base,
                                    #    './peft-dialogue-summary-checkpoint-from-s3/',
                                       './peft-dialogue-summary-checkpoint-local',
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False
                                       )
print_total_number_of_model_parameters(peft_model)
print_number_of_trainable_model_parameters(peft_model)

251,116,800
0


### 4.3 - Evaluate the Model Qualitatively (Human Evaluation)

In [28]:
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = model.generate(input_ids=input_ids, generation_config=generation_config)
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=generation_config)
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=generation_config)
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1: I'm not sure what to do. #Person2: I'm not sure what to do. #Person1: I'm not sure. #Person2: I'm not sure. #Person1: I'm not sure. #Person2: I'm not sure. #Person1: I'm not sure. #Person2: I'm not sure. #Person1: I'm not sure.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person1#: Yes, but I'm not sure what exactly I would need. #Person2#: I'd consider adding a painting program to your software. #Person1#: I'd like to add a painting program to my software. #Person2#: I'd like to add a painting program to my software. #Person1#: I'd like to add a computer graphics card. #Person2#: I'

### 4.4 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time).

In [29]:
dialogues = dataset[Dataset_Splits.TEST.value][0:10][Dataset_Columns.DIALOGUE.value]
human_baseline_summaries = dataset[Dataset_Splits.TEST.value][0:10][Dataset_Columns.SUMMARY.value]

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]
    
    original_model_outputs = model.generate(input_ids=input_ids, generation_config=generation_config)
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=generation_config)
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=generation_config)
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,Employees are being warned about the use of in...,The memo will be distributed to all employees ...,The memo is to be distributed to all employees...
1,In order to prevent employees from wasting tim...,Employees are required to use instant messagin...,The following memo is for the purpose of dicta...,The memo is to be distributed to all employees...
2,Ms. Dawson takes a dictation for #Person1# abo...,This memo is intended to inform all employees ...,#Person1#: I need to take a dictation for all ...,The memo is to be distributed to all employees...
3,#Person2# arrives late because of traffic jam....,People are talking to each other about their c...,The car is a waste of time.,The traffic jam at the Carrefour intersection ...
4,#Person2# decides to follow #Person1#'s sugges...,The driver is a little confused about the way ...,The person is going to take public transport t...,The traffic jam at the Carrefour intersection ...
5,#Person2# complains to #Person1# about the tra...,The conversation was a debate between the two ...,The car is a big issue for the driver.,The traffic jam at the Carrefour intersection ...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are getting divorced.,#Person1#: Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
7,#Person1# tells Kate that Masha and Hero are g...,The couple are getting divorced.,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting a divorce.,#Person1: Masha and Hero are getting divorced....,Masha and Hero are getting divorced.
9,#Person1# and Brian are at the birthday party ...,Brian's birthday is coming up.,"#Person1#: Happy birthday, Brian. #Person2#: I...",Brian's birthday is coming up.


In [31]:
rogue = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.23986663409244052, 'rouge2': 0.08577606417849087, 'rougeL': 0.20967310539891182, 'rougeLsum': 0.21104013921755857}
INSTRUCT MODEL:
{'rouge1': 0.25158358287510363, 'rouge2': 0.09576659451659453, 'rougeL': 0.20060649742918765, 'rougeLsum': 0.2011730501381453}
PEFT MODEL:
{'rouge1': 0.26109650997150996, 'rouge2': 0.11055072463768116, 'rougeL': 0.2302777777777778, 'rougeLsum': 0.2339245014245014}


Notice, that PEFT model results are not too bad, while the training process was much easier!

We already computed ROUGE score on the full dataset, after loading the results from the `data/dialogue-summary-training-results.csv` file. Load the values for the PEFT model now and check its performance compared to other models.

In [32]:
human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values
peft_model_summaries     = results['peft_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.2334158581572823, 'rouge2': 0.07603964187010573, 'rougeL': 0.20145520923859048, 'rougeLsum': 0.20145899339006135}
INSTRUCT MODEL:
{'rouge1': 0.42161291557556113, 'rouge2': 0.18035380596301792, 'rougeL': 0.3384439349963909, 'rougeLsum': 0.33835653595561666}
PEFT MODEL:
{'rouge1': 0.40810631575616746, 'rouge2': 0.1633255794568712, 'rougeL': 0.32507074586565354, 'rougeLsum': 0.3248950182867091}


The results show less of an improvement over full fine-tuning, but the benefits of PEFT typically outweigh the slightly-lower performance metrics.

Calculate the improvement of PEFT over the original model:

In [33]:
print("Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE
rouge1: 17.47%
rouge2: 8.73%
rougeL: 12.36%
rougeLsum: 12.34%


Now calculate the improvement of PEFT over a full fine-tuned model:

In [34]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL
rouge1: -1.35%
rouge2: -1.70%
rougeL: -1.34%
rougeLsum: -1.35%


Here we see a small percentage decrease in the ROUGE metrics vs. full fine-tuned. However, the training requires much less computing and memory resources (often just a single GPU).