<a href="https://colab.research.google.com/github/nyp-sit/iti107/blob/main/session-7/finetune_causallm_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tune a Causal Language Model for Dialogue Summarization

In this exercise, you will fine-tune Meta's Llama 2 for enhanced dialogue summarization. Llama 2 is a large language model (LLM) free for research and commercial use. It is one of the top-performing open-source LLM  comparable to GPT-3.5 on several benchmarks.

We will explore the use of Parameter Efficient Fine-Tuning (PEFT) for fine-tuning, and evaluate the resulting model using ROUGE metrics.

## Install the pre-requisites

Uncomment the following if these python packages have not been installed

In [None]:
!pip install transformers datasets accelerate sentencepiece scipy peft bitsandbytes evaluate rouge_score

## Request access to Llama-2 weights

You need to request for access to download the Llama 2 weights. You can either do so through this [link at Meta](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) or through your huggingface account at this [link](https://huggingface.co/meta-llama/Llama-2-7b). Once your request is approved, you will receive an email from Meta with instruction to download the Llama 2 weights, or email from Hugging Face informing you access has been granted.

If you download the weights from Meta directly, you need to run a conversion script to convert the weights to huggingface format for use with huggingface transformer library.

In [None]:
# %%bash
# TRANSFORM=`python -c "import transformers;print('/'.join(transformers.__file__.split('/')[:-1])+'/models/llama/convert_llama_weights_to_hf.py')"`
# python ${TRANSFORM} --input_dir models --model_size 7B --output_dir models_hf/7B

In [None]:
# Uncomment the following to login to HuggingFace to access the Llama model (only need to do once)

from huggingface_hub import notebook_login
notebook_login()

## Import packages

We first import all the necessary python libraries

In [None]:
import re
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import default_data_collator, Trainer, TrainingArguments
from datasets import load_dataset
import evaluate

## Load the Pretrained Model and Tokenizer

Load the pre-trained Llama 2 model and its tokenizer directly from HuggingFace. We will load the model in 8 bit quantization to save memory. For a more detailed understanding about how the model perform the matrix multiplication in 8-bit, see this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration)

In [None]:
model_id="meta-llama/Llama-2-7b-hf"

tokenizer = LlamaTokenizer.from_pretrained(model_id)
model = LlamaForCausalLM.from_pretrained(model_id, load_in_8bit=True, device_map='cuda:0', use_cache=False)
# model = LlamaForCausalLM.from_pretrained(model_id, device_map='auto', torch_dtype=torch.float16)

The following shows the GPU memory consumption on an A10G GPU, with different model dtype.

- load_in_8bit = 7512 MB
- load_in_16bit = 13174 MB

In [None]:
model.config

## Load the dataset

We are going to use the DialogSum Hugging Face dataset. It contains 10,000+ dialogues with the corresponding manually labeled summaries and topics. Note that the dataset is already split into train, validation and test sets.


In [None]:
dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(dataset_name)
dataset

In [None]:
dataset_train = dataset['train']
dataset_test = dataset['test']
dataset_val = dataset['validation']

Let's taka a look at one of the samples

In [None]:
dataset['train'][100]

## Test the Model with Zero Shot Inferencing

Let's test the model with zero shot inferencing (i.e. ask it to summarize without giving any example. You can see that the model struggles to summarize the dialogue compared to the baseline summary, and it is just repeating the conversation.

In [None]:
eval_prompt = """
Summarize this dialog:
#Person1#: I have a problem with my cable.
#Person2#: What about it?
#Person1#: My cable has been out for the past week or so.
#Person2#: The cable is down right now. I am very sorry.
#Person1#: When will it be working again?
#Person2#: It should be back on in the next couple of days.
#Person1#: Do I still have to pay for the cable?
#Person2#: We're going to give you a credit while the cable is down.
#Person1#: So, I don't have to pay for it?
#Person2#: No, not until your cable comes back on.
#Person1#: Okay, thanks for everything.
#Person2#: You're welcome, and I apologize for the inconvenience.
---
Summary:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():   # no gradient update
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=200)[0], skip_special_tokens=True))

## Creating instruction dataset

We will now prepare our dataset to fine-tune our base model (instruction fine-tuning).

### Instruction prompt

We need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM such as follows:

```
Summarize this dialog:

#Person1#: This is Person1 part of the conversation.
#Person2#: This is Person2 part of the conversation.
---
Summary:
This is ground truth summary of the dialog.
```

We will create a prompt template and a function to apply the template to all the samples in the original DialogSum dataset. Note that we also append a eos token to the end of the sample. This is so that the fine-tuned model will learn to end the sentence at the appropriate time (e.g. end of the summary) instead of generating tokens indefinitely.

In [None]:
def apply_prompt_template(sample):
    prompt = (
        f"Summarize this dialog:\n{{dialog}}\n---\nSummary:\n{{summary}}{{eos_token}}"
    )

    return {
        "text": prompt.format(
            dialog=sample["dialogue"],
            summary=sample["summary"],
            eos_token=tokenizer.eos_token,
        )
    }

dataset_train = dataset_train.map(apply_prompt_template, remove_columns=list(dataset_train.features))

Let's look at one of the sample. We can see that the original sample has been converted to sample with a single 'text' field, and the text now confirms to the template we specified.

In [None]:
dataset_train[0]

Similarly we will apply the prompt template to the validation and test splits too.

In [None]:
dataset_val = dataset_val.map(apply_prompt_template, remove_columns=list(dataset_val.features))
dataset_test = dataset_test.map(apply_prompt_template, remove_columns=list(dataset_test.features))

### Tokenization and Preparing the Input

#### Tokenization

Before we can use the dataset for training, we first need to tokenize the dataset.

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

dataset_train_tokenized = dataset_train.map(
    tokenize_function,
    batched=True,
    num_proc=4,
    remove_columns=dataset_train.features,
)

In [None]:
print("Dataset info: ", dataset_train_tokenized)
print("Length of input_ids: ", len(dataset_train_tokenized['input_ids'][0]))
print("Sample input: \n", dataset_train_tokenized[0])

We can see that after tokenization, we now have input_ids (which contains the id corresponding to a token (subword), and the attention mask, the attention mask tells the model which token to ignore (e.g. padding). We also shown the input_ids length of the first sample, which in this case is 341 (token ids).

We will do the same tokenization on our validation dataset and test dataset

In [None]:
dataset_val_tokenized = dataset_val.map(
    tokenize_function,
    batched=True,   # default batch size is 1000
    num_proc=4,
    remove_columns=dataset_val.features,
)

dataset_test_tokenized = dataset_test.map(
    tokenize_function,
    batched=True,
    num_proc=4,
    remove_columns=dataset_test.features,
)

Now let's prepare the input data to the moodel. As you can see above, typically the length of the token ids (input_ids) are few hundred tokens long. However, Llama model typically have 2048 or 4096 context window. To use the data more efficiently, we use a technique called packing: instead of having one text per sample in the batch and then padding to either the longest text or the maximal context of the model, we concatenate a lot of texts with a EOS token in between and cut chunks of the context size to fill the batch without any padding.

<img src="https://github.com/nyp-sit/iti107/blob/main/session-7/resources/packing.png?raw=1" width="700"/>


The code below help us find the maximum context window of the model

In [None]:
def get_max_context_length(model):

    conf = model.config
    max_length = None

    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max context lenth: {max_length} in {length_setting}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max context length: {max_length}")

    return max_length

max_context_length = get_max_context_length(model)
print('Maximum Context length: ', max_context_length)

The following functions concatenate a batch of samples, and then divide the concatenated sample into chunks of context size.  Also we also need to create 'labels' in the input dataset, which tells the model what is the token to be predicted.  Shifting the inputs and labels to align them happens inside the model, so our labels are just the exact copy of the input_ids.

In the code below, we use a context_length of 1024 instad of the maximum 4096, as we have limited gpu memory and using a larger context length will result in Out of Memory error even with batch size of 1.

In [None]:
context_length = 512
# context_length = max_context_length

def group_texts(examples):

    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= context_length:
        total_length = (total_length // context_length) * context_length
    # Split by chunks of context length.
    result = {
        k: [t[i : i + context_length] for i in range(0, total_length, context_length)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result


In [None]:
dataset_train_final = dataset_train_tokenized.map(group_texts, batched=True, num_proc=4)
dataset_val_final = dataset_val_tokenized.map(group_texts, batched=True, num_proc=4)
dataset_test_final = dataset_test_tokenized.map(group_texts, batched=True, num_proc=4)

Now let's examine the dataset_train_final and we can see that all the samples are of lenghth equal to the specified context window.

In [None]:
dataset_train_final

In [None]:
for sample in dataset_train_final['input_ids'][:5]:
    print(len(sample))

Since we have done all the heavy lifting of preprocessing the data in our codes, we just use a simple default data collator which basically just pass the dictionary-like input to the model.

In [None]:
data_collator = default_data_collator

## Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank $r$ hyper-parameter, which defines the rank/dimension of the adapter to be trained.


In [None]:
model.train()

def create_peft_config(model):
    from peft import (
        get_peft_model,
        LoraConfig,
        TaskType,
        prepare_model_for_kbit_training,
    )

    peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        inference_mode=False,
        r=8,
        lora_alpha=32,
        lora_dropout=0.05,
        target_modules = ["q_proj", "v_proj"]
    )

    # prepare int-8 model for training
    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()

    return model, peft_config

# create peft config
model, lora_config = create_peft_config(model)


If you look at the trainable prarameters, there are only about 4 million parameters, comparaed to about 6.7 billion parameters of the entire model.

In [None]:
model

## Define the Trainer and Training Arguments

We can now define training arguments and create Trainer instance. If you are using Ampere GPU (e.g. NVIDIA A10), then you can set bf16 to True to use bfloat16 for mixed precision computation.

*Note: Due to long training time (approximately 1 to 2 hours) to fine-tune the model for it to have decent performance, for this lab, we just train for a single step due to time constraint. If you have access to GPUs such a A10G or others, you can train for more steps e.g. 100 steps, and set the logging_steps=10 and save_steps=10 to log and save every 10 steps.*

In [None]:
# specify where to write the checkpoint to
output_dir = "train_out_dir"

# Define training args
training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    auto_find_batch_size=False,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    bf16=False,  # Use BF16 if available (e.g. on Ampere GPU)
    # logging strategy
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    # logging_steps=10,
    logging_steps=1,
    # saving strategy
    save_strategy="steps",
    # save_steps=10,
    save_steps=1,
    evaluation_strategy ='steps',
    optim="adamw_torch_fused",
    load_best_model_at_end=True,
    max_steps=1
)

 # Create Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_train_final,
    eval_dataset=dataset_val_final,
    data_collator=data_collator,
)


In [None]:
# Start training

trainer.train()

In [None]:
# model.eval()
# trainer.evaluate(eval_dataset=dataset_val_final)

### Save the Trained model

In [None]:
save_dir = 'lora_model_output'
model.save_pretrained(save_dir)


### Load the PEFT Model

Uncomment the following to download fine-tuned LoRA weights.

You should **restart the session to clear the GPU memory** before continuning with the next step.

In [None]:
!wget https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/iti107/pretrained-weights/lora_model_output.zip
!unzip  -o lora_model_output.zip

In [None]:
import re
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import default_data_collator, Trainer, TrainingArguments
from datasets import load_dataset
import evaluate

In [None]:
model_id = 'meta-llama/Llama-2-7b-hf'
save_dir = 'lora_model_output'
tokenizer = AutoTokenizer.from_pretrained(model_id)
peft_model = AutoModelForCausalLM.from_pretrained(save_dir, device_map='cuda:0', load_in_8bit=True, torch_dtype=torch.float16)

In [None]:
peft_model

### Test the Model

Now let's test our fine-tuned model on the same prompt.

In [None]:
eval_prompt = """
Summarize this dialog:
#Person1#: Hello, how are you doing today?
#Person2#: I ' Ve been having trouble breathing lately.
#Person1#: Have you had any type of cold lately?
#Person2#: No, I haven ' t had a cold. I just have a heavy feeling in my chest when I try to breathe.
#Person1#: Do you have any allergies that you know of?
#Person2#: No, I don ' t have any allergies that I know of.
#Person1#: Does this happen all the time or mostly when you are active?
#Person2#: It happens a lot when I work out.
#Person1#: I am going to send you to a pulmonary specialist who can run tests on you for asthma.
#Person2#: Thank you for your help, doctor.
---
Summary:
"""

# eval_prompt = """
# Summarize this dialog:
# A: Hi Tom, are you busy tomorrow’s afternoon?
# B: I’m pretty sure I am. What’s up?
# A: Can you go with me to the animal shelter?.
# B: What do you want to do?
# A: I want to get a puppy for my son.
# B: That will make him so happy.
# A: Yeah, we’ve discussed it many times. I think he’s ready now.
# B: That’s good. Raising a dog is a tough issue. Like having a baby ;-)
# A: I'll get him one of those little dogs.
# B: One that won't grow up too big;-)
# A: And eat too much;-))
# B: Do you know which one he would like?
# A: Oh, yes, I took him there last Monday. He showed me one that he really liked.
# B: I bet you had to drag him away.
# A: He wanted to take it home right away ;-).
# B: I wonder what he'll name it.
# A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))
# ---
# Summary:
# """

from transformers import TextStreamer

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

# #Streaming support
# streamer = TextStreamer(tokenizer)
# peft_model.generate(**model_input, streamer=streamer)
peft_model.eval()
with torch.no_grad():
    print(tokenizer.decode(peft_model.generate(**model_input)[0], skip_special_tokens=True))

## Evaluate the model using ROUGE metric

We first define some utility function to extract the summary part from the dialog summary

In [None]:
# remove the dialog and retain only text in the summary
def get_summary(text):
    parts = re.split(r'Summary:', text)
    summary = parts[1].strip()
    return summary

The original test set has 1500 entries, and it will take a long time to compute the rouge.  To speed up things, we just compute ROUGE for the first 15 test samples.

In [None]:
dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(dataset_name)
dialogues = dataset['test']['dialogue'][:15]
human_baseline_summaries = dataset['test']['summary'][:15]
peft_model_summaries = []

for _, dialogue in enumerate(dialogues):
    eval_prompt = f"""
Summarize this dialog:
{dialogue}
---
Summary:
"""
    model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        peft_model_output = tokenizer.decode(peft_model.generate(**model_input)[0], skip_special_tokens=True)
    summary = get_summary(peft_model_output)
    peft_model_summaries.append(summary)


In [None]:
print('Human Baseline')
print('*'*10)
for summary in human_baseline_summaries[:5]:
    print(summary)
print('PEFT summaries')
print('*'*10)
for summary in peft_model_summaries[:5]:
    print(summary)

In [None]:
rouge = evaluate.load('rouge')

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)
print('PEFT model ROUGE scores:')
print(peft_model_results)

### Testing the token generation speed

In [None]:
import transformers
import time

pipeline = transformers.pipeline(
    "text-generation",
    model=peft_model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="cuda:0",
)

In [None]:
tokens_per_second_list = []

for i in range(20):
    start = time.time()
    output = pipeline(eval_prompt, max_new_tokens=30, temperature=1, top_k=1, top_p=0.90)

    delay = time.time()
    total_time = (delay - start)
    time_per_token = total_time / 30

    # Calculate tokens per second
    tokens_per_second = 30 / total_time
    tokens_per_second_list.append(tokens_per_second)


average = sum(tokens_per_second_list) / len(tokens_per_second_list)
# Print the results
print("Total inference time: {:.2f} ms".format(total_time))
print("Time per token: {:.2f} ms/token".format(time_per_token))
print("Tokens per second: {:.2f} token/s".format(average))