<a href="https://colab.research.google.com/github/shahabday/DSR-LLM-finetuning/blob/main/03_2_Exercise_Fine_Tune_Yoda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install accelerate datasets peft trl bitsandbytes matplotlib gdown

# Exercise

In this exercise, we'll fine-tune Phi-2 to translate sentences from English to the way Yoda talks.

In order to accomplish that, we'll create a "response template", that is, a special token that triggers the translation. We'll use the token `##[YODA]##>` so, whenever it is added at the end of a sentence, the model should complete it with the translated version.

For example, given the prompt:

`There is bacon in the sandwich.##[YODA]##>`

It should complete the sentence like this:

`There is bacon in the sandwich.##[YODA]##>Bacon in the sandwich there is.`

## Yoda

Download the CSV file and load it using [`load_dataset()`](https://huggingface.co/docs/datasets/en/loading). Then, shuffle the dataset and split it into train and test sets ([preprocessing a dataset](https://huggingface.co/docs/datasets/en/process)).

In [None]:
# Downloads yoda_translation.csv
!gdown 1luZxKTMuV2E6IGoHI9UARdOFGYAOfBMy

In [None]:
from datasets import load_dataset, Split
# load dataset
dataset = ...
# shuffle and split it
dataset = ...

In [None]:
dataset

Take a look at one element of the training set. It should have two columns: `sentence` and `yoda` (the translated sentence).

In [None]:
dataset['train'][0]

### Prompt Dataset

Now, let's make it a "prompt dataset" by renaming the columns to `prompt` and `completion` ([preprocessing a dataset](https://huggingface.co/docs/datasets/en/process)).

We'll train the model to take a regular English sentence (the prompt) and produce an output (that is, complete the sentence) with the Yoda translation (completion).

In [None]:
# rename the columns
prompt_yoda = ...

Take a look at the same element as before.

In [None]:
prompt_yoda['train'][0]

## Tokenizer

Use HF's [`AutoTokenizer`](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer) to create a tokenizer for `microsoft/phi-2` model.

The parameters for Phi-2 can be found [here](https://huggingface.co/docs/transformers/main/en/model_doc/codegen#transformers.CodeGenTokenizer). Make sure you add a begin of sentence (BOS) and a padding token (`<|pad|>`) as well.

We'll need to pad it on the left side (cause we're generating new words starting on the end - the right side). You can force the tokenizer to pad on the left by using `padding_side="left"`. Moreover, we have to set `use_fast=False` because Phi's tokenizer does not support the fast tokenizer.

In [None]:
from transformers import AutoTokenizer
from torch.utils.data import DataLoader

base_model_id = 'microsoft/phi-2'

tokenizer = AutoTokenizer.from_pretrained(
    ...
)

Our "Yoda" token isn't any of the expected special tokens (padding, unknown, mask, etc.). It is an *additional special token*. Luckily, there is a method to add such tokens to the tokenizer:

In [None]:
response_template = '##[YODA]##>'
tokenizer.add_special_tokens({'additional_special_tokens': [response_template]})

len(tokenizer)

Let's check if the padding and EOS tokens are configured.

In [None]:
tokenizer.pad_token, tokenizer.eos_token

### Formatting

Let's build a formatting function that takes both prompt and completion, and inserts a particular string that will be used to trigger the translation. This string is the response template (`##[YODA]##>`) as previously discussed.

The formatting function should produce outputs such as this one:

`There is bacon in the sandwich.##[YODA]##>Bacon in the sandwich there is.`

However, there is one small - yet important - detail to add: we should add the EOS token to the end of the sentence in order to signal to the model that it should stop the generation at that point.

So, the output should really look like this:

`There is bacon in the sandwich.##[YODA]##>Bacon in the sandwich there is.<|endoftext|>`

In [None]:
def formatting_func(example):
    return ...

# Try formatting one example from the training set and see if it is working fine.
formatting_func(prompt_yoda['train'][0])

Now, we'll write a function that takes a prompt, formats it, and tokenizes it. It should truncate the formatted prompt according to the `max_length` argument and, optionally, pad the formatted prompt up to that length (see here the arguments for [calling](https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__) a tokenizer)

In [None]:
def generate_and_tokenize_prompt(prompt, max_length=128, padding=True):
    result = tokenizer(
        ...
    )
    return result

# We'll call it WITHOUT padding first
dataset = prompt_yoda['train'].map(lambda v: generate_and_tokenize_prompt(v, padding=False))
dataset = dataset.remove_columns(['prompt', 'completion'])
print(dataset[0])

Perhaps you're wondering where the labels are... as it turns out, the collator will take care of it. We'll be using a collator for completion only, since we're not interested in the regular English sentences that precede our special "Yoda" token.

In [None]:
from trl import DataCollatorForCompletionOnlyLM
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

dataloader_completion = DataLoader(dataset, batch_size=2, collate_fn=collator)
next(iter(dataloader_completion))

In [None]:
import matplotlib.pyplot as plt

def plot_data_lengths(tokenized_train_dataset):
    lengths = [len(x['input_ids']) for x in tokenized_train_dataset]
    print(len(lengths))

    # Plotting the histogram
    plt.figure(figsize=(10, 6))
    plt.hist(lengths, bins=20, alpha=0.7, color='blue')
    plt.xlabel('Length of input_ids')
    plt.ylabel('Frequency')
    plt.title('Distribution of Lengths of input_ids')
    plt.show()

plot_data_lengths(dataset)

Now, apply the `generate_and_tokenize_prompt` function to both train and validation sets using the dataset's [`map()`](https://huggingface.co/docs/datasets/en/process#map) method.

You can adjust the max length to better match the observed length of the inputs (in the plot above).

In [None]:
max_length = ...
tokenized_train_dataset = ...
tokenized_val_dataset = ...

Try tokenizing (and decoding back) one example from the training set. You should see padding tokens to the left of the sentence, an `<|endoftext|>` token signaling the beginning of the sentence (the same token is used both as BOS and EOS token), the original sentence, our special "Yoda" token, the translated sentence, and the EOS token at the very end.

Here is an example:

`'<|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|endoftext|>Quench your thirst, then eat the crackers. ##[YODA]##> Quench your thirst, the crackers then eat.<|endoftext|>'`

In [None]:
tokenizer.decode(tokenized_train_dataset[1]['input_ids'])

## Model

Now, let's load the model itself. In order to quantize it while loading it, we need an instance of [`BitAndBytesConfig`](https://huggingface.co/docs/transformers/en/main_classes/quantization#transformers.BitsAndBytesConfig). We can load it in 8-bit using the `NF4` quantization type and double quantization. The computing dtype may be `torch.float16` or - if the GPU supports it - `torch.bfloat16`.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    ...
)

model = AutoModelForCausalLM.from_pretrained(...)

Before moving forward, let's check if the embeddings layer needs resizing or not (since we have added a special token to the tokenizer).

In [None]:
model.model.embed_tokens, len(tokenizer)
# no need
# model.resize_token_embeddings(len(tokenizer))

In [None]:
model

How many trainable parameters are left after we load the quantized model? Which layers can still be trained? Let's check it out:

In [None]:
def print_trainable_parameters(model, verbose=False):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for name, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            if verbose:
                print(name)
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
print_trainable_parameters(model, verbose=True)

## LoRA

Quantization makes the model smaller to load, but we still need LoRA to make training faster.

So, we need to create an instance of [`LoraConfig`](https://huggingface.co/docs/peft/main/en/developer_guides/quantization#loraconfig). You need to choose a rank (`r`), the alpha multiplier (`lora_alpha`), the target modules that will be modified by LoRA (`target_modules`), and - optionally - other modules that should be trained and saved (`modules_to_save`).

These extra modules may include layer norm and embeddings modules, for example. Including these modules may deliver better performance but it comes at the cost of not being able to merge multiple adapters together later.

Next, you can use the configuration to get the modified model using `get_peft_model()`.

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

prepared_model = ...

config = LoraConfig(
    ...
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

peft_model = ...
peft_model.print_trainable_parameters()

In [None]:
print_trainable_parameters(peft_model, verbose=True)

## Training

Before actually training the model, we have to configure its training arguments. Hugging Face's `TrainingArguments` is very thorough and comprehensive, so we're providing suggested arguments right away.

It is important to notice that:
- it uses a paged 8-bit optimizer in order to save memory
- it uses gradient accumulation

In [None]:
# Some Environment Setup
OUTPUT_DIR = "./results/yoda/" # the path to the output directory; where model checkpoints will be saved

In [None]:
import transformers

training_args = transformers.TrainingArguments(
        output_dir=OUTPUT_DIR,
        warmup_steps=2,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        auto_find_batch_size=True,
        max_steps=500,
        learning_rate=2.5e-5,        # Want a small lr for finetuning
        optim="paged_adamw_8bit",
        logging_steps=25,            # When to start reporting loss
        logging_dir="./logs",        # Directory for storing logs
        save_strategy="steps",       # Save the model checkpoint every logging step
        save_steps=500,              # Save checkpoints every 50 steps
        eval_strategy="steps",       # Evaluate the model every logging step
        eval_steps=25,               # Evaluate and save checkpoints every 50 steps
        do_eval=True,                # Perform evaluation at the end of training
        report_to="none",
)

The `Trainer` object needs:
- a model (`model` arg)
- a training set (`train_dataset` arg)
- an (optional) validation set (`eval_dataset` arg)
- the training arguments (`args` arg)
- a data collator (`data_collator` arg)

Training may take around 10 minutes in an RTX 3090. In Colab's free version, it will take much longer.

In [None]:
trainer = transformers.Trainer(
    ...
)

trainer.train()

After 1,000 steps, training loss should be around 0.3. So, we save the trained model to disk.

In [None]:
model_ckpt = OUTPUT_DIR + "/stop"

trainer.save_model(model_ckpt)

## Reloading the Model

Now, let's reload the trained adapter we have just saved. Remember, it only saves a partial model, so we still need the (quantized) base model.

We can use [`PeftModel.from_pretrained()`](https://huggingface.co/docs/peft/en/package_reference/peft_model#peft.PeftModel.from_pretrained) method to load the fine-tuned model.

In [None]:
from peft import PeftModel

fine_tuned_model = PeftModel.from_pretrained(model, model_ckpt)

Now, let's try out our model!

First, we'll "forget" the response template and see how the model reacts:

In [None]:
eval_prompt = "Luke, I am your father!"
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

fine_tuned_model.eval()
with torch.no_grad():
    print(tokenizer.decode(fine_tuned_model.generate(**model_input, max_new_tokens=100, repetition_penalty=1.1)[0], skip_special_tokens=False))

Nothing happened... what if we add the proper response template (`##[YODA]##>`)?

In [None]:
eval_prompt = "I am your father!"
model_input = tokenizer(eval_prompt+response_template, return_tensors="pt").to("cuda")

fine_tuned_model.eval()
with torch.no_grad():
    print(tokenizer.decode(fine_tuned_model.generate(**model_input, max_new_tokens=100, repetition_penalty=1.1)[0], skip_special_tokens=False))

OK, that's more like it! We got a Yoda-like sentence back!

Let's write a function that handles all the boilerplate for us:

In [None]:
def generate(model, tokenizer, prompt, response_template="", max_new_tokens=100):
    tokenized_input = tokenizer(prompt+response_template, return_tensors="pt")
    input_ids = tokenized_input["input_ids"].cuda()

    model.eval()
    generation_output = model.generate(
        input_ids=input_ids,
        num_beams=3,
        max_new_tokens=max_new_tokens,
        repetition_penalty=1.1,
        do_sample=True, top_p=0.9,temperature=0.95,
        eos_token_id=tokenizer.eos_token_id,
    )
    output = tokenizer.batch_decode(generation_output, skip_special_tokens=False)[0]
    return output

Now, let's see our new function in action:

In [None]:
generate(fine_tuned_model, tokenizer, 'The Force is strong in this one.', response_template, max_length)

In [None]:
generate(fine_tuned_model, tokenizer, 'I am coming home.', response_template, max_length)

In [None]:
sample = prompt_yoda['test'][1]
generate(fine_tuned_model, tokenizer, sample['prompt'], response_template, max_length)

Finally, we can *disable* the LoRA adapter we trained to see how the base model reacts to the sample sentence (with and without the response template):

In [None]:
with fine_tuned_model.disable_adapter():
    print(generate(fine_tuned_model, tokenizer, sample['prompt']))
    print(generate(fine_tuned_model, tokenizer, sample['prompt'], response_template))