<!-- Banner Image -->
<img src="./model_finetuning_banner.jpeg" width="100%">

# Fine-tuning : Phi-2 using QLoRA with a custom Dataset

### 1. Prepare the environment

The combination of the model and the data I used does not need more than 1 GPU but would need more than 32GB GPU VRAM. I used [runpod.io's](https://www.runpod.io/console/gpu-cloud) instance that packed it right for me - 1 A100 GPU with 80GB VRAM 

In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U datasets scipy ipywidgets einops
!pip install -q -U matplotlib

### 2. Accelerator / W&B

In [None]:
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig

fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)

accelerator = Accelerator(fsdp_plugin=fsdp_plugin)

Using the Weights & Biases to track our training metrics. Very useful to check the evaluation losses graphically and select the optimum checkpoint for the final use.

In [None]:
!pip install -q wandb -U

import wandb, os
wandb.login()

wandb_project = "fed-res-finetune"
if len(wandb_project) > 0:
    os.environ["WANDB_PROJECT"] = wandb_project

### 3. Load Dataset

It was difficult to choose the dataset as Phi-2's original model seems to have "seen" pretty much all the popular datasets. Needed something that's a bit obscure and chose this [federal reserve question answers](https://huggingface.co/datasets/clement-cvll/us-federal-reserve-qa/viewer) dataset from hugging faces. It is very tiny but I rather overfit my model on this and see if I could really influence phi-2 deterministically. Usually, I split the data into train,val,test sets. But in this exercise, I will use the same data for train and val as the dataset is too small and I want to try overfitting my model to the data.

In [None]:
from datasets import load_dataset

train_dataset = load_dataset("clement-cvll/us-federal-reserve-qa", split="train")
eval_dataset = load_dataset("clement-cvll/us-federal-reserve-qa", split="train")
print(train_dataset)
print(eval_dataset)

### 4. Load Base Model

Load Phi-2, quantized !

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling, BitsAndBytesConfig

base_model_id = "microsoft/phi-2"
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True
)
model = AutoModelForCausalLM.from_pretrained(base_model_id, 
                                             quantization_config=bnb_config, 
                                             torch_dtype=torch.float16, 
                                             trust_remote_code=True)

### 5. Tokenization

`max_length`,  has a direct impact on the compute requirements. Can compute this but I want to plot and check visually. Setting up the  tokenizer without the truncation/padding to get the length distribution.

Setup the tokenize function to make labels and input_ids the same.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    add_eos_token=True,
    add_bos_token=True, 
    use_fast=False, # needed for now, should be fixed soon
)

def tokenize(prompt):
    result = tokenizer(prompt)
    result["labels"] = result["input_ids"].copy()
    return result

And convert each sample into the prompt format. I am using the following format as mentoned in [phi-2's huggingfaces documentation](https://huggingface.co/microsoft/phi-2)

In [None]:
def generate_and_tokenize_prompt(data_point):
    full_prompt =f"""Instruction:{data_point["Context"]}
    Assistant:{data_point["Response"]}"""
    return tokenize(full_prompt)

Reformat the prompt and tokenize each sample:

In [None]:
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

Untokenizing to make sure it was formatted properly.

In [None]:
untokenized_text = tokenizer.decode(tokenized_train_dataset[0]['input_ids']) 
print(untokenized_text)

Plot the distribution of the dataset lengths, so we can determine the appropriate `max_length` for our input tensors.

In [None]:
import matplotlib.pyplot as plt

def plot_data_lengths(tokenized_train_dataset, tokenized_val_dataset):
    lengths = [len(x['input_ids']) for x in tokenized_train_dataset]
    lengths += [len(x['input_ids']) for x in tokenized_val_dataset]
    print(len(lengths))

    # Plotting the histogram
    plt.figure(figsize=(10, 6))
    plt.hist(lengths, bins=20, alpha=0.7, color='blue')
    plt.xlabel('Length of input_ids')
    plt.ylabel('Frequency')
    plt.title('Distribution of Lengths of input_ids')
    plt.show()

plot_data_lengths(tokenized_train_dataset, tokenized_val_dataset)

I am going to keep the max length = 250 approximately from this.

In [None]:
max_length = 250 

# redefine the tokenize function and tokenizer

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    padding_side="left",
    add_eos_token=True,  
    add_bos_token=True,  
    trust_remote_code=True,
    use_fast=False, # needed for now, should be fixed soon
)
tokenizer.pad_token = tokenizer.eos_token

def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=max_length,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result

tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

Verify if each `input_ids` is padded on the left with the `eos_token` (50256) and there should be an `eos_token` 50256 added to the end, and the prompt should start with a `bos_token.

In [None]:
print(tokenized_train_dataset[0]['input_ids'])

In [None]:
untokenized_text = tokenizer.decode(tokenized_train_dataset[4]['input_ids']) 
print(untokenized_text)

Checking to see if all the training data should be the same length, `max_length` (250 in this case).

In [None]:
plot_data_lengths(tokenized_train_dataset, tokenized_val_dataset)

#### How does the model respond before SFT?

In [None]:
print("Context: " + eval_dataset[0]['Context'])
print("Response: " + eval_dataset[0]['Response'] + "\n")

In [None]:
eval_prompt = full_prompt =f"""Instruction:What should I do if I have damaged or mutilated currency?
    Assistant:"""

In [None]:
# Apply the accelerator. You can comment this out to remove the accelerator.
model = accelerator.prepare_model(model)

In [None]:
# Re-init the tokenizer so it doesn't add padding or eos token
eval_tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    add_bos_token=True,
    use_fast=False, # needed for now, should be fixed soon
)

In [None]:
device = "cuda"
model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to(device)

In [None]:
model.eval()
with torch.no_grad():
    print(eval_tokenizer.decode(model.generate(**model_input, max_new_tokens=128)[0], skip_special_tokens=True))

That is not the expected response but phi-2 was convincingly fluent :)) 

### 6. Set Up LoRA

Preprocessing to the model to prepare it for training.

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

Examining the model's layers

In [None]:
print(model)

Define the LoRA config. (To play with this laterc with variations in r and alpha)

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "Wqkv",
        "fc1",
        "fc2",
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

# Apply the accelerator. You can comment this out to remove the accelerator.
model = accelerator.prepare_model(model)

Model with LoRA adapters added:

In [None]:
print(model)

### 7. Run Training!

Preparing the SFT parameters. I am choosing 1000 steps. I could go more to overfit the data which, depending on the use case, is probably the right thing to do.

In [None]:
if torch.cuda.device_count() > 1: 
    model.is_parallelizable = True
    model.model_parallel = True

In [None]:
import transformers
from datetime import datetime

project = "fed-res-finetune"
base_model_name = "phi2"
run_name = base_model_name + "-" + project
output_dir = "./" + run_name

tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=transformers.TrainingArguments(
        output_dir=output_dir,
        warmup_steps=5,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        max_steps=1000,
        learning_rate=2.5e-5, 
        logging_steps=25,
        optim="paged_adamw_8bit",
        logging_dir="./logs",        
        save_strategy="steps",       
        save_steps=50,                
        evaluation_strategy="steps", 
        eval_steps=50,               
        do_eval=True,                
        report_to="wandb",           
        run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"         
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

I will use the checkpoint at step 950 as it gave me the lowest val loss so far.

### 8. Playing with the FineTuned Model

Load the base Phi-2 model from the Huggingface again and merge the qlora adapters generated by peft in the above step

In [None]:
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    load_in_8bit=True,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.float16,
)
ft_model = PeftModel.from_pretrained(base_model, "phi2-fed-res-finetune/checkpoint-950")

In [None]:
eval_prompt = "What should I do if I have damaged or mutilated currency?"
#eval_prompt = "Who is on the Federal Open Market Committee?"
#eval_prompt = """What does the Federal Reserve mean when it says monetary policy remains "accommodative"?"""
model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to("cuda")

ft_model.eval()
with torch.no_grad():
    print(eval_tokenizer.decode(ft_model.generate(**model_input, max_new_tokens=80)[0], skip_special_tokens=True))

### 🤙 🤙 🤙 That worked. The responses are pretty close to the actual custom dataset used. Indeed awesome, compared to the responses from the pre-finetuned version 🤙 🤙 🤙 

### 9. Pushing the adapters to huggingface hub

Push (only) the qlora adapters generated by peft to the hub.

In [None]:
from huggingface_hub import login
login("token-here")

In [None]:
ft_model.push_to_hub("spraja08/fine-bitsy")

### 10. Loading from the hub and inferencing

In [None]:
from peft import PeftConfig
peft_model_id_from_hub = "spraja08/fine-bitsy"
config = PeftConfig.from_pretrained(peft_model_id_from_hub)
model_from_hub = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

In [None]:
model_from_hub.eval()
eval_prompt = """Instruction:What should I do if I have damaged or mutilated currency?
Assistant:"""
#eval_prompt = "Who is on the Federal Open Market Committee?"
#eval_prompt = """What does the Federal Reserve mean when it says monetary policy remains "accommodative"?"""
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    print(tokenizer.decode(model_from_hub.generate(**model_input, max_new_tokens=80)[0], skip_special_tokens=True))