# Pre-requisites and Model Load

Before we load the model, we will need to install the packages mentioned below. These packages do not come out of the box with Google Colab.

We also need to ensure that the correct runtime (for GPU) is selected. You can do this by clicking on `Runtime-->Change runtime type` in the File Menu above. For my project, I picked the T4 GPU, which comes with 16GB of CPU and GPU RAM.

In [None]:
#Check the system specs
!nvidia-smi

In [None]:
#Install the required packages for this project
!pip install einops datasets bitsandbytes accelerate peft

### Loading the Microsoft Phi-2 Model
The Phi-2 model is available on Hugging Face. You can read the details of it from https://huggingface.co/microsoft/phi-2
I am also loading the model in `4-bit` which is the "Quantization" part of QLORA. The memory footprint of this is much smaller then the default.
Apart from loading the model, we will also setup the tokenizer and ensure the proper settings.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig

model_name = "microsoft/phi-2"
# Configuration to load model in 4-bit quantized
bnb_config = BitsAndBytesConfig(load_in_4bit=True,
                                bnb_4bit_quant_type='nf4',
                                bnb_4bit_compute_dtype='float16',
                                #bnb_4bit_compute_dtype=torch.bfloat16,
                                bnb_4bit_use_double_quant=True)


#Loading Microsoft's Phi-2 model with compatible settings
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto',
                                             quantization_config=bnb_config,
                                             trust_remote_code=True)

# Setting up the tokenizer for Phi-2
tokenizer = AutoTokenizer.from_pretrained(model_name,
                                          add_eos_token=True,
                                          trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.truncation_side = "left"


In [None]:
print(f"Memory footprint: {model.get_memory_footprint() / 1e9} GB")

## Login to Hugging Face
We will login to Hugging Face, so we can save the updated model weights when training is done. Make sure to use an access key that has write permissions. You can create one from the following location.

https://huggingface.co/settings/tokens

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## Tokenize and Prep Dataset for Training

Next, we will load the databricks dolly 15K dataset. This dataset is created by employees in Databricks and contains different categories. We will use this to run a instruction fine tuning on our Phi-2 model.
We will also split the dataset into train and test datasets and tokenize it, to be used for fine tuning.

In [None]:
from datasets import load_dataset

#Load the dataset. Dolly 15K has only the train split.
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")


In [None]:
#Split the Dataset to train and test, with 80% for Train and 20% for Testing
dataset = dataset.train_test_split(test_size=0.2)

print(dataset)

#Reassigning to variables
train_dataset = dataset["train"]
test_dataset = dataset["test"]

In [None]:
#Function that creates a prompt from instruction, context, category and response and tokenizes it
def collate_and_tokenize(examples):

    instruction = examples["instruction"][0].replace('"', r'\"')
    context = examples["context"][0].replace('"', r'\"')
    response = examples["response"][0].replace('"', r'\"')
    category = examples["category"][0]

    #Check if context is given for the instruction
    if context.strip():
        context = f"##Context: {context}"

    #Merging into one prompt for tokenization and training
    prompt = f"""##Instruction: {instruction}
    ##Category: {category}
     {context}
     ##Response: {response}
     ##End of Example##
    """
    encoded = tokenizer(
        prompt,
        return_tensors="np",
        padding="max_length",
        truncation=True,
        ## Very critical to keep max_length at 512.
        ## Anything more like 1024 or higher will lead to OOM on T4
        max_length=512
    )

    encoded["labels"] = encoded["input_ids"]
    return encoded

In [None]:
#We will just keep the input_ids and labels that we add in function above.
columns_to_remove = ["instruction","context", "response", "category"]

#tokenize the training and test datasets
tokenized_dataset_train = train_dataset.map(collate_and_tokenize,
                                            batched=True,
                                            batch_size=1,
                                            remove_columns=columns_to_remove)
tokenized_dataset_test = test_dataset.map(collate_and_tokenize,
                                          batched=True,
                                          batch_size=1,
                                          remove_columns=columns_to_remove)



# Training the model

We will be using LORA technique to train the model. This technique will significantly reduce the number of trainable parameters, giving better performance and memory utilization.

In [None]:
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig

fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)

accelerator = Accelerator(fsdp_plugin=fsdp_plugin)

In [None]:
from peft import prepare_model_for_kbit_training

#gradient checkpointing to save memory
# Apparently Phi-2 does not support this :-(
#model.gradient_checkpointing_enable()

# Freeze base model layers and cast layernorm in fp32
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=False)
print(model)

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=['Wqkv','out_proj'], #Run print(model) to find the target_modules
    bias="none",
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)

#Commenting this for now. I do not see a significant difference in memory
#utiization, with our without the accelerator.
#model = accelerator.prepare_model(model)

### Training the Model and saving to Hub
This is where, we setup the training arguments. These arguments have been carefully selected to improve memory utilization and also help increase performance. I played around with these for a while, before finalizing the following arguments.

Finally, I am saving the model weights to HuggingFace Hub, so we do not loose out work. The training can run for several hours and I usually keep it running overnight.

In [None]:
import time
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='./results',  # Output directory for checkpoints and predictions
    overwrite_output_dir=True, # Overwrite the content of the output directory
    per_device_train_batch_size=2,  # Batch size for training
    per_device_eval_batch_size=2,  # Batch size for evaluation
    gradient_accumulation_steps=4, # number of steps before updating weights
    #max_steps=1000,  # Total number of training steps
    num_train_epochs=1,  # Number of training epochs
    learning_rate=1e-5,  # Learning rate
    weight_decay=0.01,  # Weight decay
    optim="paged_adamw_8bit", #Keep the optimizer state and quantize it
    fp16=True, #Use mixed precision training
    #For logging and saving
    logging_dir='./logs',
    logging_strategy="steps",
    logging_steps=200,
    save_strategy="steps",
    save_steps=200,
    save_total_limit=2,  # Limit the total number of checkpoints
    evaluation_strategy="steps",
    eval_steps=200,
    load_best_model_at_end=True, # Load the best model at the end of training
)

trainer = Trainer(
    model=model,
    train_dataset=tokenized_dataset_train,
    eval_dataset=tokenized_dataset_test,
    args=training_args,
)

start_time = time.time()  # Record the start time

trainer.train()  # Start training

end_time = time.time()  # Record the end time
training_time = end_time - start_time  # Calculate total training time

print(f"Training completed in {training_time} seconds.")

#Save model to hub to ensure we save our work.
model.push_to_hub("phi2-qlora-dolly",
                  use_auth_token=True,
                  commit_message="Training Phi-2",
                  private=True)

## Run Inference

First we will run inference without the trained weights and check the output.

In [None]:
#Pick a random example from the dataset.
#We will use the same example for trained model
query = dataset["instruction"][4500].replace('"', r'\"')
context = dataset["context"][4500].replace('"', r'\"')

prompt = f"instruction: {query}\ncontext: {context}\nresponse:"
print(prompt)


In [None]:
inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False)

inputs.to('cuda')

outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)


Next, lets reload the model with lora config and run inference on it.

In [None]:
from peft import PeftModel, PeftConfig

#Load the model from hub
model_id = "praveeny/phi2-qlora-dolly"
config = PeftConfig.from_pretrained(model_id)
base_model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
lora_model = PeftModel.from_pretrained(base_model, model_id)

#Run inference
outputs = lora_model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)