<a href="https://colab.research.google.com/github/quanticedu/llm-fine-tuning/blob/main/PEFT_with_LoRA_and_QLoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Parameter-Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA) and Quantized Low-Rank Adaptation (QLoRA)

In this Colab notebook, you'll determine the hyperparameters you'll need to fine-tune the Phi-2 model using the PEFT strategies of LoRA and QLoRA.

> This notebook is based on [@maximelabonne's LLama2 fine-tuning notebook](https://github.com/mlabonne/llm-course/blob/main/Fine_tune_Llama_2_in_Google_Colab.ipynb), which is, in turn, based on Younes Belkada's [GitHub Gist](https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da). It also borrows from [this example](https://github.com/brevdev/notebooks/blob/main/phi2-finetune-own-data.ipynb) on phi2 fine-tuning.

## Load and Tokenize the Training Data

These two cells contain all the code from the previous lesson. The first cell installs the needed dependencies and tokenizes the training datasets. The second loads the model unquantized (you'll reload it quantized later in the lesson). Refer to the previous lesson if you need a refresher on anything here.

Select the T4 GPU runtime and run the two cells.


In [None]:
# Install and import needed libraries
!pip install accelerate==0.26.1 bitsandbytes==0.42.0 datasets==2.14.6 dill==0.3.7 einops==0.4.1 fsspec==2023.10.0 multiprocess==0.70.15 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.18.1 nvidia-nvtx-cu12==12.1.105 peft==0.7.1 tokenizers==0.15.2 torch==2.1.0 transformers==4.36.2 triton==2.1.0 trl==0.7.0 xxhash==3.3.0

import torch
import random
import numpy as np

if not torch.cuda.is_available():
    raise ValueError("Wrong runtime type, please fix before proceeding. "
                     "We need a GPU for this fine-tuning notebook to work.")

from transformers import (
    AutoModelForCausalLM, # Will be used to load the pre-trained model
    AutoTokenizer, # Will be used to load the pre-trained tokenizer
    BitsAndBytesConfig, # For model quantization settings
    GenerationConfig, # To control generation (inference) from a model
    TrainingArguments, # To specify parameters of the fine-tuning process
    Trainer, # The object that abstracts away the training and evaluation loop
    pipeline, # Stringing together tokenization and inference, for convenience
    logging
)

from datasets import Dataset, DatasetDict # For data handling.

from peft import LoraConfig, PeftModel, get_peft_model # PEFT stands for "Parameter Efficient Fine-Tuning"
                                                       # These objects will help us to run Low Rank Adaptation
                                                       # instead of full fine-tuning.

torch.manual_seed(42)
torch.cuda.manual_seed(42)
np.random.seed(42)
random.seed(42); # Set the state of the random number generator. Important for reproducibility.

#Load and tokenize the training data
import gdown # For downloads from google drive

url_the_sun = 'https://quanticedu.github.io/llm-fine-tuning/TheSunAlsoRisesCleaned.txt'
gdown.download(url_the_sun, "./TheSunAlsoRisesCleaned.txt", quiet=True)

url_men_without = 'https://quanticedu.github.io/llm-fine-tuning/MenWithoutWomenCleaned.txt'
gdown.download(url_men_without, "./MenWithoutWomenCleaned.txt", quiet=True)

with open("MenWithoutWomenCleaned.txt", "r", encoding="utf-8") as f:
    raw_training_text = f.read()

with open("TheSunAlsoRisesCleaned.txt", 'r', encoding="utf-8") as f:
    raw_eval_text = f.read()

# The model that you want to train from the Hugging Face hub
model_name = "microsoft/phi-2"
revision = "523a3d62e793d3f51ad6334ccfd3b67de28771c0"

# Load the pre-trained Phi2 tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, revision=revision)
tokenizer.pad_token = tokenizer.eos_token # A common, slightly hacky solution.
                                          # Some models are trained without padding,
                                          # but we can usually reuse the eos (end of sequence) token
                                          # for padding purposes.

# Note: 'special tokens' warning can be ignored, as well as the warning about the "HF_TOKEN"

raw_train_data = Dataset.from_dict({"text": [raw_training_text]}) # Wrapping our data into a huggingface library's Dataset object,
raw_eval_data = Dataset.from_dict({"text": [raw_eval_text]})      # which allows convenient data preprocessing options.

raw_datasets = DatasetDict( # Wrapping both datasets into a "DatasetDict" object that can hold different data splits.
    {
        "train": raw_train_data,
        "valid": raw_eval_data,
    }
)

raw_datasets.set_format("torch") # Makes the datasets more convenient to use with pytorch.

# How much context should we consider at once. We'll keep it low to keep things manageable.
context_length = 250

def tokenize(element):
    '''A function to tokenize a given element(or a batch of elements) in the data.'''

    outputs = tokenizer(
        element["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        padding=True, # Using a special padding token, extend shorter sequences in the batch of elements to match the length of the longest one.
        return_tensors='pt' # Returned data will be PyTorch tensors.
    )

    tokenized_words = outputs["input_ids"].to("cuda:0")

    # Note that in Causal Language Modeling, the answer to each input is just the next token in the input.
    # So essentially the outputs are inputs shifted by one. Here we provide labels to be the same as inputs
    # because during training, this label shifting will be done for us automatically.
    return {"input_ids": tokenized_words, "labels": tokenized_words.clone()}

tokenized_datasets = raw_datasets.map(
    tokenize, batched=True, remove_columns="text"
).shuffle()

In [None]:
# Specify quantization and load the model
################################################################################
# bitsandbytes (quantization) parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = False # Whether to quantize model weights to 4bits (QLoRA).

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16" # For some GPUs, 'bfloat16' format could be the optimal choice.

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4" # Choosing between different number representation formats.

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

# Use variables above to define a quantization configuration object.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=bnb_4bit_compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,

)

device = "cuda:0" # The first among the available GPUs.
device_map = {"": 0} # Specify which elements of the model go to which device.
                     # This is especially relevant for huge models that don't fit on one GPU.
                     # In our case, we map everything to device 0 (GPU number 0) when loading the model.

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    revision=revision,
    device_map=device_map,
    trust_remote_code=True, # This is to let huggingface know that we are downloading this custom model from a trusted source.
    quantization_config=bnb_config if use_4bit else None,
    torch_dtype=torch.float16 # When quantization is not used,
                              # we need to specify this to avoid loading the model in 32bit.
)

model.config.use_cache = False # Caching speeds up inference, but is irrelevant for training/fine-tuning.
                               # We've found it interfere with Colab behavior when different models are loaded/unloaded.
                               # So we'll keep it off. In practice, for inference, setting it to True (default) is advisable.

## Base Model Samples

Before we fine-tune the model, we should get samples of its baseline performance. First we create a pipeline for convenience.


In [None]:
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

Now we create two samples. In the first we'll use the same prompt we used in the previous lesson. Note that in the previous lesson we had `do_sample=False` in the generation configuration, which means that this output will not be the same.

In [None]:
generation_config = GenerationConfig(max_length=200,
                                      do_sample=True,    # Whether to use deterministic (highest probability) decoding
                                      use_cache=False,    # or sample each next word proportionally to its predicted probability.
                                      temperature=1,
                                      eos_token_id=tokenizer.eos_token_id,
                                      bos_token_id=tokenizer.eos_token_id,
                                      pad_token_id=tokenizer.eos_token_id)

# Try the old "sad one-sentence story" prompt we used in the previous lesson:
torch.manual_seed(42)
result = pipe("As promised, here is a one-sentence story that will make you cry: ", generation_config=generation_config)
print(result[0]['generated_text'])

In the second sample, we'll have the model give us a continuation for an opening of a story.

In [None]:
# Continue a generic story opening:
torch.manual_seed(42)
result = pipe("I went outside,", generation_config=generation_config)
print(result[0]['generated_text'])

##Trainable Modules

To determine which modules to apply LoRA to, we need to know which modules are in the model.

In [None]:
print(model)

##Training Hyperparameters

This cell sets the hyperparameters and creates a PEFT model from the base model.

In [None]:

# Fine-tuned model name for saving later
new_model_name = "phi2-hemingway"

################################################################################
# LoRA parameters
################################################################################

# LoRA attention dimension
# 1 is the minimum, which would result in extremely limited flexibility.
# The higher the number - the more flexible our LoRA.
# adjustment matrices will be. The cost is higher memory demand and longer training.
# If we increase it too much, we'll essentially be doing full fine-tuning on the
# weights to which LoRA is applied (see "training_modules" parameter in the next cell).
# Common values to try are: 8, 16, 32.
lora_r = 32

# Alpha parameter for LoRA scaling # Covered in the lesson directly
# Higher alpha will result in higher impact of lora adaptation.
# A common rule of thumb is to set this to lora_r times two.
# But it's not guaranteed to be best and experimentation can help find more optimal values.
lora_alpha = 64

# Dropout probability for LoRA layers
# Dropout refers to randomly "switching off" a certain proportion of neurons.
# This encourages the network not to rely on any one weight too much and thus be
# more robust.
lora_dropout = 0.1


################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs. How many times to go over the dataset.
# Overly high - increased risk of overfitting (memorizing the training set without understanding)
# Overly low - increased risk of cutting training too early.
# Reasonable value can be selected by selecting a large value and monitoring validation set performance.

num_train_epochs = 3.0

# Enable fp16/bf16 training (set bf16 to True with an A100)
# Can speed up training and decrease memory demands by
# using different quantization levels on different network parts.
# Important for QLoRA. Might not work on some/many GPUs.
fp16 = True if use_4bit else False
bf16 = False

# Batch siz (how many training examples to work with in parallel) per GPU for training
# Usually, the higher the batch size - the better (results in more stable learning).
# BUT go too high - and you'll quickly run out of GPU memory.
# Generally, select the highest number you can without running out of memory.
per_device_train_batch_size = 2

# Batch size per GPU for evaluation
# This can often be a bit higher since during evaluation we don't need to store gradients.
# The higher this number - the faster the evaluation will be.
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
# If you batch size is small, you can increase this number for more stable training.
# (we'll accumulate evidence for some time before making the weight update step)
# It's essentially the same as batch size, but done sequentially instead of in parallel.
gradient_accumulation_steps = 2

# Enable gradient checkpointing. Lowers memory demand by clever combination
# of caching and recomputation. The cost is a small slowdown.
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping). Prevent gradients from growing too large
# and causing training instabilities or numerical overflows.
max_grad_norm = 0.3


# Weight decay to apply to all layers except bias/LayerNorm weights
# Weight decay prevents individual weights from becoming too large.
# This is a classical way of softly reducing model flexibility / degrees of freedom.
# If weight decay is too high, all weights will be incentivised to become near-zero.
weight_decay = 0.000 # 0.001, 0.005, 0.0001 are all values one might want to try.
                     # Be careful with this parameter, though, as too much weight decay
                     # might make the model forget everything.


# Optimizer to use (intuitively, the training data will tell us the direction
# on how much each weight should be changed to improve the performance a little.
# But the optimizer will 'decide' how exactly to use this information: change
# fast or slow, with or without inertia, etc.)
optim = "paged_adamw_32bit"

# Initial learning rate (AdamW optimizer)
learning_rate = 1e-4 # How fast to step along the directions described above.
                     # AdamW is adaptive, meaning that it will internally adjust this,
                     # but it's still important to choose an adequate starting point.

# Learning rate schedule. Learning rate additional changes during training according to a pre-specified
# schedule. Usually, getting smaller towards the end of training, with the idea that
# towards the end, we are making finer adjustments than in the beginning.
# A nice article covering different scheduler shapes: https://towardsdatascience.com/a-visual-guide-to-learning-rate-schedulers-in-pytorch-24bbb262c863
# The scheduler is especially important if we were to use the SGD optimizer.
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
# Warm-up refers to starting, in contrast, with a lower learning rate, to avoid
# overly dramatic changes in the very beginning of learning.
warmup_ratio = 0.03

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25

# Define LoRA configuration
peft_config = LoraConfig(
    target_modules=[ # Which model parts to apply L matrices to.
        "fc1",      # use print(model) to make a more informed decision.
        "fc2",       # Weights related to queries, keys, and values are a must
        "k_proj",
        "q_proj",
        "v_proj",
        "dense"
    ],
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="lora_only",
    task_type="CAUSAL_LM",
)

model_peft = get_peft_model(model, peft_config)

Let's investigate our PEFT model.

In [None]:
print(model_peft)

## From https://github.com/brevdev/notebooks/blob/main/phi2-finetune-own-data.ipynb

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in a model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

print_trainable_parameters(model_peft)

##Fine-Tuning

In this cell we'll create the trainer object, which uses the training and evaluation datasets along with the PEFT model and arguments to control the training. Before we actually fine tune, we'll look at the evaluation loss of the pre-trained model on the new evaluation data.

In [None]:
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    evaluation_strategy="steps",
    eval_steps=25,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    lr_scheduler_type=lr_scheduler_type,
    seed=42
)

trainer = Trainer(
    model=model_peft,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets["valid"],
    args=training_arguments,
)

trainer.evaluate() # To see the loss before the start of training.


Now we conduct the actual fine tuning. This process will take a few minutes.

In [None]:
torch.manual_seed(42)
# Train model
trainer.train()

Notice how important it is to evaluate the model at least once before training. Without it it might seem that the loss barely changed compared to the non-fine-tuned model. Often the biggest improvement happens before the first evaluation.

## Post-Training Samples

In [None]:
logging.set_verbosity(logging.CRITICAL) # Ignore warnings
trainer.model.eval(); # Set the model into evaluation regime.
pipe = pipeline(task="text-generation", model=trainer.model, tokenizer=tokenizer)
generation_config = GenerationConfig(max_length=200,
                                      do_sample=True,    # Whether to use deterministic (highest probability) decoding
                                      use_cache=False,    # or sample each next word proportionally to its predicted probability.
                                      temperature=1,
                                      eos_token_id=tokenizer.eos_token_id,
                                      bos_token_id=tokenizer.eos_token_id,
                                      pad_token_id=tokenizer.eos_token_id)

# Sad one-sentence story completion
torch.manual_seed(42)
result = pipe("As promised, here is a one-sentence story that will make you cry: ", generation_config=generation_config)
print(result[0]['generated_text'])


In [None]:
# Generic story beginning completion:
torch.manual_seed(42)
prompt = "I went outside,"
result = pipe(prompt, generation_config=generation_config)
print(result[0]['generated_text'])