# Supervised Fine-tuning of Open Source LLMs (Mistral, MistralLite, Zephyr (alpha, beta)

The purpose of this notebook is to provide a comprehensive, step-by-step tutorial for fine-tuning any LLM (Large Language Model)

This guide will be divided into two parts:

**Part 1: Setting up and Preparing for Fine-Tuning**
1. Installing and loading the required modules
2. Loading a pre-trained model and its associated tokenizer
3. Loading the training dataset
4. Preprocessing the training dataset for model fine-tuning

**Part 2: Fine-Tuning and Open-Sourcing**
1. Configuring PEFT (Parameter Efficient Fine-Tuning) method QLoRA for efficient fine-tuning
2. Fine-tuning the pre-trained model
3. Saving the fine-tuned model and its associated tokenizer
4. Pushing the fine-tuned model to the Hugging Face Hub for public usage

### Installing Required Libraries

First, we will install some required libraries.

`transformers`: for loading a large language model and fine-tuning it.

`bitsandbytes`: for loading the model in 4-bit precision.

`peft`: for fine-tuning a small number of parameters.

`trl`: for training transformer language models using Reinforcement Learning.


In [None]:
!pip install  accelerate --progress-bar off
!pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118 --progress-bar off
!pip install  peft --progress-bar off
!pip install  bitsandbytes --progress-bar off
!pip install git+https://github.com/huggingface/transformers --progress-bar off
!pip install  xformers==0.0.21 --progress-bar off
!pip install git+https://github.com/huggingface/trl.git --progress-bar off
!pip install deepspeed==0.9.5 --progress-bar off
!pip install evaluate==0.3.0 --progress-bar off
!pip install wandb langchain --progress-bar off

### Loading Required Libraries

Next, we will load the required libraries for fine-tuning a Large Language Model (LLM) like Llama 2. We will look at each imported class in greater detail in subsequent sections.

In [None]:
import os
import pandas as pd
from datasets import Dataset

import torch
import bitsandbytes as bnb
from trl import SFTTrainer

from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments

In [None]:
torch_version = torch.__version__
if torch_version == "2.0.1+cu118":
    print(f"Torch version is satisfied: {torch.__version__}")
else:
    print("Torch version should be 2.0.1+cu118. Please ensure that before going further")

In [None]:
def create_bnb_config(load_in_4bit, bnb_4bit_use_double_quant, bnb_4bit_quant_type, bnb_4bit_compute_dtype):
    """
    Configures model quantization method using bitsandbytes to speed up training and inference

    :param load_in_4bit: Load model in 4-bit precision mode
    :param bnb_4bit_use_double_quant: Nested quantization for 4-bit model
    :param bnb_4bit_quant_type: Quantization data type for 4-bit model
    :param bnb_4bit_compute_dtype: Computation data type for 4-bit model
    """

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=load_in_4bit,
        bnb_4bit_use_double_quant=bnb_4bit_use_double_quant,
        bnb_4bit_quant_type=bnb_4bit_quant_type,
        bnb_4bit_compute_dtype=bnb_4bit_compute_dtype,
    )
    return bnb_config

In [None]:
def load_model(model_name, bnb_config):
    """
    Loads model and model tokenizer

    :param model_name: Hugging Face model name
    :param bnb_config: Bitsandbytes configuration
    """

    # Get number of GPU device and set maximum memory
    n_gpus = torch.cuda.device_count()
    max_memory = f'{22500}MB'

    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        max_memory={i: max_memory for i in range(n_gpus)},
    )

    # Load model tokenizer with the user authentication token
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    tokenizer.padding_side = "right"

    model.config.use_cache = False
    model.config.pretraining_tp = 1
    model.resize_token_embeddings(len(tokenizer))

    return model, tokenizer

### Initializing Transformers and Bitsandbytes Parameters

We will now initialize input parameters for the `transformers` and `bitsandbytes` modules.

In [None]:
# The pre-trained model from the Hugging Face Hub to load and fine-tune
models_list = ["mistralai/Mistral-7B-v0.1", "amazon/MistralLite", "HuggingFaceH4/zephyr-7b-beta"]

model_name = models_list[-1]

# Activate 4-bit precision base model loading
use_4bit = True

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Compute data type for 4-bit base models
compute_dtype = torch.float16

In [None]:
# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

Finally, we will call the above functions to get `model` and `tokenizer` objects.

In [None]:
# Load model from Hugging Face Hub with model name and bitsandbytes configuration

bnb_config = create_bnb_config(use_4bit, use_nested_quant, bnb_4bit_quant_type, compute_dtype)

model, tokenizer = load_model(model_name, bnb_config)

In [None]:
# Adding special tokens for autocomplete. Comment this for other usecases
tokenizer.add_tokens(["<AGENT_NAME>", "<PERSON>", "<URL>", "PHONE_NUMBER", "EMAIL_ADDRESS", "<CREDIT_CARD>"],
                     special_tokens=False)

tokenizer.pad_token_id, len(tokenizer)

### Loading Dataset

In [None]:
# Load dataset

train_df = pd.read_csv("rogers_data/rogers_train_df.csv")
val_df = pd.read_csv("rogers_data/rogers_val_df.csv")

In [None]:
print(f'Number of prompts: {len(train_df)}')
print(f'Column names are: {train_df.columns}')

In [None]:
train_df.head()

In [None]:
train_df.iloc[3]["text"]

### Getting Maximum Sequence Length of the Pre-trained Model

In the next cell, we will define the `get_max_length` function to find out the maximum sequence length of the Llama-2-7B model. This function will pull the model configuration and attempt to find the maximum sequence length from one of the several configuration keys that may contain it. If the maximum sequence length is not found, it will default to 1024. We will use the maximum sequence length during dataset preprocessing to remove records that exceed that context length because the pre-trained model won't accept them.

In [None]:
getattr(model.config, "max_position_embeddings")

In [None]:
def get_max_length(model):
    """
    Extracts maximum token length from the model configuration

    :param model: Hugging Face model
    """

    # Initialize a "max_length" variable to store maximum sequence length as null
    max_length = None
    # Find maximum sequence length in the model configuration and save it in "max_length" if found
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    # Set "max_length" to 1024 (default value) if maximum sequence length is not found in the model configuration
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length

In [None]:
# Convert Dataframe to Dataset

trainds = Dataset.from_pandas(train_df)
valds = Dataset.from_pandas(val_df)

In [None]:
trainds, valds

### Creating PEFT Configuration


Fine-tuning pretrained LLMs on downstream datasets results in huge performance gains when compared to using the pretrained LLMs out-of-the-box. However, as models get larger and larger, full fine-tuning becomes infeasible to train on consumer hardware. In addition, storing and deploying fine-tuned models independently for each downstream task becomes very expensive, because fine-tuned models are the same size as the original pretrained model. Parameter-Efficient Fine-tuning (PEFT) approaches are meant to address both problems!


PEFT approaches only fine-tune a small number of (extra) model parameters while freezing most parameters of the pretrained LLMs, thereby greatly decreasing the computational and storage costs. It also helps in portability, wherein users can tune models using PEFT methods to get tiny checkpoints worth a few MB compared to the large checkpoints of full fine-tuning.


**In short, PEFT approaches enable you to get performance comparable to full fine-tuning while only having a small number of trainable parameters.**


Hugging Face provides the PEFT library, which provides the latest Parameter-Efficient Fine-tuning techniques seamlessly integrated with Hugging Face Transformers and Hugging Face Accelerate.


There are several PEFT methods. In the next cell, we will use QLoRA, one of the latest methods that reduces the memory usage of LLM finetuning without performance tradeoffs, using the `LoraConfig` class from the `peft` library.


QLoRA uses 4-bit quantization to compress a pretrained language model. The LM parameters are then frozen, and a relatively small number of trainable parameters are added to the model in the form of Low-Rank Adapters. During finetuning, QLoRA backpropagates gradients through the frozen 4-bit quantized pretrained language model into the Low-Rank Adapters. The LoRA layers are the only parameters being updated during training.

In [None]:
def create_peft_config(lora_r, lora_alpha, target_modules, lora_dropout, bias, task_type):
    """
    Creates Parameter-Efficient Fine-Tuning configuration for the model

    :param lora_r: LoRA attention dimension
    :param lora_alpha: Alpha parameter for LoRA scaling
    :param modules: Names of the modules to apply LoRA to
    :param lora_dropout: Dropout Probability for LoRA layers
    :param bias: Specifies if the bias parameters should be trained
    """
    config = LoraConfig(
        r = lora_r,
        lora_alpha = lora_alpha,
        target_modules = target_modules,
        lora_dropout = lora_dropout,
        bias = bias,
        task_type = task_type,
    )
    return config

### Finding Modules for LoRA Application

In the next cell, we will define the `find_all_linear_names` function to find the module to apply LoRA to. This function will get the module names from `model.named_modules()` and store it in a set to keep distinct module names.

In [None]:
def find_all_linear_names(model):
    """
    Find modules to apply LoRA to.

    :param model: PEFT model
    """

    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    # if 'lm_head' in lora_module_names:
    #     lora_module_names.remove('lm_head')
    print(f"LoRA module names: {list(lora_module_names)}")
    return list(lora_module_names)

In [None]:
find_all_linear_names(model)

### Calculating Trainable Parameters

We can use the `print_trainable_parameters` function to find out the number and percentage of trainable model parameters. This function will calculate the number of total parameters in `model.named_parameters()` and then those that would get updated.

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.

    :param model: PEFT model
    """

    trainable_params = 0
    all_param = 0

    for _, param in model.named_parameters():
        num_params = param.numel()
        if num_params == 0 and hasattr(param, "ds_numel"):
            num_params = param.ds_numel
        all_param += num_params
        if param.requires_grad:
            trainable_params += num_params

    print(
        f"All Parameters: {all_param:,d} || Trainable Parameters: {trainable_params:,d} || Trainable Parameters %: {100 * trainable_params / all_param}"
    )

In [None]:
print_trainable_parameters(model)

### Fine-tuning the Pre-trained Model

We will create `fine_tune`, our final function, to wrap everything we have done so far and initiate the fine-tuning process. This function will perform the following model preprocessing operations to prepare it for training:


1. Enable gradient checkpointing to reduce memory usage during fine-tuning.
2. Use the `prepare_model_for_kbit_training` function from PEFT to prepare the model for fine-tuning.
3. Call find_all_linear_names` to get the module names to apply LoRA to.
4. Create LoRA configuration by calling the `create_peft_config` function.
5. Wrap the base Hugging Face model for fine-tuning to PEFT using the `get_peft_model` function.
6. Print the trainable parameters.


For training, we will instantiate a `Trainer()` object within the `fine_tune` function. This class requires the model, preprocessed dataset, and training arguments, listed below.


`per_device_train_batch_size`: The batch size per GPU/TPU/CPU for training.


`gradient_accumulation_steps`: Number of update steps to accumulate the gradients for, before performing a backward/update pass.


`warmup_steps`: Number of steps used for a linear warmup from 0 to `learning_rate`.


`max_steps`: If set to a positive number, the total number of training steps to perform.


`learning_rate`: The initial learning rate for Adam.


`fp16`: Whether to use 16-bit (mixed) precision training (through NVIDIA apex) instead of 32-bit training.


`logging_steps`: Number of update steps between two logs.


`output_dir`: The output directory where the model predictions and checkpoints will be written.


`optim`: The optimizer to use for training.

Initializing QLoRA and TrainingArguments parameters below for training.

In [None]:
################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 16

# Alpha parameter for LoRA scaling
lora_alpha = 64

# Dropout probability for LoRA layers
lora_dropout = 0.1

# Bias
bias = "none"

# Task type
task_type = "CAUSAL_LM"

In [None]:
################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "results/autocomplete_7b"

# Batch size per GPU for training
per_device_train_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-5

num_train_epochs = -1

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "cosine"

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.1

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = False

# Number of training steps (overrides num_train_epochs)
max_steps = 500

# Linear warmup steps from 0 to learning_rate
warmup_steps = 200

# Enable fp16/bf16 training (set bf16 to True with an A100)
bf16 = True
fp16 = False

# Log every X updates steps
logging_steps = 250

save_steps = 250

In [None]:
################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = 1024

# Pack multiple short examples in the same input sequence to increase efficiency
packing = True

Calling the `fine_tune` function below to fine-tune or instruction-tune the pre-trained model on our preprocessed news classification instruction dataset.

In [None]:
new_model = "rogers-autocomplete-zephyr"

model = prepare_model_for_kbit_training(model)

# Get LoRA module names
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "lm_head"]

# Create PEFT configuration for these modules and wrap the model to PEFT
peft_config = create_peft_config(lora_r, lora_alpha, target_modules, lora_dropout, bias, task_type)

model = get_peft_model(model, peft_config)

# Print information about the percentage of trainable parameters
print_trainable_parameters(model)

In [None]:
training_arguments = TrainingArguments(
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    warmup_steps=warmup_steps,
    max_steps=max_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    warmup_ratio=warmup_ratio,
    lr_scheduler_type=lr_scheduler_type,
    save_steps=save_steps,
    seed=42,
    save_strategy="steps",
    logging_steps=logging_steps,
    output_dir=output_dir,
    logging_first_step=True,
    evaluation_strategy="steps",
    group_by_length=group_by_length,
    optim=optim,
    report_to="wandb",
)


# Training parameters
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=trainds,
    eval_dataset=valds,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    peft_config=peft_config,
    args=training_arguments,
    packing=packing,
    neftune_noise_alpha=5
)

# Launch training and log metrics
print("Training...")

trainer.train()

# wandb authorize key
# 71374ee1894829db638938029d7510f5fcc09ddb

In [None]:
# trainer.train(resume_from_checkpoint=True)

In [None]:
# Save model

trainer.model.save_pretrained(new_model)