# DS 235 | HOMEWORK 5 | VYACHESLAV STEPANYAN

# Fine-Tuning LLMs with QLoRA

This notebook will guide you through the process of performing Supervised Fine Tuning (SFT) on pre-trained LLMs. We are going to be using HuggingFace libraries to load, quantize and train an LLM using QLoRA: A technique for memory-efficient training of very large models.

We will load a dataset consisting of instruction-response pairs and train a base model to follow instructions in the dataset. Hopefully, we can achieve a decent model, even with training very low amount of parameters using QLoRA.

**Note: It is highly recommended to complete the exercises using the CPU to not waste resources. After completing the exercise, you may shift to GPU and train the model!**

# 1. Installing the Dependencies

We will need:

1. [transformers](https://huggingface.co/docs/transformers/index): for loading and using transformer based pre-trained models
2. [peft](https://huggingface.co/docs/peft/index) For parameter efficient training methods (LoRA, QLoRA, etc.)
3. [bitsandbytes](https://huggingface.co/docs/bitsandbytes/index): For quantization
4. [trl](https://huggingface.co/docs/trl/en/sft_trainer): for training

Even though this notebook guides you through the usage of these libraries, you are encouraged to explore their functionality of your own.

In [1]:
%%capture
!pip install -q accelerate peft bitsandbytes transformers trl

In [2]:
import torch

# Set device to GPU (CUDA) if available, otherwise fallback to CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


# Login to HuggingFace

If you don't have an account, register and create a token at https://huggingface.co/settings/tokens  
Visit https://huggingface.co/mistralai/Mistral-7B-v0.1 and click agree to be able to access the model.

In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
import os
import re

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
import transformers
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

## Model and Dataset

Here, we define the dataset and the model to be loaded from HuggingFace. We will fine tune Mistral-7B model, which is one of the best open-source models available to date. It was pre-trained on a large dataset of very good quality. However, without SFT, it is not so good to interact with. We will try to perform SFT on the base Mistral model and see how it ends up.

We will use the [DataBricks: Dolly 15K](https://huggingface.co/datasets/databricks/databricks-dolly-15k) dataset for supervised fine-tuning. It contains instruction-completion samples that cover several categories outlined in InstructGPT paper (Q and A, information extraction, etc.). The goal is to teach pre-trained LLMs to follow instructions and respond in desirable ways.
Click on the link to go to the HuggingFace page of the dataset. Explore some samples there to get an idea of what model's responses should look like according to the data.

In [5]:
# The model that you want to train from the Hugging Face hub
model_name = "mistralai/Mistral-7B-v0.1"

# The instruction dataset to use
dataset_name = "databricks/databricks-dolly-15k"

# Fine-tuned model name
new_model = "Mistral-7B-sft-dolly"

# 2. Loading and Preparing the Dataset

We will now proceed with loading the data and preparing training samples. Our dataset contains various instruction-completion samples. The category of the samples is also available.

Additionally, note that some samples have context. The instruction and completion are related to the context. We will use different input-output templates for samples depending whether it has context or not. This will let the model know that we want it to use the context to answer our requests.

In [6]:
dataset = load_dataset(dataset_name, split="train")
print(dataset)

Dataset({
    features: ['instruction', 'context', 'response', 'category'],
    num_rows: 15011
})


In [7]:
print(dataset[0])
print(dataset[1])

{'instruction': 'When did Virgin Australia start operating?', 'context': "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.", 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'}
{'instruction': 'Which is a species of fish? Tope or Rope', 'context': '', 'response': 'Tope', 'category': 'classification'}


## Task 1: Tokenization

Write a function to tokenize the text with the model's tokenizer and compute the number of tokens. Make sure not to truncate or pad as we will use the number of tokens to filter out long samples. Return a dict with input ids and length of input ids to be able to do this.

In [8]:
# Load the model tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, padding_side="right", trust_remote_code=True)
eos_token = tokenizer.eos_token # Get the eos token for formatting

# By default, Mistral's tokenizer doesn't have a pad token. We will add it to be able to pad short sequences.
# We will use the <unk> token as the pad token.
tokenizer.pad_token = tokenizer.unk_token

def tokenize_fn(text, tokenizer):
    # Tokenize the text
    tokens = tokenizer.tokenize(text)
    
    # Convert tokens to input ids
    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    
    # Compute the length of input ids
    input_ids_lens = len(input_ids)

    return { "ids": input_ids, "lens": input_ids_lens }

tokenize_fn(
    """
    ### Instruction: Give me a list of 3 good ways to gain weight.
    ### Response: Here is a list of 3 good ways to gain weight:
    1. Eat more food, especially highly processed foods and foods high in sugars
    2. Don't exercise, sit and do nothing for as long as you can
    3. Drink sugary drinks for extra calories

    ### Instruction: Which is a species of fish? Escolar or Escobar
    ### Response: Escolar</s>
    """,
    tokenizer
)

{'ids': [28705,
  13,
  2287,
  774,
  3133,
  3112,
  28747,
  16104,
  528,
  264,
  1274,
  302,
  28705,
  28770,
  1179,
  4342,
  298,
  8356,
  4336,
  28723,
  13,
  2287,
  774,
  12107,
  28747,
  4003,
  349,
  264,
  1274,
  302,
  28705,
  28770,
  1179,
  4342,
  298,
  8356,
  4336,
  28747,
  13,
  260,
  28740,
  28723,
  413,
  270,
  680,
  2887,
  28725,
  4012,
  6416,
  16244,
  14082,
  304,
  14082,
  1486,
  297,
  28670,
  1168,
  13,
  260,
  28750,
  28723,
  3189,
  28742,
  28707,
  9095,
  28725,
  1943,
  304,
  511,
  2511,
  354,
  390,
  1043,
  390,
  368,
  541,
  13,
  260,
  28770,
  28723,
  2985,
  655,
  28670,
  628,
  16195,
  354,
  4210,
  24336,
  13,
  13,
  2287,
  774,
  3133,
  3112,
  28747,
  9595,
  349,
  264,
  7018,
  302,
  8006,
  28804,
  13731,
  8330,
  442,
  13731,
  598,
  283,
  13,
  2287,
  774,
  12107,
  28747,
  13731,
  8330,
  2,
  28705,
  13,
  260],
 'lens': 119}

# Task 2: Formatting Instruction-Completion Samples for Model Input

To train the model, we need to convert the dataset samples into text which can be used as input to the model. We will use a template for formatting samples from the dataset. The created formatting functions will be provided to the [trainer](https://huggingface.co/docs/trl/en/sft_trainer) which will use it to create inputs and train the model. The templates have special delimiters for the context, instruction, and the response. We will, however, compute the loss only for the completion tokens. This will teach the model to complete the response by understanding context and instruction.

Write a function to format the context, instruction, response of a single sample into a single text using the following templates. If there is context, use the template that has context header. Otherwise, use the regular template which has only instruction and response headers.

Your result should look something like this:

**### Instruction: Some text**  
**### Response: The response**

Then, write another function to do exactly the same for a several samples and return formatted texts. This function will be used to work with a batch of data.

The model will process the input and the context and learn to complete the response.

In [9]:
template_no_context =  "### Instruction: {}\n ### Response: {}" + eos_token # Template to use when there is no context
template_context =  "### Context: {}\n ### Instruction: {}\n ### Response: {}" + eos_token # Template to use when there is context

def get_single_formatted_input(example):
    """
    example: A dict with the following keys
      'instruction': str
      'context': str
      'response': str
      'category': str
    """

    if example['context']:
        text = template_context.format(example['context'], example['instruction'], example['response'])
    else:
        text = template_no_context.format(example['instruction'], example['response'])


    return text

context_sample = { "instruction": "test", "context": "test", "response": "test"}
no_context_sample = { "instruction": "test", "context": "", "response": "test"}

context_formatted = get_single_formatted_input(context_sample)
no_context_formatted = get_single_formatted_input(no_context_sample)

print(context_formatted)
print(no_context_formatted)

def get_formatted_inputs(examples):
    """
    examples: A dict with the following keys
      'instruction': a list of str
      'context': a list of str
      'response': a list of str
      'category': a list of str

      examples["instruction"][0] will be the instruction of the 1st sample in the batch
    """
    output_texts = []

    for i in range(len(examples['instruction'])):
        example = {
            'instruction': examples['instruction'][i],
            'context': examples['context'][i] if 'context' in examples and len(examples['context']) > i else '',
            'response': examples['response'][i]
        }
        output_texts.append(get_single_formatted_input(example))

    ### YOUR CODE GOES HERE


    return output_texts

print("\n\n".join(get_formatted_inputs(dataset[0:3])))

### Context: test
 ### Instruction: test
 ### Response: test</s>
### Instruction: test
 ### Response: test</s>
### Context: Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.
 ### Instruction: When did Virgin Australia start operating?
 ### Response: Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.</s>

### Instruction: Which is a species of fish? Tope or Rope
 ### Response: Tope</s>

### Instruction: Why can camels survive for long without water?
 ### Response: Camels use the 

## Task 3: Filtering Long Samples

Write a function that formats samples from the dataset. Then, tokenize the formatted samples and filter those with number of tokens larger than 256.

In [10]:
def filter_long_samples(sample):
  ### YOUR CODE GOES HERE
    formatted_samples = []
    for i in range(len(dataset['instruction'])):
        example = {
            'instruction': dataset['instruction'][i],
            'context': dataset['context'][i] if 'context' in dataset and len(dataset['context']) > i else '',
            'response': dataset['response'][i]
        }
        formatted_text = get_single_formatted_input(example)
        # Tokenize the formatted text
        tokens = tokenizer.tokenize(formatted_text)
        # Check the number of tokens
        if len(tokens) <= 256:
            input_ids = tokenizer.convert_tokens_to_ids(tokens)
            input_ids_len = len(input_ids)
            formatted_samples.append({ "ids": input_ids, "lens": input_ids_len })
        
        return formatted_samples['lens'] <= 256


print(f"Filtered dataset has {len(dataset)} samples")

Filtered dataset has 15011 samples


Now, lets split the dataset to proceed with training.

In [11]:
dataset = dataset.shuffle(seed=42)
splits = dataset.train_test_split(test_size=0.05)
train_dataset, val_dataset = splits["train"], splits["test"]

data_module = dict(train_dataset=train_dataset, eval_dataset=val_dataset)

# 3. Defining Training Parameters

There is a lot of parameters and configuration to define.

1. We will need to define the base model and dataset.
2. Configure LoRA and define which modules it will target.
3. Define quantization configuration to be able to load and train huge models on small GPUs.

## Task 4: LoRA Configuration

Here we need to define LoRA configuration to be able to fine-tune Mistral LLM.

Do some research and define your initial LoRA parameters.

You need to define 4 things:
1. The rank *r*
2. The scaling factor alpha. It determines how the adaptation layer's weights affect the base model's. Higher alpha means the LoRA layers act more strongly on the base model.
3. LoRA dropout for regularization
4. The list of modules to attach LoRA adapters to.

In [12]:
# LoRA attention dimension
lora_r = 32

# Alpha parameter for LoRA scaling
lora_alpha = 64

# Dropout probability for LoRA layers
lora_dropout = 0.05

# Modules to target with LoRA. Available options: ['q_proj','k_proj','v_proj','o_proj','gate_proj','down_proj','up_proj', 'lm_head']
target_modules = ['q_proj','k_proj','v_proj','o_proj','gate_proj','down_proj','up_proj']

## Quantization Configuration

Here we define the quantization strategy for training. Quantization is necessary to be able to load very large models in GPUs with small VRAM.

We are going to load the models in 4 bit precision for QLoRA training. Training a QLoRA demands specifying quantization and computation data types. We are going to store the weights in 4 bit normal float type. But, we will dequantize the weights and perform computations in 16 bit float during forward and backward passes. This allows us to use less VRAM when storing the model, but still use high precision fp16 format when doing computations. Refer to the QLoRA paper for more details on the quantization and computation.

[QLoRA Paper](https://arxiv.org/abs/2305.14314)

In [13]:
# Activate 4-bit precision base model loading
use_4bit = True

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Activate nested quantization (double quantization). This may save extra VRAM
use_nested_quant = False

## Training Parameters

Here we define the training parameters such as the learning rate, the optimizer, number of training steps, etc. Everything is defined for you, but take your time to read about each option and play around with them. There are brief explanations for each of these parameters.

In [14]:
# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100). Bf16 is a better data type. It is available on newer NVIDIA chips. https://stats.stackexchange.com/questions/637988/understanding-the-advantages-of-bf16-vs-fp16-in-mixed-precision-training#:~:text=Brain%20float%20(BF16)%20and%2016,the%20cost%20of%20reduced%20precision.
fp16 = True
bf16 = False

# Batch size per GPU for training. You can try increasing this
per_device_train_batch_size = 1

# Batch size per GPU for evaluation
per_device_eval_batch_size = 1

# Number of update steps to accumulate the gradients for. Effective batch size becomes (Batch size x Grad Acc). Grad acc is a cheap way to increase batch size and avoid high VRAM usage.
gradient_accumulation_steps = 4

# Enable gradient checkpointing. Gradient checkpointing saves VRAM by recomputing activations, instead of storing them in memory after forward pass. https://github.com/cybertronai/gradient-checkpointing
gradient_checkpointing = True

# Maximum gradient norm (gradient clipping). Take care to use small values when the training is not stable and gradient norms peak.
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 1e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.0

# Optimizer to use. Paged optimizer is used to save VRAM
optim = "paged_adamw_32bit"

# Learning rate scheduler
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = 2

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.05

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 100

# Log every X updates steps
logging_steps = 10

# Maximum sequence length to use. We will use smaller context to not run out of memory on 16Gb VRAM.
max_seq_length = 256

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

# 4. Training

With the dataset formatting and configuration taken care of, we can proceed to training. We will launch a Tensorboard instance to monitor the training. It allows us to monitor the loss and the gradient norms live.

In [15]:
%load_ext tensorboard
%tensorboard --logdir results/runs

Reusing TensorBoard on port 6006 (pid 21584), started 0:04:36 ago. (Use '!kill 21584' to kill it.)

Next, we set up the training. Here are the steps:

1. We create a quantization config instance using bits and bytes and the parameters that we have defined above
2. Next, we load the base model
3. Then, we call *prepare_model_for_kbit_training(model)* to prepare it for quantized training
4. Then, we create a LoRA configuration and define training parameters.
5. Finally, we start a training session.

One imporant thing to note is that we need to train the model **ONLY** on the completion/response tokens. This will be taken care of by the **DataCollatorForCompletionOnlyLM** data collator.

It will take samples of text, tokenizer them, pad them and ignore the loss function for all tokens that are not part of the completion. To set it up, we have to tell it which part of our text is the completion by specifying a response template. It will ignore loss for all tokens coming before that template.

For example:
```python
response_template = "### Response:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)
```

Will only compute the loss for tokens coming after ### Response: This is exactly what we need.

In [16]:
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

# Create a Quantization config with the parameters defined in the previous cell.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)

# Quantize the model for low precision training
model = prepare_model_for_kbit_training(model)
model.config.use_cache = False # Saves a lot of memory
model.pretraining_fp = 1

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    target_modules=target_modules,
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    max_steps=max_steps,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard",
    do_eval=False
)

# Set supervised fine-tuning collator. The response template tells the trainer to compute loss only for the tokens coming after it
# We do not need to compute loss for the instruction and context tokens. We need the model to be trained only the completion part.
response_template = "### Response:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

# Create the SFT Trainer. We pass in the formatting function that we have defined alongside the model, dataset, collator, and LoRA configuration
trainer = SFTTrainer(
    model=model,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
    formatting_func=get_formatted_inputs,
    data_collator=collator,
    **data_module
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Map:   0%|          | 0/14260 [00:00<?, ? examples/s]

Map:   0%|          | 0/751 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


  0%|          | 0/2 [00:00<?, ?it/s]


Background
The Tailored Access Operations unit has existed since the late 90s. Its mission is to collect intelligence on foreign targets of the United States by hacking into computers and telecommunication networks.

In 2012 This instance will be ignored in loss calculation. Note, if this happens often, consider increasing the `max_seq_length`.

The height of the tower is 55.86 metres (183 feet 3 inches) from the ground on the low side and 56.67 m (185 ft 11 in) on the high side. The width of the walls at the base is 2.44 m (8 ft 0 in). Its weight is estimated at 14,500 tonnes (16,000 short tons). The tower has 296 or 294 steps; the seventh floor has two fewer steps on the north-f This instance will be ignored in loss calculation. Note, if this happens often, consider increasing the `max_seq_length`.
  attn_output = torch.nn.functional.scaled_dot_product_attention(

The format involves a qualification phase, which takes place over the preceding three years, to determine which teams qu

{'train_runtime': 45.9151, 'train_samples_per_second': 0.174, 'train_steps_per_second': 0.044, 'train_loss': 1.0658118724822998, 'epoch': 0.0}


# 5. Generating from the Trained Model

Now, we will load the base model and apply the trained LoRA adapters over it. Next, we can try prompting it!

Note: You may need to restart session to free up some VRAM to be able to load the model.

In [18]:
# Free up the memory to be able to reaload the model

import gc
# del base_model
del trainer
gc.collect()

1176

In [19]:
# Reload model and merge it with LoRA weights. Of course we reload it quantized as well.

compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
    quantization_config=bnb_config
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.padding_side = "right"
tokenizer.pad_token = tokenizer.unk_token

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [20]:
from transformers import GenerationConfig
generation_config = GenerationConfig(max_new_tokens=128, top_p=0.9, do_sample=True, repetition_penalty=1)

prompt_template = "### Instruction: {}\n ### Response: "
prompt = "What do you do when the weather outside is cold and it is raining?"
input = tokenizer(prompt_template.format(prompt), return_tensors="pt").to(model.device)["input_ids"]

print(tokenizer.decode(model.generate(input, generation_config=generation_config, pad_token_id=tokenizer.pad_token_id)[0]))

<s> ### Instruction: What do you do when the weather outside is cold and it is raining?
 ### Response: 101 Warm-Up Drills and Skills for Soccer Coaches
 ### [Book Website](https://www.amazon.com/Warm-Up-Drills-Soccer-Coaches/dp/1589239681)

[![AWS: Warm Up Drills and Skills for Soccer Coaches](http://ecx.images-amazon.com/images/I/410XD12N7TL.jpg)](https://www.amazon.com/Warm-Up-Drills-Soccer
