# A practical introduction to LLM fine-tuning

This notebook has been taken from [`this`](https://github.com/ashishpatel26/LLM-Finetuning/blob/main/2.Fine_Tune_Your_Own_Llama_2_Model_in_a_Colab_Notebook.ipynb) GitHub repo and adapted for the new model and the dataset.

![](https://archive.is/0iIXL/f587d66c7324054f5ae1e81d7a5736567e8c15c8.webp)

In [3]:
from IPython.display import HTML, display
colab_button = HTML(
    '<a target="_blank" href="https://colab.research.google.com/github/surrey-nlp/NLP-2025/blob/main/lab08/lab08_Instruction_FT_using_Llama3_2.ipynb">'
    '<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>'
)
display(colab_button)

# Background on fine-tuning LLMs

![](https://archive.is/0iIXL/5f30742c57ad532b4cda9f1b48790dbcc7d00a85.webp)

**Summary:**

1. **LLM Pretraining:**
   - Large Language Models (LLMs) are pretrained on extensive text corpora.
   - Llama 3.2 was pretrained on a dataset of 3 trillion tokens, compared to BERT's training on BookCorpus and Wikipedia.
   - Pretraining is resource-intensive and time-consuming.

2. **Auto-Regressive Prediction:**
   - Llama 3.2, an auto-regressive model, predicts the next token in a sequence.
   - Auto-regressive models lack usefulness in providing instructions, leading to the need for instruction tuning.

3. **Fine-Tuning Techniques:**
   - Instruction tuning uses two main fine-tuning techniques:
     a. Supervised Fine-Tuning (SFT): Trained on instruction-response datasets, minimizing differences between generated and actual responses.
     b. Reinforcement Learning from Human Feedback (RLHF): Trained to maximize rewards based on human evaluations.

4. **RLHF vs. SFT:**
   - RLHF captures complex human preferences but requires careful reward system design and consistent human feedback.
   - Direct Preference Optimization (DPO) an alternative to RLHF.
   - SFT can be highly effective when the model hasn't encountered specific data during pretraining.

5. **Effective SFT Example:**
   - [`LIMA`](https://arxiv.org/abs/2305.11206) paper showed improved performance of LLaMA v1 model over GPT-3 by fine-tuning on a small high-quality dataset.
   - Data quality and model size (e.g., 65b parameters) are crucial for successful fine-tuning.

6. **Importance of Prompt Templates:**
   - Prompt templates structure inputs: system prompt, user prompt, additional inputs, and model answer.
   - Llama 3.2's template example: *\<s>[INST] <<SYS>> System prompt <</SYS>> User prompt [/INST] Model answer </s>*
   - Different templates (e.g., Alpaca, Vicuna) have varying impacts.

7. **Reformatting for Llama 3.2:**
   - Converting instruction dataset to Llama 2's template is important.
   - The [`main`](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) dataset has already been reformatted for this purpose using [`this`](https://colab.research.google.com/drive/1ktwneRByMnm14i5fosRO_1PFCR0CAKFJ?usp=sharing) notebook.

(Note: LLMs = Large Language Models, SFT = Supervised Fine-Tuning, RLHF = Reinforcement Learning from Human Feedback, DPO = Direct Preference Optimization)

**Fine-Tuning Llama 3.2 with VRAM Limitations and QLoRA:**

In this section, the goal is to fine-tune a Llama 3.2 model with 1 billion parameters using a T4 GPU with 16 GB of VRAM. Given the VRAM limitations, traditional fine-tuning is not feasible, necessitating parameter-efficient fine-tuning (PEFT) techniques like LoRA or QLoRA. The chosen approach is QLoRA, which employs 4-bit precision to drastically reduce VRAM usage.

The following steps will be executed:

1. **Environment Setup:**
   The task involves leveraging the Hugging Face ecosystem and several libraries: transformers, accelerate, peft, trl, and bitsandbytes.

2. **Installation and Library Loading:**
   The first step is to install and load the required libraries.

In [1]:
!nvidia-smi

Sun Mar 23 01:21:31 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   53C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [1]:
!pip install accelerate peft bitsandbytes transformers trl



In [2]:
# Import necessary packages for the fine-tuning process
import os                          # Operating system functionalities
import torch                       # PyTorch library for deep learning
from datasets import load_dataset  # Loading datasets for training
from transformers import (
    AutoModelForCausalLM,          # AutoModel for language modeling tasks
    AutoTokenizer,                # AutoTokenizer for tokenization
    BitsAndBytesConfig,           # Configuration for BitsAndBytes
    HfArgumentParser,             # Argument parser for Hugging Face models
    TrainingArguments,            # Training arguments for model training
    pipeline,                     # Creating pipelines for model inference
    logging,                      # Logging information during training
)
from peft import LoraConfig, PeftModel  # Packages for parameter-efficient fine-tuning (PEFT)
from trl import SFTTrainer         # SFTTrainer for supervised fine-tuning

In [3]:
# !pip install -q datasets
# !huggingface-cli login

---
* **Section 1:** Parameters to tune
    * Load the llama-3.2-1b model and fine-tune it on the mindhunter23/guanaco-llama2-1k dataset.
    * The dataset contains 1,000 samples.
    * You can find more information about the dataset in [`here`](https://huggingface.co/datasets/timdettmers/openassistant-guanaco).
    * Feel free to use a different dataset.
* **Section 2:** QLoRA parameters
    * QLoRA will use a rank of 64 with a scaling parameter of 16.
    * The Llama 2 model will be loaded directly in 4-bit precision using the NF4 type.
    * The model will be trained for one epoch.
* **Section 3:** Other parameters
    * To get more information about the other parameters, check the [`TrainingArguments`](https://archive.is/o/0iIXL/https://huggingface.co/docs/transformers/main_classes/trainer%23transformers.TrainingArguments), [`PeftModel`](https://archive.is/o/0iIXL/https://huggingface.co/docs/peft/package_reference/peft_model), and [`SFTTrainer`](https://archive.is/o/0iIXL/https://huggingface.co/docs/trl/main/en/sft_trainer) documentation.

In [4]:
# The model that you want to train from the Hugging Face hub
model_name = "unsloth/Llama-3.2-1B"
# model_name = "meta-llama/Llama-3.2-1B"
# model_name = "NousResearch/Llama-2-7b-hf"

# The instruction dataset to use
dataset_name = "mindhunter23/guanaco-llama2-1k-en"

# Fine-tuned model name
new_model = "llama-3.2-1b-miniguanaco"

In [5]:
################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

In [6]:
################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

In [7]:
################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 1

# Batch size per GPU for evaluation
per_device_eval_batch_size = 1

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 4

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 25

# Log every X updates steps
logging_steps = 25

In [8]:
################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}


1. **Loading the Dataset:**
   The first step involves loading the preprocessed dataset. This dataset will be used for fine-tuning. Preprocessing might involve reformatting prompts, filtering out low-quality text, and combining multiple datasets if needed.

2. **Configuring BitsAndBytes for 4-bit Quantization:**
   The `BitsAndBytesConfig` is set up to enable 4-bit quantization. This configuration is crucial for reducing the memory usage during fine-tuning.

3. **Loading Llama 3.2 Model and Tokenizer in 4-bit Precision:**
   The Llama 3.2 model is loaded with 4-bit precision, which significantly reduces the memory footprint. The corresponding tokenizer is also loaded to preprocess the text data.

4. **Loading Configurations and Initializing SFTTrainer:**
   - The configurations needed for QLoRA, which is a parameter-efficient fine-tuning technique, are loaded.
   - Regular training parameters are set up.
   - The `SFTTrainer` is initialized with all the loaded configurations and parameters. This trainer will manage the supervised fine-tuning process.

5. **Start of Training:**
   After all the necessary components are loaded and configured, the training process begins. The `SFTTrainer` takes care of fine-tuning the Llama 3.2 model using the specified dataset, configurations, and parameters.
   
  These steps collectively set up the environment for fine-tuning a Llama 3.2 model with 1 billion parameters in 4-bit precision using the QLoRA technique, thus optimizing for VRAM limitations while maintaining model performance.

In [9]:
# Step 1 : Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [10]:
# Step 2 : Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

# TODO: complete the function arguments using initializations done before
bnb_config = BitsAndBytesConfig(
    load_in_4bit=...,
    bnb_4bit_quant_type=...,
    bnb_4bit_compute_dtype=...,
    bnb_4bit_use_double_quant=...,
)

In [11]:
# Step 3 :Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

In [12]:
# Step 4 :Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

In [13]:
# Step 5 :Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

## LoRA Configuration:

Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning technique that injects trainable rank decomposition matrices into each layer of the Transformer model, effectively increasing the expressiveness of the model with a small number of extra parameters.

The following code configures the LoRA parameters for fine-tuning:

* **`lora_alpha`:** This parameter controls the scaling factor for the LoRA updates. A higher value means the LoRA updates will have a larger impact on the model's weights.
* **`lora_dropout`:** This parameter sets the dropout probability for the LoRA layers. Dropout is a regularization technique that helps prevent overfitting.
* **`r`:** This parameter determines the rank of the LoRA decomposition matrices. A lower rank leads to fewer trainable parameters but might limit the expressiveness of the LoRA updates.
* **`bias`:** This parameter specifies whether to apply a bias term to the LoRA updates. "none" means no bias is used.
* **`task_type`:** This parameter indicates the type of task the model is being fine-tuned for. In this case, it's "CAUSAL_LM" for causal language modeling.

In [14]:
# Step 6 :Load LoRA configuration

# TODO: complete the function arguments for LoRA config
peft_config = LoraConfig(
    lora_alpha=...,
    lora_dropout=...,
    r=...,
    bias=...,
    task_type=...,
)

In [15]:
# Step 7 :Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

In [None]:
# Step 8 :Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    # dataset_text_field="text",
    # max_seq_length=None,
    # tokenizer=tokenizer,
    args=training_arguments,
    # packing=False,
)

In [None]:
torch.cuda.empty_cache()

# Step 9 :Train model
trainer.train()

# Step 10 :Save trained model
trainer.model.save_pretrained(new_model)

You might notice constant variations in the training loss during the fine-tuning process. This is expected and can be attributed to factors such as:

* **Small Model Size:** The Llama 3.2-1B model used here has a relatively small number of parameters compared to larger LLMs. This can lead to more sensitivity to individual training examples and fluctuations in the loss.
* **Dataset Size and Variability:** The dataset used for fine-tuning is limited in size and contain a diverse range of prompts and responses. This variability can contribute to fluctuations in the loss as the model learns to generalize across different types of examples.
* **Batch Size:** A small batch size, such as the one used here (per_device_train_batch_size = 1), can also contribute to variations in the loss. With a smaller batch size, the model updates its parameters more frequently based on a limited number of examples, which can lead to fluctuations in the loss curve.

Despite these variations, the overall trend of the loss should generally be decreasing, indicating that the model is learning and improving over time.

In [None]:
%load_ext tensorboard
%tensorboard --logdir results/runs

In [None]:
!pip install evaluate rouge_score bert_score

## Evaluating the Fine-tuned Model:

In this section, we evaluate the fine-tuned model using ROUGE and BERTScore metrics. We will generate text from the model using the prompts in the test set and compare the generated text with the reference responses.

In [None]:
import re

def extract_prompt_response(text):
    """
    Extracts the prompt and response from the given text using regex.

    Args:
        text (str): The text containing the prompt and response.

    Returns:
        tuple: A tuple containing the prompt and response, or None if not found.
    """
    match = re.search(r"<s>\[INST\](.*?)\[/INST\](.*?)</s>", text, re.DOTALL)
    if match:
        return match.group(1).strip(), match.group(2).strip()
    else:
        return None, None

In [None]:
import evaluate
import numpy as np

# Load the fine-tuned model and tokenizer
eval_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=...   # TODO: include quantization config
    device_map=device_map
)
eval_model = PeftModel.from_pretrained(eval_model, new_model)  # Load LoRA weights
eval_tokenizer = ...  # TODO: use the otkenizer from the trained model
eval_model.eval()

# Prepare the test split
test_dataset = load_dataset(dataset_name, split="test")

# Perform inference and evaluate using ROUGE-L and BERTScore
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")

predictions = []
references = []

for i in range(min(100, len(test_dataset))): # perform inference on 100 samples
    example = test_dataset[i]
    prompt, response = extract_prompt_response(example["text"])

    if prompt and response:
        inputs = eval_tokenizer(prompt, return_tensors="pt").to(eval_model.device)
        with torch.no_grad():
            outputs = eval_model.generate(**inputs, max_new_tokens=50) # Adjust max_new_tokens as needed
        generated_text = eval_tokenizer.decode(outputs[0], skip_special_tokens=True)
        predictions.append(generated_text)
        references.append(response)
    else:
        print(f"Skipping example: {example['text'][:50]}... (Invalid format)")

# Calculate ROUGE-L scores
rouge_scores = rouge.compute(predictions=predictions, references=references)

# Calculate BERTScore scores
bertscore_scores = bertscore.compute(predictions=predictions, references=references, lang="en")

print("ROUGE-L scores:", rouge_scores)
# print("BERTScore scores:", bertscore_scores)

average_bertscore = np.mean(bertscore_scores["f1"])

print("Average BERTScore:", average_bertscore)


## Explanation of ROUGE Scores:

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic summarization of texts.
The ROUGE-L scores you see above represent the following:

* **rouge1:** Measures the overlap of unigrams (single words) between the generated text and the reference text.
* **rouge2:** Measures the overlap of bigrams (two-word sequences) between the generated text and the reference text.
* **rougeL:**  Measures the longest common subsequence (LCS) between the generated text and the reference text. It is generally considered a better measure of recall than ROUGE-1 and ROUGE-2, as it considers sentence-level structure.
* **rougeLsum:** Measures the longest common subsequence (LCS) between the generated text and the reference text, but it is calculated over the summary level rather than sentence level. This metric was introduced in a later version of ROUGE.

Higher scores for all these metrics indicate better performance, meaning the generated text is more similar to the reference text.

## Explanation of BERTScore:

BERTScore is an automatic evaluation metric that measures the similarity between two texts using contextual embeddings from BERT. It computes a similarity score based on the cosine similarity between the token embeddings of the generated text and the reference text.

The average BERTScore you see above represents the average F1 score over the entire test set. The F1 score is the harmonic mean of precision and recall, which provides a balanced measure of performance.

Higher BERTScore indicates better performance, meaning the generated text is semantically more similar to the reference text.