# Fine-Tuning Large Language Models with LoRA/QLoRA

Welcome! In this notebook, you'll learn how to fine-tune a large language model efficiently using Parameter-Efficient Fine-Tuning (PEFT) techniques.

**What you'll learn:**
- How to prepare and tokenize datasets for instruction fine-tuning
- How to apply LoRA (Low-Rank Adaptation) to reduce trainable parameters
- How to use QLoRA for memory-efficient training with quantization
- How to implement assistant-only masking to train only on model responses
- How to train, save, and share your fine-tuned model

**Why this matters:**
Fine-tuning large models typically requires massive computational resources. LoRA and QLoRA allow you to fine-tune models on consumer hardware by only training a small fraction of the parameters.


## 1. Setup and Imports

Let's start by importing all the libraries we'll need. We're using:
- **Transformers**: For loading models and tokenizers
- **PEFT**: For applying LoRA adapters to our model
- **Datasets**: For loading and processing training data
- **Weights & Biases (wandb)**: For tracking training metrics (optional)
- **BitsAndBytes**: For quantization if using QLoRA


In [1]:
! pip install -q torch wandb datasets peft huggingface_hub


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [5]:
import os
import json
import warnings
import torch
import wandb
from dotenv import load_dotenv
from typing import List, Dict, Tuple, Optional

from datasets import Dataset, load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback,
)
from peft import LoraConfig, get_peft_model, PeftModel
from huggingface_hub import login

warnings.filterwarnings("ignore")


def get_model_size_gb(model: torch.nn.Module) -> float:
    """
    Calculate the model size in GB based on parameter count and data type.

    Args:
        model: PyTorch model

    Returns:
        Model size in GB
    """
    total_size = 0

    for param in model.parameters():
        # Get the number of elements
        param_size = param.numel()

        # Get the size of each element in bytes based on data type
        if param.dtype == torch.float32:
            bytes_per_param = 4
        elif param.dtype == torch.float16 or param.dtype == torch.bfloat16:
            bytes_per_param = 2
        elif param.dtype == torch.int8:
            bytes_per_param = 1
        elif param.dtype == torch.float64:
            bytes_per_param = 8
        else:
            # Default to 4 bytes for unknown types
            bytes_per_param = 4

        total_size += param_size * bytes_per_param

    # Convert bytes to GB (1 GB = 1024^3 bytes)
    size_gb = total_size / (1024**3)
    return size_gb


def read_json_file(file_path: str) -> dict:
    """
    Read a JSON file and return the contents as a dictionary.
    """
    with open(file_path, "r") as file:
        return json.load(file)

def push_to_hub(
    model: PeftModel, tokenizer: AutoTokenizer, model_name: str, hf_username: str
):
    """
    Push a model and tokenizer to Hugging Face Hub.
    """
    model_id = f"{hf_username}/{model_name}"
    try:
        model.push_to_hub(f"{model_id}-adapters", private=False)

        merged_model = model.merge_and_unload()
        merged_model.push_to_hub(model_id, private=False)

        tokenizer.push_to_hub(model_id)
        print(f"Adapters successfully pushed to: https://huggingface.co/{model_id}")
    except Exception as e:
        print(f"Error pushing to Hugging Face: {e}")
        print("Make sure you're logged in with: huggingface-cli login")

In [30]:
config = {
    "model_name": "meta-llama/Llama-3.2-1B-Instruct",
    "save_model_name": "llama-1b-legal-qlora",
    "assistant_only_masking": True,
    "use_qlora": False,
    "deepspeed_version": 1,
    "dataset_config": {
        "dataset_dir_path": "../../data",
        "instruction_column": None,
        "input_column": "question",
        "output_column": "answer",
        "max_length": 2048,
        "sample_size": None,
        "validation_size": None,
        "test_size": None
    },
    "quantization_config": {
        "load_in_4bit": True
    },
    "lora_config": {
        "r": 8,
        "lora_alpha": 32,
        "target_modules": [
            "q_proj",
            "k_proj",
            "v_proj",
            "o_proj"
        ],
        "lora_dropout": 0.05,
        "bias": "none"
    },
    "training_args": {
        "output_dir": "./checkpoints",
        "per_device_train_batch_size": 4,
        "num_train_epochs": 2,
        "learning_rate": 5e-4,
        "logging_steps": 4,
        "save_strategy": "epoch",
        "eval_strategy": "epoch",
        "warmup_steps": 0,
        "lr_scheduler_type": "cosine",
        "optim": "adamw_torch",
        "report_to": "wandb",
        "remove_unused_columns": False,
        "dataloader_drop_last": True,
        "gradient_checkpointing": False,
        "max_grad_norm": 1.0,
        "metric_for_best_model": "eval_loss",
        "greater_is_better": False,
        "fp16": False,
        "logging_dir": "./logs",
        "logging_first_step": True,
        "log_level": "info",
        "disable_tqdm": False
    },
    "early_stopping": {
        "early_stopping_patience": 2,
        "early_stopping_threshold": 0.05
    }
}

## 2. Authentication and Configuration

Before we begin training, we need to authenticate with HuggingFace (to download models and upload results) and optionally with Weights & Biases (to track training progress).

**Important:** Make sure you have a `.env` file with:
- `HF_TOKEN`: Your HuggingFace access token
- `HF_USERNAME`: Your HuggingFace username
- `WANDB_API_KEY`: Your Weights & Biases API key (optional)

We'll also load our training configuration from `config.json`, which contains all hyperparameters and settings.


In [31]:
load_dotenv()

HF_TOKEN = os.getenv("HF_TOKEN")
HF_USERNAME = os.getenv("HF_USERNAME")
login(HF_TOKEN)


if config["training_args"]["report_to"] == "wandb":
    if os.getenv("WANDB_API_KEY") is None:
        raise ValueError("WANDB_API_KEY is not set in the environment variables")
    
    wandb.login(key=os.getenv("WANDB_API_KEY"))
    wandb.init(
        project="my-llm-finetune",
        name="run-1-lora-experiment",
    )


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [32]:
model_name = config["model_name"]
dataset_config = config["dataset_config"]
quantization_config = config["quantization_config"]
lora_config = config["lora_config"]
use_qlora = config["use_qlora"]
training_args = config["training_args"]
save_model_name = config["save_model_name"]
assistant_only_masking = config["assistant_only_masking"]
early_stopping_config = config.get("early_stopping", {})

print(f"Model: {model_name}")
print(f"Using QLoRA: {use_qlora}")
print(f"Assistant-only masking: {assistant_only_masking}")


Model: meta-llama/Llama-3.2-1B-Instruct
Using QLoRA: False
Assistant-only masking: True


## 3. Dataset Preparation Functions

Now we'll define the functions that prepare our data for training. These functions will:

1. **Load and format the dataset** into an instruction-following format
2. **Apply assistant-only masking** so the model only learns from the output portion
3. **Tokenize the data** to convert text into numerical tokens the model can process
4. **Create batches** with proper padding

**Key Concept - Assistant-Only Masking:**
When fine-tuning instruction-following models, we typically only want the model to learn to predict the assistant's response, not the instruction itself. We achieve this by masking the instruction tokens (setting them to -100), so they don't contribute to the loss during training.


In [None]:
def prepare_dataset(
    input_column: str,
    output_column: str,
    instruction_column: str = None,
    default_instruction: str = None,
    dataset_name: str = None,
    dataset_dir_path: str = None,
    sample_size: Optional[int] = None,
    validation_size: Optional[float] = None,
    test_size: Optional[float] = None,
) -> Tuple[Dataset, Dataset, Dataset]:
    """
    Load and prepare the dataset for fine-tuning.
    
    Args:
        dataset_name: Name of the dataset from HuggingFace
        instruction_column: Column name for instructions
        input_column: Column name for inputs
        output_column: Column name for outputs
        sample_size: Optional limit on training samples
        validation_size: Fraction of data for validation
        test_size: Fraction of data for testing
    
    Returns:
        Tuple of (train_dataset, validation_dataset, test_dataset)
    """
    if dataset_dir_path is not None:
        train_file_path = os.path.join(dataset_dir_path, "train_data.jsonl")
        validation_file_path = os.path.join(dataset_dir_path, "validation_data.jsonl")
        test_file_path = os.path.join(dataset_dir_path, "test_data.jsonl")

        dataset = load_dataset(
            "json",
            data_files={
                "train": train_file_path,
                "validation": validation_file_path,
                "test": test_file_path
            }
        )

    elif dataset_name is not None:
        dataset = load_dataset(dataset_name)
    
    else:
        raise ValueError("Either dataset_dir_path or dataset_name must be provided")
    
    def format_instruction_data(data_point: Dict) -> str:
        if instruction_column is not None:
            instruction = data_point[instruction_column]
        else:
            instruction = default_instruction

        input_text = data_point[input_column]
        output = data_point[output_column]
        
        formatted_text = f"### Instruction\n{instruction}\n\n"
        
        if input_text:
            formatted_text += f"### Input\n{input_text}\n\n"
        
        formatted_text += f"### Output\n{output}"
        
        return {"text": formatted_text}
    
    
    
    if sample_size is not None:
        dataset["train"] = dataset["train"].select(range(sample_size))
    
    if validation_size is not None and test_size is not None:
        val_plus_test_size = validation_size + test_size
        split = dataset["train"].train_test_split(test_size=val_plus_test_size, seed=42)
        dataset["validation_and_test"] = split["test"]
        dataset["train"] = split["train"]
        
        test_to_val_plus_test_ratio = test_size / val_plus_test_size
        split = dataset["validation_and_test"].train_test_split(
            test_size=test_to_val_plus_test_ratio, seed=42
        )
        dataset["test"] = split["test"]
        dataset["validation"] = split["train"]
        del dataset["validation_and_test"]
    
    elif validation_size is not None:
        split = dataset["train"].train_test_split(test_size=validation_size, seed=42)
        dataset["validation"] = split["test"]
        dataset["train"] = split["train"]
    
    elif test_size is not None:
        split = dataset["train"].train_test_split(test_size=test_size, seed=42)
        dataset["test"] = split["test"]
        dataset["train"] = split["train"]
    
    validation_dataset = None
    test_dataset = None
    
    train_dataset = dataset["train"].map(
        format_instruction_data, desc="Formatting train data"
    )
    
    if "validation" in dataset:
        validation_dataset = dataset["validation"].map(
            format_instruction_data, desc="Formatting validation data"
        )
    
    if "test" in dataset:
        test_dataset = dataset["test"].map(
            format_instruction_data, desc="Formatting test data"
        )
    
    return train_dataset, validation_dataset, test_dataset


train_dataset, validation_dataset, test_dataset = prepare_dataset(
    dataset_dir_path=dataset_config["dataset_dir_path"],
    input_column=dataset_config["input_column"],
    output_column=dataset_config["output_column"],
    default_instruction="You are a helpful assistant responsible for solving simple math problems. Answer the following question by stating you reasoning then provide the answer at the end preceeded by ####.",
)


In [34]:
def apply_assistant_masking(
    input_ids: List[int], tokenizer: AutoTokenizer
) -> List[int]:
    """
    Apply assistant-only masking by setting instruction tokens to -100.
    This ensures the model only learns from the output/response portion.
    
    Args:
        input_ids: Tokenized input sequence
        tokenizer: The tokenizer used
    
    Returns:
        Labels with instruction tokens masked (-100)
    """
    labels = input_ids.copy()
    
    output_marker = "### Output"
    output_marker_tokens = tokenizer.encode(output_marker, add_special_tokens=False)
    
    output_start_idx = None
    for i in range(len(input_ids) - len(output_marker_tokens) + 1):
        if input_ids[i : i + len(output_marker_tokens)] == output_marker_tokens:
            output_start_idx = i + len(output_marker_tokens)
            break
    
    if output_start_idx is not None:
        for i in range(output_start_idx):
            labels[i] = -100
    
    return labels


In [35]:
class DataCollatorForCausalLM:
    """
    Custom data collator for causal language modeling.
    Handles padding of input_ids, attention_mask, and labels.
    """
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
    
    def __call__(self, features):
        labels = [f.pop("labels") for f in features]
        
        batch = self.tokenizer.pad(features, padding=True, return_tensors="pt")
        
        max_len = batch["input_ids"].size(1)
        padded_labels = torch.full((len(labels), max_len), -100, dtype=torch.long)
        for i, l in enumerate(labels):
            padded_labels[i, : len(l)] = torch.tensor(l, dtype=torch.long)
        batch["labels"] = padded_labels
        return batch


In [36]:
def tokenize_dataset(
    model_name: str,
    train_dataset: Dataset,
    validation_dataset: Optional[Dataset] = None,
    test_dataset: Optional[Dataset] = None,
    assistant_only_masking: bool = True,
    max_length: int = 2048,
) -> Tuple[Dataset, Dataset, Dataset, AutoTokenizer]:
    """
    Load and tokenize the dataset with optional assistant-only masking.
    
    Args:
        model_name: Name of the model to use for tokenizer
        dataset_name: Name of the dataset to use
        assistant_only_masking: Whether to apply assistant-only masking
        max_length: Maximum sequence length
    
    Returns:
        Tuple of (train, validation, test, tokenizer)
    """    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id
    
    def tokenize_and_mask_function(examples):
        texts_with_eos = [text + tokenizer.eos_token for text in examples["text"]]
        tokenized = tokenizer(
            texts_with_eos,
            truncation=True,
            padding=False,
            max_length=max_length,
            return_tensors=None,
            add_special_tokens=True,
        )
        
        labels = []
        for input_ids in tokenized["input_ids"]:
            if assistant_only_masking:
                masked_labels = apply_assistant_masking(input_ids, tokenizer)
            else:
                masked_labels = input_ids
            labels.append(masked_labels)
        
        tokenized["labels"] = labels
        return tokenized
    
    train = train_dataset.map(
        tokenize_and_mask_function,
        batched=True,
        remove_columns=train_dataset.column_names,
        desc="Processing train dataset",
    )
    
    validation = None
    if validation_dataset is not None:
        validation = validation_dataset.map(
            tokenize_and_mask_function,
            batched=True,
            remove_columns=train_dataset.column_names,
            desc="Processing validation dataset",
        )
    
    test = None
    if test_dataset is not None:
        test = test_dataset.map(
            tokenize_and_mask_function,
            batched=True,
            remove_columns=train_dataset.column_names,
            desc="Processing test dataset",
        )
    
    return train, validation, test, tokenizer



## 4. Load and Prepare Dataset

Time to load and process our training data! This cell will:
- Download the dataset from HuggingFace
- Format it into the instruction format
- Tokenize all examples
- Apply masking if enabled
- Split into train/validation/test sets

This might take a few minutes depending on your dataset size.


In [37]:
train, validation, test, tokenizer = tokenize_dataset(
    model_name=model_name,
    train_dataset=train_dataset,
    validation_dataset=validation_dataset,
    test_dataset=test_dataset,
    assistant_only_masking=assistant_only_masking,
    max_length=dataset_config.get("max_length", 2048),
)

print(f"Train dataset size: {len(train)}")
print(f"Validation dataset size: {len(validation) if validation else 0}")
print(f"Test dataset size: {len(test) if test else 0}")


Processing train dataset:   0%|          | 0/7473 [00:00<?, ? examples/s]

Processing validation dataset:   0%|          | 0/400 [00:00<?, ? examples/s]

Processing test dataset:   0%|          | 0/400 [00:00<?, ? examples/s]

Train dataset size: 7473
Validation dataset size: 400
Test dataset size: 400


## 5. Inspect Dataset Sample

Let's take a peek at our processed data to verify everything looks correct. 

If you enabled assistant-only masking, you'll see statistics showing:
- How many tokens are in each sequence
- How many tokens are masked (the instruction part)
- How many tokens we're actually training on (the response part)

**What to expect:** Typically, 50-70% of tokens will be masked, as instructions are usually shorter than responses.


In [38]:
if assistant_only_masking:
    first_example = train[0]
    input_ids = first_example["input_ids"]
    labels = first_example["labels"]
    
    masked_tokens = sum(1 for label in labels if label == -100)
    total_tokens = len(labels)
    
    print(f"Total tokens: {total_tokens}")
    print(f"Masked tokens: {masked_tokens}")
    print(f"Training tokens: {total_tokens - masked_tokens}")
    print(f"Mask ratio: {masked_tokens/total_tokens:.2%}")


Total tokens: 129
Masked tokens: 79
Training tokens: 50
Mask ratio: 61.24%


## 6. Model Configuration

Now we'll configure our Parameter-Efficient Fine-Tuning (PEFT) approach.

**LoRA (Low-Rank Adaptation):**
Instead of updating all model parameters, LoRA adds small trainable matrices to specific layers. This dramatically reduces the number of parameters you need to train (often by 1000x or more!).

**QLoRA (Quantized LoRA):**
If enabled, we'll load the model in 4-bit or 8-bit precision, further reducing memory requirements. This allows you to fine-tune larger models on smaller GPUs.

**Key hyperparameters:**
- `r` (rank): Size of the low-rank matrices (higher = more capacity, more memory)
- `lora_alpha`: Scaling factor for LoRA updates
- `target_modules`: Which model layers to apply LoRA to (typically attention and feed-forward layers)


In [41]:
bnb_config = None

if use_qlora:
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=quantization_config["load_in_4bit"],
        load_in_8bit=not quantization_config["load_in_4bit"],
    )

lora_config = LoraConfig(
    **lora_config,
    task_type="CAUSAL_LM",
)

print("LoRA Configuration:")
print(lora_config)


LoRA Configuration:
LoraConfig(task_type='CAUSAL_LM', peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, inference_mode=False, r=8, target_modules={'k_proj', 'q_proj', 'o_proj', 'v_proj'}, exclude_modules=None, lora_alpha=32, lora_dropout=0.05, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', trainable_token_indices=None, loftq_config={}, eva_config=None, corda_config=None, use_dora=False, use_qalora=False, qalora_group_size=16, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False), lora_bias=False, target_parameters=None)


## 7. Load Model with PEFT

Time to load our base model and apply the LoRA adapters!

This step will:
1. Download the base model from HuggingFace (if not cached)
2. Apply quantization if using QLoRA
3. Inject LoRA adapter layers into the model
4. Freeze the base model parameters (only adapters will be trained)

**Watch for:** The model will print how many parameters are trainable vs. frozen. You should see that only 0.1-1% of parameters are trainable with LoRA. This is the magic that makes efficient fine-tuning possible!


In [42]:
def get_apply_peft(
    model_name: str,
    lora_config: LoraConfig,
    qlora_config: Optional[BitsAndBytesConfig] = None,
) -> torch.nn.Module:
    """
    Load model and apply PEFT (LoRA) configuration.
    
    Args:
        model_name: HuggingFace model name
        lora_config: LoRA configuration
        qlora_config: Optional quantization configuration
    
    Returns:
        Model with PEFT adapters applied
    """
    model = AutoModelForCausalLM.from_pretrained(
        model_name, quantization_config=qlora_config, device_map="auto"
    )
    
    return get_peft_model(model, lora_config)


peft_model = get_apply_peft(model_name, lora_config, bnb_config)

print(f"PEFT model size: {get_model_size_gb(peft_model):.2f} GB")
peft_model.print_trainable_parameters()


PEFT model size: 4.61 GB
trainable params: 1,703,936 || all params: 1,237,518,336 || trainable%: 0.1377


## 8. Training Setup

Almost ready to train! Let's configure the training process.

We're setting up:
- **Training arguments**: Learning rate, batch size, number of epochs, etc.
- **Data collator**: Handles batching and padding during training
- **Callbacks**: Optional early stopping to prevent overfitting

**Key training parameters to understand:**
- `learning_rate`: How big of steps to take during optimization (typically 1e-4 to 5e-4 for LoRA)
- `per_device_train_batch_size`: Number of examples per GPU (adjust based on your memory)
- `gradient_accumulation_steps`: Simulate larger batches if memory is limited
- `num_train_epochs`: How many times to go through the entire dataset
- `save_steps` / `eval_steps`: How often to save checkpoints and evaluate


In [43]:
print("Setting up training...")

training_args = TrainingArguments(**training_args)
data_collator = DataCollatorForCausalLM(tokenizer)

callbacks = []
if early_stopping_config:
    early_stopping_callback = EarlyStoppingCallback(**early_stopping_config)
    callbacks.append(early_stopping_callback)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=train,
    eval_dataset=validation,
    data_collator=data_collator,
    callbacks=callbacks if callbacks else None,
)

print("Training configuration complete")


The model is already on multiple devices. Skipping the move to device specified in `args`.


Setting up training...
Training configuration complete


## 9. Train the Model

Here we go! Time to train your model!

**What happens during training:**
- The model processes batches of your training data
- It makes predictions and compares them to the correct outputs
- It updates the LoRA adapter weights to improve its predictions
- Every few steps, it evaluates on your validation set to track progress

**What you'll see:**
- Loss values (lower is better - measures prediction error)
- Training speed (samples/second)
- Periodic evaluation results
- Progress bar showing completion

**How long will this take?** Depends on:
- Your dataset size (more data = longer training)
- Your GPU (faster GPU = faster training)
- Number of epochs and batch size

Grab a coffee - this could take anywhere from minutes to hours!


In [None]:
trainer.train()


## 10. Save and Share Your Model

Congratulations! Your model is trained. Now let's save it and share it with the world (or your team).

**What we're saving:**
- Only the LoRA adapter weights (not the entire base model!)
- The tokenizer configuration
- Training metadata

**Why this is cool:**
Instead of saving a 7GB+ model, we're only saving ~10-100MB of adapter weights. Anyone can then:
1. Download the base model
2. Load your adapters on top
3. Use your fine-tuned model

**Next steps after this notebook:**
- Test your model's performance on held-out data
- Compare it to the base model
- Iterate on your dataset or hyperparameters if needed
- Share your model on HuggingFace Hub for others to use!


In [None]:
local_adapter_path = "model-adapters"
peft_model.save_pretrained(local_adapter_path)
tokenizer.save_pretrained(local_adapter_path)

print(f"Adapters saved locally to: {local_adapter_path}")


In [None]:
push_to_hub(peft_model, tokenizer, save_model_name, HF_USERNAME)

print("Training complete and model pushed to HuggingFace Hub")
