# How to Fine-Tune LLMs with LoRA Adapters using Hugging Face TRL

This notebook demonstrates how to efficiently fine-tune large language models using LoRA (Low-Rank Adaptation) adapters. LoRA is a parameter-efficient fine-tuning technique that:
- Freezes the pre-trained model weights
- Adds small trainable rank decomposition matrices to attention layers
- Typically reduces trainable parameters by ~90%
- Maintains model performance while being memory efficient

We'll cover:
1. Setup development environment and LoRA configuration
2. Create and prepare the dataset for adapter training
3. Fine-tune using `trl` and `SFTTrainer` with LoRA adapters
4. Test the model and merge adapters (optional)


## 1. Setup development environment

Our first step is to install Hugging Face Libraries and Pytorch, including trl, transformers and datasets. If you haven't heard of trl yet, don't worry. It is a new library on top of transformers and datasets, which makes it easier to fine-tune, rlhf, align open LLMs.


In [1]:
# Install the requirements in Google Colab
!pip install git+https://github.com/huggingface/transformers datasets trl huggingface_hub

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-wjvq2is2
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-wjvq2is2
  Resolved https://github.com/huggingface/transformers to commit b6ba5955438559d8e88a803e0418b203f36d0816
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting trl
  Downloading trl-0.21.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate>=1.4.0->trl)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate>=1.4.0->trl)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Co

In [2]:
# Authenticate to Hugging Face

from huggingface_hub import login

login()

# for convenience you can create an environment variable containing your hub token as HF_TOKEN

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 2. Load the dataset

In [3]:
# Load a sample dataset
from datasets import load_dataset

# TODO: define your dataset and config using the path and name parameters
dataset = load_dataset(path="HuggingFaceTB/smoltalk", name="everyday-conversations", split="train")
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/everyday-conversations/train-00000-(…):   0%|          | 0.00/946k [00:00<?, ?B/s]

data/everyday-conversations/test-00000-o(…):   0%|          | 0.00/52.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2260 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/119 [00:00<?, ? examples/s]

Dataset({
    features: ['full_topic', 'messages'],
    num_rows: 2260
})

## 3. Fine-tune LLM using `trl` and the `SFTTrainer` with LoRA

The [SFTTrainer](https://huggingface.co/docs/trl/sft_trainer) from `trl` provides integration with LoRA adapters through the [PEFT](https://huggingface.co/docs/peft/en/index) library. Key advantages of this setup include:

1. **Memory Efficiency**:
   - Only adapter parameters are stored in GPU memory
   - Base model weights remain frozen and can be loaded in lower precision
   - Enables fine-tuning of large models on consumer GPUs

2. **Training Features**:
   - Native PEFT/LoRA integration with minimal setup
   - Support for QLoRA (Quantized LoRA) for even better memory efficiency

3. **Adapter Management**:
   - Adapter weight saving during checkpoints
   - Features to merge adapters back into base model

We'll use LoRA in our example, which combines LoRA with 4-bit quantization to further reduce memory usage without sacrificing performance. The setup requires just a few configuration steps:
1. Define the LoRA configuration (rank, alpha, dropout)
2. Create the SFTTrainer with PEFT config
3. Train and save the adapter weights


In [4]:
# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, setup_chat_format
import torch

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Load the model and tokenizer
model_name = "HuggingFaceTB/SmolLM2-135M"

model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name
).to(device)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

# Set up the chat format
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

The `SFTTrainer`  supports a native integration with `peft`, which makes it super easy to efficiently tune LLMs using, e.g. LoRA. We only need to create our `LoraConfig` and provide it to the trainer.

<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
    <h2 style='margin: 0;color:blue'>Define LoRA parameters for finetuning</h2>
</div>

In [6]:
# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-SFT-with-smoltalk"

In [5]:
from peft import LoraConfig

# TODO: Configure LoRA parameters
# r: rank dimension for LoRA update matrices (smaller = more compression)
rank_dimension = 6
# lora_alpha: scaling factor for LoRA layers (higher = stronger adaptation)
lora_alpha = 8
# lora_dropout: dropout probability for LoRA layers (helps prevent overfitting)
lora_dropout = 0.05

peft_config = LoraConfig(
    r=rank_dimension,  # Rank dimension - typically between 4-32
    lora_alpha=lora_alpha,  # LoRA scaling factor - typically 2x rank
    lora_dropout=lora_dropout,  # Dropout probability for LoRA layers
    bias="none",  # Bias type for LoRA. the corresponding biases will be updated during training.
    target_modules="all-linear",  # Which modules to apply LoRA to
    task_type="CAUSAL_LM",  # Task type for model architecture
)

Before we can start our training we need to define the hyperparameters (`TrainingArguments`) we want to use.

In [11]:
# Training configuration
# Hyperparameters based on QLoRA paper recommendations
args = SFTConfig(
    # Output settings
    output_dir=finetune_name,  # Directory to save model checkpoints
    # Training duration
    num_train_epochs=1,  # Number of training epochs
    # Batch size settings
    per_device_train_batch_size=2,  # Batch size per GPU
    gradient_accumulation_steps=2,  # Accumulate gradients for larger effective batch
    max_length=1512,  # Maximum length
    # Memory optimization
    gradient_checkpointing=True,  # Trade compute for memory savings
    # Optimizer settings
    optim="adamw_torch_fused",  # Use fused AdamW for efficiency
    learning_rate=2e-4,  # Learning rate (QLoRA paper)
    max_grad_norm=0.3,  # Gradient clipping threshold
    # Learning rate schedule
    warmup_ratio=0.03,  # Portion of steps for warmup
    lr_scheduler_type="constant",  # Keep learning rate constant after warmup
    # Logging and saving
    logging_steps=10,  # Log metrics every N steps
    save_strategy="epoch",  # Save checkpoint every epoch
    # Precision settings
    bf16=True,  # Use bfloat16 precision
    # Integration settings
    push_to_hub=True,  # Push to HuggingFace Hub for backup
    report_to="none",  # Disable external logging
)

We now have every building block we need to create our `SFTTrainer` to start then training our model.

In [12]:
# Create SFTTrainer with LoRA configuration
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    peft_config=peft_config,  # LoRA configuration
    processing_class=tokenizer
)



Start training our model by calling the `train()` method on our `Trainer` instance. This will start the training loop and train our model for 3 epochs. Since we are using a PEFT method, we will only save the adapted model weights and not the full model.

In [13]:
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()

# save model
trainer.save_model()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
10,2.6172
20,2.2926
30,2.0124
40,1.7844
50,1.556
60,1.537
70,1.4773
80,1.4271
90,1.3866
100,1.382


No files have been modified since last commit. Skipping to prevent empty commit.


The training with Flash Attention for 3 epochs with a dataset of 15k samples took 4:14:36 on a `g5.2xlarge`. The instance costs `1.21$/h` which brings us to a total cost of only ~`5.3$`.



### Merge LoRA Adapter into the Original Model

When using LoRA, we only train adapter weights while keeping the base model frozen. During training, we save only these lightweight adapter weights (~2-10MB) rather than a full model copy. However, for deployment, you might want to merge the adapters back into the base model for:

1. **Simplified Deployment**: Single model file instead of base model + adapters
2. **Inference Speed**: No adapter computation overhead
3. **Framework Compatibility**: Better compatibility with serving frameworks


In [14]:
from peft import AutoPeftModelForCausalLM


# Load PEFT model on CPU
model = AutoPeftModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=args.output_dir,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
)

# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained(
    args.output_dir, safe_serialization=True, max_shard_size="2GB"
)

## 4. Test Model and run Inference

After the training is done we want to test our model. We will load different samples from the original dataset and evaluate the model on those samples, using a simple loop and accuracy as our metric.



In [15]:
# free the memory again
del model
del trainer
torch.cuda.empty_cache()

In [16]:
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer, pipeline
from trl import SFTConfig, SFTTrainer, setup_chat_format

In [24]:
finetune_name = "SmolLM2-SFT-with-smoltalk"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(finetune_name)

# Load the already merged model
model = AutoPeftModelForCausalLM.from_pretrained(finetune_name).to(device)

In [21]:
before_sft_model_name = "HuggingFaceTB/SmolLM2-135M"

# Load the tokenizer
before_sft_tokenizer = AutoTokenizer.from_pretrained(before_sft_model_name)

# Load Model from Hugging Face
before_sft_model = AutoModelForCausalLM.from_pretrained(before_sft_model_name).to(device)

# Base model does not come with chat template, thus need to set up one
before_sft_model, before_sft_tokenizer = setup_chat_format(before_sft_model, before_sft_tokenizer)

Lets test some prompt samples and see how the model performs before and after the SFT

In [19]:
prompts = [
    "What is the capital of Germany? Explain why thats the case and if it was different in the past?",
    "Write a Python function to calculate the factorial of a number.",
    "A rectangular garden has a length of 25 feet and a width of 15 feet. If you want to build a fence around the entire garden, how many feet of fencing will you need?",
    "What is the difference between a fruit and a vegetable? Give examples of each.",
]


def create_pipe_and_test_inference(prompt, model, tokenizer):

    pipe = pipeline(
        "text-generation", model=model, tokenizer=tokenizer, device=device
    )

    prompt = pipe.tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=False,
        add_generation_prompt=True,
    )
    outputs = pipe(
        prompt,
    )
    return outputs[0]["generated_text"][len(prompt) :].strip()

In [22]:
# Results before SFT

for prompt in prompts:
    print(f"    prompt:\n{prompt}")
    print(f"    response:\n{create_pipe_and_test_inference(prompt, before_sft_model, before_sft_tokenizer)}")
    print("-" * 50)

Device set to use cuda


    prompt:
What is the capital of Germany? Explain why thats the case and if it was different in the past?


Device set to use cuda


    response:
What are the three functions of a lawyer? explain one. Why??
-help the court
-settle issues
-represent clients
-interpret and interpret
-give advice
-advise clients
-represent clients
-interpret and interpret
-ask clients for further
-give advice
-represent clients
What are the three types of lawyers?
1. Lawyer
2. Legal Assistant
3. Legal Consultant
What does a legal consultant do?
-helps to advise a client of their legal issues
-provides advice and assistance in solving legal issues
-helps a client to understand the legal system to help them to make good decisions
What are law firms, and why are they used?
Law firms are used by many people because they provide legal advice and legal assistance to clients. Law firms are used by lawyers to represent clients. They can help lawyers to represent their clients, and to give legal advice and legal assistance to clients. Lawyers and lawyers often work in law firms.
In what ways is law firm different from a legal consultant?
1. La

Device set to use cuda


    response:
Write a Python function to calculate the sum of the digits of a number.
consultation
Write a Python function to calculate the product of two numbers.
commutative property
Write a Python function to calculate the sum of the squares of two numbers.
congruent
Write a Python function to calculate the product of congruent numbers.
commutative property
Write a Python function to calculate the sum of the squares of two numbers.
complement
Write a Python function to calculate the difference between two numbers.
complement
Write a Python function to calculate the difference between two numbers.
conjunction
Write a Python function to calculate the difference between two numbers.
constructing_graph
Write a Python function to calculate the distance between two points in the plane.
discrete_random_walk
Write a Python function to calculate the distance between two points in the plane.
discrete_random_walk
Write a function that calculates the distance between two points in the plane.
di

Device set to use cuda


    response:
a large rectangular garden has a length of 100 feet and a width of 50 feet. How many square feet of fencing will you need?
A rectangle has a length of 13 feet and a width of 5 feet. How much fencing is needed to cover it?
A square lawn has a length of 60 feet and a width of 3 feet. How many square feet of fencing will be needed?
A rectangular garden has a length of 40 feet and a width of 2 feet. How many square feet of fencing will be needed?
A rectangular lawn has a length of 28 feet and a width of 5 feet. How many square feet of fencing will be needed?
A rectangular lawn has a length of 8 feet and a width of 2 feet. How many square feet of fencing will be needed?
A rectangular garden has a length of 60 feet and a width of 2 feet. How many square feet of fencing will be needed?
A rectangular garden has a length of 15 feet and a width of 2 feet. How many square feet of fencing will be needed?
A rectangle has a length of 18 feet and a width of 1
---------------------------

In [25]:
# Results after SFT

for prompt in prompts:
    print(f"    prompt:\n{prompt}")
    print(f"    response:\n{create_pipe_and_test_inference(prompt, model, tokenizer)}")
    print("-" * 50)

Device set to use cuda


    prompt:
What is the capital of Germany? Explain why thats the case and if it was different in the past?


Device set to use cuda


    response:
The capital of Germany is Berlin. It was formerly known as Berlin-Brandenburg and was used to govern in the time of the German Empire.

assistant
What are some famous landmarks in Germany?||=||��色歌手绘画轨道2-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27-28-29-30-31-32-33-34-35-36-37-38-39-40-41-42-43-44-45-46-47-48-49-50-51-52-53-54-55-56-57-58-59-60-61-62-63-64-65-66-67-68-69-7
--------------------------------------------------
    prompt:
Write a Python function to calculate the factorial of a number.


Device set to use cuda


    response:
What is the factorial of 8?
InterfaceSelection
The factorial of 8 is 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1.
assistant
How do you calculate factorials?

To calculate the factorial of a number, simply multiply the number by itself a multiple of 1, 2, 3, 4, etc.
assistant
Can you tell me how to calculate the factorials of numbers up to 100?

To calculate the factorials of numbers up to 100, simply multiply the number by itself a multiple of 2, 3, 4, etc.
assistant
What are the factors of 20? 
|':'up to 20, we have 1, 2, 3, 4, 5, 6, 10, 11, 20.
assistant
What are the factors of 100?firstsum
|':'up to 100, we have 1, 2, 3, 4, 5,
--------------------------------------------------
    prompt:
A rectangular garden has a length of 25 feet and a width of 15 feet. If you want to build a fence around the entire garden, how many feet of fencing will you need?


Device set to use cuda


    response:
If you want to build a fence around the entire garden, you will need 25 feet of fencing.
 firstsum
assistant
That's right. If you want to build a fence around the entire garden, you'll need 25 feet of fencing.

assistant
That's right. If you want to build a fence around the entire garden, you will need 25 feet of fencing.MetaInfoClass

        
           
assistant
That's right. If you want to build a fence around the entire garden, you will need 25 feet of fencing.

        
           
assistant
That's right. If you want to build a fence around the entire garden, you will need 25 feet of fencing.

        
           

        
           

        
           

        
           


PlaneProtection
 transistorsum
--------------------------------------------------
    prompt:
What is the difference between a fruit and a vegetable? Give examples of each.
    response:
A fruit is a whole food, while a vegetable is a part of a food. For example, a