# How to Fine-Tune LLMs with LoRA Adapters using Hugging Face TRL

This notebook demonstrates how to efficiently fine-tune large language models using LoRA (Low-Rank Adaptation) adapters. LoRA is a parameter-efficient fine-tuning technique that:
- Freezes the pre-trained model weights
- Adds small trainable rank decomposition matrices to attention layers
- Typically reduces trainable parameters by ~90%
- Maintains model performance while being memory efficient

We'll cover:
1. Setup development environment and LoRA configuration
2. Create and prepare the dataset for adapter training
3. Fine-tune using `trl` and `SFTTrainer` with LoRA adapters
4. Test the model and merge adapters (optional)


## 1. Setup development environment

Our first step is to install Hugging Face Libraries and Pyroch, including trl, transformers and datasets. If you haven't heard of trl yet, don't worry. It is a new library on top of transformers and datasets, which makes it easier to fine-tune, rlhf, align open LLMs.


In [None]:
# Install the requirements in Google Colab
%pip install transformers datasets trl huggingface_hub

# Authenticate to Hugging Face

from huggingface_hub import login

login()

# for convenience you can create an environment variable containing your hub token as HF_TOKEN

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting trl
  Downloading trl-0.13.0-py3-none-any.whl.metadata (11 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trl-0.13.0-py3-none-any.whl (293 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m293.4/293.4 kB[0m [31m23.3 MB/s[0

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 2. Load the dataset

In [None]:
# Load a sample dataset
import pprint
from datasets import load_dataset

# TODO: define your dataset and config using the path and name parameters
dataset = load_dataset(path="HuggingFaceTB/smoltalk", name="everyday-conversations")
dataset
pprint.pprint(dataset["train"][0])

README.md:   0%|          | 0.00/9.25k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/946k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/52.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2260 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/119 [00:00<?, ? examples/s]

{'full_topic': 'Travel/Vacation destinations/Beach resorts',
 'messages': [{'content': 'Hi there', 'role': 'user'},
              {'content': 'Hello! How can I help you today?',
               'role': 'assistant'},
              {'content': "I'm looking for a beach resort for my next "
                          'vacation. Can you recommend some popular ones?',
               'role': 'user'},
              {'content': 'Some popular beach resorts include Maui in Hawaii, '
                          "the Maldives, and the Bahamas. They're known for "
                          'their beautiful beaches and crystal-clear waters.',
               'role': 'assistant'},
              {'content': 'That sounds great. Are there any resorts in the '
                          'Caribbean that are good for families?',
               'role': 'user'},
              {'content': 'Yes, the Turks and Caicos Islands and Barbados are '
                          'excellent choices for family-friendly resorts in

## 3. Fine-tune LLM using `trl` and the `SFTTrainer` with LoRA

The [SFTTrainer](https://huggingface.co/docs/trl/sft_trainer) from `trl` provides integration with LoRA adapters through the [PEFT](https://huggingface.co/docs/peft/en/index) library. Key advantages of this setup include:

1. **Memory Efficiency**:
   - Only adapter parameters are stored in GPU memory
   - Base model weights remain frozen and can be loaded in lower precision
   - Enables fine-tuning of large models on consumer GPUs

2. **Training Features**:
   - Native PEFT/LoRA integration with minimal setup
   - Support for QLoRA (Quantized LoRA) for even better memory efficiency

3. **Adapter Management**:
   - Adapter weight saving during checkpoints
   - Features to merge adapters back into base model

We'll use LoRA in our example, which combines LoRA with 4-bit quantization to further reduce memory usage without sacrificing performance. The setup requires just a few configuration steps:
1. Define the LoRA configuration (rank, alpha, dropout)
2. Create the SFTTrainer with PEFT config
3. Train and save the adapter weights


In [None]:
# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, setup_chat_format
import torch

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Load the model and tokenizer
model_name = "HuggingFaceTB/SmolLM2-135M"

model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name
).to(device)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

# Set up the chat format
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-MyDataset"
finetune_tags = ["smol-course", "module_1"]

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.66k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

The `SFTTrainer`  supports a native integration with `peft`, which makes it super easy to efficiently tune LLMs using, e.g. LoRA. We only need to create our `LoraConfig` and provide it to the trainer.

<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
    <h2 style='margin: 0;color:blue'>Exercise: Define LoRA parameters for finetuning</h2>
    <p>Take a dataset from the Hugging Face hub and finetune a model on it. </p>
    <p><b>Difficulty Levels</b></p>
    <p>🐢 Use the general parameters for an abitrary finetune</p>
    <p>🐕 Adjust the parameters and review in weights & biases.</p>
    <p>🦁 Adjust the parameters and show change in inference results.</p>
</div>

In [None]:
from peft import LoraConfig

# TODO: Configure LoRA parameters
# r: rank dimension for LoRA update matrices (smaller = more compression)
rank_dimension = 6
# lora_alpha: scaling factor for LoRA layers (higher = stronger adaptation)
lora_alpha = 8
# lora_dropout: dropout probability for LoRA layers (helps prevent overfitting)
lora_dropout = 0.05

peft_config = LoraConfig(
    r=rank_dimension,  # Rank dimension - typically between 4-32
    lora_alpha=lora_alpha,  # LoRA scaling factor - typically 2x rank
    lora_dropout=lora_dropout,  # Dropout probability for LoRA layers
    bias="none",  # Bias type for LoRA. the corresponding biases will be updated during training.
    target_modules="all-linear",  # Which modules to apply LoRA to
    task_type="CAUSAL_LM",  # Task type for model architecture
)

Before we can start our training we need to define the hyperparameters (`TrainingArguments`) we want to use.

In [None]:
print(len(dataset['train']))  # Prints the number of examples in the training split
print(len(dataset['test']))

import pprint

for i in range(5):
  pprint.pprint(dataset['train'][i])  # Print the first 5 examples


2260
119
{'full_topic': 'Travel/Vacation destinations/Beach resorts',
 'messages': [{'content': 'Hi there', 'role': 'user'},
              {'content': 'Hello! How can I help you today?',
               'role': 'assistant'},
              {'content': "I'm looking for a beach resort for my next "
                          'vacation. Can you recommend some popular ones?',
               'role': 'user'},
              {'content': 'Some popular beach resorts include Maui in Hawaii, '
                          "the Maldives, and the Bahamas. They're known for "
                          'their beautiful beaches and crystal-clear waters.',
               'role': 'assistant'},
              {'content': 'That sounds great. Are there any resorts in the '
                          'Caribbean that are good for families?',
               'role': 'user'},
              {'content': 'Yes, the Turks and Caicos Islands and Barbados are '
                          'excellent choices for family-friendly r

In [None]:
# Training configuration
# Hyperparameters based on QLoRA paper recommendations
args = SFTConfig(
    # Output settings
    output_dir= "Peft_wgts",  # Directory to save model checkpoints
    # Training duration
    num_train_epochs=1,  # Number of training epochs
    # Batch size settings
    per_device_train_batch_size=4,  # Batch size per GPU
    gradient_accumulation_steps=2,  # Accumulate gradients for larger effective batch
    # Memory optimization
    gradient_checkpointing=True,  # Trade compute for memory savings
    # Optimizer settings
    optim="adamw_torch_fused",  # Use fused AdamW for efficiency
    learning_rate=2e-4,  # Learning rate (QLoRA paper)
    max_grad_norm=0.3,  # Gradient clipping threshold
    # Learning rate schedule
    warmup_ratio=0.03,  # Portion of steps for warmup
    lr_scheduler_type="constant",  # Keep learning rate constant after warmup
    # Logging and saving
    logging_steps=10,  # Log metrics every N steps
    save_strategy="epoch",  # Save checkpoint every epoch
    # Precision settings
    bf16=True,  # Use bfloat16 precision
    # Integration settings
    push_to_hub=False,  # Don't push to HuggingFace Hub
    report_to=None,  # Disable external logging
)

We now have every building block we need to create our `SFTTrainer` to start then training our model.

In [None]:
max_seq_length = 1512  # max sequence length for model and packing of the dataset

# Create SFTTrainer with LoRA configuration
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset["train"],
    peft_config=peft_config,  # LoRA configuration
    # max_seq_length=max_seq_length,  # Maximum sequence length
    tokenizer=tokenizer,
    # packing=True,  # Enable input packing for efficiency
    # dataset_kwargs={
    #     "add_special_tokens": False,  # Special tokens handled by template
    #     "append_concat_token": False,  # No additional separator needed
    # },
)

  trainer = SFTTrainer(


Map:   0%|          | 0/2260 [00:00<?, ? examples/s]

Start training our model by calling the `train()` method on our `Trainer` instance. This will start the training loop and train our model for 3 epochs. Since we are using a PEFT method, we will only save the adapted model weights and not the full model.

In [None]:
import os

# start training, the model will be automatically saved to the hub and the output directory
trainer.train()

# Save the model
trainer.save_model(f"./{finetune_name}")

# Save to the huggingface hub if login (HF_TOKEN is set)
if os.getenv("HF_TOKEN"):
    trainer.push_to_hub(tags=finetune_tags)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
10,2.1996
20,1.9439
30,1.6871
40,1.5174
50,1.3654
60,1.3218
70,1.3105
80,1.2727
90,1.2527
100,1.2441


The training with Flash Attention for 3 epochs with a dataset of 15k samples took 4:14:36 on a `g5.2xlarge`. The instance costs `1.21$/h` which brings us to a total cost of only ~`5.3$`.



### Merge LoRA Adapter into the Original Model

When using LoRA, we only train adapter weights while keeping the base model frozen. During training, we save only these lightweight adapter weights (~2-10MB) rather than a full model copy. However, for deployment, you might want to merge the adapters back into the base model for:

1. **Simplified Deployment**: Single model file instead of base model + adapters
2. **Inference Speed**: No adapter computation overhead
3. **Framework Compatibility**: Better compatibility with serving frameworks


In [None]:
from peft import AutoPeftModelForCausalLM

# Load PEFT model from local directory
model = AutoPeftModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path="./Peft_wgts/checkpoint-282",  # Load from local path
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
)

# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained(
    "./Peft_wgts_merged",  # Save merged model to a new directory
    safe_serialization=True,
    max_shard_size="2GB",
)

## 3. Test Model and run Inference

After the training is done we want to test our model. We will load different samples from the original dataset and evaluate the model on those samples, using a simple loop and accuracy as our metric.



<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
    <h2 style='margin: 0;color:blue'>Bonus Exercise: Load LoRA Adapter</h2>
    <p>Use what you learnt from the ecample note book to load your trained LoRA adapter for inference.</p>
</div>

In [None]:
# free the memory again
del model
del trainer
torch.cuda.empty_cache()

In [None]:
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer, pipeline

# Load Model with PEFT adapter
tokenizer = AutoTokenizer.from_pretrained(finetune_name)
model = AutoPeftModelForCausalLM.from_pretrained(
    finetune_name, device_map="auto", torch_dtype=torch.float16
)
pipe = pipeline(
    "text-generation", model=merged_model, tokenizer=tokenizer, device=device
)

Device set to use cuda


Lets test some prompt samples and see how the model performs.

In [None]:
prompts = [
    "What is the capital of Germany? Explain why thats the case and if it was different in the past?",
    "Write a Python function to calculate the factorial of a number.",
    "A rectangular garden has a length of 25 feet and a width of 15 feet. If you want to build a fence around the entire garden, how many feet of fencing will you need?",
    "What is the difference between a fruit and a vegetable? Give examples of each.",
]


def test_inference(prompt):
    prompt = pipe.tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=False,
        add_generation_prompt=True,
    )
    outputs = pipe(
        prompt,
    )
    return outputs[0]["generated_text"][len(prompt) :].strip()


for prompt in prompts:
    print(f"    prompt:\n{prompt}")
    print(f"    response:\n{test_inference(prompt)}")
    print("-" * 50)

    prompt:
What is the capital of Germany? Explain why thats the case and if it was different in the past?
    response:
The capital of Germany is Berlin. It is located in the state of Brandenburg. It is the
--------------------------------------------------
    prompt:
Write a Python function to calculate the factorial of a number.
    response:
Write a Python function to calculate the factorial of a number.

|•: (1
--------------------------------------------------
    prompt:
A rectangular garden has a length of 25 feet and a width of 15 feet. If you want to build a fence around the entire garden, how many feet of fencing will you need?
    response:
A rectangular garden has a length of 25 feet and a width of 15 feet.
--------------------------------------------------
    prompt:
What is the difference between a fruit and a vegetable? Give examples of each.
    response:
What is the difference between a fruit and a vegetable? Give examples of each. (2)
-----------------------------

# **🦁 Adjust the parameters and show change in inference results.**


In [None]:
# Experiment with adjusted LoRA parameters
adjusted_peft_config = LoraConfig(
    r=12,  # Increase rank dimension for better expressivity
    lora_alpha=16,  # More aggressive adaptation
    lora_dropout=0.1,  # Slightly higher dropout for regularization
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)


In [None]:
trainer = SFTTrainer(
    model=model,
    args=args,  # Use the same training arguments as before
    train_dataset=dataset["train"],
    peft_config=adjusted_peft_config,  # Use the adjusted LoRA configuration
    tokenizer=tokenizer,
)

# Train the model with the new parameters
trainer.train()
trainer.save_model(f"./{finetune_name}_adjusted")


  trainer = SFTTrainer(


Map:   0%|          | 0/2260 [00:00<?, ? examples/s]

Step,Training Loss
10,2.1195
20,1.7723
30,1.4792
40,1.3809
50,1.3028
60,1.2786
70,1.2706
80,1.2378
90,1.2216
100,1.2128


Step,Training Loss
10,2.1195
20,1.7723
30,1.4792
40,1.3809
50,1.3028
60,1.2786
70,1.2706
80,1.2378
90,1.2216
100,1.2128


In [None]:
# Load the adjusted model with LoRA
model_adjusted = AutoPeftModelForCausalLM.from_pretrained(
    f"./{finetune_name}_adjusted", torch_dtype=torch.float16, low_cpu_mem_usage=True
)

# Merge LoRA and save
merged_model_adjusted = model_adjusted.merge_and_unload()
merged_model_adjusted.save_pretrained(
    "./Peft_wgts_adjusted_merged", safe_serialization=True, max_shard_size="2GB"
)


In [None]:
# Load the tokenizer and adjusted merged model
pipe_adjusted = pipeline(
    "text-generation", model=merged_model_adjusted, tokenizer=tokenizer, device=device
)

# Test the adjusted model with the same prompts
for prompt in prompts:
    print(f"Prompt:\n{prompt}")
    adjusted_response = pipe_adjusted(prompt)[0]["generated_text"]
    print(f"Adjusted Response:\n{adjusted_response}")
    print("-" * 50)


Device set to use cuda


Prompt:
What is the capital of Germany? Explain why thats the case and if it was different in the past?
Adjusted Response:
What is the capital of Germany? Explain why thats the case and if it was different in the past?

Germany is a country in Europe. It has a capital city called Berlin. It is located
--------------------------------------------------
Prompt:
Write a Python function to calculate the factorial of a number.
Adjusted Response:
Write a Python function to calculate the factorial of a number.

```python
def factorial(n):
    if n == 0:
        return
--------------------------------------------------
Prompt:
A rectangular garden has a length of 25 feet and a width of 15 feet. If you want to build a fence around the entire garden, how many feet of fencing will you need?
Adjusted Response:
A rectangular garden has a length of 25 feet and a width of 15 feet. If you want to build a fence around the entire garden, how many feet of fencing will you need?

A rectangular garden has 

In [None]:
from transformers import AutoModelForCausalLM

# Load the merged model
model = AutoModelForCausalLM.from_pretrained("./Peft_wgts_merged")
model.push_to_hub(finetune_name)  # Replace with your repository name

# Push the tokenizer (optional but recommended)
tokenizer.push_to_hub(finetune_name)


model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/thenomdevel/SmolLM2-FT-MyDataset/commit/d3fe09d7e751a8175431bbf5aeb08bd4f1db3bbe', commit_message='Upload tokenizer', commit_description='', oid='d3fe09d7e751a8175431bbf5aeb08bd4f1db3bbe', pr_url=None, repo_url=RepoUrl('https://huggingface.co/thenomdevel/SmolLM2-FT-MyDataset', endpoint='https://huggingface.co', repo_type='model', repo_id='thenomdevel/SmolLM2-FT-MyDataset'), pr_revision=None, pr_num=None)