# ðŸ““ The GenAI Revolution Cookbook

**Title:** Parameter-Efficient Fine-Tuning (PEFT) with LoRA [2025 Hands-On Guide]

**Description:** Fine-tune LLMs on a single GPU using PEFT and LoRA to save memory, ship MB-sized adapters, evaluate outputs confidently, privately.

**ðŸ“– Read the full article:** [Parameter-Efficient Fine-Tuning (PEFT) with LoRA [2025 Hands-On Guide]](https://blog.thegenairevolution.com/article/parameter-efficient-fine-tuning-peft-with-lora-2025-hands-on-guide)

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



Fine\-tuning a large language model can be a fantastic way to tailor it to your specific needs. It's also more cost\-effective since you can focus on fine\-tuning a smaller, task\-specific part of the model instead of dealing with a large, general\-purpose one. But here's the thing: full fine\-tuning of large models often requires serious computational power and memory to handle all the model's weights, optimizer states, gradients, and activations. These demands add up fast and can easily exceed what regular hardware can handle.

Parameter\-Efficient Fine\-Tuning (PEFT) offers a much more practical alternative to full fine\-tuning. By updating only a small portion of the model's parameters, PEFT drastically cuts down memory requirements while still achieving comparable performance. This makes it possible to fine\-tune even large models on more limited hardware, like a single GPU.

In this post, I'll explain how PEFT works and walk you through a practical example, showing how this technique can make fine\-tuning large models both easier and more resource\-friendly.

## Why PEFT?

Before we jump into the code, let's take a closer look at what makes PEFT valuable. We need to understand the limitations of full fine\-tuning first, then see how PEFT addresses them.

### Limitations of Full Fine\-Tuning

* **High Memory and Storage Requirements**: Full fine\-tuning updates every single parameter in the model, which requires tons of memory to store not just the model weights (often hundreds of gigabytes), but also optimizer states, gradients, and activations. Each new version you create for different tasks also takes up additional storage.
* **Risk of Catastrophic Forgetting**: When you fine\-tune for a new task, the model can actually "forget" knowledge from previous tasks, especially if you're tackling multiple tasks in parallel. This really limits the model's flexibility across diverse applications.
* **High Deployment Costs**: Fine\-tuning produces a completely new large model for every task, which increases both storage and deployment costs. This makes it pretty challenging to maintain multiple task\-specific models without significant resources.

### How PEFT Overcomes These Challenges

* **Reduced Memory Footprint**: PEFT dramatically cuts memory usage by only updating a small portion of the model's parameters. The resulting weights are compact, often just a few megabytes, which makes it possible to run PEFT on a single GPU.
* **Efficient Multitasking**: PEFT\-tuned weights are specific to each task and can be swapped out easily during inference. This allows for flexible, efficient adaptation of one model for multiple tasks without duplicating the entire model.
* **Lower Risk of Catastrophic Forgetting**: By keeping most of the model's original parameters fixed, PEFT minimizes the changes that can lead to forgetting. This helps the model retain knowledge across tasks more effectively.
* **On\-Premise and Private Deployments**: PEFT's lightweight approach makes it ideal for on\-premise deployments where data privacy is critical. Keeping data on local servers ensures privacy without needing external APIs.

### When to Choose PEFT Over Full Fine\-Tuning

* **Prompting Limitations**: When simple prompting isn't enough to improve performance, PEFT offers a more advanced way to fine\-tune specific parts of the model.
* **Model Size and Resource Constraints**: Full fine\-tuning on large models is often too resource\-intensive for consumer hardware. PEFT is a more practical option when you're working with limited resources.
* **Data Privacy**: PEFT allows for on\-premise fine\-tuning, making it possible to customize models without relying on external servers. This is crucial for privacy\-focused applications.

While PEFT is ideal when simple prompting falls short, you can often achieve strong results with advanced [prompt engineering strategies for LLM APIs](/article/prompt-engineering-with-llm-apis-how-to-get-reliable-outputs-4) before moving to fine\-tuning. This helps you maximize model performance with minimal code changes.

In short, PEFT is a flexible and cost\-effective way to adapt large language models, especially when on\-premise deployment, data privacy, or limited hardware resources are priorities.

## PEFT Methods Overview

PEFT includes several techniques for fine\-tuning large language models by modifying only a small subset of parameters. Each method fits different tasks, resource levels, and model needs. Let me walk you through the main PEFT methods.

PEFT methods can be especially powerful when combined with retrieval augmented generation workflows, where efficient fine\-tuning helps adapt large models to domain\-specific retrieval tasks. For a hands\-on guide to building robust pipelines, check out our overview of [retrieval\-augmented generation (RAG) workflows](/article/rag-101-build-an-index-run-semantic-search-and-use-langchain-to-automate-it).

### Selective Methods

Selective methods adapt specific parts of the model, like certain layers or parameter types, without changing the whole model. This approach works well for tasks that benefit from focusing on certain layers, though it may not be as effective for tasks that need more comprehensive adjustments.

How it works:

* Target specific layers or components, such as attention layers or feed\-forward networks.
* Adjust only selected parameter types to balance task requirements with efficiency.

### Reparameterization Methods

Reparameterization methods, like LoRA (Low\-Rank Adaptation), are perfect for tasks where reducing memory and computational costs is essential. These methods use small, low\-rank matrices to fine\-tune while keeping the main model's structure intact.

How it works:

* Keep the main model weights fixed and add small rank\-decomposition matrices.
* Update these smaller matrices during training to save on memory and compute costs.
* Combine the low\-rank matrices with the main weights during inference to keep latency and memory usage low.

### Additive Methods

Additive methods add new, task\-specific features or prompts to improve performance without changing the model's core structure. Adapters and Soft Prompts are great for tasks that need flexible additions to enhance accuracy.

How it works:

* **Adapters**: Insert additional layers that are trained specifically for the task.
* **Soft Prompts**: Adjust or add specific prompt tokens for the task, keeping the main model fixed.

Now that you're familiar with the different PEFT methods, it's worth noting that LoRA (Low\-Rank Adaptation) is one of the most commonly used techniques in PEFT. Actually, discussions about PEFT often refer specifically to LoRA. In this post, I'll focus on using LoRA to demonstrate how it efficiently balances memory usage with strong performance.

## PEFT/LoRA Fine\-Tuning Walkthrough

Now let's go through how PEFT works, step by step, using LoRA. We'll build on the use case and training dataset from the previous post on Full Fine\-Tuning. For a detailed look at the use case and dataset, feel free to revisit [Fine\-Tuning Large Language Models: A Step\-by\-Step Cookbook](https://thegenairevolution.com/fine-tuning-large-language-models-a-step-by-step-cookbook/).

If you're interested in [integrating LLMs into your data science workflow](/article/how-to-boost-workflow-with-llm-pair-programming-in-jupyter-ai-2), especially for interactive coding and debugging, explore our tutorial on using LLM pair programming in Jupyter AI. It complements PEFT by showing how fine\-tuned models can accelerate real world projects.

### Install the Required Libraries

To get started, you'll need to install the PEFT library, which provides advanced Parameter\-Efficient Fine\-Tuning methods for adapting large language models efficiently.

In [None]:
!pip install peft

### Set Up the Model for Fine\-Tuning

To configure the PEFT/LoRA model, you'll add a new adapter layer specifically for fine\-tuning while keeping the underlying LLM frozen. This setup ensures that only the adapter layer is trained, leaving the rest of the model untouched.

In [None]:
Import necessary libraries
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time

# Define the base model to be used
model_name = 'google/flan-t5-base'

# Load the pre-trained model and tokenizer, setting the data type for efficient computation
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Configure LoRA parameters
lora_config = LoraConfig(
    r=32,                            # 'r' is the rank, defining the adapter's dimensionality for LoRA
    lora_alpha=32,                   # Scaling factor to balance LoRA adjustments with the original model's outputs
    target_modules=["q", "v"],       # Specifies which modules to adapt; here, 'q' and 'v' refer to attention mechanism
    lora_dropout=0.05,               # Dropout rate applied to LoRA layers to improve generalization
    bias="none",                     # Specifies bias configuration; 'none' indicates no bias adjustments in LoRA
    task_type=TaskType.SEQ_2_SEQ_LM  # Task type, set to sequence-to-sequence language modeling for fine-tuning
)

# Add LoRA adapters to the original model
peft_model = get_peft_model(original_model, lora_config)

### Inspect Trainable Parameters

Let's create a simple helper function to check the percentage of trainable parameters. Why? This helps you see how PEFT optimizes resource usage by updating only a small fraction of the model. As shown here, only 1\.41% of the parameters are trainable. This makes fine\-tuning feasible even on limited hardware, such as a single GPU.

In [None]:
def print_trainable_parameters(model):
    # Initialize counters for trainable and total parameters
    trainable_params = sum(param.numel() for param in model.parameters() if param.requires_grad)
    total_params = sum(param.numel() for param in model.parameters())
    
    # Calculate the percentage of trainable parameters
    trainable_percentage = 100 * trainable_params / total_params
    
    # Print parameter summary
    print(f"Trainable parameters: {trainable_params}")
    print(f"Total parameters: {total_params}")
    print(f"Percentage of trainable parameters: {trainable_percentage:.2f}%")
    
# Call the function to display the parameter summary
print_trainable_parameters(peft_model)

### Load and Preprocess the Data

You'll use the same dataset prepared during the exploration of full fine\-tuning. For complete details on the use case and dataset, feel free to revisit [Fine\-Tuning Large Language Models: A Step\-by\-Step Cookbook](https://thegenairevolution.com/fine-tuning-large-language-models-a-step-by-step-cookbook/).

In [None]:
from datasets import load_dataset
# Load your dataset from the JSONL file
dataset = load_dataset("json", data_files="city_qna.jsonl")

# Check the dataset structure
print(dataset["train"][0])

In [None]:
{'input': 'Can you describe the city of Paris to me?', 'output': 'Paris is a city in France with a population of 2.1 million, known for landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral.'}

In [None]:
# Tokenize the dataset
def preprocess_data(examples):
    # Extract inputs and outputs as lists from the dictionary
    inputs = examples["input"]
    outputs = examples["output"]

    # Tokenize inputs and outputs with padding and truncation
    model_inputs = tokenizer(inputs, max_length=128, padding="max_length", truncation=True)
    labels = tokenizer(outputs, max_length=128, padding="max_length", truncation=True).input_ids

    # Replace padding token IDs with -100 to ignore them in the loss function
    labels = [[-100 if token == tokenizer.pad_token_id else token for token in label] for label in labels]
    model_inputs["labels"] = labels

    return model_inputs

# Use the map function to apply the preprocessing to the whole dataset
tokenized_dataset = dataset["train"].map(preprocess_data, batched=True)

### Set Up Training Configuration

Before you start fine\-tuning, define the training arguments and create a Trainer instance to manage the training process.

In [None]:
# Set up output directory with a unique name based on the current timestamp
output_dir = f'./peft-flan-t5-city-tuning-{str(int(time.time()))}'

# Define training arguments for PEFT
peft_training_args = TrainingArguments(
    output_dir=output_dir,               # Directory to save model checkpoints and logs
    auto_find_batch_size=True,           # Automatically adjust batch size based on available memory
    learning_rate=5e-4,                  # Learning rate, set to a moderate value to balance stability and speed
    num_train_epochs=3,                  # Number of full passes through the training dataset
    logging_steps=10,                    # Log training metrics every 10 steps
    save_steps=100,                      # Save a model checkpoint every 100 steps
    eval_strategy="no",                  # Evaluation strategy during training; set to "no" if not evaluating
    save_total_limit=2,                  # Keep only the last 2 checkpoints to save storage space
    per_device_train_batch_size=8,       # Batch size per device during training; adjusted for memory constraints
    per_device_eval_batch_size=8,        # Batch size per device during evaluation
    weight_decay=0.01,                   # Weight decay for regularization, prevents overfitting
    max_steps=500                        # Maximum number of training steps, limits training for quicker experimentation
)

# Initialize the Trainer with the specified model, training arguments, and dataset
peft_trainer = Trainer(
    model=peft_model,                    # The PEFT/LoRA model to be fine-tuned
    args=peft_training_args,             # Training arguments defined above
    train_dataset=tokenized_dataset,     # Training dataset, assumed to be preprocessed
)

### Fine\-Tune the Model

Now you'll start the training process using the Trainer instance you set up. This will fine\-tune the model according to the specified configuration and dataset.

In [None]:
# Start the training process using the defined Trainer instance
peft_trainer.train()

# Define a local path to save the fine-tuned PEFT/LoRA model and tokenizer
peft_model_path = "./peft-flan-t5-city-tuning-checkpoint-local"

# Save the fine-tuned model to the specified path for later use
peft_trainer.model.save_pretrained(peft_model_path)

# Save the tokenizer associated with the model to the same path, ensuring compatibility during inference
tokenizer.save_pretrained(peft_model_path)

### Check the Size of the Fine\-Tuned Model

To highlight the efficiency of PEFT, you can check the size of the fine\-tuned model. Since PEFT modifies only a small subset of the model's parameters, the storage requirements are significantly lower compared to full fine\-tuning. The following code calculates the total size of the saved model directory, providing a clear view of how compact the fine\-tuned PEFT model is.

In [None]:
def get_directory_size(path):
    # Calculate the total size of all files in the specified directory
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            total_size += os.path.getsize(fp)
    return total_size

# Define the path to the saved PEFT model
peft_model_path = "./peft-flan-t5-city-tuning-checkpoint-local"

# Get the size in bytes and convert to megabytes (MB)
model_size_mb = get_directory_size(peft_model_path) / (1024 * 1024)

print(f"Fine-tuned PEFT model size: {model_size_mb:.2f} MB")

In [None]:
Fine-tuned PEFT model size: 15.98 MB

### Evaluate the Model Qualitatively (Human Evaluation)

Now that you've fine\-tuned the model, the next step is to evaluate its performance. To prepare the model for qualitative evaluation, you'll add an adapter to the original FLAN\-T5 model and set is\_trainable\=False to configure it for inference only. This setup allows you to assess the model's responses without additional training, focusing on its qualitative performance with the applied PEFT.

In [None]:
# Import PeftModel to load the model with adapters, and PeftConfig for adapter configuration settings
from peft import PeftModel, PeftConfig

# Load the base FLAN-T5 model with the specified data type for efficient computation
peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)

# Load the tokenizer for the FLAN-T5 model, required for text processing during inference
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

# Load the fine-tuned PEFT model by adding the LoRA adapter to the base model
# Set is_trainable=False to ensure the model is in inference mode (no further training)
peft_model = PeftModel.from_pretrained(
    peft_model_base,                           # Base model to apply the PEFT adapter to
    './peft-flan-t5-city-tuning-checkpoint-local/', # Path to the saved fine-tuned adapter
    torch_dtype=torch.bfloat16,                # Data type for efficient inference
    is_trainable=False                         # Disable training mode for inference-only evaluation
)

In [None]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Few-shot learning
input_text = "Describe the city of Vancouver"

# Tokenize input
inputs = tokenizer(input_text, return_tensors="pt")

#  Generate response
outputs = peft_model.generate(input_ids=inputs.input_ids, max_length=50)

# Decode and print the ouput
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In [None]:
Vancouver is a city in Canada with a population of 1.8 million, known for landmarks such as the Golden Horseshoe Bridge, the Vancouver Museum, and the Canadian Museum.

And there you have it. The model now returns responses in the desired format. While some information may still lack accuracy and there might be occasional hallucinations, the output is much closer to what we intended, demonstrating real progress toward our goal.

## Conclusion

With PEFT and LoRA, we're unlocking a practical way to adapt large models without requiring extensive resources. The real advantage here is that you can create several small, fine\-tuned versions for different tasks and swap them in and out as needed. This allows you to keep one main model and load in specific adapters rather than managing multiple large models for each task. It's a streamlined approach to handle various tasks with a single model setup, saving on time, storage, and computing power.

PEFT techniques are also valuable for adapting cutting edge and specialized architectures, including [reasoning\-focused models like OpenAI o1](/article/understanding-reasoning-models-ai-systems-designed-to-think). For complex problem solving and analysis, understanding how these models think can help you choose the right fine\-tuning strategy.

In essence, PEFT makes working with large models much more accessible and efficient. You don't need high\-end hardware to get strong performance on specialized tasks. By creating these lightweight adapters for each use case, you can switch between them seamlessly, making your AI setup far more adaptable and resource\-friendly.