# PEFT Fine-Tuning with LoRA on IMDb Dataset

**Objective:**  
This document explains **Parameter-Efficient Fine-Tuning (PEFT)** using **LoRA (Low-Rank Adaptation)** on a **DistilBERT model** for **text classification**.  
We use the **IMDb movie review dataset** for sentiment analysis (positive/negative classification).  

**Key Concepts Covered:**

- PEFT and LoRA for fine-tuning large models efficiently  
- Targeting specific layers (query/value projections) in attention  
- Manual PyTorch training loop (no `Trainer`, no `accelerate`)  
- Evaluation of model performance  

---

## 1️⃣ Install Required Packages

Required libraries:

- `transformers`: Model and tokenizer  
- `datasets`: IMDb dataset  
- `peft`: LoRA fine-tuning  
- `torch`: PyTorch training  

**Install command:**

In [68]:
#!pip install transformers datasets peft torch

In [42]:
# Import required libraries
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, TaskType
from sklearn.metrics import accuracy_score, precision_recall_fscore_support


## 2️⃣ Load IMDb Dataset

We will load the dataset using **Hugging Face Datasets** library.  
IMDb contains **50,000 movie reviews**, evenly split between positive and negative sentiment.  

We will use a **small subset** for demonstration purposes.

In [58]:
from datasets import load_dataset

# Load IMDb dataset
dataset = load_dataset("imdb")

# Inspect sample
print(dataset["train"][0])


{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

## 3️⃣ Tokenization

We use **DistilBERT tokenizer** to convert text into token IDs.  

- Pad or truncate each text to **max_length=128**  
- Output includes `input_ids` and `attention_mask`  


In [59]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=128)

# Tokenize datasets
tokenized_datasets = dataset.map(tokenize, batched=True)

# Set format for PyTorch
tokenized_datasets.set_format("torch", columns=["input_ids", "attention_mask", "label"])


Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

## 4️⃣ Create DataLoaders

We will use **PyTorch DataLoaders** for batching the data:

- Batch size = 16  
- Shuffle the training set  
- Select a smaller subset for faster experimentation  


In [60]:
from torch.utils.data import DataLoader

train_loader = DataLoader(tokenized_datasets["train"].shuffle(seed=42).select(range(2000)), batch_size=16)
eval_loader = DataLoader(tokenized_datasets["test"].shuffle(seed=42).select(range(500)), batch_size=16)


## 5️⃣ Load Model and Inspect Layers

We load **DistilBERT for sequence classification**.  

- Inspect layer names to identify **query (`q_lin`) and value (`v_lin`) projection layers**  
- LoRA will only modify these layers for **parameter-efficient fine-tuning**  


In [61]:
import torch
from transformers import AutoModelForSequenceClassification
from peft import LoraConfig, get_peft_model, TaskType

# Load base model
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

# Check attention layer names
for name, _ in model.named_modules():
    if "attention" in name:
        print(name)


distilbert.transformer.layer.0.attention
distilbert.transformer.layer.0.attention.dropout
distilbert.transformer.layer.0.attention.q_lin
distilbert.transformer.layer.0.attention.k_lin
distilbert.transformer.layer.0.attention.v_lin
distilbert.transformer.layer.0.attention.out_lin
distilbert.transformer.layer.1.attention
distilbert.transformer.layer.1.attention.dropout
distilbert.transformer.layer.1.attention.q_lin
distilbert.transformer.layer.1.attention.k_lin
distilbert.transformer.layer.1.attention.v_lin
distilbert.transformer.layer.1.attention.out_lin
distilbert.transformer.layer.2.attention
distilbert.transformer.layer.2.attention.dropout
distilbert.transformer.layer.2.attention.q_lin
distilbert.transformer.layer.2.attention.k_lin
distilbert.transformer.layer.2.attention.v_lin
distilbert.transformer.layer.2.attention.out_lin
distilbert.transformer.layer.3.attention
distilbert.transformer.layer.3.attention.dropout
distilbert.transformer.layer.3.attention.q_lin
distilbert.transformer.

## 6️⃣ Configure LoRA (PEFT)

LoRA allows **low-rank adaptation** by injecting trainable matrices into **selected layers**, reducing the number of trainable parameters.  

**Configuration parameters:**

- `task_type`: `SEQ_CLS` (sequence classification)  
- `r`: Rank of adaptation matrices (8 recommended)  
- `lora_alpha`: Scaling factor  
- `lora_dropout`: Dropout on LoRA layers  
- `target_modules`: Layers to modify (`q_lin` and `v_lin`)  


In [62]:
# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    bias="none",
    target_modules=["q_lin", "v_lin"]  # Correct modules for DistilBERT
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): DistilBertForSequenceClassification(
      (distilbert): DistilBertModel(
        (embeddings): Embeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (transformer): Transformer(
          (layer): ModuleList(
            (0-5): 6 x TransformerBlock(
              (attention): DistilBertSdpaAttention(
                (dropout): Dropout(p=0.1, inplace=False)
                (q_lin): lora.Linear(
                  (base_layer): Linear(in_features=768, out_features=768, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.1, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=76

## 7️⃣ Define Training Loop

We will manually train the LoRA-adapted model:

- Use **CrossEntropyLoss** for classification  
- **Adam optimizer** with learning rate 5e-4  
- Train for **1 epoch** (demo)  
- Print **loss per epoch**
  
**Reason for using only 1 epoch:**

- This notebook is intended as a **tutorial/demo**, so training on a small subset is sufficient to verify that the pipeline works.  
- Training for only 1 epoch **reduces runtime** and GPU memory usage.  
- For real-world applications or production models, you would train for **multiple epochs** until convergence.  



In [65]:
from torch.nn import CrossEntropyLoss
from torch.optim import Adam

# Loss and optimizer
criterion = CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)

print("training")
# Training
epochs = 1

for epoch in range(epochs):
    model.train()
    total_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = criterion(outputs.logits, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1} - Loss: {total_loss/len(train_loader):.4f}")


training
Epoch 1 - Loss: 0.3633


## 8️⃣ Evaluate Model

We calculate **accuracy** on the test subset:

- Switch model to `eval()` mode  
- Disable gradient calculation for faster evaluation  
- Compare predictions with ground-truth labels  

In [66]:
# Evaluation
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for batch in eval_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)
        
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        preds = torch.argmax(outputs.logits, dim=-1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)

print(f"Test Accuracy: {correct/total:.4f}")


Test Accuracy: 0.8000


## ✅ Summary

- Loaded **IMDb dataset** and tokenized text  
- Configured **LoRA for DistilBERT** with correct attention layers  
- Trained model using **manual PyTorch loop**  
- Evaluated test accuracy  

**Advantages of PEFT + LoRA:**

- Reduces number of trainable parameters  
- Fine-tuning is faster and requires less GPU memory  
- Easy to apply to large LLMs for downstream tasks  