
# Instruction Tuning for Large Language Models (LLMs)




## Motivation

**Language Models (LMs)** like GPT-2 are trained on huge text datasets with a *self-supervised* objective:
> “Predict the next word given the previous ones.”

Example:
> Input: "The capital of France is"  
> Model learns to predict: "Paris"

This works great for *language generation*, but not for *task following*.

---

### Problem
When asked a question like:
> “Translate ‘Hello’ to French.”

A pretrained model may respond:
> “Translate ‘Hello’ to French.”

It **doesn’t understand** that it must *do something*; it just continues the text pattern.
---

### Solution: Instruction Tuning
**Instruction tuning** is a fine-tuning process that teaches models to *follow human instructions* using examples of:
```
Instruction + Input → Desired Output
```


For example:

| Instruction | Input | Output |
|--------------|--------|---------|
| "Translate English to French." | "The sky is blue." | "Le ciel est bleu." |
| "Summarize this text." | "AI enables machines to learn..." | "AI lets machines learn tasks automatically." |

After this, models start to *generalize* to new tasks they've never seen; just by reading the instruction!



---
| Feature           | **Pretraining**                 | **Fine-tuning**               | **Instruction-tuning**                     |
| ----------------- | ------------------------------- | ----------------------------- | ------------------------------------------ |
| **Data Type**     | Unlabeled, large-scale          | Labeled, task-specific        | Labeled, multi-task with instructions      |
| **Goal**          | Learn general language patterns | Specialize in one domain/task | Learn to follow human instructions         |
| **Learning Type** | Self-supervised                 | Supervised                    | Supervised (with natural instructions)     |
| **Examples**      | GPT, BERT base training         | GPT fine-tuned for Sentiment Classification | ChatGPT, FLAN-T5, LLaMA-Instruct           |
| **Outcome**       | General-purpose LM              | Task-specialized LM           | Conversational or instruction-following LM |




## The Intuition Behind Instruction Tuning

Instruction tuning makes the model **conditioned on intent**.

Think of it like “meta-learning”:  
- Instead of teaching the model *how to do one task* (like translation),  
- We teach it *how to learn from the instruction itself*.

So the model learns this **meta pattern**:  
> "When someone says 'Translate', produce translation."  
> "When someone says 'Summarize', produce a summary."  


---




## Anatomy of an Instruction Dataset

Each record has three parts:

```json
{
  "instruction": "Summarize this paragraph.",
  "input": "Artificial intelligence enables computers to perform human-like tasks.",
  "output": "AI lets computers perform human-like tasks."
}
```

This format allows the model to learn:  
- **What** to do (from the *instruction*),  
- **What data** to operate on (the *input*),  
- **What kind of result** to produce (the *output*).


In [2]:
!pip install -q transformers datasets peft accelerate bitsandbytes

In [3]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
from datasets import load_dataset

dataset = load_dataset("tatsu-lab/alpaca", split="train[:1%]")
print(dataset[0])

{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}


In [5]:
dataset.shape

(520, 4)

In [6]:
def format_instruction(example):
    prompt = f"Instruction: {example['instruction']}\nInput: {example['input']}\nResponse:"
    return {"prompt": prompt, "label": example["output"]}

formatted_dataset = dataset.map(format_instruction)
print(formatted_dataset[0])
formatted_dataset = formatted_dataset.remove_columns(dataset.column_names)
formatted_dataset = formatted_dataset.rename_column("prompt", "input_text")
formatted_dataset = formatted_dataset.rename_column("label", "target_text")
formatted_dataset = formatted_dataset.shuffle(seed=42)
print(formatted_dataset[0])

{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.', 'prompt': 'Instruction: Give three tips for staying healthy.\nInput: \nResponse:', 'label': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}
{'input_text': 

In [7]:
def tokenize_function(example):
    model_inputs = tokenizer(example["input_text"], truncation=True, padding="max_length", max_length=128)
    labels = tokenizer(example["target_text"], truncation=True, padding="max_length", max_length=128)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = formatted_dataset.map(tokenize_function, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns(["input_text", "target_text"])
tokenized_dataset.set_format("torch")

In [8]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_2_SEQ_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 344,064 || all params: 77,305,216 || trainable%: 0.4451


In [9]:
!pip install --upgrade transformers peft accelerate



In [10]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    num_train_epochs=1,
    logging_steps=10,
    save_strategy="no",
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

trainer.train()

Step,Training Loss
10,29.994
20,28.1867
30,36.0805
40,27.1413
50,29.8795
60,28.5482
70,29.766
80,31.9955
90,24.4059
100,27.3499


TrainOutput(global_step=130, training_loss=30.06300295316256, metrics={'train_runtime': 50.1596, 'train_samples_per_second': 10.367, 'train_steps_per_second': 2.592, 'total_flos': 24303171010560.0, 'train_loss': 30.06300295316256, 'epoch': 1.0})

In [14]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

def generate_response(instruction, input_text=""):
    prompt = f"Instruction: {instruction}\n Input: {input_text}\n Response:"
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=200)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test again
print(generate_response(
    "Summarize the following text:",
    "Artificial Intelligence enables machines to learn from data and perform tasks that typically require human intelligence."
))


Artificial intelligence is a technology that enables machines to learn from data and perform tasks that require human intelligence.



## Theoretical Notes

1. **Why Instruction Tuning Works:** It couples intent with behavior, teaching semantic mappings from "task type" to "output form."  
2. **Generalization:** The model learns how to *follow* instructions, not just memorize responses - enabling zero/few-shot learning.  
3. **Relation to Prompting:** Instruction-tuned models *internalize* prompting behavior through supervised examples.  
4. **LoRA Efficiency:** LoRA fine-tunes small adapter weights - reducing trainable parameters 10-100x.



# Key Takeaways

- Instruction tuning aligns LLMs with human intent.
- It uses (instruction, input, output) triplets to teach “how to follow commands.”
- LoRA enables efficient fine-tuning on small GPUs.
- Foundation for conversational and aligned LLMs like ChatGPT.

---


### Further Reading
- [FLAN-T5 paper](https://arxiv.org/abs/2210.11416)
- [Self-Instruct (Stanford)](https://arxiv.org/abs/2212.10560)
- [Databricks Dolly Blog](https://www.databricks.com/blog/2023/03/24/hello-dolly.html)
- [PEFT Documentation](https://huggingface.co/docs/peft)


In [None]:
Read: https://arxiv.org/pdf/2308.10792#page=1.66

# LoRA and PEFT: Detailed Explanation

## What is PEFT?

**PEFT (Parameter-Efficient Fine-Tuning)** is a family of techniques that fine-tune large models by updating only a small subset of parameters, rather than all billions of weights.

### The Problem PEFT Solves

Traditional fine-tuning updates **all** model parameters:
- FLAN-T5-small: ~60M parameters
- FLAN-T5-large: ~780M parameters  
- LLaMA-7B: 7 billion parameters

**Challenges:**
- ❌ Requires huge GPU memory (store gradients for all params)
- ❌ Slow training
- ❌ Need to save full model copy for each task
- ❌ Risk of catastrophic forgetting

### PEFT Solution

Train only **0.1-10% of parameters**, keep the rest frozen:
- ✅ Much lower memory (no gradients for frozen weights)
- ✅ Faster training
- ✅ Save only the adapter weights (~few MB instead of GBs)
- ✅ Can swap adapters for different tasks

---

## What is LoRA?

**LoRA (Low-Rank Adaptation)** is the most popular PEFT method. It works by injecting trainable rank decomposition matrices into each layer.

### The Core Idea

Instead of updating the full weight matrix **W**, LoRA keeps W frozen and adds a low-rank update:

$$
W' = W + \Delta W = W + BA
$$

Where:
- **W**: Original frozen weights (d × k)
- **B**: Trainable matrix (d × r)
- **A**: Trainable matrix (r × k)
- **r**: Rank (typically 8, 16, 32) << min(d, k)

**Key insight**: The update ΔW = BA is **low-rank**, so we only train r(d+k) parameters instead of d×k.

### Example: Parameter Reduction

For a single attention layer in FLAN-T5:
- Original Wq matrix: 512 × 512 = **262,144 params**
- LoRA with r=8: (512×8) + (8×512) = **8,192 params** (97% reduction!)

---

## LoRA Config Breakdown



In [None]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,                          # Rank of decomposition matrices
    lora_alpha=32,                # Scaling factor
    target_modules=["q", "v"],    # Which layers to adapt
    lora_dropout=0.05,            # Regularization
    bias="none",                  # Don't train bias terms
    task_type="SEQ_2_SEQ_LM"      # Model architecture type
)



### Parameter Explanation

| Parameter | What It Does | Typical Values | Impact |
|-----------|--------------|----------------|--------|
| **r** | Rank of A and B matrices | 4, 8, 16, 32 | Higher r = more capacity but more params |
| **lora_alpha** | Scales ΔW by α/r | 16, 32 | Controls magnitude of updates |
| **target_modules** | Which weight matrices to adapt | `["q", "v"]`, `["q", "k", "v", "o"]` | More modules = better adaptation but slower |
| **lora_dropout** | Dropout rate on LoRA weights | 0.0, 0.05, 0.1 | Prevents overfitting |
| **bias** | Whether to train bias | `"none"`, `"all"`, `"lora_only"` | Usually keep frozen |
| **task_type** | Model type | `"SEQ_2_SEQ_LM"`, `"CAUSAL_LM"` | Determines how to inject LoRA |

---

## Detailed Example: How LoRA Works in T5

### 1. Original T5 Attention (Frozen)



In [None]:
# T5 has these attention weight matrices (all frozen):
W_q: (512, 512)  # Query projection
W_k: (512, 512)  # Key projection  
W_v: (512, 512)  # Value projection
W_o: (512, 512)  # Output projection

# Original forward pass:
Q = X @ W_q
K = X @ W_k
V = X @ W_v



### 2. With LoRA Applied



In [None]:
# LoRA adds trainable adapters to q and v:
W_q stays frozen
B_q: (512, 8)   # NEW trainable
A_q: (8, 512)   # NEW trainable

W_v stays frozen  
B_v: (512, 8)   # NEW trainable
A_v: (8, 512)   # NEW trainable

# Modified forward pass:
Q = X @ (W_q + B_q @ A_q)  # Frozen + LoRA update
K = X @ W_k                 # No LoRA on K
V = X @ (W_v + B_v @ A_v)  # Frozen + LoRA update



### 3. Parameter Count



In [None]:
# Original T5-small trainable params: ~60M

# With LoRA (r=8, target=["q", "v"]):
# Each encoder/decoder layer has 1 attention block
# T5-small has 6 encoder + 6 decoder layers = 12 layers

params_per_layer = 2 * (512*8 + 8*512) = 2 * 8192 = 16,384
total_lora_params = 12 * 16,384 = 196,608

# Trainable params: ~200K (0.3% of original!)



---

## Code Walkthrough

### Step 1: Wrap Model with LoRA



In [None]:
from transformers import AutoModelForSeq2SeqLM
from peft import LoraConfig, get_peft_model

# Load base model (all params frozen by default in PEFT)
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")

# Define LoRA config
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q", "v"],  # Only adapt query & value in attention
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_2_SEQ_LM"
)

# Inject LoRA adapters
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()



**Output:**


In [None]:
trainable params: 196,608 || all params: 60,506,624 || trainable%: 0.32%



### Step 2: What `get_peft_model` Does

Internally:
1. Freezes all original model weights (`param.requires_grad = False`)
2. Identifies layers matching `target_modules` (e.g., `encoder.block.0.layer.0.SelfAttention.q`)
3. Replaces them with `LoraLinear` wrappers that add B and A matrices
4. Sets `requires_grad = True` only for LoRA params

### Step 3: Training



In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    num_train_epochs=3,
    learning_rate=3e-4,  # Higher LR ok since fewer params
)

trainer = Trainer(
    model=model,  # LoRA-wrapped model
    args=training_args,
    train_dataset=tokenized_dataset,
)

trainer.train()



**What happens during training:**
- Forward pass: `output = X @ (W_frozen + B @ A)`
- Backward pass: Gradients only flow to B and A
- Optimizer updates: Only B and A are updated

### Step 4: Saving and Loading



In [None]:
# Save only LoRA adapters (~2 MB instead of 240 MB)
model.save_pretrained("./lora_adapters")

# Later: Load base model + adapters
from peft import PeftModel

base_model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
model = PeftModel.from_pretrained(base_model, "./lora_adapters")



---

## Why LoRA Works: The Math

### Hypothesis: Weight Updates Are Low-Rank

Research shows that during fine-tuning, the change ΔW = W_finetuned - W_pretrained tends to be **low-rank** (most singular values are near zero).

LoRA explicitly enforces this:
$$
\Delta W = BA, \quad \text{rank}(\Delta W) \leq r
$$

### Scaling Factor (lora_alpha)

The actual update is scaled:
$$
W' = W + \frac{\alpha}{r} BA
$$

- If α = r, scaling factor = 1 (no scaling)
- If α = 2r, updates are 2× larger
- Tuning α/r controls learning rate for LoRA weights

**Rule of thumb:** Set `lora_alpha = 2 * r` for stable training.

---

## Comparison: LoRA vs Other PEFT Methods

| Method | Trainable Params | Memory | Speed | Use Case |
|--------|------------------|--------|-------|----------|
| **Full Fine-tuning** | 100% | High | Slow | When you have huge GPU resources |
| **LoRA** | 0.1-1% | Low | Fast | General-purpose PEFT (best default) |
| **Prefix Tuning** | ~0.1% | Low | Fast | Seq2seq tasks, frozen model |
| **Prompt Tuning** | <0.01% | Very Low | Very Fast | Simple tasks, extreme efficiency |
| **Adapter Layers** | 1-5% | Medium | Medium | Task-specific modules |

---

## Practical Example: Multi-Task Learning

You can train **different LoRA adapters** for different tasks and swap them:



In [None]:
# Train LoRA for summarization
lora_summarize = LoraConfig(r=8, target_modules=["q", "v"], task_type="SEQ_2_SEQ_LM")
model_summarize = get_peft_model(base_model, lora_summarize)
# ...train on summary data...
model_summarize.save_pretrained("./lora_summarize")

# Train LoRA for translation
lora_translate = LoraConfig(r=8, target_modules=["q", "v"], task_type="SEQ_2_SEQ_LM")
model_translate = get_peft_model(base_model, lora_translate)
# ...train on translation data...
model_translate.save_pretrained("./lora_translate")

# At inference, load the adapter you need:
from peft import PeftModel

# For summarization:
model = PeftModel.from_pretrained(base_model, "./lora_summarize")

# For translation:
model = PeftModel.from_pretrained(base_model, "./lora_translate")



Each adapter is only ~2-10 MB, so you can store dozens of task-specific models!

---

## Key Takeaways

1. **PEFT** reduces fine-tuning cost by updating <1% of parameters
2. **LoRA** decomposes weight updates as W + BA (low-rank)
3. **r** controls capacity/efficiency trade-off (typical: 8-32)
4. **target_modules** specifies which layers to adapt (attention is most important)
5. **lora_alpha** scales the update magnitude (set to 2r as default)
6. Save/load only adapters for efficient multi-task deployment

**When to use LoRA:**
- ✅ Limited GPU memory (<16GB)
- ✅ Need to fine-tune multiple tasks
- ✅ Want fast experimentation
- ✅ Working with models >1B params