# Module 20: Fine-Tuning LLMs

**LoRA, QLoRA, and Efficient Adaptation**

---

## 1. Objectives

- âœ… Understand full fine-tuning vs PEFT
- âœ… Master LoRA (Low-Rank Adaptation)
- âœ… Implement QLoRA for memory efficiency
- âœ… Know when and how to fine-tune

## 2. Prerequisites

- [Module 19: Prompt Engineering](../19_prompt_engineering/19_prompt_engineering.ipynb)

## 3. Fine-Tuning Landscape

### Types of Fine-Tuning

| Method | What Changes | Memory | Quality |
|--------|-------------|--------|--------|
| Full | All weights | Very High | Best |
| LoRA | Low-rank adapters | Low | Great |
| QLoRA | Quantized + LoRA | Very Low | Great |
| Prefix Tuning | Soft prompts | Low | Good |

### Decision Framework

```
Have 100+ GPU hours? â†’ Full fine-tuning
Have 16GB+ VRAM?    â†’ LoRA
Have 8GB VRAM?      â†’ QLoRA
Just prototyping?   â†’ Prompt engineering first!
```

In [1]:
# Install required packages
# !pip install peft bitsandbytes accelerate transformers datasets trl

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Device: {device}")

Device: cuda


## 4. LoRA (Low-Rank Adaptation)

### Key Insight

Instead of updating full weight matrix W, learn low-rank update:

$$W' = W + BA$$

Where:
- W: Original weights (frozen)
- B: (d Ã— r) matrix
- A: (r Ã— k) matrix
- r << min(d, k) (typically 8-64)

### Memory Savings

```
Original: d Ã— k parameters
LoRA: r Ã— (d + k) parameters

Example: 4096 Ã— 4096 = 16M params
LoRA (r=16): 16 Ã— 8192 = 131K params (0.8%!)
```

In [2]:
from peft import LoraConfig, get_peft_model, TaskType

# LoRA configuration
lora_config = LoraConfig(
    r=16,                         # Rank
    lora_alpha=32,                # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

print(f"LoRA Config: r={lora_config.r}, alpha={lora_config.lora_alpha}")
print(f"Target modules: {lora_config.target_modules}")

LoRA Config: r=16, alpha=32
Target modules: {'q_proj', 'v_proj'}


## 5. QLoRA Setup

QLoRA = 4-bit Quantization + LoRA

Enables fine-tuning 65B models on single GPU!

In [6]:
!pip install -q bitsandbytes accelerate peft

import torch
from transformers import BitsAndBytesConfig

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # Normalized Float 4
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True  # Nested quantization
)

print("QLoRA config ready!")
print("This reduces 7B model from 28GB to ~4GB")

QLoRA config ready!
This reduces 7B model from 28GB to ~4GB


## 6. Complete Fine-Tuning Pipeline

In [7]:
# Full QLoRA fine-tuning example (pseudocode - needs GPU)

def setup_qlora_training(model_name, dataset):
    """Complete QLoRA setup."""

    # 1. Load quantized model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto"
    )

    # 2. Apply LoRA
    model = get_peft_model(model, lora_config)

    # 3. Print trainable parameters
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")

    return model

print("QLoRA training setup function ready!")

QLoRA training setup function ready!


In [8]:
from transformers import TrainingArguments

# Training arguments for LoRA
training_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch = 16
    learning_rate=2e-4,  # Higher LR for LoRA
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="epoch",
    fp16=True,  # Mixed precision
)

print("Training arguments configured!")

Training arguments configured!


## 7. Data Formatting

### Instruction Format (Alpaca Style)

```
### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}
```

In [9]:
def format_instruction(sample):
    """Format sample for instruction tuning."""

    if sample.get('input'):
        return f"""### Instruction:
{sample['instruction']}

### Input:
{sample['input']}

### Response:
{sample['output']}"""
    else:
        return f"""### Instruction:
{sample['instruction']}

### Response:
{sample['output']}"""

# Example
sample = {
    'instruction': 'Summarize the following text.',
    'input': 'Machine learning is a subset of AI that enables systems to learn from data.',
    'output': 'ML is AI that learns from data.'
}
print(format_instruction(sample))

### Instruction:
Summarize the following text.

### Input:
Machine learning is a subset of AI that enables systems to learn from data.

### Response:
ML is AI that learns from data.


## 8. ðŸ”¥ Real-World Usage

### When to Fine-Tune

| Scenario | Approach |
|----------|----------|
| Need specific format | LoRA |
| Domain adaptation | QLoRA |
| Better at task | Prompt first, then LoRA |
| New capabilities | Full fine-tune |

### Best Practices

1. **Start with prompting** - often sufficient
2. **Use quality data** - 1000 good examples > 10000 bad
3. **Validate on held-out set**
4. **Monitor for overfitting**
5. **Merge weights for deployment**

## 9. Interview Questions

**Q1: What is LoRA and why is it memory efficient?**
<details><summary>Answer</summary>

LoRA learns low-rank decomposition (BA) instead of full weight updates. With r=16, it uses <1% of parameters while achieving similar quality to full fine-tuning.
</details>

**Q2: What is QLoRA?**
<details><summary>Answer</summary>

QLoRA combines 4-bit quantization (NF4) with LoRA. Frozen weights are 4-bit, LoRA adapters are trained in FP16/BF16. Enables 65B fine-tuning on 48GB VRAM.
</details>

**Q3: When NOT to fine-tune?**
<details><summary>Answer</summary>

- Task solvable by prompting
- Very small datasets (<100 examples)
- No evaluation data
- Time-sensitive deployment
</details>

## 10. Summary

- **Full Fine-Tuning**: All weights, best results, high cost
- **LoRA**: Low-rank adapters, great quality, low memory
- **QLoRA**: 4-bit + LoRA, very low memory
- **Best Practice**: Prompt â†’ LoRA â†’ Full fine-tune

## 11. References

- [LoRA Paper](https://arxiv.org/abs/2106.09685)
- [QLoRA Paper](https://arxiv.org/abs/2305.14314)
- [PEFT Library](https://github.com/huggingface/peft)
- [TRL Library](https://github.com/huggingface/trl)

---
**Next:** [Module 21: RAG (Retrieval-Augmented Generation)](../21_rag/21_rag.ipynb)