```{contents}
```

## Low-Rank Adaptation (LoRA)

LoRA is a **parameter-efficient fine-tuning technique** that lets you adapt a large pretrained model **without updating its original weights**.

Instead of modifying a big weight matrix (W), LoRA learns a **small low-rank update** (\Delta W):

$$
W_{\text{new}} = W + \Delta W
$$

where:

$$
\Delta W = B A
$$

* (A) and (B) are small trainable matrices
* Rank (r) is tiny (like 4, 8, 16)
* (W) stays **frozen**

This reduces training cost, memory, and overfitting.

---

### Why LoRA Exists (Problem Statement)

Full fine-tuning is expensive because:

* LLMs have **billions of parameters**
* Updating all of them requires **huge GPU memory**
* Risk of **catastrophic forgetting**
* Each task requires training/storing another full model

We want to fine-tune for new tasks (medical, legal, customer support) **without retraining the whole model**.

LoRA solves this.

---

### Key Insight Behind LoRA (Intuition)

#### Most of the knowledge in a large model is already correct.

To adapt it to a new task, you *only need a small adjustment*.

This adjustment usually lies in a **low-dimensional subspace**.

So instead of updating a big matrix:

```
4096 × 4096 = 16,777,216 parameters
```

LoRA updates two small matrices:

```
4096 × r and r × 4096
```

If r = 8:

```
4096×8 + 8×4096 = 65,536 parameters
```

→ **250× fewer trainable parameters**

This small update is the “nudge” the model needs.

---

### How LoRA Works (Mechanism)

For a weight matrix (W):

1. **Freeze** the original weight (no gradients)
2. Add a low-rank decomposition:
   $$
   \Delta W = B A
   $$
3. Train **only A and B**
4. During inference:
   $$
   y = Wx + B(Ax)
   $$

You don’t modify (W); you add a learned correction on top.

---

### Where LoRA Applies in Transformers

LoRA is usually applied to the **attention projection matrices**:

* Query (Q)
* Key (K)
* Value (V)
* Output projection (O)

These matrices are the most influential and high-dimensional in LLMs.

Applying LoRA here gives maximum effect with minimal parameters.

---

### Benefits of LoRA

#### **✔ Train fewer parameters**

Often **<1%** of the model is trainable.

#### **✔ Preserve original model**

No risk of damaging pretrained knowledge.

#### **✔ Efficient storage**

Adapters are **tiny** (few MB).

#### **✔ Fast training**

Fits on a single GPU.

#### **✔ Modular**

Load different LoRA adapters for:

* coding
* math
* medical
* translation

One base model → many skills.

---

### Mini Demonstration (PyTorch Style)

#### Original weight:

```python
W: (4096 × 4096)
```

#### LoRA learns:

```python
A: (r × 4096)
B: (4096 × r)
```

Forward pass:

```python
y = W @ x + B @ (A @ x)
```

W is frozen; only A and B learn.

---

**One-Sentence Summary**

**LoRA fine-tunes large models by learning a small low-rank matrix update on top of frozen pretrained weights, making adaptation fast, cheap, and parameter-efficient.**


### Demonstration

#### 1) Intuition (one short paragraph)

LoRA keeps the pretrained weight `W` frozen and learns a **low-rank update** (\Delta W = B A) with small rank `r`. Forward becomes:
[
y = W x + \frac{\alpha}{r} B (A x)
]
Only `A` and `B` are trained — drastically fewer params and fast fine-tuning.

---

#### 2) Minimal LoRA module (linear layer adapter)

```python
import torch
import torch.nn as nn

class LoRALayer(nn.Module):
    """
    LoRA adapter for a single nn.Linear module.
    It wraps a frozen base linear layer and adds a low-rank trainable update.
    """
    def __init__(self, base_linear: nn.Linear, r: int = 4, alpha: float = 1.0):
        super().__init__()
        # store base (frozen) linear
        self.base = base_linear
        self.base.weight.requires_grad = False
        if self.base.bias is not None:
            self.base.bias.requires_grad = False

        in_dim = self.base.in_features
        out_dim = self.base.out_features
        self.r = r
        self.alpha = alpha
        self.scaling = alpha / max(1, r)

        # A: projects input -> r (down)
        # B: projects r -> output (up)
        # initialize A so outputs are small; initialize B to zero to start with base behaviour
        self.A = nn.Parameter(torch.randn(r, in_dim) * 0.01)   # shape (r, in)
        self.B = nn.Parameter(torch.zeros(out_dim, r))         # shape (out, r)

    def forward(self, x):
        # x: (batch, ..., in_dim) -> we treat last dim as feature dim
        # compute base output
        base_out = self.base(x)  # uses frozen weights

        # compute low-rank update: B @ (A @ x^T)  -> but do via (x @ A^T) then @ B^T
        # x shape (BATCH, in_dim); xA_T shape (BATCH, r)
        xA_T = torch.matmul(x, self.A.t())             # (BATCH, r)
        lora_out = torch.matmul(xA_T, self.B.t())      # (BATCH, out_dim)
        return base_out + self.scaling * lora_out
```

---

#### 3) Helper: attach LoRA to all `nn.Linear` layers (or targeted layers)

```python
def inject_lora(model: nn.Module, r: int = 4, alpha: float = 1.0,
                target_type=nn.Linear, module_name_filter=None):
    """
    Replace target_type modules with LoRA-wrapped modules.
    module_name_filter: optional function(name)->bool to pick which layers to wrap.
    Returns a list of (orig_name, lora_module).
    """
    replaced = []
    for name, mod in list(model.named_modules()):
        # only consider direct children to allow safe replacement
        parent = None
        name_in_parent = None
        parts = name.split('.')
        if len(parts) > 0 and parts[0] == '':
            continue
    # simpler: iterate over immediate children of top-level module only
    for child_name, child in list(model._modules.items()):
        if isinstance(child, target_type) and (module_name_filter is None or module_name_filter(child_name)):
            model._modules[child_name] = LoRALayer(child, r=r, alpha=alpha)
            replaced.append(child_name)
        else:
            # recurse into child
            if len(list(child.children())) > 0:
                inject_lora(child, r=r, alpha=alpha, target_type=target_type, module_name_filter=module_name_filter)
    return replaced
```

> Use `module_name_filter` to target Q/K/V/O proj layers by name in transformers (e.g. names containing `'q_proj'`, `'k_proj'`, `'v_proj'`, `'o_proj'`).

---

#### 4) Tiny end-to-end example: an MLP, inject LoRA, check params, train only LoRA

```python
import torch.optim as optim
import torch.nn.functional as F

# simple base model
class SmallMLP(nn.Module):
    def __init__(self, input_dim=8, hidden=64, out=2):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden)
        self.fc2 = nn.Linear(hidden, hidden)
        self.out = nn.Linear(hidden, out)
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.out(x)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SmallMLP().to(device)

# Count original params
total_params = sum(p.numel() for p in model.parameters())
print("Total params (orig):", total_params)

# Inject LoRA into fc1 and fc2 only (example)
inject_lora(model, r=4, alpha=8.0, target_type=nn.Linear,
            module_name_filter=lambda n: n in ("fc1","fc2"))

# Count params after injection and trainable params
total_params_after = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Total params (after):", total_params_after)
print("Trainable params (should be small):", trainable_params)

# Verify only LoRA params are trainable
for name, p in model.named_parameters():
    if p.requires_grad:
        print("Trainable:", name, p.shape)

# Dummy dataset
X = torch.randn(128, 8).to(device)
y = torch.randint(0,2,(128,)).to(device)

# Optimizer must include only trainable params (LoRA A,B)
opt = optim.Adam([p for p in model.parameters() if p.requires_grad], lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

model.train()
for epoch in range(5):
    logits = model(X)
    loss = loss_fn(logits, y)
    opt.zero_grad()
    loss.backward()
    opt.step()
    print(f"Epoch {epoch} loss {loss.item():.4f}")
```

**What this does:**

* Replaces the `fc1` and `fc2` `nn.Linear` with `LoRALayer` wrappers.
* Base `Linear` weights are frozen.
* Only LoRA `A` and `B` parameters have `requires_grad=True`.
* Optimizer updates only LoRA params.
* You can observe training loss decreasing while only small adapter weights are learned.

---

#### 5) Parameter-count demonstration (why LoRA is small)

Example numbers (conceptual):

* `fc1` weight: 64 × 8 = 512 params
* `fc2` weight: 64 × 64 = 4,096 params

LoRA with `r=4`:

* For `fc1`: A: 4×8 = 32; B: 64×4 = 256 → total 288 params
* For `fc2`: A: 4×64 = 256; B: 64×4 = 256 → total 512 params

Total trainable ≈ 800 params vs full fine-tune thousands — big savings.

---

#### 6) Save / load LoRA adapter only (recommended)

```python
# save only LoRA parameters
lora_state = {k:v.cpu() for k,v in model.state_dict().items() if 'A' in k or 'B' in k}
torch.save(lora_state, "lora_adapter.pth")

# load back (on same architecture)
adapter = torch.load("lora_adapter.pth", map_location='cpu')
model.load_state_dict(adapter, strict=False)  # strict=False because only subset loaded
```

---

#### 7) Notes & best practices

* **Where to apply LoRA:** Q/K/V/O projection matrices in attention yield strong effect for transformers. Use `module_name_filter` to target these layers by name.
* **Rank `r`:** small (4–16) usually works. Higher `r` = more capacity but more params.
* **Alpha scaling:** common to use `alpha = r` or some value to scale update magnitude.
* **Initialization:** A small random init and B zero is common so model behavior starts unchanged.
* **Saving:** keep adapter only — multiple small adapters can be stored for different tasks.

### LLM
Below is a **clean, practical demonstration of LoRA applied to a real LLM** using **HuggingFace + PEFT** — the *correct modern method* for LoRA fine-tuning large language models like LLaMA, Mistral, GPT-J, Falcon, etc.

This example is:

* **Minimal** (copy–paste runnable)
* **Uses PEFT (official library for LoRA)**
* **Shows where LoRA attaches inside an LLM**
* **Shows training only LoRA params**
* **Works on CPU or GPU**

---

#### **1. Install Required Packages**

```bash
pip install transformers peft accelerate datasets sentencepiece
```

---

#### **2. Load an LLM + Apply LoRA (Using PEFT)**

Example: LLaMA-like or Mistral model (small).

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model

model_name = "tiiuae/falcon-7b-instruct"  # choose any HF causal LM
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,            # optional: use 8-bit to reduce VRAM
    device_map="auto"
)
```

---

#### **3. Define LoRA Configuration**

LoRA usually targets **Q, K, V, O** attention projection layers of the LLM.

```python
lora_config = LoraConfig(
    r=8,                          # rank of LoRA matrices
    lora_alpha=16,                # scaling factor
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], 
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
```

### Why target these layers?

Because Q/K/V/O projection matrices are where most transformer adaptation happens.

---

#### **4. Inject LoRA Into the Model**

```python
model = get_peft_model(model, lora_config)
```

---

#### **5. Show Trainable vs Frozen Parameters**

LoRA trains **<1%** of the parameters.

```python
model.print_trainable_parameters()
```

Typical output:

```
trainable params: 3,276,800 || all params: 7,000,000,000 
trainable = 0.046%
```

This is the *true power* of LoRA.

---

#### **6. Prepare Dataset (Tiny Example)**

Use any text dataset. Example using `datasets` library:

```python
from datasets import load_dataset
data = load_dataset("yelp_review_full", split="train[:2000]")  # small sample
```

Tokenize:

```python
def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

tokenized = data.map(tokenize, batched=True)
```

---

#### **7. Training Loop Using Transformers Trainer**

```python
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="lora-llm",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    warmup_steps=50,
    max_steps=200,
    fp16=True,
    logging_steps=10,
    learning_rate=2e-4
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized
)

trainer.train()
```

LoRA updates only the low-rank parameters (`A` and `B`) and keeps the main model frozen.

---

#### **8. Use the Fine-Tuned LLM**

```python
model.eval()
prompt = "Explain LoRA in simple words."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output = model.generate(**inputs, max_new_tokens=80)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```

---

#### **9. Save Only the LoRA Adapter (Not the Full LLM)**

```python
model.save_pretrained("lora_adapter")
```

This directory contains only ~5–20MB of LoRA weights (tiny).

You can later apply them to the full LLM again:

```python
from peft import PeftModel

model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
model = PeftModel.from_pretrained(model, "lora_adapter")
```

---

#### ✔ **What Happens Inside the LLM? (Mechanism)**

LoRA modifies transformer attention layers:

### Original Q projection:

$$
Q = x W_q
$$

### With LoRA:

$$
Q = x W_q + x (B A)
$$

here:

* $A: d \rightarrow r$ small
* $B: r \rightarrow d$ small
* $W_q$ frozen
* only $A$ and $B$ are trained

This gives the model a **learned correction ("nudge")** without touching the original LLM weights.

---

**FINAL SUMMARY — LoRA on LLMs**

| Feature                      | Benefit                             |
| ---------------------------- | ----------------------------------- |
| Only 0.1% parameters trained | Fast & cheap fine-tuning            |
| LLM weights frozen           | Prevents catastrophic forgetting    |
| LoRA adapters tiny           | Can store many skills               |
| Works for all LLMs           | GPT-J, Falcon, Mistral, LLaMA, etc. |
| Easy with PEFT               | One line of code                    |