# Notebook ⑥ – Q‑GaLore Full‑Parameter Fine‑tuning (Llama‑3 8B)

Quantised‐Gradient GaLore lets you *fully* fine‑tune an 8 B model under a single RTX 4060 Ti (16 GB) by:<br>
* loading **4‑bit weights** (bitsandbytes NF4)  
* keeping **4‑bit gradients** via GaLore’s low‑rank projection  
* using an 8‑bit AdamW optimizer

---

### Packages

```bash
pip install bitsandbytes==0.43.1             transformers>=4.41.0             accelerate>=0.29.3             datasets trl             galore-pytorch  # https://github.com/amirgholami/GaLore
```


In [None]:
import os, torch, pandas as pd
from datasets import Dataset
from pathlib import Path
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
)
from galore_torch import GaLoreAdamW8bit  # comes from galore-pytorch
from trl import SFTTrainer, DataCollatorForLanguageModeling


### Load 50‑example instruction dataset

In [None]:
DATA_FILE = Path("eval_qa50.csv")   # created in Notebook ②
df = pd.read_csv(DATA_FILE)

def to_prompt(row):
    return f"CONTEXT:\n{row.context}\nQUESTION:\n{row.question}\nANSWER:"
dataset = Dataset.from_dict({
    "prompt": [to_prompt(r) for _, r in df.iterrows()],
    "response": df["answer"].tolist()
})


### Load 8 B model in 4‑bit

In [None]:
BASE_MODEL = "meta-llama/Meta-Llama-3-8B-Instruct"
bnb_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_cfg,
    device_map="auto",
    trust_remote_code=True
)
print(f"Model loaded with {sum(p.numel() for p in model.parameters())/1e6:.0f} M params")


### Configure Q‑GaLore Optimizer

In [None]:
from accelerate import Accelerator
from transformers import TrainingArguments

# GaLore hyper‑params
galore_cfg = dict(
    rank=64,                # low‑rank projection
    update_proj_gap=200,    # how often to re‑project
    scale=1.0,
    proj_type="std",
    beta1=0.9,
    beta2=0.95,
    weight_decay=0.0
)

def make_param_groups(model):
    decay, no_decay = [], []
    for n, p in model.named_parameters():
        if not p.requires_grad: continue
        if p.ndim < 2 or "norm" in n or "bias" in n:
            no_decay.append(p)
        else:
            decay.append(p)
    return [
        {"params": decay, "weight_decay": 0.01},
        {"params": no_decay, "weight_decay": 0.0},
    ]

optimizer = GaLoreAdamW8bit(
    make_param_groups(model),
    lr=2e-4,
    **galore_cfg
)


### SFTTrainer Loop with GaLore optimizer

In [None]:
training_args = TrainingArguments(
    output_dir="qgalore-llama3-8b",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    bf16=True,
    optim="adamw_torch",   # dummy—will be overridden
    logging_steps=10,
    save_strategy="epoch"
)

collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    data_collator=collator,
    max_seq_length=1024,
    packing=False,
)

# Override optimizer
trainer.create_optimizer = lambda *_: optimizer

trainer.train()
trainer.save_model()
tokenizer.save_pretrained("qgalore-llama3-8b")
print("✅ Saved Q‑GaLore fine‑tuned model to qgalore-llama3-8b/")


## Expected VRAM

| Stage | 8 B params | Peak VRAM |
|-------|------------|-----------|
| Loading 4‑bit weights | ~5 GB | |
| Training step (GaLore rank 64, 8‑bit AdamW) | 12‑14 GB | should fit RTX 4060 Ti 16 GB |

Adjust `rank` and `update_proj_gap` if you need more headroom.
