# Finetuning Pre-trained Models — From Theory to Practice

**Framework:** PyTorch + Hugging Face Transformers  
**Models:** Light models (CPU/Colab friendly)  
**Datasets:** Small public samples (IMDB, AG News, CoNLL2003)  
**Covers:** Full finetuning, Feature Extraction, Adapters, Prefix/Prompt Tuning, LoRA, NER, Vision TL, Eval, Deployment

> Tip: Run this on Google Colab with a T4/A100 GPU for the LoRA and generation examples.

## Table of Contents
1. Setup
2. Concepts & Heuristics
3. Data Loading & Preparation
4. Full Finetuning (Text Classification)
5. Feature Extraction (Frozen Encoder + Head)
6. Adapters (Parameter-Efficient)
7. Prefix/Prompt Tuning (Generation)
8. LoRA (Parameter-Efficient Finetuning)
9. Token Classification (NER)
10. Vision Transfer Learning (Brief)
11. Evaluation & Monitoring
12. Deployment (Pipeline + FastAPI)
13. Cost/Quality Heuristics
14. Pitfalls & Debugging
15. Exercises

## 1. Setup
This installs required libraries. Skip installs if your environment already has them.

In [None]:
!pip -q install transformers datasets accelerate evaluate scikit-learn seqeval

In [None]:
# Optional (for Adapters & PEFT/LoRA; may require GPU for speed)
!pip -q install peft adapter-transformers bitsandbytes==0.43.1

In [None]:
import warnings, os, math, random
warnings.filterwarnings('ignore')

import torch
from datasets import load_dataset, DatasetDict
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification, AutoModel,
    TrainingArguments, Trainer, DataCollatorWithPadding,
    AutoModelForTokenClassification, DataCollatorForTokenClassification,
    pipeline
)
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, classification_report
print('Torch:', torch.__version__)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

## 2. Concepts & Heuristics

**Finetuning** = continue training a pre-trained model on task-/domain-specific data.  
**When to choose what:**  
- **Prompting/RAG**: tiny data, zero engineering budget.  
- **Feature Extraction**: quick baseline; freeze encoder, train head.  
- **Adapters / Prefix**: low compute, many domains.  
- **LoRA/QLoRA**: best quality/cost for LLMs; train small low-rank matrices.  
- **Full Finetune**: small models (≤400M) with enough data/GPU; precise control.

## 3. Data Loading & Preparation

We'll use small, public datasets to keep the notebook lightweight:
- **AG News** (4-class text classification)
- **IMDB** (binary sentiment) — optional
- **CoNLL2003** (NER)

In [None]:
# Load AG News (classification)
from datasets import load_dataset, DatasetDict
ag = load_dataset("ag_news")
# Subsample for speed
ag_small = DatasetDict({
    'train': ag['train'].shuffle(seed=42).select(range(2000)),
    'validation': ag['test'].shuffle(seed=42).select(range(1000))
})
ag_small

In [None]:
# Load CoNLL2003 (NER)
conll = load_dataset("conll2003")
conll_small = DatasetDict({
    'train': conll['train'].shuffle(seed=42).select(range(2000)),
    'validation': conll['validation'].shuffle(seed=42).select(range(1000))
})
label_list = conll['train'].features['ner_tags'].feature.names
id2label = {i:l for i,l in enumerate(label_list)}
label2id = {l:i for i,l in enumerate(label_list)}
label_list[:10], len(label_list)

## 4. Full Finetuning (Text Classification)

**Definition:** Update **all** parameters of a pre-trained encoder for the downstream task.  
**When:** Small/medium models, enough data, need best accuracy.

We'll use **DistilBERT** on **AG News**.

In [None]:
from transformers import DataCollatorWithPadding, AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

model_name = "distilbert-base-uncased"
tok = AutoTokenizer.from_pretrained(model_name)

def tok_ag(batch):
    return tok(batch['text'], truncation=True, padding=False, max_length=256)

ag_tok = ag_small.map(tok_ag, batched=True)
ag_tok = ag_tok.rename_column('label', 'labels')
ag_tok.set_format(type='torch', columns=['input_ids','attention_mask','labels'])

model_full = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4)

dc = DataCollatorWithPadding(tok)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1_macro": f1_score(labels, preds, average='macro')
    }

args = TrainingArguments(
    output_dir="ft_full_agnews",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    evaluation_strategy="epoch",
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    save_strategy="epoch"
)

trainer = Trainer(
    model=model_full, args=args,
    train_dataset=ag_tok['train'],
    eval_dataset=ag_tok['validation'],
    tokenizer=tok,
    data_collator=dc,
    compute_metrics=compute_metrics
)
trainer.train()

In [None]:
# Evaluate and preview predictions
preds = trainer.predict(ag_tok['validation'])
print(preds.metrics)

## 5. Feature Extraction (Frozen Encoder + Head)

**Definition:** Freeze the encoder and train a small classification head.  
**Why:** Very fast, good baseline, reduces overfitting on small data.

In [None]:
import torch, torch.nn as nn
from transformers import AutoModel

encoder = AutoModel.from_pretrained(model_name)
for p in encoder.parameters():
    p.requires_grad = False

class ClsHead(nn.Module):
    def __init__(self, hidden, n_labels):
        super().__init__()
        self.dropout = nn.Dropout(0.1)
        self.fc = nn.Linear(hidden, n_labels)
    def forward(self, last_hidden_state, attention_mask=None):
        pooled = last_hidden_state[:,0]  # [CLS]-like pooling
        return self.fc(self.dropout(pooled))

head = ClsHead(hidden=encoder.config.hidden_size, n_labels=4)

# Skeleton training loop over tokenized AG News features
optim = torch.optim.AdamW(head.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

def batch_iter(ds, bs=32):
    n = len(ds["input_ids"])
    for i in range(0, n, bs):
        yield {k: v[i:i+bs] for k,v in ds.items() if hasattr(v, "shape")}

train_set = ag_tok['train']
for step, batch in enumerate(batch_iter(train_set, bs=32)):
    with torch.no_grad():
        out = encoder(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'])
    logits = head(out.last_hidden_state)
    loss = loss_fn(logits, batch['labels'])
    optim.zero_grad()
    loss.backward()
    optim.step()
    if step % 50 == 0:
        print('step', step, 'loss', float(loss))
    if step > 200: break  # keep light

## 6. Adapters (Parameter-Efficient)

**Definition:** Insert small trainable bottlenecks inside each Transformer block; freeze the base.  
**Benefits:** Swap adapters for domains; small memory footprint.

In [None]:
from transformers import AutoModelForSequenceClassification
import transformers.adapters.composition as ac

model_ad = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4)
model_ad.add_adapter("agnews_adapter", config="pfeiffer")
model_ad.train_adapter("agnews_adapter")

trainer_ad = Trainer(
    model=model_ad, args=TrainingArguments(
        output_dir="ft_adapters_agnews",
        per_device_train_batch_size=16, per_device_eval_batch_size=32,
        num_train_epochs=1, evaluation_strategy="epoch", learning_rate=5e-4,
        logging_steps=50
    ),
    train_dataset=ag_tok['train'], eval_dataset=ag_tok['validation'],
    tokenizer=tok, data_collator=dc, compute_metrics=compute_metrics
)
trainer_ad.train()

## 7. Prefix/Prompt Tuning (Generation)

**Definition:** Learn soft prompt vectors prepended to inputs; keep base frozen.  
We'll demonstrate with a **small causal LM** (`gpt2`). For better results, use a GPU.

In [None]:
from peft import PromptTuningConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer

base_causal = "gpt2"
tok_g = AutoTokenizer.from_pretrained(base_causal)
tok_g.pad_token = tok_g.eos_token

gpt2 = AutoModelForCausalLM.from_pretrained(base_causal)

pt_cfg = PromptTuningConfig(task_type=TaskType.CAUSAL_LM, num_virtual_tokens=20)
gpt2_pt = get_peft_model(gpt2, pt_cfg)
gpt2_pt.print_trainable_parameters()

prompt = "Write a one-sentence product description for a fintech savings app:"
ids = tok_g(prompt, return_tensors='pt')
out = gpt2_pt.generate(**ids, max_new_tokens=40)
print(tok_g.decode(out[0], skip_special_tokens=True))

## 8. LoRA (PEFT) — Lightweight Demo

**LoRA:** Train low-rank matrices on top of frozen weights; great quality/cost trade-off.  
We'll show setup on a small causal LM (e.g., `gpt2`) for demonstration. For serious finetunes, use larger models + GPU.

In [None]:
from peft import LoraConfig, get_peft_model
from transformers import DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import Dataset

lora_cfg = LoraConfig(r=8, lora_alpha=16, lora_dropout=0.05, bias="none",
                      task_type="CAUSAL_LM", target_modules=["c_attn","c_proj"])
gpt2_lora = get_peft_model(gpt2, lora_cfg)
gpt2_lora.print_trainable_parameters()

# Tiny toy dataset for SFT-style tuning (demo only)
train_texts = [
    "Instruction: Summarize the benefit of automatic savings.\nAnswer: It helps users build funds effortlessly over time.",
    "Instruction: Explain KYC in one line.\nAnswer: KYC is the identity verification required by financial institutions.",
    "Instruction: Write a friendly reminder about overdraft fees.\nAnswer: Remember to maintain your balance to avoid overdraft fees."
]
toy = [{"text": t} for t in train_texts]

def tok_txt(row): 
    return tok_g(row["text"], truncation=True, max_length=128)
toy_ds = Dataset.from_list(toy).map(lambda x: tok_txt(x))
toy_ds.set_format(type="torch", columns=["input_ids","attention_mask"])

dc_lm = DataCollatorForLanguageModeling(tok_g, mlm=False)
args_lora = TrainingArguments(
    output_dir="gpt2_lora_toy",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=5e-4,
    num_train_epochs=1,
    logging_steps=5
)
trainer_lora = Trainer(model=gpt2_lora, args=args_lora, train_dataset=toy_ds, data_collator=dc_lm)
trainer_lora.train()

In [None]:
# Try generating again (adapter still attached)
prompt = "Instruction: Write a friendly one-line savings tip.\nAnswer:"
ids = tok_g(prompt, return_tensors='pt')
out = gpt2_lora.generate(**ids, max_new_tokens=40)
print(tok_g.decode(out[0], skip_special_tokens=True))

## 9. Token Classification (NER)

**Task:** Assign an entity label to each token (e.g., PER, ORG, LOC).  
We'll finetune a small BERT on a **subset of CoNLL2003**.

In [None]:
from transformers import AutoModelForTokenClassification, DataCollatorForTokenClassification, AutoTokenizer
tok_n = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_and_align_labels(examples):
    tokenized = tok_n(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i in range(len(examples["tokens"])):
        ids = tokenized.word_ids(batch_index=i)
        ex_labels = examples["ner_tags"][i]
        aligned = []
        prev = None
        for wid in ids:
            if wid is None:
                aligned.append(-100)
            elif wid != prev:
                aligned.append(ex_labels[wid])
            else:
                aligned.append(-100)
            prev = wid
        labels.append(aligned)
    tokenized["labels"] = labels
    return tokenized

conll_tok = conll_small.map(tokenize_and_align_labels, batched=True)
cols = ["input_ids","attention_mask","labels"]
conll_tok.set_format(type="torch", columns=cols)

ner_model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-cased", num_labels=len(label_list), id2label=id2label, label2id=label2id
)
dc_ner = DataCollatorForTokenClassification(tok_n)

args_ner = TrainingArguments(
    output_dir="ner_bert_conll",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    evaluation_strategy="epoch",
    logging_steps=50
)
trainer_ner = Trainer(
    model=ner_model, args=args_ner,
    train_dataset=conll_tok['train'], eval_dataset=conll_tok['validation'],
    data_collator=dc_ner, tokenizer=tok_n
)
trainer_ner.train()

## 10. Vision Transfer Learning (Brief)

**Idea:** Start from an ImageNet pre-trained model (e.g., ResNet18), freeze most layers, train a small classifier head for your classes.

In [None]:
import torch.nn as nn
from torchvision import models

resnet = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
for p in resnet.parameters():
    p.requires_grad = False
resnet.fc = nn.Linear(resnet.fc.in_features, 5)  # example 5 classes
sum(p.numel() for p in resnet.parameters() if p.requires_grad), 'trainable params in head'

## 11. Evaluation & Monitoring

- **Classification:** accuracy, macro-F1, confusion matrix  
- **Generation:** ROUGE/BLEU, human eval, safety filters  
- **NER:** precision/recall/F1 (seqeval)  
- **Monitoring:** drift detection, periodic re-eval, canary prompts

In [None]:
from evaluate import load
# Example: ROUGE for generation
rouge = load("rouge")
preds = ["The cat sat on the mat."]
refs = ["A cat is sitting on a mat."]
rouge.compute(predictions=preds, references=refs)

## 12. Deployment

### Option A — Simple Pipeline

In [None]:
clf = pipeline("text-classification", model=model_full, tokenizer=tok, device=-1)
clf("Breaking: New economic policy announced by central bank")

### Option B — FastAPI Endpoint (skeleton)
> Save as `serve.py` and run with `uvicorn serve:app --reload`

In [None]:
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class Inp(BaseModel):
    text: str

@app.post("/predict")
def predict(inp: Inp):
    out = clf(inp.text)[0]
    return {"label": out['label'], "score": float(out['score'])}

## 13. Cost/Quality Heuristics

- **Low compute:** Adapters / Prefix / LoRA  
- **Many domains:** One base + per-domain adapters  
- **Tiny dataset:** Feature extraction baseline, then PEFT  
- **Latency critical:** Smaller base + quantization; distill later  
- **Regulated domains:** Instruction SFT + guardrails + audits

## 14. Pitfalls & Debugging

| Issue | Symptom | Fix |
|---|---|---|
| Overfitting | train↑, val↓ | more data, dropout, early stop |
| Catastrophic forgetting | generic ability↓ | reduce LR, PEFT, data mix |
| Tokenization mismatch | errors, poor perf | use original tokenizer |
| Label leakage | unrealistically high val | fix splits & pipelines |
| Hallucination (LLMs) | incorrect facts | RAG, stricter prompts, eval |

## 15. Exercises

1. Replace DistilBERT with `bert-base-uncased` and re-run section 4. Compare F1.
2. Change max sequence length to 128 vs 256. Observe speed/accuracy trade-off.
3. Add a confusion matrix for AG News validation.
4. Train an additional adapter named `finance_v2` and compare adapter-only metrics.
5. For LoRA, increase rank `r` from 8 → 16 and measure generation changes.
6. Build a tiny NER dataset from your domain (5–10 examples) and test few-shot finetune.
7. (Vision) Unfreeze the last *two* ResNet layers and fine-tune at a lower LR.