# Lightweight Fine-Tuning Project

* PEFT technique: LoRA, QLoRA
* Model: BERT
* Evaluation approach: Accuracy and Macro-averaged F1 Score: Accuracy provides overall correctness, while Macro-F1 balances performance across all four classes equally.
* Fine-tuning dataset: AG News Dataset: The AG’s News Topic Classification dataset consists of four categories from the original corpus: World, Sports, Business, and Sci/Tech. Each category includes 30,000 training samples and 1,900 test samples, resulting in a total of 120,000 training samples and 7,600 test samples.

## Loading and Evaluating a Foundation Model



In [None]:
import os, random, numpy as np, torch, pandas as pd
from collections import Counter
from datasets import load_dataset, DatasetDict
from transformers import (
    AutoTokenizer, AutoConfig, AutoModelForSequenceClassification,
    DataCollatorWithPadding, TrainingArguments, Trainer, EarlyStoppingCallback,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, AutoPeftModelForSequenceClassification, TaskType

In [None]:
# Load the ag news dataset, and print the first three rows of them
ds = load_dataset("ag_news")
print("Dataset loaded:", ds)

for i in range(3):
    print(f"Row {i}:")
    print("  text :", ds["train"][i]["text"])
    print("  label:", ds["train"][i]["label"])

Downloading readme: 0.00B [00:00, ?B/s]

Downloading data: 100%|██████████| 18.6M/18.6M [00:00<00:00, 27.5MB/s]
Downloading data: 100%|██████████| 1.23M/1.23M [00:00<00:00, 8.95MB/s]


Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

Dataset loaded: DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})
Row 0:
  text : Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
  label: 2
Row 1:
  text : Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.
  label: 2
Row 2:
  text : Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums.
  label: 2


In [None]:
# Load the BERT model and tokenizer
model_name = "bert-base-uncased"
try:
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
    # Load the pretrained model for sequence classification
    config = AutoConfig.from_pretrained(model_name, num_labels=4)
    model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config)
    print("Model and tokenizer loaded successfully.")
except Exception as e:
    print(f"An error occurred while loading the model or tokenizer: {e}")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model and tokenizer loaded successfully.


In [None]:
# Split 10% of the training set into a validation set, keep the original test set unchanged
split = ds["train"].train_test_split(
    test_size=0.1,
    seed=42,
    stratify_by_column="label"
)
dataset = DatasetDict({
    "train": split["train"],
    "validation": split["test"],
    "test": ds["test"],
})

In [None]:
# Check class distribution in train, validation, and test sets to ensure they are balanced
def count_labels(ds): return dict(Counter(ds["label"]))

print("Train counts    :", count_labels(dataset["train"]))
print("Validation counts:", count_labels(dataset["validation"]))
print("Test counts     :", count_labels(dataset["test"]))

Train counts    : {1: 27000, 3: 27000, 2: 27000, 0: 27000}
Validation counts: {2: 3000, 0: 3000, 3: 3000, 1: 3000}
Test counts     : {2: 1900, 3: 1900, 1: 1900, 0: 1900}


In [None]:
# Tokenize all datasets
def tok_fn(batch):
    return tokenizer(batch["text"], truncation=True)

cols_to_remove = [c for c in dataset["train"].column_names if c != "label"]
ds_tok = dataset.map(tok_fn, batched=True, remove_columns=cols_to_remove)

Map:   0%|          | 0/108000 [00:00<?, ? examples/s]

Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

In [None]:
# Define Evaluation Metrics: Accuracy and Macro-averaged F1 Score
def accuracy_score_np(y_true, y_pred):
    return (y_true == y_pred).mean()

def f1_score_macro_np(y_true, y_pred, num_labels):
    f1s = []
    for label in range(num_labels):
        tp = np.sum((y_pred == label) & (y_true == label))
        fp = np.sum((y_pred == label) & (y_true != label))
        fn = np.sum((y_pred != label) & (y_true == label))

        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0

        if precision + recall == 0:
            f1s.append(0.0)
        else:
            f1s.append(2 * precision * recall / (precision + recall))
    return float(np.mean(f1s))

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_score_np(labels, preds)
    macro_f1 = f1_score_macro_np(labels, preds, num_labels=4)
    return {"accuracy": acc, "macro_f1": macro_f1}

In [None]:
# Evaluate BERT on the AG News test set before fine-tuning
collator = DataCollatorWithPadding(tokenizer=tokenizer)

args = TrainingArguments(
    output_dir="./pre_eval_bert_agnews",
    per_device_eval_batch_size=32,
    report_to=[],
)

trainer = Trainer(
    model=model,
    args=args,
    eval_dataset=ds_tok["test"],
    tokenizer=tokenizer,
    data_collator=collator,
    compute_metrics=compute_metrics,
)

pre_eval = trainer.evaluate()
print("Pre-finetune evaluation:", pre_eval)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Pre-finetune evaluation: {'eval_loss': 1.459642767906189, 'eval_accuracy': 0.25, 'eval_macro_f1': 0.10053574918182621, 'eval_runtime': 44.5747, 'eval_samples_per_second': 170.5, 'eval_steps_per_second': 5.339}


## Performing Parameter-Efficient Fine-Tuning



In [None]:
# Configure LoRA for sequence classification
lora_config = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.1,
    target_modules=["query", "key", "value", "dense"],
    bias="none",
    task_type=TaskType.SEQ_CLS,
)

lora_model = get_peft_model(model, lora_config)
lora_model.print_trainable_parameters()

trainable params: 2,684,936 || all params: 112,167,176 || trainable%: 2.393691359404466


In [None]:
# Fine-tune BERT with LoRA on the AG News training set
collator = DataCollatorWithPadding(tokenizer=tokenizer)

args = TrainingArguments(
    output_dir="/tmp/your_model_name",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=1e-3,
    weight_decay=0.01,
    evaluation_strategy="steps",
    save_strategy="steps",
    logging_steps=50,
    eval_steps=200,
    save_steps=200,
    load_best_model_at_end=True,
    metric_for_best_model="macro_f1",
    greater_is_better=True,
    fp16=True,
    bf16=False,
    warmup_ratio=0.06,
    report_to=[],
)

callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]

trainer = Trainer(
    model=lora_model,
    args=args,
    train_dataset=ds_tok["train"],
    eval_dataset=ds_tok["validation"],
    tokenizer=tokenizer,
    data_collator=collator,
    compute_metrics=compute_metrics,
    callbacks=callbacks,
)

trainer.train()

Step,Training Loss,Validation Loss,Accuracy,Macro F1
200,0.355,0.371929,0.878,0.878003
400,0.4569,0.399037,0.88675,0.886365
600,0.4398,0.358828,0.886,0.885538
800,1.0542,1.463574,0.25075,0.101556
1000,1.4138,1.388844,0.25,0.1


TrainOutput(global_step=1000, training_loss=0.7304897518157959, metrics={'train_runtime': 357.3583, 'train_samples_per_second': 302.218, 'train_steps_per_second': 18.889, 'total_flos': 826308231785472.0, 'train_loss': 0.7304897518157959, 'epoch': 0.15})

In [None]:
ADAPTER_DIR = "/tmp/bert_lora_adapter"
lora_model.save_pretrained(ADAPTER_DIR)
tokenizer.save_pretrained(ADAPTER_DIR)
print("Saved LoRA adapter to:", ADAPTER_DIR)

Saved LoRA adapter to: /tmp/bert_lora_adapter


## Performing Inference with a PEFT Model



In [None]:
# Run inference with the fine-tuned BERT + LoRA (PEFT) model on the AG News test set
ADAPTER_DIR = "/tmp/bert_lora_adapter"
BASE = "bert-base-uncased"
NUM_LABELS = 4
ID2LABEL = {0:"World", 1:"Sports", 2:"Business", 3:"Sci/Tech"}
LABEL2ID = {"World":0, "Sports":1, "Business":2, "Sci/Tech":3}

tokenizer = AutoTokenizer.from_pretrained(BASE, use_fast=True)
model = AutoPeftModelForSequenceClassification.from_pretrained(
    ADAPTER_DIR,
    num_labels=NUM_LABELS,
    id2label=ID2LABEL,
    label2id=LABEL2ID,
)
model.eval()

collator = DataCollatorWithPadding(tokenizer)

args = TrainingArguments(output_dir="/tmp/eval_peft_bert", per_device_eval_batch_size=32, report_to=[])
trainer = Trainer(
    model=model,
    args=args,
    eval_dataset=ds_tok["test"],
    tokenizer=tokenizer,
    data_collator=collator,
    compute_metrics=compute_metrics,
)

lora_eval = trainer.evaluate()
print("PEFT (bert + LoRA) on AG News test:", lora_eval)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


PEFT (bert + LoRA) on AG News test: {'eval_loss': 0.40790843963623047, 'eval_accuracy': 0.8825, 'eval_macro_f1': 0.8820321664958857, 'eval_runtime': 26.0191, 'eval_samples_per_second': 292.093, 'eval_steps_per_second': 9.147}


In [None]:
# Compare pre-finetune baseline and PEFT (BERT + LoRA) results
def pick_metrics(m):
    return {
        "Loss":      float(m.get("eval_loss", float("nan"))),
        "Accuracy":  float(m.get("eval_accuracy", float("nan"))),
        "Macro-F1":  float(m.get("eval_macro_f1", float("nan"))),
    }
pre_m  = pick_metrics(pre_eval)
peft_m = pick_metrics(lora_eval)
df = pd.DataFrame(
    [pre_m, peft_m],
    index=["Pre-finetune (baseline)", "PEFT (BERT + LoRA)"]
)
df.loc["Δ (PEFT - Base)"] = df.iloc[1] - df.iloc[0]
print(df)

                             Loss  Accuracy  Macro-F1
Pre-finetune (baseline)  1.459643    0.2500  0.100536
PEFT (BERT + LoRA)       0.407908    0.8825  0.882032
Δ (PEFT - Base)         -1.051734    0.6325  0.781496


## QLoRA

In [None]:
SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)

SAVE_ROOT = "/tmp/peft_grid_bert_agnews"
os.makedirs(SAVE_ROOT, exist_ok=True)

def pick_optim(model):
    # Use paged_adamw_8bit for quantization to save GPU memory; otherwise use torch's AdamW
    if getattr(model, "is_loaded_in_4bit", False) or getattr(model, "is_loaded_in_8bit", False):
        return "paged_adamw_8bit"
    return "adamw_torch"

In [None]:
# Tokenization
tokenizer = AutoTokenizer.from_pretrained(BASE, use_fast=True)

def tok_fn(b):
    return tokenizer(b["text"], truncation=True, max_length=256)

cols_to_remove = [c for c in dataset["train"].column_names if c != "label"]
ds_tok = dataset.map(tok_fn, batched=True, remove_columns=cols_to_remove)
collator = DataCollatorWithPadding(tokenizer)

Map:   0%|          | 0/108000 [00:00<?, ? examples/s]

Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

In [None]:
# Model loader with fallbacks
def load_model_with_fallback():
    cfg = AutoConfig.from_pretrained(BASE, num_labels=4, id2label=ID2LABEL, label2id=LABEL2ID)
    # Try 4-bit (QLoRA)
    try:
        bnb4 = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_use_double_quant=True,
            bnb_4bit_compute_dtype=torch.float16,   # T4: fp16 compute
        )
        print("[INFO] Try 4-bit NF4...")
        return AutoModelForSequenceClassification.from_pretrained(
            BASE, config=cfg, quantization_config=bnb4, device_map={"":0}, low_cpu_mem_usage=True
        )
    except Exception as e:
        print("[WARN] 4-bit failed:", repr(e))
    # Fallback 8-bit
    try:
        bnb8 = BitsAndBytesConfig(load_in_8bit=True)
        print("[INFO] Try 8-bit...")
        return AutoModelForSequenceClassification.from_pretrained(
            BASE, config=cfg, quantization_config=bnb8, device_map={"":0}, low_cpu_mem_usage=True
        )
    except Exception as e2:
        print("[WARN] 8-bit failed:", repr(e2))
    # Fallback full precision (GPU if available)
    print("[INFO] Fallback to full-precision.")
    model = AutoModelForSequenceClassification.from_pretrained(BASE, config=cfg)
    if use_cuda:
        model = model.to("cuda")
    return model

In [None]:
# PEFT grid (name, r, alpha, dropout, lr)
PEFT_GRID = [
    ("lora_r8_a16_d05",   8, 16, 0.05, 1e-3),
    ("lora_r16_a32_d10", 16, 32, 0.10, 1e-3),
    ("lora_r32_a32_d10", 32, 32, 0.10, 8e-4),
]

def count_params(m, trainable=False):
    return int(sum(p.numel() for p in m.parameters() if (p.requires_grad if trainable else True)))

In [None]:
# Run experiments
results = []

for name, r, alpha, dr, lr in PEFT_GRID:
    print(f"\n==== Experiment: {name} (r={r}, alpha={alpha}, dropout={dr}, lr={lr}) ====")
    base_model = load_model_with_fallback()

    if hasattr(base_model, "gradient_checkpointing_enable"):
        base_model.gradient_checkpointing_enable()

    # LoRA
    lcfg = LoraConfig(
        r=r, lora_alpha=alpha, lora_dropout=dr,
        target_modules=["query","key","value","dense"],
        bias="none", task_type=TaskType.SEQ_CLS,
    )
    model = get_peft_model(base_model, lcfg)

    # Force trainable parameters & classification head to FP32 (for stability)
    for n, p in model.named_parameters():
        if p.requires_grad and p.dtype != torch.float32:
            p.data = p.data.float()
    if hasattr(model, "model") and hasattr(model.model, "classifier"):
        model.model.classifier = model.model.classifier.float()

    model.print_trainable_parameters()

    args = TrainingArguments(
        output_dir=f"{SAVE_ROOT}/{name}",
        num_train_epochs=2,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=32,
        learning_rate=lr,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="macro_f1",
        greater_is_better=True,
        fp16=True,
        bf16=False,
        logging_steps=100,
        save_total_limit=2,
        report_to=[],
        optim=pick_optim(base_model),
    )

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=ds_tok["train"],
        eval_dataset=ds_tok["validation"],
        tokenizer=tokenizer,
        data_collator=collator,
        compute_metrics=compute_metrics,
    )

    trainer.train()

    # Evaluation：validation & test
    val_metrics  = trainer.evaluate(ds_tok["validation"])
    test_metrics = trainer.evaluate(ds_tok["test"])

    # Save the LoRA adapter (lightweight)
    adapter_dir = f"{SAVE_ROOT}/{name}/adapter"
    model.save_pretrained(adapter_dir)
    tokenizer.save_pretrained(adapter_dir)

    row = {
        "run": name, "r": r, "alpha": alpha, "dropout": dr, "lr": lr,
        "val_acc":  float(val_metrics.get("eval_accuracy", np.nan)),
        "val_f1":   float(val_metrics.get("eval_macro_f1", np.nan)),
        "test_acc": float(test_metrics.get("eval_accuracy", np.nan)),
        "test_f1":  float(test_metrics.get("eval_macro_f1", np.nan)),
        "trainable_params": count_params(model, trainable=True),
        "total_params":     count_params(model, trainable=False),
        "adapter_dir": adapter_dir,
    }
    results.append(row)
    print("Saved adapter to:", adapter_dir)


==== Experiment: lora_r8_a16_d05 (r=8, alpha=16, dropout=0.05, lr=0.001) ====
[INFO] Try 4-bit NF4...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 1,345,544 || all params: 110,827,784 || trainable%: 1.2140854499084814


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,Macro F1
1,0.3582,0.31248,0.893583,0.893373
2,0.2868,0.280433,0.900167,0.900119




Saved adapter to: /tmp/peft_grid_bert_agnews/lora_r8_a16_d05/adapter

==== Experiment: lora_r16_a32_d10 (r=16, alpha=32, dropout=0.1, lr=0.001) ====
[INFO] Try 4-bit NF4...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 2,684,936 || all params: 112,167,176 || trainable%: 2.393691359404466




Epoch,Training Loss,Validation Loss,Accuracy,Macro F1
1,0.3653,0.305116,0.895417,0.895327
2,0.2905,0.275519,0.899833,0.899802




Saved adapter to: /tmp/peft_grid_bert_agnews/lora_r16_a32_d10/adapter

==== Experiment: lora_r32_a32_d10 (r=32, alpha=32, dropout=0.1, lr=0.0008) ====
[INFO] Try 4-bit NF4...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 5,363,720 || all params: 114,845,960 || trainable%: 4.670360193776081




Epoch,Training Loss,Validation Loss,Accuracy,Macro F1
1,0.3633,0.30322,0.89475,0.894549
2,0.2955,0.281204,0.898583,0.89854




Saved adapter to: /tmp/peft_grid_bert_agnews/lora_r32_a32_d10/adapter


In [None]:
# Summary table
df = pd.DataFrame(results).sort_values(["val_f1","test_f1"], ascending=False).reset_index(drop=True)
display(df[["run","r","alpha","dropout","lr","val_acc","val_f1","test_acc","test_f1","trainable_params","adapter_dir"]])
print("\nBest by validation macro-F1 ->", df.iloc[0]["run"], "\nAdapter path:", df.iloc[0]["adapter_dir"])

Unnamed: 0,run,r,alpha,dropout,lr,val_acc,val_f1,test_acc,test_f1,trainable_params,adapter_dir
0,lora_r8_a16_d05,8,16,0.05,0.001,0.900167,0.900119,0.898026,0.897948,1345544,/tmp/peft_grid_bert_agnews/lora_r8_a16_d05/ada...
1,lora_r16_a32_d10,16,32,0.1,0.001,0.899833,0.899802,0.898289,0.898238,2684936,/tmp/peft_grid_bert_agnews/lora_r16_a32_d10/ad...
2,lora_r32_a32_d10,32,32,0.1,0.0008,0.898583,0.89854,0.8975,0.897415,5363720,/tmp/peft_grid_bert_agnews/lora_r32_a32_d10/ad...



Best by validation macro-F1 -> lora_r8_a16_d05 
Adapter path: /tmp/peft_grid_bert_agnews/lora_r8_a16_d05/adapter
