<a href="https://colab.research.google.com/github/yilinmiao/LightweightFineTuning/blob/main/LightweightFineTuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: Low-Rank Adaptation (LoRA)
* Model: GPT-2 (gpt2)
* Evaluation approach: Accuracy metric with Hugging Face's Trainer
* Fine-tuning dataset: Stanford Sentiment Treebank (SST-2)

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

First, we'll load the pre-trained GPT-2 model and the SST-2 dataset, and evaluate the model's performance prior to fine-tuning.

In [2]:
# Install required packages if needed
# !pip install -q transformers datasets evaluate peft torch accelerate

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/485.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m481.3/485.4 kB[0m [31m20.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m60.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
# Import required libraries
import numpy as np
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding
)
import evaluate

In [4]:
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [5]:
# Load SST-2 dataset
dataset = load_dataset("glue", "sst2")
print(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})


In [45]:
# Take sufficient samples for training
# Using 10% of the training data (about 6.7K samples) for a more robust training
train_size = len(dataset["train"]) // 10
eval_size = min(1000, len(dataset["validation"]))  # Up to 1000 samples for evaluation

In [48]:
# Take smaller subsets for faster training and evaluation
train_dataset = dataset["train"].select(range(train_size))
eval_dataset = dataset["validation"].select(range(eval_size))

print(f"Training dataset size: {len(train_dataset)}")
print(f"Evaluation dataset size: {len(eval_dataset)}")

Training dataset size: 6734
Evaluation dataset size: 872


In [7]:
# Load model and tokenizer
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # GPT-2 doesn't have a pad token by default

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [50]:
# Load pre-trained model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,  # Binary classification (positive/negative)
    pad_token_id=tokenizer.eos_token_id,  # Set pad_token_id to match tokenizer
    # Properly initialize with good defaults
    problem_type="single_label_classification",
    return_dict=True
)
model.config.pad_token_id = tokenizer.eos_token_id
model.to(device)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)

In [51]:
# Print model size
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Model: {model_name}")
print(f"Number of trainable parameters: {num_params:,}")
print(f"Model config:\n{model.config}")

Model: gpt2
Number of trainable parameters: 124,441,344
Model config:
GPT2Config {
  "_attn_implementation_autoset": true,
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "pad_token_id": 50256,
  "problem_type": "single_label_classification",
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "tran

In [52]:
# Define tokenization function
def tokenize_function(examples):
    return tokenizer(examples["sentence"], truncation=True, padding="max_length", max_length=128)

In [53]:
# Tokenize datasets
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_eval = eval_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/6734 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

In [54]:
# Data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [55]:
# Define compute metrics function for evaluation
accuracy_metric = evaluate.load("accuracy")

In [56]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return accuracy_metric.compute(predictions=predictions, references=labels)

In [58]:
# Set up trainer
training_args = TrainingArguments(
    output_dir="./results",
    per_device_eval_batch_size=16,
    do_train=False,
    do_eval=True,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    eval_dataset=tokenized_eval,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


In [59]:
# Evaluate the model before fine-tuning
print("Evaluating the model before fine-tuning...")
base_model_metrics = trainer.evaluate()
print(f"Base model metrics: {base_model_metrics}")


Evaluating the model before fine-tuning...


Base model metrics: {'eval_loss': 3.072819948196411, 'eval_model_preparation_time': 0.0038, 'eval_accuracy': 0.5091743119266054, 'eval_runtime': 7.0144, 'eval_samples_per_second': 124.316, 'eval_steps_per_second': 7.841}


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

Now, we'll create a PEFT model using LoRA, train it on our dataset, and save the resulting weights.

In [61]:
# Import PEFT library components
from peft import LoraConfig, get_peft_model, TaskType

In [62]:
# Create LoRA configuration
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,  # Sequence classification task
    r=8,                         # Rank of LoRA matrices
    lora_alpha=32,               # Alpha parameter for LoRA scaling
    lora_dropout=0.1,            # Dropout probability for LoRA layers
    bias="none",                 # Don't adapt bias terms
    # Target the attention layers in GPT-2
    target_modules=["c_attn", "c_proj"],
)

In [63]:
# Create PEFT model
peft_model = get_peft_model(model, peft_config)
peft_model.print_trainable_parameters()
peft_model.to(device)

trainable params: 812,544 || all params: 125,253,888 || trainable%: 0.6487




PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): GPT2ForSequenceClassification(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-11): 12 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): lora.Linear(
                (base_layer): Conv1D(nf=2304, nx=768)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2304, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B):

In [64]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir="./peft_results",
    learning_rate=1e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    gradient_accumulation_steps=2,
    report_to="none",
)



In [65]:
# Initialize Trainer
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


In [66]:
# Train the model
print("Training the PEFT model...")
trainer.train()

Training the PEFT model...


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.378292,0.837156
2,0.604500,0.402312,0.838303
3,0.356400,0.307766,0.881881


TrainOutput(global_step=1263, training_loss=0.4505011467167242, metrics={'train_runtime': 427.7532, 'train_samples_per_second': 47.228, 'train_steps_per_second': 2.953, 'total_flos': 1332285969530880.0, 'train_loss': 0.4505011467167242, 'epoch': 3.0})

In [67]:
# Evaluate the fine-tuned model
print("Evaluating the fine-tuned model...")
peft_metrics = trainer.evaluate()
print(f"PEFT model metrics: {peft_metrics}")

Evaluating the fine-tuned model...


PEFT model metrics: {'eval_loss': 0.30776599049568176, 'eval_accuracy': 0.8818807339449541, 'eval_runtime': 7.4434, 'eval_samples_per_second': 117.15, 'eval_steps_per_second': 7.389, 'epoch': 3.0}


In [68]:
# Save the PEFT model
peft_model.save_pretrained("./peft_gpt2_sst2")
print("PEFT model saved to ./peft_gpt2_sst2")

PEFT model saved to ./peft_gpt2_sst2


## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

Finally, we'll load the saved PEFT model and evaluate its performance compared to the original model.

In [69]:
# Load the base model
base_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    pad_token_id=tokenizer.eos_token_id
).to(device)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [70]:
# Load the PEFT model
from peft import PeftModel, PeftConfig

peft_model_path = "./peft_gpt2_sst2"
config = PeftConfig.from_pretrained(peft_model_path)
peft_model_loaded = PeftModel.from_pretrained(base_model, peft_model_path).to(device)

In [78]:
# Function to run inference on both models with the same inputs
def compare_predictions(base_model, peft_model, tokenizer, sample_texts):
    """Compare predictions from base and PEFT models on sample texts."""
    for text in sample_texts:
        inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device)

        # Get base model prediction
        with torch.no_grad():
            base_outputs = base_model(**inputs)
            base_logits = base_outputs.logits
            base_pred = torch.softmax(base_logits, dim=1).tolist()[0]

        # Get PEFT model prediction
        with torch.no_grad():
            peft_outputs = peft_model(**inputs)
            peft_logits = peft_outputs.logits
            peft_pred = torch.softmax(peft_logits, dim=1).tolist()[0]

        # Format results
        print(f"Text: {text}")
        print(f"Base model prediction - Negative: {base_pred[0]:.4f}, Positive: {base_pred[1]:.4f}")
        print(f"PEFT model prediction - Negative: {peft_pred[0]:.4f}, Positive: {peft_pred[1]:.4f}\n")

In [79]:
# Sample texts for inference
sample_texts = [
    "This movie was fantastic! I really enjoyed it.",
    "The acting was terrible and the plot made no sense.",
    "It was an average film, neither great nor terrible.",
    "The cinematography was beautiful, but the story was weak."
]
# Compare predictions
compare_predictions(base_model, peft_model_loaded, tokenizer, sample_texts)

In [None]:
# Set up trainers for both models to evaluate on the test set
base_trainer = Trainer(
    model=base_model,
    args=TrainingArguments(
        output_dir="./base_eval",
        per_device_eval_batch_size=16,
        do_train=False,
        do_eval=True,
        report_to="none",
    ),
    eval_dataset=tokenized_eval,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

peft_trainer = Trainer(
    model=peft_model_loaded,
    args=TrainingArguments(
        output_dir="./peft_eval",
        per_device_eval_batch_size=16,
        do_train=False,
        do_eval=True,
        report_to="none",
    ),
    eval_dataset=tokenized_eval,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
# Evaluate both models
print("Evaluating base model...")
base_metrics = base_trainer.evaluate()

print("Evaluating PEFT model...")
peft_metrics = peft_trainer.evaluate()

In [None]:
# Compare metrics
print("\nPerformance Comparison:")
print(f"Base model accuracy: {base_metrics['eval_accuracy']:.4f}")
print(f"PEFT model accuracy: {peft_metrics['eval_accuracy']:.4f}")
print(f"Improvement: {peft_metrics['eval_accuracy'] - base_metrics['eval_accuracy']:.4f}")

In [None]:
# Print PEFT parameter efficiency
base_params = sum(p.numel() for p in base_model.parameters() if p.requires_grad)
peft_params = sum(p.numel() for p in peft_model_loaded.parameters() if p.requires_grad)
peft_trainable_params = sum(p.numel() for p in peft_model_loaded.parameters() if p.requires_grad)

print(f"\nParameter Efficiency:")
print(f"Base model trainable parameters: {base_params:,}")
print(f"PEFT model trainable parameters: {peft_trainable_params:,}")
print(f"Parameter reduction: {peft_trainable_params / base_params:.2%} of original")