# Lightweight Fine-Tuning Project

In this cell, describe your choices for each of the following

* PEFT technique: LoRA
* Model: GPT-2
* Evaluation approach: Hugging Face `Trainer`
* Fine-tuning dataset: imdb

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [2]:
from datasets import load_dataset


splits = ["train", "test"]
ds = {split: ds for split, ds in zip(splits, load_dataset("stanfordnlp/imdb", split=splits))}

for split in splits:
    ds[split] = ds[split].shuffle(seed=42).select(range(500))

ds

  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 100%|██████████| 25000/25000 [00:00<00:00, 471912.76 examples/s]
Generating test split: 100%|██████████| 25000/25000 [00:00<00:00, 639613.51 examples/s]
Generating unsupervised split: 100%|██████████| 50000/50000 [00:00<00:00, 591842.37 examples/s]


{'train': Dataset({
     features: ['text', 'label'],
     num_rows: 500
 }),
 'test': Dataset({
     features: ['text', 'label'],
     num_rows: 500
 })}

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Set padding token if it's not already set,
#  or it will throw error when tokenize the inputs
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token


def preprocess_function(examples):
    """Preprocess the imdb dataset by returning tokenized examples."""
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_ds = {}
for split in splits:
    tokenized_ds[split] = ds[split].map(preprocess_function, batched=True)


# **FIX**: Remove the original columns ('text', 'label') after tokenization.
# The Trainer will pass all columns as arguments to the model, causing an error
# if a column name doesn't match a model argument.
for split in splits:
    tokenized_ds[split] = tokenized_ds[split].remove_columns(['text', 'label'])

tokenized_ds

Map: 100%|██████████| 500/500 [00:00<00:00, 3673.90 examples/s]
Map: 100%|██████████| 500/500 [00:00<00:00, 3262.10 examples/s]


{'train': Dataset({
     features: ['input_ids', 'attention_mask'],
     num_rows: 500
 }),
 'test': Dataset({
     features: ['input_ids', 'attention_mask'],
     num_rows: 500
 })}

In [4]:
from transformers import AutoModelForCausalLM

foundation_model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    # The model is loaded in 8-bit to reduce memory usage.
    # load_in_8bit=True,
    device_map="auto" # Automatically map model layers to available devices (CPU/GPU)
)

In [None]:
import numpy as np
from transformers import DataCollatorWithPadding, DataCollatorForLanguageModeling, Trainer, TrainingArguments

# Default Behavior: By default, the Trainer.evaluate() method automatically calculates the evaluation loss (eval_loss). For a Causal Language Model like GPT-2, this loss is the primary metric. It measures how well the model predicts the next token in the sequence.
# When to Use compute_metrics: You would typically provide a compute_metrics function when you want to calculate metrics other than loss, such as accuracy, precision, recall, or F1-score. This is most common for classification tasks (e.g., using AutoModelForSequenceClassification), where the model makes a distinct prediction that can be directly compared to a true label
# def compute_metrics(eval_pred):
#     predictions, labels = eval_pred
#     predictions = np.argmax(predictions, axis=1)
#     return {"accuracy": (predictions == labels).mean()}

# mlm=False indicates that we are doing Causal Language Modeling (next token prediction), not Masked Language Modeling.
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

foundation_trainer = Trainer(
    model=foundation_model,
    args=TrainingArguments(
        output_dir="./data/sentiment_analysis",
        learning_rate=2e-3,
        # Reduce the batch size if you don't have enough memory
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        num_train_epochs=1,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    #     compute_metrics=compute_metrics,
)

In [None]:
# evaluate using trainer
baseline_eval_results = foundation_trainer.evaluate()

baseline_eval_results

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [None]:
from peft import LoraConfig

# Define LoRA configuration. It's crucial to set the task_type.
config = LoraConfig(
    r=8,
    lora_alpha=16,
    #     target_modules=["c_attn"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM" # This is essential for the Trainer to work correctly based on the chosen foundation model: gpt2.
)

model = AutoModelForCausalLM.from_pretrained("gpt2")

In [None]:
from peft import get_peft_model

lora_model = get_peft_model(model, config)

lora_model.print_trainable_parameters()

In [None]:
# optional
import os
import torch

checkpoint_dir = "/workspace/checkpoints"
os.makedirs(checkpoint_dir, exist_ok=True)

# Save checkpoint
checkpoint_path = os.path.join(checkpoint_dir, f"checkpoint_latest.pth")
torch.save(lora_model.state_dict(), checkpoint_path)

# Keep only the last 3 checkpoints
checkpoints = sorted(os.listdir(checkpoint_dir), reverse=True)
if len(checkpoints) > 3:
    os.remove(os.path.join(checkpoint_dir, checkpoints[-1]))  # Delete the oldest checkpoint

In [None]:
lora_trainer = Trainer(
    model=lora_model,
    args=TrainingArguments(
        output_dir="./data/lora_analysis",
        learning_rate=2e-3,
        # Reduce the batch size if you don't have enough memory
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        num_train_epochs=2,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    #     compute_metrics=compute_metrics,
)

lora_trainer.train()

###  ⚠️ IMPORTANT ⚠️

Due to workspace storage constraints, you should not store the model weights in the same directory but rather use `/tmp` to avoid workspace crashes which are irrecoverable.
Ensure you save it in /tmp always.

In [None]:
# only saves the trained LoRA adapter weights,
lora_model.save_pretrained("/tmp/yyan-peft-lora-gpt2")

# need to also save the tokenizer separately into the same directory. The AutoPeftModelForCausalLM class is smart enough to load the base model and then apply the adapter on top, but you still need to load the tokenizer from a complete configuration.
tokenizer.save_pretrained("/tmp/yyan-peft-lora-gpt2")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [None]:
from peft import AutoPeftModelForCausalLM

reloaded_model = AutoPeftModelForCausalLM.from_pretrained("/tmp/yyan-peft-lora-gpt2")

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("/tmp/yyan-peft-lora-gpt2")
inputs = tokenizer("Hello, my name is ", return_tensors="pt")
outputs = reloaded_model.generate(input_ids=inputs["input_ids"], max_new_tokens=10)
print(tokenizer.batch_decode(outputs))

In [None]:
fine_tuned_performance = lora_trainer.evaluate()
print("Original Model:", baseline_eval_results)
print("Fine-Tuned Model:", fine_tuned_performance)

In [None]:
os.listdir("/tmp/yyan-peft-lora-gpt2/")