# Lightweight Fine-Tuning Project

# Table of Contents
1. [Loading and Evaluating a Foundation Model](#lefm)
2. [Performing PEFT](#ppeft)
3. [Performing Inference with a PEFT Model](#piwpeft)

TODO: In this cell, describe your choices for each of the following

* PEFT technique: LoRA
* Model: GPT-2
* Evaluation approach: Hugging Face `Trainer`
* Fine-tuning dataset: imdb

## Loading and Evaluating a Foundation Model
<a id="lefm"></a>

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

Load stanfordnlp/imdb dataset, which is for text classification, split it into train and test sets

In [2]:
from datasets import load_dataset


splits = ["train", "test"]
ds = {split: ds for split, ds in zip(splits, load_dataset("stanfordnlp/imdb", split=splits))}

for split in splits:
    ds[split] = ds[split].shuffle(seed=42).select(range(1000))

ds

{'train': Dataset({
     features: ['text', 'label'],
     num_rows: 1000
 }),
 'test': Dataset({
     features: ['text', 'label'],
     num_rows: 1000
 })}

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Set padding token if it's not already set, 
#  or it will throw error when tokenize the inputs
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token


def preprocess_function(examples):
    """Preprocess the imdb dataset by returning tokenized examples."""
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_ds = {}
for split in splits:
    tokenized_ds[split] = ds[split].map(preprocess_function, batched=True)

    
# The Trainer needs this column to compute the loss. We only remove the original 'text' column
# which is now redundant after tokenization.
for split in splits:
    tokenized_ds[split] = tokenized_ds[split].rename_column("label", "labels")
    tokenized_ds[split] = tokenized_ds[split].remove_columns(['text'])   

tokenized_ds

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

{'train': Dataset({
     features: ['labels', 'input_ids', 'attention_mask'],
     num_rows: 1000
 }),
 'test': Dataset({
     features: ['labels', 'input_ids', 'attention_mask'],
     num_rows: 1000
 })}

In [4]:
# from transformers import AutoModelForCausalLM
from transformers import AutoModelForSequenceClassification

foundation_model = AutoModelForSequenceClassification.from_pretrained(
        "gpt2",
        # The model is loaded in 8-bit to reduce memory usage.
        # load_in_8bit=True, 
        num_labels=2,  # IMDB has 2 labels: positive and negative
        device_map="auto" # Automatically map model layers to available devices (CPU/GPU)
    )

# GPT-2's architecture doesn't have a pad token, so we need to tell the model
# which token to use for padding. We'll use the end-of-sentence token.
# foundation_model.config.pad_token_id = tokenizer.pad_token_id

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

# Default Behavior: By default, the Trainer.evaluate() method automatically calculates the evaluation loss (eval_loss). For a Causal Language Model like GPT-2, this loss is the primary metric. It measures how well the model predicts the next token in the sequence.
# When to Use compute_metrics: You would typically provide a compute_metrics function when you want to calculate metrics other than loss, such as accuracy, precision, recall, or F1-score. This is most common for classification tasks (e.g., using AutoModelForSequenceClassification), where the model makes a distinct prediction that can be directly compared to a true label
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

# mlm=False indicates that we are doing Causal Language Modeling (next token prediction), not Masked Language Modeling.
# data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

foundation_trainer = Trainer(
    model=foundation_model,
    args=TrainingArguments(
        output_dir="./data/sentiment_analysis",
        learning_rate=2e-5,
        # Reduce the batch size if you don't have enough memory
        per_device_train_batch_size=1,
        per_device_eval_batch_size=1,
        num_train_epochs=1,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)


In [6]:
# evaluate using trainer
baseline_eval_results = foundation_trainer.evaluate()

baseline_eval_results

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 2.962777853012085,
 'eval_accuracy': 0.488,
 'eval_runtime': 85.0784,
 'eval_samples_per_second': 11.754,
 'eval_steps_per_second': 11.754}

## Performing Parameter-Efficient Fine-Tuning
<a id="ppeft"></a>

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [13]:
from peft import LoraConfig

# Define LoRA configuration. It's crucial to set the task_type.
config = LoraConfig(
    r=16,
    lora_alpha=32,
    # We target 'c_attn' which is the attention layer in GPT-2's architecture.
    # This is the most impactful layer for LoRA adaptation.
    # If don't specify target_modules, PEFT will try to guess which linear layers to 
    #  apply LoRA to — but that doesn’t always work, especially for models like GPT2, 
    #  which have non-standard naming or custom layer wrappers.
    # If LoRA modules aren’t properly injected, you’re essentially training nothing — 
    #  the added parameters aren't wired into the forward pass.
    target_modules = ["c_attn", "c_proj", "c_fc"],
    lora_dropout=0.1,
#     bias="none",
    task_type="SEQ_CLS" # This is essential for the Trainer to work correctly.
)

In [14]:
from peft import get_peft_model

lora_model = get_peft_model(foundation_model, config)



In [15]:
lora_model.print_trainable_parameters()

trainable params: 2,359,296 || all params: 126,802,176 || trainable%: 1.860611603384472


In [16]:
import os
import torch

checkpoint_dir = "/workspace/checkpoints"
os.makedirs(checkpoint_dir, exist_ok=True)

# Save checkpoint
checkpoint_path = os.path.join(checkpoint_dir, f"checkpoint_latest.pth")
torch.save(lora_model.state_dict(), checkpoint_path)

# Keep only the last 3 checkpoints
checkpoints = sorted(os.listdir(checkpoint_dir), reverse=True)
if len(checkpoints) > 3:
    os.remove(os.path.join(checkpoint_dir, checkpoints[-1]))  # Delete the oldest checkpoint
    

In [17]:
import gzip
import shutil

with open('/workspace/checkpoints/checkpoint_latest.pth', 'rb') as f_in:
    with gzip.open('/workspace/checkpoints/checkpoint_latest.pth.gz', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

In [19]:
lora_trainer = Trainer(
    model=lora_model,
    args=TrainingArguments(
        output_dir="./data/lora_analysis",
        learning_rate=2e-5,
        # Reduce the batch size if you don't have enough memory
        per_device_train_batch_size=1,
        per_device_eval_batch_size=1,
        num_train_epochs=4,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

lora_trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.9487,0.82473,0.596
2,1.0903,0.793243,0.64
3,0.8185,0.656786,0.79
4,0.7563,0.57749,0.816


Checkpoint destination directory ./data/lora_analysis/checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/lora_analysis/checkpoint-2000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/lora_analysis/checkpoint-3000 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=4000, training_loss=0.959916259765625, metrics={'train_runtime': 1462.8614, 'train_samples_per_second': 2.734, 'train_steps_per_second': 2.734, 'total_flos': 2148393811968000.0, 'train_loss': 0.959916259765625, 'epoch': 4.0})

###  ⚠️ IMPORTANT ⚠️
<a id="smitd"></a>

Due to workspace storage constraints, you should not store the model weights in the same directory but rather use `/tmp` to avoid workspace crashes which are irrecoverable.
Ensure you save it in /tmp always.

In [20]:
# Saving the model
# only saves the trained LoRA adapter weights,
lora_model.save_pretrained("/tmp/yyan-peft-lora-gpt2")

# need to also save the tokenizer separately into the same directory. The AutoPeftModelForCausalLM class is smart enough to load the base model and then apply the adapter on top, but you still need to load the tokenizer from a complete configuration.
tokenizer.save_pretrained("/tmp/yyan-peft-lora-gpt2")

('/tmp/yyan-peft-lora-gpt2/tokenizer_config.json',
 '/tmp/yyan-peft-lora-gpt2/special_tokens_map.json',
 '/tmp/yyan-peft-lora-gpt2/vocab.json',
 '/tmp/yyan-peft-lora-gpt2/merges.txt',
 '/tmp/yyan-peft-lora-gpt2/added_tokens.json',
 '/tmp/yyan-peft-lora-gpt2/tokenizer.json')

## Performing Inference with a PEFT Model
<a id="piwpeft"></a>

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [21]:
from peft import AutoPeftModelForSequenceClassification

reloaded_model = AutoPeftModelForSequenceClassification.from_pretrained("/tmp/yyan-peft-lora-gpt2")

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
# from transformers import AutoTokenizer

# tokenizer = AutoTokenizer.from_pretrained("/tmp/yyan-peft-lora-gpt2")
# inputs = tokenizer("Hello, my name is ", return_tensors="pt")
# outputs = reloaded_model.generate(input_ids=inputs["input_ids"], max_new_tokens=10)
# print(tokenizer.batch_decode(outputs))

In [22]:
fine_tuned_performance = lora_trainer.evaluate()
print("Original Model:", baseline_eval_results)
print("Fine-Tuned Model:", fine_tuned_performance)

Original Model: {'eval_loss': 2.962777853012085, 'eval_accuracy': 0.488, 'eval_runtime': 85.0784, 'eval_samples_per_second': 11.754, 'eval_steps_per_second': 11.754}
Fine-Tuned Model: {'eval_loss': 0.5774895548820496, 'eval_accuracy': 0.816, 'eval_runtime': 103.3878, 'eval_samples_per_second': 9.672, 'eval_steps_per_second': 9.672, 'epoch': 4.0}


In [23]:
os.listdir("/tmp/yyan-peft-lora-gpt2/")

['adapter_model.bin',
 'adapter_config.json',
 'README.md',
 'merges.txt',
 'tokenizer.json',
 'tokenizer_config.json',
 'special_tokens_map.json',
 'vocab.json']

In [25]:
cd /workspace/checkpoints

/workspace/checkpoints


In [2]:
du -sh

NameError: name 'du' is not defined

In [3]:
ls -al

total 48
drwxr-xr-x 5 student student  4096 Aug 13 21:14 [0m[01;34m.[0m/
drwxr-xr-x 1 root    root     4096 Aug 13 21:14 [01;34m..[0m/
drwxr-xr-x 2 student student  4096 Aug 13 21:14 [01;34m.ipynb_checkpoints[0m/
-rw-r--r-- 1 student student    61 Aug 13 21:07 .workspace-submit.json
-rw-r--r-- 1 student student 20654 Aug 13 21:08 LightweightFineTuning.ipynb
drwxr-xr-x 2 student student  4096 Aug 13 21:14 [01;34mcheckpoints[0m/
drwxr-xr-x 5 student student  4096 Aug 13 21:14 [01;34mdata[0m/


In [26]:
ls -lh

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


total 929M
-rw-r--r-- 1 student student 484M Aug 14 13:37 checkpoint_latest.pth
-rw-r--r-- 1 student student 445M Aug 14 13:38 checkpoint_latest.pth.gz


In [27]:
os.remove("./checkpoint_latest.pth")

In [28]:
ls -lh

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


total 445M
-rw-r--r-- 1 student student 445M Aug 14 13:38 checkpoint_latest.pth.gz
