# Lightweight Fine-Tuning Project

* PEFT technique: 
    - For the Parameter-Efficient Fine-Tuning (PEFT) technique, I utilized the Lora method.

* Model: 
    - The base model used for fine-tuning is the GPT-2 model, which is a transformer-based language model pre-trained on a large corpus of text data. GPT-2 is well-suited for various natural language processing tasks, including sequence classification, due to its powerful contextual understanding capabilities.

* Evaluation approach: 
    - The evaluation approach involves assessing the performance of the fine-tuned model using accuracy as the primary evaluation metric. Accuracy measures the proportion of correctly classified instances out of the total number of instances in the evaluation dataset. Additionally, the evaluation includes calculating the evaluation loss to understand the model's performance in terms of prediction confidence.

* Fine-tuning dataset: 
    - The fine-tuning dataset comprises the IMDB dataset, which contains movie reviews labeled with sentiment (positive or negative). This dataset provides a diverse range of textual data suitable for training a sentiment analysis model. Additionally, a subset of the dataset is used to reduce computational resources while still ensuring effective fine-tuning of the model.

## Loading and Evaluating a Foundation Model

In the cells below, I loaded my chosen pre-trained Hugging Face model and evaluated its performance prior to fine-tuning.

### Install & Import necessary libraries

In [1]:
# Install the required version of datasets in case you have an older version
# You will need to choose "Kernel > Restart Kernel" from the menu after executing this cell
! pip install -q "datasets==2.15.0"

[0m

In [1]:
import numpy as np

from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)
from datasets import load_dataset

### Define constants

In [47]:
MODEL_ID = "gpt2"
DATASET_NAME = "imdb"
splits = ["train", "test"]

### Load dataset
Let's load the dataset and select subset of the dataset for faster computation

In [48]:
# Load dataset
ds = {split: ds for split, ds in zip(splits, load_dataset(DATASET_NAME, split=splits))}

# Select subset of the dataset for faster computation
for split in splits:
    ds[split] = ds[split].shuffle(seed=42).select(range(5000))

# Print the dataset 
ds

{'train': Dataset({
     features: ['text', 'label'],
     num_rows: 5000
 }),
 'test': Dataset({
     features: ['text', 'label'],
     num_rows: 5000
 })}

### Pre-process datasets
Now we are going to process our datasets by converting all the text into tokens for our models. 

In [52]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Set pad_token to eos_token
tokenizer.pad_token = tokenizer.eos_token

Pre-process datasets by converting text into tokens

In [55]:
def preprocess_function(examples):
    return tokenizer(
        examples["text"], 
        padding="max_length", 
        truncation=True,
        max_length=512
    )

tokenized_ds = {}

# Tokenize and preprocess dataset splits
for split in splits:
    tokenized_ds[split] = ds[split].map(preprocess_function, batched=True)

# Print the tokenized dataset
tokenized_ds

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

{'train': Dataset({
     features: ['text', 'label', 'input_ids', 'attention_mask'],
     num_rows: 5000
 }),
 'test': Dataset({
     features: ['text', 'label', 'input_ids', 'attention_mask'],
     num_rows: 5000
 })}

Print one example represented as tokes

In [56]:
tokenized_ds["train"][0]["input_ids"][:10]

[1858, 318, 645, 8695, 379, 477, 1022, 6401, 959, 290]

### Testing how tokenizer and padding works [Optional]

In [14]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

tokenizer.pad_token = tokenizer.eos_token

print(tokenizer.pad_token)
print(tokenizer.eos_token)

# Example sentences
sentences = [
    "This is the first sentence.",
    "And here comes the second one.",
    "Finally, the last and longest sentence."
]

# Tokenize sentences
tokenized_sentences = tokenizer(sentences, padding=True, return_tensors="pt")

# Print tokenized sequences
for i, sentence in enumerate(sentences):
    print(f"Sentence {i+1}: {sentence}")
    print("Tokenized:", tokenized_sentences['input_ids'][i])
    print("Attention Mask:", tokenized_sentences['attention_mask'][i])
    print()


<|endoftext|>
<|endoftext|>
Sentence 1: This is the first sentence.
Tokenized: tensor([ 1212,   318,   262,   717,  6827,    13, 50256, 50256])
Attention Mask: tensor([1, 1, 1, 1, 1, 1, 0, 0])

Sentence 2: And here comes the second one.
Tokenized: tensor([ 1870,   994,  2058,   262,  1218,   530,    13, 50256])
Attention Mask: tensor([1, 1, 1, 1, 1, 1, 1, 0])

Sentence 3: Finally, the last and longest sentence.
Tokenized: tensor([11158,    11,   262,   938,   290, 14069,  6827,    13])
Attention Mask: tensor([1, 1, 1, 1, 1, 1, 1, 1])



## Load and set up the base model

In [10]:
base_model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_ID,
    num_labels=2,
    id2label={0: "NEGATIVE", 1: "POSITIVE"},
    label2id={"NEGATIVE": 0, "POSITIVE": 1},
)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Set the model's pad token id to match the tokenizer's pad token id

In [11]:
base_model.config.pad_token_id = tokenizer.pad_token_id

Ensure that finetuning doesn't effect the base model weights, as we are going to train only the adapter later on using LoRa

In [12]:
for param in base_model.base_model.parameters():
    param.requires_grad = False

Print the architecture of the model

In [13]:
base_model

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)

Print the final linear layer used for classification

In [14]:
base_model.score

Linear(in_features=768, out_features=2, bias=False)

## Evaluate the model

Define function to compute metrics during evaluation

In [57]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

Define the TrainingArguments used as the training configuration, including output directory, learning rate, batch size, and evaluation strategy.

In [58]:
training_args = TrainingArguments(
    output_dir="./model_output",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

Next the Trainer initializes the training process with the specified arguments, datasets, tokenizer, and metric computation for the base model.

In [59]:
pretrain_trainer = Trainer(
    model=base_model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer, padding=True),
    compute_metrics=compute_metrics,
)

The code below evaluates the foundational model on the evaluation dataset and prints out the evaluation results, including metrics such as evaluation loss and accuracy.

In [17]:
pretrained_results = pretrain_trainer.evaluate()

print(f"Evaluation results of foundational model: {pretrained_results}")

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Evaluation results of foundational model: {'eval_loss': 7.370438098907471, 'eval_accuracy': 0.4986, 'eval_runtime': 173.8013, 'eval_samples_per_second': 28.768, 'eval_steps_per_second': 1.801}


## Performing Parameter-Efficient Fine-Tuning

In the cells below, I created a PEFT model from my loaded model, run a training loop, and saved the PEFT model weights.

In [60]:
from peft import LoraConfig, get_peft_model, TaskType
import torch

# ensure cuda caches is empty
torch.cuda.empty_cache()

The code below initializes a base model for sequence classification using the AutoModelForSequenceClassification class from the Hugging Face library, loading the model specified by MODEL_ID and configuring it for binary classification with two labels: "NEGATIVE" and "POSITIVE". 

In [61]:
base_model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_ID,
    num_labels=2,
    id2label={0: "NEGATIVE", 1: "POSITIVE"},  # For converting predictions to strings
    label2id={"NEGATIVE": 0, "POSITIVE": 1},
)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Set the model's pad token id to match the tokenizer's pad token id

In [62]:
base_model.config.pad_token_id = tokenizer.pad_token_id

Next I initialize a configuration for Parameter-Efficient Fine-Tuning (PEFT) using the LoraConfig class, specifying parameters such as the regularization factor (r), Lora alpha value (lora_alpha), target modules for applying PEFT (target_modules), Lora dropout rate (lora_dropout), bias type (bias), and the task type (task_type) as sequence classification.

In [63]:
peft_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=['c_attn', 'c_proj'],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.SEQ_CLS,    
)

The code below creates a PEFT model by applying Parameter-Efficient Fine-Tuning (PEFT) to the base model using the specified configuration (peft_config). Afterward, it prints the number of trainable parameters in the PEFT model.

In [64]:
model = get_peft_model(base_model, peft_config)
model.print_trainable_parameters()



trainable params: 814,080 || all params: 125,253,888 || trainable%: 0.6499438963523432


Format the training and testing datasets to torch tensors, specifying the columns to include, such as input IDs, attention masks, and labels. This formatting facilitates compatibility with PyTorch-based training.

In [65]:
tokenized_ds["train"].set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
tokenized_ds["test"].set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])


Initialize training arguments and a Trainer object for fine-tuning the base model. 

In [24]:
training_args = TrainingArguments(
    output_dir="./lora_model_output",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_steps = 50,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_steps=50,
    logging_dir='./logs',
)

trainer = Trainer(
    model=base_model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer, padding=True),
    compute_metrics=compute_metrics,
)

Start training for number of epochs given by "num_train_epochs" parameter.

In [25]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.3587,0.409858,0.8786
2,0.409,0.4065,0.8884
3,0.5052,0.40111,0.8904


Checkpoint destination directory ./lora_model_output/checkpoint-1250 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./lora_model_output/checkpoint-2500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./lora_model_output/checkpoint-3750 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=3750, training_loss=0.5038451108296712, metrics={'train_runtime': 2078.983, 'train_samples_per_second': 7.215, 'train_steps_per_second': 1.804, 'total_flos': 3956893286400000.0, 'train_loss': 0.5038451108296712, 'epoch': 3.0})

Save fine tuned PEFT model

In [26]:
model.save_pretrained("gpt-lora")

## Performing Inference with a PEFT Model

In the cells below, I loaded the saved PEFT model weights and evaluated the performance of the trained PEFT model.

In [27]:
import torch
from peft import AutoPeftModelForSequenceClassification

This code initializes a PEFT model for sequence classification using the AutoPeftModelForSequenceClassification class. It loads the pretrained PEFT model named, configures it for binary classification with two labels, and transfers it to the available CUDA device if available, otherwise to CPU. 

In [28]:
NUM_LABELS = 2
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

lora_model = AutoPeftModelForSequenceClassification.from_pretrained(
    "gpt-lora", 
    num_labels=NUM_LABELS, 
    ignore_mismatched_sizes=True
).to(device)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Set the model's pad token id to match the tokenizer's pad token id

In [29]:
lora_model.config.pad_token_id = tokenizer.pad_token_id

Initialize training arguments and a Trainer object as before.

In [30]:
training_args = TrainingArguments(
    output_dir="./data/sentiment_analysis",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

finetuned_trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

Evaluate the fine-tuned model on the test set

In [31]:
finetuned_results = finetuned_trainer.evaluate()

# Print the evaluation results for the fine-tuned model
print("Evaluation results for the fine-tuned model:", finetuned_results)

Evaluation results for the fine-tuned model: {'eval_loss': 0.40110960602760315, 'eval_accuracy': 0.8904, 'eval_runtime': 192.6615, 'eval_samples_per_second': 25.952, 'eval_steps_per_second': 1.625}


Evaluating both the foundational model and the fine-tuned model on the evaluation dataset shows the following results below: 

For the foundational "gpt-2" model:
- Evaluation loss: 7.3704
- Evaluation accuracy: 49.86%

For the fine-tuned model:
- Evaluation loss: 0.4011
- Evaluation accuracy: 89.04%

These results show a significant improvement in accuracy after fine-tuning the model compared to its performance before fine-tuning.

To see how well the model performs, the function below helps to predict the label for a given input sentence using the fine-tuned PEFT model (lora_model) and returns the predicted label as a string.

In [67]:
def predict(sentence: str) -> str:
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    lora_model.to(device)

    # Tokenize the input text
    inputs = tokenizer(sentence, return_tensors="pt").to(device)

    # Get logit predictions
    with torch.no_grad():
        outputs = lora_model(**inputs)
        logits = outputs.logits

    probabilities = torch.nn.functional.softmax(logits, dim=1)
    
    return probabilities.argmax().item()


Negative Example

In [68]:
# Example usage
sentence = "I'm sad and i wanna cry"
predicted_label = predict(sentence)
print(f"Sentence: '{sentence}'\nPredicted label: {predicted_label}")

Sentence: 'I'm sad and i wanna cry'
Predicted label: 0


Positive Example

In [74]:
sentence = "I am happy."

predicted_label = predict(sentence)
print(f"Sentence: '{sentence}'\nPredicted label: {predicted_label}")

Sentence: 'I am happy.'
Predicted label: 1


Positive Example

In [77]:
sentence = "This movie was so funny."

predicted_label = predict(sentence)
print(f"Sentence: '{sentence}'\nPredicted label: {predicted_label}")

Sentence: 'This movie was so funny.'
Predicted label: 1


Negative Example

In [78]:
sentence = "This movie was so frustrating."

predicted_label = predict(sentence)
print(f"Sentence: '{sentence}'\nPredicted label: {predicted_label}")

Sentence: 'This movie was so frustrating.'
Predicted label: 0
