# **Lightweight Fine-Tuning Project**



*   PEFT technique: LoRa
*   Model: GPT-2
*   Evaluation approach: Hugging Face Transformer Trainer
*   Fine-tuning dataset: dair-ai/emotion
  *   Description: This dataset contains text examples and then their classifications by labels (1-6). The classifications for each string of text are joy, sadness, anger, fear, love, and surprise.


# **Loading and Evaluating a Foundation Model**

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
!pip install scikit-learn
!pip install evaluate
!pip install transformers[torch]

Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from evaluate)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [

In [2]:
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification
from datasets import load_dataset

In [3]:
#loading the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2ForSequenceClassification.from_pretrained('gpt2', num_labels=6)

print(model)

#setting the padding token because gpt2 does not have one inherently
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

#setting the padding token in the model config
model.config.pad_token_id = tokenizer.pad_token_id

print(f"Padding token: {tokenizer.pad_token}, ID: {tokenizer.pad_token_id}")

#clearing the dataset and cache so we don't have to re-run project every time
dataset = load_dataset('dair-ai/emotion', cache_dir='./datasets_cache')

#padding the tokenizer on the right side to match gpt2 requirements
tokenizer.padding_side = "right"


#taking a smaller subset of the dataset due to size and run-time
small_train_dataset = dataset['train'].shuffle(seed=42).select([i for i in range(10000)])
small_test_dataset = dataset['test'].shuffle(seed=42).select([i for i in range(2000)])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=6, bias=False)
)
Padding token: [PAD], ID: 50257


Downloading builder script:   0%|          | 0.00/3.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.28k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.78k [00:00<?, ?B/s]

The repository for dair-ai/emotion contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/dair-ai/emotion.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/592k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [4]:
#creating a process function to tokenize
def tokenize(examples):
    tokenized = tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=128
    )
    tokenized['labels']=examples['label']
    return tokenized

encoded_train_dataset = small_train_dataset.map(tokenize, batched=True)
encoded_test_dataset = small_test_dataset.map(tokenize, batched=True)

#setting the format of the dataset to PyTorch tensors
encoded_train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
encoded_test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])


print(encoded_train_dataset.format)
print(encoded_test_dataset.format)

print(encoded_train_dataset[0])
print(encoded_test_dataset[0])


Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

{'type': 'torch', 'format_kwargs': {}, 'columns': ['input_ids', 'attention_mask', 'labels'], 'output_all_columns': False}
{'type': 'torch', 'format_kwargs': {}, 'columns': ['input_ids', 'attention_mask', 'labels'], 'output_all_columns': False}
{'input_ids': tensor([ 4514, 16259,   287,   262,  1499, 50257, 50257, 50257, 50257, 50257,
        50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
        50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
        50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
        50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
        50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
        50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
        50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
        50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
        50257, 50257, 50257, 50257, 5025

In [5]:
from transformers import DataCollatorWithPadding

#data collator to dynamically pad the inputs
data_collator = DataCollatorWithPadding(tokenizer)

#confimring the data collator is working
print(data_collator)

DataCollatorWithPadding(tokenizer=GPT2Tokenizer(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '[PAD]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	50257: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}, padding=True, max_length=None, pad_to_multiple_of=None, return_tensors='pt')


In [6]:
from transformers import Trainer, TrainingArguments
import numpy as np
import evaluate

#loading the accuracy metric
accuracy = evaluate.load('accuracy')

#compute metrics function
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

#defining the training arguments
training_args=TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

#defining the trainer
trainer=Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_train_dataset,
    eval_dataset=encoded_test_dataset,
    compute_metrics=compute_metrics,
    data_collator=data_collator
)

#evaluating the model
eval_results = trainer.evaluate()
print(f"Evaluation results (original model): {eval_results}")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Evaluation results (original model): {'eval_loss': 7.019896507263184, 'eval_accuracy': 0.2905, 'eval_runtime': 22.0927, 'eval_samples_per_second': 90.528, 'eval_steps_per_second': 22.632}


We can see that after evaluating GPT-2 (our foundation model) on the emotions dataset that we have an 'eval_accuracy' of 0.2905. This means that without fine tuning the evaluation accuracy was 29.05%. The goal is to see if we can improve that accuracy after fine-tuning.

# **Performing Parameter-Efficient Fine-Tuning¶**

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [7]:
!pip install peft

from peft import LoraConfig, get_peft_model

target_modules = []
for i in range(12):
    target_modules.extend([
        f'transformer.h.{i}.attn.c_attn',
        f'transformer.h.{i}.attn.c_proj',
        f'transformer.h.{i}.mlp.c_fc',
        f'transformer.h.{i}.mlp.c_proj'
    ])

#creating the LoRa PEFT cofiguration
peft_config=LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=target_modules,
    lora_dropout=0.1
)

Collecting peft
  Downloading peft-0.11.1-py3-none-any.whl (251 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.6/251.6 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: peft
Successfully installed peft-0.11.1


In [8]:
#creating the PEFT model
peft_model = get_peft_model(model, peft_config)

print(peft_model)



PeftModel(
  (base_model): LoraModel(
    (model): GPT2ForSequenceClassification(
      (transformer): GPT2Model(
        (wte): Embedding(50258, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-11): 12 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): lora.Linear(
                (base_layer): Conv1D()
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2304, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
       

In [9]:
import torch
from torch.utils.data import DataLoader
from transformers import AdamW

#creating a dataloader
train_dataloader = DataLoader(encoded_train_dataset, batch_size=4, shuffle=True, collate_fn=data_collator)
eval_dataloader = DataLoader(encoded_test_dataset, batch_size=4, shuffle=False, collate_fn=data_collator)

#setting the optimizer to AdamW
optimizer = AdamW(peft_model.parameters(), lr=5e-5)

#creating the training loop
num_epochs = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
peft_model.to(device)

for epoch in range(num_epochs):
    peft_model.train()
    total_train_loss=0.0
    total_batches=0

    #iterating over the training data in the batches batch
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = peft_model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        #adding training loss
        total_train_loss += loss.item()
        total_batches += 1

    #calculating average training loss across batches
    avg_train_loss = total_train_loss / total_batches
    print(f"Training loss: {avg_train_loss}")


    peft_model.eval()
    eval_loss = 0

    #iterating over the evaluation data in the batches
    for batch in eval_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = peft_model(**batch)
        eval_loss += outputs.loss.item()
    eval_loss /= len(eval_dataloader)
    print(f"Validation loss: {eval_loss}")

#saving the trained model
peft_model.save_pretrained('./peft_model')



Training loss: 1.4374157854422926
Validation loss: 0.5672815988063812




# **Performing Inference with a PEFT Model**

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [10]:
#evaluating the PEFT model
eval_results_peft = trainer.evaluate()
print(f"Evaluation results (PEFT model): {eval_results_peft}")

Evaluation results (PEFT model): {'eval_loss': 0.5672816038131714, 'eval_accuracy': 0.7995, 'eval_runtime': 20.4208, 'eval_samples_per_second': 97.939, 'eval_steps_per_second': 24.485}


In [11]:
#comparing the results from pre and post fine-tuning
print("Comparison of evaluation results:")
print(f"Original model accuracy: {eval_results['eval_accuracy']}")
print(f"PEFT model accuracy: {eval_results_peft['eval_accuracy']}")

Comparison of evaluation results:
Original model accuracy: 0.2905
PEFT model accuracy: 0.7995


We can clearly see that after doing some Lightweight Fine-Tuning using LoRa that our evaluation accuracy significantly improved. Originally we had a sub-par evaluation accuracy of 29.05%, and after PEFT, the model's accuracy went up to 79.95%.