# Lightweight Fine-Tuning Project

This Notebook contains a walk through:
* The evaluation of BERT LLM model
* Finetuning BERT with custom data (imdb) with different parameters
* Compare the accuracy between the original model and after fine-tuning

These tools and techiniques are used throughout the document:

* LLM Model: **google-bert/bert-base-cased**
* PEFT technique: **LoRA**
* Evaluation approach: **Using accuracy metric**
* Fine-tuning dataset: **stanfordnlp/imdb**

## Loading and Evaluating a Foundation Model

In this section, a pre-trained Hugging Face model is loaded and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [2]:
# You kight need to restart the kernel after this command to avoid errors when calling AutoModelForSequenceClassification
!pip install -r requirements.txt -q

## Prepare the Foundation Model

### Load a pretrained HF model

In [3]:
from transformers import AutoTokenizer
model_id="google-bert/bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_id)

### Load and preprocess a dataset

In [4]:
from datasets import load_dataset
dataset = load_dataset("stanfordnlp/imdb")

In [5]:
# IMDB contain train, test and unsupervised datasets with 25000, 25000 and 500000 samples respectively.
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [6]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_train_datasets = dataset["train"].map(tokenize_function, batched=True)
tokenized_test_datasets = dataset["test"].map(tokenize_function, batched=True)

In [7]:
train_dataset_size = 6000
eval_dataset_size = 3000

In [8]:
small_train_dataset = tokenized_train_datasets.shuffle(seed=42).select(range(train_dataset_size))
small_eval_dataset = tokenized_test_datasets.shuffle(seed=42).select(range(eval_dataset_size))

In [9]:
print(small_eval_dataset, small_train_dataset)

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 3000
}) Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 6000
})


### Evaluate the pretrained model

In [10]:
#Create a map between expected ids and labels
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [11]:
from transformers import AutoModelForSequenceClassification
import torch 

model = AutoModelForSequenceClassification.from_pretrained(
    model_id, 
    num_labels=2,
    id2label=id2label,
    label2id=label2id
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
def count_params(model, is_human: bool = False):
    params: int = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return f"{params / 1e6:.2f}M" if is_human else params

print(model)
print("Total # of params for the model {}: {}".format(model_id, count_params(model, is_human=True)))

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [18]:
import random


# Generate a random integer within the range of eval_dataset_size
x = random.randint(0, eval_dataset_size)

print("text: {},\nlabel:{} = {}".format(
    small_eval_dataset["text"][x],
    small_eval_dataset["label"][x],
    id2label[small_eval_dataset["label"][x]])
)

text: A true stand out episode from season 1 is what Ice is.An artic location,claustrophobic conditions and a general feel of paranoia looming in the freezing air makes this is a must see episode from season one.The previous occupants of the artic station Mulder,Scully and four others go to have either killed each other or killed themselves.A virus is bringing out murderous aggression and is responsible for bringing out deadly paranoia and fear.Mulder and Scully actually begin to question each others sanity.Tension is that high.The writers have to receive great credit for creating that sort of scenario where the atmosphere is so tense Mulder and Scully come into conflict in such a direct manner,
label:1 = POSITIVE


In [19]:
#Use accuracy metric
#Function inspired from https://huggingface.co/learn/nlp-course/en/chapter3/3#evaluation
import numpy as np

import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


In [20]:
from transformers import Trainer, TrainingArguments

trainer = Trainer(
    model=model,
    args=TrainingArguments(
        "evaluate_foundational_model",
        eval_strategy="epoch",
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16
    ),
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
    processing_class=tokenizer,
)


In [21]:
%%time
import numpy as np

# Let's see the perfomance of the foundation model before any prior training
trainer_foundation_eval=trainer.evaluate(eval_dataset=small_eval_dataset)
trainer_foundation_eval

CPU times: user 4.6 s, sys: 1.31 s, total: 5.91 s
Wall time: 2min 35s


{'eval_loss': 0.7342166304588318,
 'eval_model_preparation_time': 0.0033,
 'eval_accuracy': 0.49633333333333335,
 'eval_runtime': 155.5398,
 'eval_samples_per_second': 19.288,
 'eval_steps_per_second': 1.209}

###### **Without any fine tuning the model "google-bert/bert-base-cased" has an _accuracy_ of _0.4963_**
###### **Not great, but we will improve it sustantialy with fione tuning in the next steps**

### Saving the foundation model to local directory

In [22]:
# Save the foundational model to the local directory "foundational_model/" 
trainer.save_model("foundational_model/")

## Performing Parameter-Efficient Fine-Tuning

Create two PEFT models to test two different lora_config values and compare the results between the two. Save the PEFT model weights for each training.

### PEFT model (Same foundational model for the two PEFT configuraiotns)

In [23]:
peft_model_id = model_id 
model = AutoModelForSequenceClassification.from_pretrained(
    peft_model_id,
    num_labels=2,
    id2label=id2label,
    label2id=label2id
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Create a PEFT model #1

In [24]:
# Create an dictiopnary with two set of values for two trainings and performance comparaison
peft_values= {
    "values1": {
        "r": 16,
        "lora_alpha": 16,
        "lora_dropout": 0.1,
        "bias": "none"
    },
    "values2": {
        "r": 64,
        "lora_alpha": 128,
        "lora_dropout": 0.01,
        "bias": "none"
    }
}

In [26]:
from peft import LoraConfig, TaskType

lora_config1 = LoraConfig(
    task_type=TaskType.TOKEN_CLS,
    r=peft_values["values1"]["r"],
    lora_alpha=peft_values["values1"]["lora_alpha"],
    lora_dropout=peft_values["values1"]["lora_dropout"],
    bias=peft_values["values1"]["bias"],
    target_modules=["query", "value"]
)

In [27]:
from peft import get_peft_model

lora_model1 = get_peft_model(model, lora_config1)
lora_model1.print_trainable_parameters()

trainable params: 591,362 || all params: 108,903,172 || trainable%: 0.5430


**Here we can see the advantage of using PEFT fine tuning instead of training the whole model: only 0.543% of the 109 million parameters BERT has.**

### Train the PEFT model #1

In [28]:
training_args_peft1 = TrainingArguments(
    "trainer_peft1_output",
    eval_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16)

In [29]:
%%time
trainer1 = Trainer(
    model=lora_model1,
    args=training_args_peft1,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)
trainer1.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.315898,0.872
2,0.486200,0.288072,0.884333
3,0.301500,0.286446,0.882667


CPU times: user 2min 28s, sys: 2min 22s, total: 4min 50s
Wall time: 1h 9min 23s


TrainOutput(global_step=1125, training_loss=0.38100865342881945, metrics={'train_runtime': 4163.4267, 'train_samples_per_second': 4.323, 'train_steps_per_second': 0.27, 'total_flos': 4768698949632000.0, 'train_loss': 0.38100865342881945, 'epoch': 3.0})

In [31]:
%%time
trainer1_eval=trainer1.evaluate()
trainer1_eval

CPU times: user 6.03 s, sys: 3.2 s, total: 9.23 s
Wall time: 3min 57s


{'eval_loss': 0.28644636273384094,
 'eval_accuracy': 0.8826666666666667,
 'eval_runtime': 237.0754,
 'eval_samples_per_second': 12.654,
 'eval_steps_per_second': 0.793,
 'epoch': 3.0}

##### **With fine tuning the model1 "google-bert/bert-base-cased" the _accuracy_ is now _0.882_ much better than the performance of the original foundational model.**

### Save the PEFT model #1

In [33]:
lora_model1.save_pretrained("trainer_peft_1")

In [34]:
!ls -ltra trainer_peft_1/

total 6248
drwxr-xr-x@  5 mk  staff      160 27 Dec 17:37 [1m[36m.[m[m
drwxr-xr-x  14 mk  staff      448  4 Jan 19:47 [1m[36m..[m[m
-rw-r--r--@  1 mk  staff     5101  4 Jan 19:47 README.md
-rw-r--r--@  1 mk  staff  2372416  4 Jan 19:47 adapter_model.safetensors
-rw-r--r--@  1 mk  staff      752  4 Jan 19:47 adapter_config.json


### Create PEFT model #2

In [35]:
from peft import LoraConfig, TaskType

lora_config2 = LoraConfig(
    task_type=TaskType.TOKEN_CLS,
    r=peft_values["values2"]["r"],
    lora_alpha=peft_values["values2"]["lora_alpha"],
    lora_dropout=peft_values["values2"]["lora_dropout"],
    bias=peft_values["values2"]["bias"],
    target_modules=["query", "value"]
)

In [36]:
from peft import get_peft_model

lora_model2 = get_peft_model(model, lora_config2)
lora_model2.print_trainable_parameters()

trainable params: 2,360,834 || all params: 110,672,644 || trainable%: 2.1332


### Train PEFT model #2

In [37]:
training_args_peft2 = TrainingArguments(
    "trainer_peft2_output",
    eval_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16
)

In [38]:
%%time
trainer2 = Trainer(
    model=lora_model2,
    args=training_args_peft2,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics
)
trainer2.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.275695,0.887
2,0.375400,0.261908,0.897
3,0.255100,0.260806,0.899333


CPU times: user 2min 30s, sys: 3min, total: 5min 31s
Wall time: 1h 16min 5s


TrainOutput(global_step=1125, training_loss=0.3054690348307292, metrics={'train_runtime': 4564.0093, 'train_samples_per_second': 3.944, 'train_steps_per_second': 0.246, 'total_flos': 4866543673344000.0, 'train_loss': 0.3054690348307292, 'epoch': 3.0})

In [40]:
%%time
trainer2_eval=trainer2.evaluate()
trainer2_eval

CPU times: user 6.27 s, sys: 3.89 s, total: 10.2 s
Wall time: 2min 49s


{'eval_loss': 0.26080623269081116,
 'eval_accuracy': 0.8993333333333333,
 'eval_runtime': 169.1841,
 'eval_samples_per_second': 17.732,
 'eval_steps_per_second': 1.111,
 'epoch': 3.0}

**With fine tuning the model2 "google-bert/bert-base-cased" the _accuracy_ is now _0.899_ much better than the performance of the original foundational model and the PEFT1 model.**

### Save the PEFT model #2

In [41]:
lora_model1.save_pretrained("trainer_peft_2")

In [42]:
!ls -ltra trainer_peft_2/

total 18728
drwxr-xr-x@  5 mk  staff      160 29 Dec 15:29 [1m[36m.[m[m
drwxr-xr-x  15 mk  staff      480  4 Jan 21:45 [1m[36m..[m[m
-rw-r--r--@  1 mk  staff     5101  4 Jan 21:47 README.md
-rw-r--r--@  1 mk  staff  9450336  4 Jan 21:47 adapter_model.safetensors
-rw-r--r--@  1 mk  staff      754  4 Jan 21:47 adapter_config.json


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Performing Inference with a PEFT Model

Loading the PEFT model weights that has the best accuracy and evaluate the performance of the trained PEFT model.

## Perform Inference Using the Fine-Tuned Model

### Load the saved PEFT model

We load the best PEFT model of the two we created: "trainer_peft_2"

In [43]:
saved_model = AutoModelForSequenceClassification.from_pretrained("trainer_peft_2")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Evaluate the fine-tuned model

In [44]:
# %%time

x = random.randint(0, eval_dataset_size)

text_to_classify=small_eval_dataset["text"][x]

print("Text from eval_dataset: {} \n\nlabel from eval_dataset:{}".format(
    small_eval_dataset["text"][x], 
    id2label[small_eval_dataset["label"][x]]
))


def classify(text):
    #Tokenize the text and return a PyTorch tensor
    inputs = tokenizer(text, truncation=True, padding=True, return_tensors="pt")

    #Pass the tokinezed text to the model and get logits
    with torch.no_grad():
        outputs = saved_model(**inputs)
    # Get the predicted class
    predictions = torch.argmax(outputs.logits, dim=-1)
    print(f"\nPredicted class: {model.config.id2label[predictions.item()]} \n")


classify(text_to_classify)

Text from eval_dataset: Maybe I'm biased because the F-16 is my favorite fighter aircraft - although the F-14 is probably second or third - but I liked this movie. The sequels (Iron Eagle II and III) don't measure up acting and plot wise, but the first one - along with Top Gun - have excellent flying and music, along with reasonable plots and acting. II and III clearly have much less of a "flight budget", but their main drawback is plot and acting. I suspect the relative fame and popularity of Iron Eagle compared to Top Gun is almost entirely a reflection of the fame and popularity of Jason Gedrick compared to Tom Cruise. Another plus (for me) is an all too brief appearance by Shawnee Smith. 7/10 

label from eval_dataset:POSITIVE

Predicted class: POSITIVE 



**The inference classified the text as POSITIVE which matches the label in the dataset.**