<a href="https://colab.research.google.com/github/ralph27/ZAKA-hands-on/blob/master/Fine_Tuning_BERT_with_LoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning BERT with LoRA
---
© 2023, Zaka AI, Inc. All Rights Reserved.


**Objective:** In this practical exercise, we will the LoRA fine-tuning approach. Our main objective is to fine-tune a BERT model to categorize movie ratings binarily. The dataset consists of polarized ratings between positive and negative reviews and our goal is to increase the model's accuracy with fine-tuning



## Importing Needed Packages

### Prerequisite Libraries:

1. **transformers**: Hugging Face transformers library, which is used for working with pre-trained transformer-based models, such as BERT.
2. **peft**: Library that allows the usage of PEFT techniques such as LoRA.
3. **evaluate**: A library that makes evaluating and comparing models and reporting their performance easier and more standardized.

In [None]:
!pip install -q transformers
!pip install -q peft
!pip install -q evaluate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m29.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m74.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m61.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m32.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.6/85.6 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

## Dataset Loading

The selected dataset for this Notebook is the **IMDB** dataset. The decision to use this dataset is based on two reasons.
- First, the dataset offers a substantial amount of data, providing a rich resource for training machine learning models effectively.
- Second, the dataset's straightforward columns are well-suited for the purpose of comparing various machine learning algorithms.


Let's start by downlaoding and loading the dataset:

In [None]:
from datasets import load_dataset

dataset = load_dataset("imdb")

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

## Data Preprocessing

### Tokenization
Tokenization will allow us to feed batches of sequences at the same time:
- SMS need to be padded to the same length
- SMS need to be truncated to match the model's maximum input length.

To perform the tokenization of the data, we also need to choose a pre-trained tokenizer. For our case, the basic model (bert-base-cased) will be sufficient:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

### Splitting dataset into Train and Test Sets
It is important to split the dataset into train and test sets while preserving the distribution of spam and ham texts. In our case this is already done with the previously set Tokenizer, so we are only going to fetch the train and test splits, and set the same seed for both so it's reproducible.

In [None]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

### Defining LoRA Configurations
Since we are using the peft library, we can set a LoRA configuration to be loaded later on, without much boiler-plate code.

r=  the rank of the update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.

lora_alpha= responsible for the scaling of the weight matrices

lora_dropout= dropout techniques that avoids overfitting

In [None]:
from peft import LoraConfig, TaskType

lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=1, lora_alpha=1, lora_dropout=0.1
)

### Loading the pre-trained BERT
Now we will load the pre-trained BERT, specifically for Sequence Classification, and set the number of labels to 2 since this is our goal

In [None]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(
    'bert-base-cased',
    num_labels=2
)

Downloading model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Adding LoRA to our Model
To add the Lora configurations we set earlier to our model, we use the get_peft_model function, passing to it our model and the lora_config variable we set earlier.

In [None]:
from peft import get_peft_model
model = get_peft_model(model, lora_config)

## Creating our Metrics Function
We will create a compute_metrics function during this section that will be used during Training. We will evaluate our model based on accuracy, and this function will be passed onto our Trainer( ) class that will train our model with the LoRA configurations.

In [None]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

## Training our Model with LoRA
Finally, we will train our model and see how the accuracy enhances through each epoch. The Trainer( ) class will allow seamless implementation of our training process, by just adding the previous variables and functions we created earlier

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch",
                                 num_train_epochs=25,)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.679898,0.595
2,No log,0.672862,0.606
3,No log,0.662353,0.629
4,0.693400,0.64318,0.664
5,0.693400,0.616434,0.686
6,0.693400,0.580412,0.703
