# Transfer Learning using a Pre-trained Transformer Model for Text Classification

**Approach**

In this notebook, we solve a natural language processing (NLP) task of text classification by using a Transformer architecture-based pre-trained model. Specifically, the model is fine-tuned for transfer learning. The pre-trained model is obtained from the Huggingface library.


- Step 1: Load the raw dataset
- Step 2: Tokenize the raw dataset
- Step 3: Create a data collator
- Step 4: Create train and test dataset loader objects
- Step 5: Instantiate a pre-trained model from the model checkpoint & compile the model
- Step 6: Train the model
- Step 7: Model evaluation



**Dataset**

MRPC (Microsoft Research Paraphrase Corpus) is a text classification dataset. It consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing).

MRPC is one of the 10 datasets composing the GLUE benchmark,
which is an academic benchmark that is used to measure the performance of ML models across 10 different text classification tasks.


**Acknowledgmentgement**

This notebook is adapted from the following resources.
- https://huggingface.co/learn/nlp-course/chapter3/1?fw=tf

- Natural Language Processing with Transformers (Revised Edition) By Lewis Tunstall, Leandro von Werra, Thomas Wolf (O’Reilly, 2022)


In [None]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━

**Store the Fine-Tuned Model on the Hugging Face Hub**

We can store the fine-tuned model Hugging Face Hub cloud repository. This will make it easier to reuse the fine-tuned model.

We will use push_to_hub API for this purpose. However, to use this utility, we need to have a Hugging Face account (sign up with a Hugging Face account at: https://huggingface.co/welcome). Then, get an authentication token and input the token after running the following cell.


In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import TFAutoModelForSequenceClassification
import evaluate

import os
import numpy as np
from sklearn.metrics import classification_report
import tensorflow as tf


**Step 1: Load the raw dataset**

The load_dataset method returns a dictionary object of type DatasetDict


In [None]:
raw_datasets = load_dataset("glue", "mrpc")
print(raw_datasets)
raw_datasets


Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [None]:

# Access each pair of sentences in the raw_datasets object by indexing
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [None]:
# Inspect the features of the raw_train_dataset object
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

**Step 2: Tokenize the raw dataset**

Tokenization is the process of converting the text to numbers
This will be done via a tokenizer object.

- Step 2(a): Instantiate a tokenizer object
- Step 2(b): Define a function to tokenize the input
- Step 2(c): Tokenize the batches


In [None]:
'''
Step 2(a): Instantiate a tokenizer object
This is done by using a suitable model checkpoint
'''
MODEL_CHECKPOINT = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)



'''
Step 2(b): Define a function to tokenize the input

The truncation=True will truncate the sequences
    that are longer than the model max length
    (e.g., 512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

For truncating the sequences that are longer than the specified max length,
use the following code:
tokenizer(example["sentence1"], example["sentence2"],
          max_length=8, truncation=True)
'''
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

'''
- Step 2(c): Tokenize the batches

We want to keep the data as a dataset object after tokenization.
So, we use the Dataset map() method for tokenization,
by enabling it to utilize the tokenizer function.
The map() method works by applying a function to each element of the dataset.
This gives us the flexibility to apply additional preprocessing
via the map() method.

We set batched=True load samples in the RAM in batches.
This is possible because the datasets from the 🤗 Datasets library
are stored on the disk as they are Apache Arrow files.

Note that we didn't pad the samples.
Because it's more efficient to apply padding during the creation of the batches.
In such a case, we only need to pad to the maximum length in that batch,
and not the maximum length in the entire dataset.
This will save time and processing power
when the inputs have very variable lengths!
'''
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets



Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

**Step 3: Create a data collator**

Put together the samples in a batch by using a collate function.
By default, this function converts samples to tf.Tensor and
concatenate them (recursively if the elements are lists,
tuples, or dictionaries).

We can't utilize the default function as our inputs have variable lengths.
To address the variable-length issue, we will apply padding during the batching.
This is an efficient approach (padding during baching) as we can
avoid having over-long inputs with a lot of padding.

For applying the correct amount of padding to the items
of the dataset in a batch we use DataCollatorWithPadding.
The DataCollatorWithPadding takes a tokenizer when we instantiate it (to know which padding token to use,
and whether the model expects padding to be on the left or on the right of the inputs) and will do everything we need.


In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

'''
Inspect the collated batched data

For inspecting the padding added via the collator function,
we look at a few samples from our training set that we would like to batch together.
Here, we remove the columns idx, sentence1, and sentence2 as they won’t be needed
and contain strings (and we can’t create tensors with strings) and
have a look at the lengths of each entry in the batch.
'''

samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
print([len(x) for x in samples["input_ids"]])


'''
Check the dynamic padding of the batch via the data_collator.
We will see that the length of each entry in the batch is set
to the length of the max length entry in the same batch
'''
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[50, 59, 47, 67, 59, 50, 62, 32]


{'input_ids': TensorShape([8, 67]),
 'token_type_ids': TensorShape([8, 67]),
 'attention_mask': TensorShape([8, 67]),
 'labels': TensorShape([8])}

**Hyperparameters**

In [None]:
BATCH_SIZE = 8

MAX_EPOCHS = 4


'''
The initial learning rate is used by the optimizers, e.g., SGD, ADAM, NADAM, etc.

Note that transformer models benefit from a much lower learning rate than the default for Adam, which is 1e-3,
A much smaller rate, e.g., 5e-5, is a better starting point.
'''
INITIAL_LEARNING_RATE = 2e-5

WEIGHT_DECAY = 0.01

**Step 4: Create train and test dataset loader objects**

Create the train and validation dataset by putting together the
tokenized dataset (step 3) and collated dataset (step 4) via the to_tf_dataset() method.

It will wrap a tf.data.Dataset around the dataset, with an optional collation function.

The tf.data.Dataset is a native TensorFlow format that Keras can use for model.fit().


In [None]:
tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=BATCH_SIZE,
)

tf_validation_dataset = tokenized_datasets["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=BATCH_SIZE,
)

Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 


**Step 5: Instantiate a pre-trained model from the model checkpoint & compile**

Before we instantiate and compile the model, we need to define the optimizer
and the loss function.

In [None]:
'''
Reset all states generated by Keras.
It deletes the TensorFlow graph before creating a new model,
otherwise memory overflow will occur.
'''
tf.keras.backend.clear_session()

'''
To reproduce the same result by the model in each iteration, we use fixed seeds for random number generation.
'''
np.random.seed(42)
tf.random.set_seed(42)



###################### Optimizer ##########################
# We provide various choices for the optimizer

'''
For the learning schedule, we need to set how long training is going to be, i.e., the number of training steps.
num_of_training_steps = (num_of_training_samples // batch_size) *  epochs

Since the tf_train_dataset is batched, its len() is already num_of_training_samples // batch_size
'''
num_of_training_steps = len(tf_train_dataset) * MAX_EPOCHS

'''
Scheduler: ExponentialDecay
'''
lr_scheduler = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=INITIAL_LEARNING_RATE,
    decay_steps=num_of_training_steps,
    decay_rate=WEIGHT_DECAY,
    staircase=True)

'''
Scheduler: PolynomialDecay
'''
lr_scheduler = tf.keras.optimizers.schedules.PolynomialDecay(
    initial_learning_rate=INITIAL_LEARNING_RATE,
    decay_steps=num_of_training_steps,
    end_learning_rate=0.0,
    power=1.0,
    cycle=False,
    name=None
)

'''
Optimizer:

Instantiate an optimizer. Use one of the following choices.
- Fixed LR: learning_rate=INITIAL_LEARNING_RATE
- Scheduled LR: learning_rate=lr_scheduler
'''
#optimizer = tf.keras.optimizers.SGD(learning_rate=lr_scheduler, momentum=0.9, nesterov=False)
optimizer=tf.keras.optimizers.Adam(learning_rate=lr_scheduler)
#optimizer=tf.keras.optimizers.Nadam(learning_rate=lr_scheduler)
#optimizer=tfa.optimizers.AdamW(learning_rate=lr_scheduler, weight_decay=WEIGHT_DECAY)
#optimizer=tfa.optimizers.LAMB(learning_rate=lr_scheduler, weight_decay_rate=WEIGHT_DECAY)


###################### Loss Function ##########################

'''
Loss Function:

Instantiate a function to compute the training loss (per iteration/step).
NOTE: the "reduction" argument should be set to the value AUTO (it's the default value).
AUTO indicates that the reduction option will be determined by the usage context.
For almost all cases this defaults to SUM_OVER_BATCH_SIZE.
Thus, the function will return a single scalar loss value for the entire batch.
If "reduction" is set to NONE,
then we need to apply tf.reduce_mean() function over all loss values for every instance in the batch.
'''
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True,
    reduction=tf.keras.losses.Reduction.AUTO)


###################### Model Instatiation ##########################

NUM_LABELS = 2

'''
Instantiate the model & compile
For the binary classification problem, we will
use the TFAutoModelForSequenceClassification class, with two labels.
'''
model = TFAutoModelForSequenceClassification.from_pretrained(MODEL_CHECKPOINT, num_labels=NUM_LABELS)

model.compile(optimizer=optimizer, loss=loss_fn, metrics=["accuracy"])

model.summary()

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 109483778 (417.65 MB)
Trainable params: 109483778 (417.65 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


**Callback Functions**

Define the following callback function.
- PushToHubCallback

In [None]:
'''
PushToHubCallback
It will sync up the fine-tuned model with the Hugging Face Hub.
First, the model will be stored (serialized) on the disk (output_dir).
Then, it will be synced.

This function will allow model reuse.
- The locally stored model can be loaded from "output_dir"
- The cloud-stored model can be loaded from the Hub

The function will allow to resume training from other machines,
share the model after training is finished,
and even test the model's inference quality midway through training!
'''

# Define a name of the fine-tuned model for the callback function
model_name = MODEL_CHECKPOINT.split("/")[-1]
push_to_hub_model_id = f"{model_name}-finetuned-classification_mrpc"


push_to_hub_callback = PushToHubCallback(
    output_dir="./model_classification_mrpc_save",
    tokenizer=tokenizer,
    hub_model_id=push_to_hub_model_id,
)


callbacks = [push_to_hub_callback]

**Step 6: Train the model**


In [None]:
'''
Train in mixed-precision float16
Mixed precision is the use of both 16-bit and 32-bit floating-point types
in a model during training to make it run faster and use less memory.
'''
tf.keras.mixed_precision.set_global_policy("mixed_float16")


# Fine-tune the model
model.fit(tf_train_dataset,
          validation_data=tf_validation_dataset,
          epochs=MAX_EPOCHS,
          callbacks=callbacks)



Epoch 1/3
Epoch 2/3
Epoch 3/3

Saving the fully trained model in the SavedModel format ... 




**Step 7: Model evaluation**

We will use the predict() method of the fine-tuned model to return the logits from the output head of the model, one per class. Then, the logits will be converted into class probabilities, which will be used to compute the classification performance, i.e., accuracy and class-based precision, recall, and F1 score.

We will use the following three approaches to obtain the fine-tuned model.
- Use the current fine-tuned model
- Load the saved fine-tuned model from the disk
- Load the saved fine-tuned model from the Hugging Face Hub


In addition, we can load the metrics associated with the MRPC dataset using the evaluate.load() function. The object returned has a compute() method we can use to do the metric calculation.


**Step 7 (a): Use the current fine-tuned model**

In [None]:
# Predict the logits for the validation input
logits_pred = model.predict(tf_validation_dataset)["logits"]

# Compute the predicted class labels
labels_pred = np.argmax(logits_pred, axis=1)

# Shape of the predicted logits
print("\nShape of the predicted logits: ", logits_pred.shape)

# Shape of the predicted class labels
print("Shape of the predicted class labels: ", labels_pred.shape)


'''
Create an array (NumPy) of validation labels from the validation data loader object
'''
# A list for storing the validation labels
labels_list = []

# Get the validation labels from the validation data loader object
for batch in tf_validation_dataset:
    labels_list_batch = batch[1]  # Extracting labels from the batch
    labels_list.extend(labels_list_batch.numpy())  # Assuming labels are in a NumPy array, convert to Python list and extend the list

# Number of validation labels in the list
print("Number of validation labels in the list: ", len(labels_list))

# Convert the val label list into a NumPy array
labels = np.array(labels_list)


# Shape of the labels array
print("Shape of the labels array: ", labels.shape)


# Classification report
class_names = ['class 0', 'class 1']
print("\n-----------------------------------------------------------\n")
print(classification_report(labels, labels_pred, target_names=class_names))



Shape of the predicted logits:  (408, 2)
Shape of the predicted class labels:  (408,)
Number of validation labels in the list:  408
Shape of the labels array:  (408,)

-----------------------------------------------------------

              precision    recall  f1-score   support

     class 0       0.82      0.63      0.71       129
     class 1       0.84      0.94      0.89       279

    accuracy                           0.84       408
   macro avg       0.83      0.78      0.80       408
weighted avg       0.84      0.84      0.83       408



**Step 7 (b): Load the saved fine-tuned model from the disk**

In [None]:
# Path to the saved model on the disk
output_dir="./model_classification_mrpc_save"
NUM_LABELS = 2
model_saved = TFAutoModelForSequenceClassification.from_pretrained(output_dir, num_labels=NUM_LABELS)
logits_pred = model_saved.predict(tf_validation_dataset)["logits"]

**Step 7 (c): Load the saved fine-tuned model from the Hugging Face Hub**

In [None]:
# Fine-tuned model stored on the HUgging Face Hub
FINE_TUNED_MODEL = "hasan-mr/bert-base-uncased-finetuned-classification_mrpc"
NUM_LABELS = 2
model_saved = TFAutoModelForSequenceClassification.from_pretrained(FINE_TUNED_MODEL, num_labels=NUM_LABELS)
logits_pred = model_saved.predict(tf_validation_dataset)["logits"]

**Load the metrics associated with the MRPC dataset**

We use the evaluate.load() function to load the metrics associated with the MRPC dataset. The object returned has a compute() method that can be used to do the metric calculation.

In [None]:
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=labels_pred, references=raw_datasets["validation"]["label"])

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.8382352941176471, 'f1': 0.8877551020408163}