# Finetuning DistilBERT for Multi Label Classification

Recently I have started to learn on LLM and various problems it can solve. There is no doubt that various open source LLMs, being trained on hige corpus of texts, can understand the natural language tasks better than the self build NLP models. However, to get better results, one need to fine-tune these models.

When I say fine-tune, it means to provide context to these LLMs. One way to provide the context to these models is to re-train them on specific task and relevant data.

In this post, I am trying to finetune one of the open source LLM, DistilBERT, for multi label classification.

## Problem Statement

One of the Kaggle competition movie-genre-prediction(add hyperlink), provides the data and the problem statement to Multi-Label classification.

In this problem statement, data provided is the synopsis and movie name as inputs and expects the code to classify these movies into one of the ten categories.

Lets Explore data.

### Data Exploration

In [27]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd

train_data = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Datasets/train.csv')
train_data.head()

Unnamed: 0,id,movie_name,synopsis,genre
0,44978,Super Me,A young scriptwriter starts bringing valuable ...,fantasy
1,50185,Entity Project,A director and her friends renting a haunted h...,horror
2,34131,Behavioral Family Therapy for Serious Psychiat...,This is an educational video for families and ...,family
3,78522,Blood Glacier,Scientists working in the Austrian Alps discov...,scifi
4,2206,Apat na anino,Buy Day - Four Men Widely - Apart in Life - By...,action


Now that we have imported our data, we can see that movie_name and synopsis fields are the inputs and genre is the class we have to predict. Lets find out number of classes and the data distribution

In [None]:
train_data['genre'].value_counts()

fantasy      5400
horror       5400
family       5400
scifi        5400
action       5400
crime        5400
adventure    5400
mystery      5400
romance      5400
thriller     5400
Name: genre, dtype: int64

There are 10 classes and each movie is assigned exactly one class.

But this data is little bigger for academic purposes. I will sample the data to pick exactly 10% of cases across each class.

In [None]:
train_data_sample = train_data.groupby('genre', group_keys=False).apply(lambda x: x.sample(frac=1)).reset_index(drop=True)
train_data_sample.head()

Unnamed: 0,id,movie_name,synopsis,genre
0,8346,Queen Crab,A young girl steals her dad's growth experimen...,action
1,5234,Shi mei chu ma,An abused woman's journey from near death to k...,action
2,7156,Voices,VOICES is an intense thriller set in 2010 betw...,action
3,6945,Dalapathi,Ram fights to save his love Vaidehi and to bri...,action
4,295,Allegiance,High-tech mercenaries unwittingly sabotage the...,action


In [None]:
train_data_sample['genre'].value_counts()

action       5400
adventure    5400
crime        5400
family       5400
fantasy      5400
horror       5400
mystery      5400
romance      5400
scifi        5400
thriller     5400
Name: genre, dtype: int64

## Data Pre-Processing

The next step is to prepare the input data to the model. To fine-tune or retrain the model, we need input text and expected labels as the training data set. We would also need same for validation as well.

Looking at the one record from training data:

In [None]:
train_data_sample[:1]

Unnamed: 0,id,movie_name,synopsis,genre
0,8346,Queen Crab,A young girl steals her dad's growth experimen...,action


We intend to pass both synopsis and movie_name as the input. Thus we need to concatenate them to create a input field

In [None]:
concatenated_train_text = train_data_sample['synopsis'] + " " + train_data_sample['movie_name']
concatenated_train_text.head(1)

0    A young girl steals her dad's growth experimen...
dtype: object

Now we look at the labels. We need to create an array of for each record with 10 columns. This is to record the value of label for that record. Thus 9 of these 10 columns will have 0 and 1 will have value 1. I have leveraged OneHotEncoder to achieve it.

In [None]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
train_labels_all = np.array(train_data_sample['genre'].tolist()).reshape(-1,1)

enc = OneHotEncoder(handle_unknown='ignore')

label_onehot = enc.fit_transform(train_labels_all).toarray()

label_onehot[0]

array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Next step would be to assign IDs to these labels

In [None]:
unique_labels = enc.categories_[0].tolist()
unique_labels

['action',
 'adventure',
 'crime',
 'family',
 'fantasy',
 'horror',
 'mystery',
 'romance',
 'scifi',
 'thriller']

In [None]:
id2label = {idx: label for idx, label in enumerate(unique_labels)}
id2label

{0: 'action',
 1: 'adventure',
 2: 'crime',
 3: 'family',
 4: 'fantasy',
 5: 'horror',
 6: 'mystery',
 7: 'romance',
 8: 'scifi',
 9: 'thriller'}

In [None]:
label2id = {label: idx for idx, label in enumerate(unique_labels)}
label2id

{'action': 0,
 'adventure': 1,
 'crime': 2,
 'family': 3,
 'fantasy': 4,
 'horror': 5,
 'mystery': 6,
 'romance': 7,
 'scifi': 8,
 'thriller': 9}

## Model Run - Preparation

To prepare our data for model run, first it requires us to do Train and Test data split. We will leverage standard sklearn library to that tasks.

In [None]:
#Split the data
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = train_test_split(concatenated_train_text, label_onehot, test_size=.2)

train_texts[0]

"A young girl steals her dad's growth experiment infused grapes and feeds them to a pet crab. Years later, the now gigantic crustacean attacks the town! Queen Crab"

In [None]:
train_labels[0]

array([0., 0., 0., 0., 1., 0., 0., 0., 0., 0.])

Any NLP task requires us to tokenize the text or in other words convert text to numbers which models can identify. For this exercise, I have chosen DistilBERT model to finetune and classify our movies dataset. Thus, even the tokenizer have to be the same. Each model have their own tokenizers. We will explore more in another posts. Since our Synopsis is not too high in terms of length of text, I have kept the max length of token array to be 128. This can be further tweaked based on the inputs.

In [None]:
!pip install transformers==4.28.0

Collecting transformers==4.28.0
  Downloading transformers-4.28.0-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0 (from transformers==4.28.0)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.28.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m43.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.16.4 tokenizers-0.13.3 transformers-4.28.0


In [None]:
from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encoding = tokenizer(train_texts.tolist(), padding="max_length", truncation=True, max_length=128)
val_encoding = tokenizer(val_texts.tolist(), padding="max_length", truncation=True, max_length=128)


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Now that we have tokenized data, we need to create custom datasets for our model to run upon. The model requires Input IDs, Attention Mask and Labels in one dataset. Thus have created below class to define our custom dataset, MovieDataset.

In [None]:
import torch

class MoviesDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = MoviesDataset(train_encoding, train_labels)
val_dataset = MoviesDataset(val_encoding, val_labels)


Below is the model definition where we have explicitly mentioned DistilBERT model to be picked for our exercise. Also import for text classification tasks is the explicit mention of Multi Label Calssification problem type, since we are dealing with more than 10 classes in this exercise.

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased',
                                                           problem_type="multi_label_classification",
                                                           num_labels=len(unique_labels),
                                                           id2label=id2label,
                                                           label2id=label2id
                                                        )


Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classi

The below piece of code defines the arguments for the model. I have leveraged a small batch size of 8 and 3 epochs for this academic exercise. However if one has access to lot of computing power, recommendation would be to leverage very high number of epochs for decent accuracy numbers.

Also defined in below codes is the metrics which are caluclated to record the accuracy at overall and each class level.

In [None]:
batch_size = 8
metric_name = "f1"

from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    disable_tqdm=False,
    output_dir = 'FineTuning_DistilBERT_MultiClass',
    #push_to_hub=True,
)


In [None]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score, classification_report
from transformers import EvalPrediction
import torch

# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)

    target_names = unique_labels

    # calcualte individual metrics
    reports = classification_report(y_true, y_pred, target_names=target_names, output_dict=True)
    reports_dict = {}
    for label, metrics in reports.items():
        if label in target_names:
            reports_dict[f"{label}_precision"] = metrics["precision"]
            reports_dict[f"{label}_recall"] = metrics["recall"]
            reports_dict[f"{label}_f1-score"] = metrics["f1-score"]

    # return as dictionary
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}

    # merge dictionary
    metrics = dict(**metrics, **reports_dict)

    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions,
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds,
        labels=p.label_ids)
    return result


Now we just going to test if the model is giving the desired output array before we finetune by passing just one record. The expected value is an array of size (1,10).

In [None]:
#forward pass
outputs = model(input_ids=torch.tensor(train_dataset.encodings['input_ids'][0]).unsqueeze(0),
                labels=torch.tensor(train_dataset.labels[0]).unsqueeze(0),
                )
outputs


SequenceClassifierOutput(loss=tensor(0.6941, dtype=torch.float64,
       grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), logits=tensor([[-0.0111,  0.0107,  0.0010,  0.0084,  0.0348,  0.1002,  0.0169, -0.0788,
          0.0201, -0.0179]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

## Model Training

We have to pass our custom datasets, arguments and calculated metric function to the trainer. This will run in batch of 8 and 3 epochs. Looking the result of model runs, you will with each batch and epoch run, the model hyperparameters are further tuned to give higher accuracy.

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()


You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy,Action Precision,Action Recall,Action F1-score,Adventure Precision,Adventure Recall,Adventure F1-score,Crime Precision,Crime Recall,Crime F1-score,Family Precision,Family Recall,Family F1-score,Fantasy Precision,Fantasy Recall,Fantasy F1-score,Horror Precision,Horror Recall,Horror F1-score,Mystery Precision,Mystery Recall,Mystery F1-score,Romance Precision,Romance Recall,Romance F1-score,Scifi Precision,Scifi Recall,Scifi F1-score,Thriller Precision,Thriller Recall,Thriller F1-score
1,0.2513,0.248716,0.260206,0.577253,0.168796,0.666667,0.026762,0.051458,0.62069,0.016349,0.031858,0.520309,0.244768,0.332921,0.568862,0.371094,0.449173,0.636364,0.026022,0.05,0.48913,0.26087,0.340265,0.517241,0.013838,0.026954,0.659884,0.423507,0.515909,0.560937,0.337723,0.421609,0.0,0.0,0.0
2,0.2298,0.249531,0.306352,0.596682,0.213241,0.559211,0.15165,0.238596,0.537671,0.142598,0.225413,0.535503,0.164695,0.251914,0.554483,0.392578,0.459691,0.491289,0.131041,0.206897,0.465177,0.342029,0.394209,0.481818,0.048893,0.088777,0.649222,0.428172,0.51602,0.545985,0.351834,0.427918,0.461538,0.010667,0.020851


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy,Action Precision,Action Recall,Action F1-score,Adventure Precision,Adventure Recall,Adventure F1-score,Crime Precision,Crime Recall,Crime F1-score,Family Precision,Family Recall,Family F1-score,Fantasy Precision,Fantasy Recall,Fantasy F1-score,Horror Precision,Horror Recall,Horror F1-score,Mystery Precision,Mystery Recall,Mystery F1-score,Romance Precision,Romance Recall,Romance F1-score,Scifi Precision,Scifi Recall,Scifi F1-score,Thriller Precision,Thriller Recall,Thriller F1-score
1,0.2513,0.248716,0.260206,0.577253,0.168796,0.666667,0.026762,0.051458,0.62069,0.016349,0.031858,0.520309,0.244768,0.332921,0.568862,0.371094,0.449173,0.636364,0.026022,0.05,0.48913,0.26087,0.340265,0.517241,0.013838,0.026954,0.659884,0.423507,0.515909,0.560937,0.337723,0.421609,0.0,0.0,0.0
2,0.2298,0.249531,0.306352,0.596682,0.213241,0.559211,0.15165,0.238596,0.537671,0.142598,0.225413,0.535503,0.164695,0.251914,0.554483,0.392578,0.459691,0.491289,0.131041,0.206897,0.465177,0.342029,0.394209,0.481818,0.048893,0.088777,0.649222,0.428172,0.51602,0.545985,0.351834,0.427918,0.461538,0.010667,0.020851
3,0.2045,0.257846,0.337061,0.61356,0.256759,0.470464,0.19893,0.279624,0.486486,0.179837,0.262599,0.468531,0.243858,0.320766,0.533499,0.419922,0.469945,0.411877,0.199814,0.269086,0.443167,0.335266,0.381738,0.429752,0.143911,0.215619,0.609314,0.439366,0.510569,0.509918,0.411101,0.455208,0.28866,0.024889,0.045827


  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=16200, training_loss=0.23257998690193082, metrics={'train_runtime': 1781.3164, 'train_samples_per_second': 72.755, 'train_steps_per_second': 9.094, 'total_flos': 4292556042240000.0, 'train_loss': 0.23257998690193082, 'epoch': 3.0})

In [None]:
trainer.evaluate()

  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.257845938205719,
 'eval_f1': 0.33706089704631087,
 'eval_roc_auc': 0.6135596707818929,
 'eval_accuracy': 0.25675925925925924,
 'eval_action_precision': 0.4704641350210971,
 'eval_action_recall': 0.19892952720785012,
 'eval_action_f1-score': 0.2796238244514107,
 'eval_adventure_precision': 0.4864864864864865,
 'eval_adventure_recall': 0.17983651226158037,
 'eval_adventure_f1-score': 0.2625994694960212,
 'eval_crime_precision': 0.46853146853146854,
 'eval_crime_recall': 0.24385805277525022,
 'eval_crime_f1-score': 0.32076600837821667,
 'eval_family_precision': 0.533498759305211,
 'eval_family_recall': 0.419921875,
 'eval_family_f1-score': 0.46994535519125685,
 'eval_fantasy_precision': 0.4118773946360153,
 'eval_fantasy_recall': 0.19981412639405205,
 'eval_fantasy_f1-score': 0.26908635794743435,
 'eval_horror_precision': 0.44316730523627074,
 'eval_horror_recall': 0.3352657004830918,
 'eval_horror_f1-score': 0.3817381738173818,
 'eval_mystery_precision': 0.429752066115702

## Final Test



In [None]:
test_synopysis = "A group of four teenage friends become trapped in a Mexican border tunnel where they fall prey, one-by one, to tortured ghosts who haunt it."
test_movie_name = "Intermedio"
test_concatenated = test_synopysis+" "+test_movie_name

encoding = tokenizer(test_concatenated, return_tensors="pt")
encoding = {k: v.to(trainer.model.device) for k,v in encoding.items()}

outputs = trainer.model(**encoding)


In [None]:
logits = outputs.logits
logits.shape

torch.Size([1, 10])

In [None]:
# apply sigmoid + threshold
sigmoid = torch.nn.Sigmoid()
probs = sigmoid(logits.squeeze().cpu())
predictions = np.zeros(probs.shape)
predictions[np.where(probs >= 0.5)] = 1
# turn predicted id's into actual label names
predicted_labels = [id2label[idx] for idx, label in enumerate(predictions) if label == 1.0]
print(predicted_labels)

['horror']


I have passed one random synopsis and movie name as input to see the class. If you look at the synopsis, you will realize the movie is a Horror movie. And even our model has predicted the same. :)

## Next Steps

The above model accuracy can be further increased by following steps:

- Increase number of Epochs
- Leverage different LLMs such as BART, T5, etc. with more parameters

See you in next post.
