<a href="https://colab.research.google.com/github/susantaghosh1/nlp-notebooks/blob/develop/FineTuning_BERT_RoBERTa_DeBERTa_DistilBERT_CANINE_multi_label_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning BERT (and friends) for multi-label text classification

In this notebook, we are going to fine-tune BERT to predict one or more labels for a given piece of text. Note that this notebook illustrates how to fine-tune a bert-base-uncased model, but you can also fine-tune a RoBERTa, DeBERTa, DistilBERT, CANINE, ... checkpoint in the same way.

All of those work in the same way: they add a linear layer on top of the base model, which is used to produce a tensor of shape (batch_size, num_labels), indicating the unnormalized scores [***LOGITS***] for a number of labels for every example in the batch.



# Set-up environment


In [3]:
%%capture
!pip install datasets transformers[sentencepiece]
!pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
!pip install scipy sklearn

In [1]:
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cuda')

In [4]:
!nvidia-smi

Wed May 18 15:59:47 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [5]:
import transformers
print(transformers.__version__)

4.19.2


# Load the dataset


In [6]:
from datasets import load_dataset

dataset = load_dataset("sem_eval_2018_task_1", "subtask5.english")

Downloading builder script:   0%|          | 0.00/1.95k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

Downloading and preparing dataset sem_eval2018_task1/subtask5.english (download: 5.70 MiB, generated: 1.24 MiB, post-processed: Unknown size, total: 6.94 MiB) to /root/.cache/huggingface/datasets/sem_eval2018_task1/subtask5.english/1.1.0/a7c0de8b805f1988b118882fb289ccfbbeb9085c7820b6f046b5887e234af182...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/5.98M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/6838 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3259 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/886 [00:00<?, ? examples/s]

Dataset sem_eval2018_task1 downloaded and prepared to /root/.cache/huggingface/datasets/sem_eval2018_task1/subtask5.english/1.1.0/a7c0de8b805f1988b118882fb289ccfbbeb9085c7820b6f046b5887e234af182. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
        num_rows: 6838
    })
    test: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
        num_rows: 3259
    })
    validation: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
        num_rows: 886
    })
})

In [8]:
dataset['train'][0]

{'ID': '2017-En-21441',
 'Tweet': "“Worry is a down payment on a problem you may never have'. \xa0Joyce Meyer.  #motivation #leadership #worry",
 'anger': False,
 'anticipation': True,
 'disgust': False,
 'fear': False,
 'joy': False,
 'love': False,
 'optimism': True,
 'pessimism': False,
 'sadness': False,
 'surprise': False,
 'trust': True}


The dataset consists of tweets, labeled with one or more emotions.

In [9]:
dataset['train'].features

{'ID': Value(dtype='string', id=None),
 'Tweet': Value(dtype='string', id=None),
 'anger': Value(dtype='bool', id=None),
 'anticipation': Value(dtype='bool', id=None),
 'disgust': Value(dtype='bool', id=None),
 'fear': Value(dtype='bool', id=None),
 'joy': Value(dtype='bool', id=None),
 'love': Value(dtype='bool', id=None),
 'optimism': Value(dtype='bool', id=None),
 'pessimism': Value(dtype='bool', id=None),
 'sadness': Value(dtype='bool', id=None),
 'surprise': Value(dtype='bool', id=None),
 'trust': Value(dtype='bool', id=None)}

Let's create a list that contains the labels, as well as 2 dictionaries that map labels to integers and back.

In [10]:
labels = [each_key for each_key in dataset['train'].features.keys() if each_key not in ['ID','Tweet']]
labels

['anger',
 'anticipation',
 'disgust',
 'fear',
 'joy',
 'love',
 'optimism',
 'pessimism',
 'sadness',
 'surprise',
 'trust']

In [11]:
id2label = {idx : label for idx,label in enumerate(labels)}
label2id = {label : idx for idx,label in enumerate(labels)}

In [12]:
label2id

{'anger': 0,
 'anticipation': 1,
 'disgust': 2,
 'fear': 3,
 'joy': 4,
 'love': 5,
 'optimism': 6,
 'pessimism': 7,
 'sadness': 8,
 'surprise': 9,
 'trust': 10}

In [13]:
id2label

{0: 'anger',
 1: 'anticipation',
 2: 'disgust',
 3: 'fear',
 4: 'joy',
 5: 'love',
 6: 'optimism',
 7: 'pessimism',
 8: 'sadness',
 9: 'surprise',
 10: 'trust'}

# Tokenize Data


In [14]:
model_checkpoint = "bert-base-uncased"
batch_size = 16

In [15]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint,use_fast=True)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [16]:
dataset['train'][0]


{'ID': '2017-En-21441',
 'Tweet': "“Worry is a down payment on a problem you may never have'. \xa0Joyce Meyer.  #motivation #leadership #worry",
 'anger': False,
 'anticipation': True,
 'disgust': False,
 'fear': False,
 'joy': False,
 'love': False,
 'optimism': True,
 'pessimism': False,
 'sadness': False,
 'surprise': False,
 'trust': True}

As we are able to see that this dataset doesn;t contain any numeric labels and transformers models only understand numeric labels. As this is a multi-class problem, so each sample will have 11 labels **['anger',
 'anticipation',
 'disgust',
 'fear',
 'joy',
 'love',
 'optimism',
 'pessimism',
 'sadness',
 'surprise',
 'trust']** of floats . So labels for 1st sample will be **[0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1.]**

In [17]:
labels

['anger',
 'anticipation',
 'disgust',
 'fear',
 'joy',
 'love',
 'optimism',
 'pessimism',
 'sadness',
 'surprise',
 'trust']

In [18]:
dataset['train'][0].keys()

dict_keys(['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'])

In [19]:
def make_labels(sample):
  label_list =[]
  for key,value in sample.items():
    if key in labels:
      
      if value==True:
        label_list.append(1.)
      else:
        label_list.append(0.)
  return {'labels':label_list}

In [20]:
dataset_with_labels = dataset.map(make_labels)

  0%|          | 0/6838 [00:00<?, ?ex/s]

  0%|          | 0/3259 [00:00<?, ?ex/s]

  0%|          | 0/886 [00:00<?, ?ex/s]

In [21]:
dataset_with_labels

DatasetDict({
    train: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust', 'labels'],
        num_rows: 6838
    })
    test: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust', 'labels'],
        num_rows: 3259
    })
    validation: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust', 'labels'],
        num_rows: 886
    })
})

In [22]:
def tokenize_tweet(sample):
  return tokenizer(sample['Tweet'],truncation=True,padding='max_length',max_length=128)

In [23]:
encoded_dataset = dataset_with_labels.map(tokenize_tweet,batched=True)

  0%|          | 0/7 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [24]:
encoded_dataset = encoded_dataset.remove_columns(column_names=['ID','Tweet','anger','anticipation','disgust','fear'
,'joy','love','optimism','pessimism','sadness','surprise','trust'])

In [25]:
encoded_dataset

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 6838
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3259
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 886
    })
})

In [26]:
encoded_dataset['train'][0]['labels']

[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0]

# Model Training

Here we define a model that includes a pre-trained base (i.e. the weights from bert-base-uncased) are loaded, with a random initialized classification head (linear layer) on top. One should fine-tune this head, together with the pre-trained base on a labeled dataset.

This is also printed by the warning.

We set the problem_type to be "multi_label_classification", as this will make sure the appropriate loss function is used (namely BCEWithLogitsLoss). We also make sure the output layer has len(labels) output neurons, and we set the id2label and label2id mappings.

In [27]:
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint,
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id,
                                                           problem_type="multi_label_classification")

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [28]:
metric_name="f1"
training_args = TrainingArguments(output_dir="bert-finetuned-sem_eval-english",
                                  evaluation_strategy="epoch",
                                  save_strategy="epoch",learning_rate=2e-5,per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,num_train_epochs=5,
                                  weight_decay=0.01,load_best_model_at_end=True,metric_for_best_model=metric_name,)

We are also going to compute metrics while training. For this, we need to define a compute_metrics function, that returns a dictionary with the desired metric values.

In the multilabel case with binary label indicators:

>>>
>>> import numpy as np
>>> accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))
0.5

In [29]:
from transformers.utils.dummy_pt_objects import YOSO_PRETRAINED_MODEL_ARCHIVE_LIST
from transformers.trainer_utils import EvalPrediction
import numpy as np
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
threshold = 0.5
def compute_metrics(p : EvalPrediction):
  logits,labels = p.predictions,p.label_ids
  print(f"shape of logits {logits.shape} and shape of labels {labels.shape}") # should be (batch_size,num_labels)
  # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
  sigmoid = torch.nn.Sigmoid()
  predicted_probability = sigmoid(torch.Tensor(logits))
  print(f"shape of predicted prob {predicted_probability.shape}") # should be (batch_size,num_labels)
  # next, use threshold [0.5] to turn them into integer predictions [sparse matrix which will contain either 0 or 1]
  y_pred = np.zeros(predicted_probability.shape)
  # predicted_probability is a matirx of shape (batch_size, num_labels) and y_pred is matrix of shape (batch_size, num_labels) and each element is 0
  # what we will do is that we will get the indices from array predicted_probability where the value is >0.5 and put 1 in those locations in y_pred
  y_pred[np.where(predicted_probability >= threshold)] = 1
  y_true = labels
  print(f"shape of y_pred {y_pred.shape} and shape of y_true {y_true.shape}") # should be (batch_size,num_labels)
  f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
  roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
  accuracy = accuracy_score(y_true, y_pred)
  # return as dictionary
  metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
  return metrics


In [30]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [31]:
trainer.train()

***** Running training *****
  Num examples = 6838
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2140


Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,No log,0.329871,0.648546,0.754606,0.251693
2,0.388500,0.30898,0.692484,0.78774,0.292325
3,0.286700,0.302713,0.703403,0.794613,0.286682
4,0.250400,0.306067,0.702473,0.796571,0.275395
5,0.225000,0.30691,0.705413,0.798554,0.277652


***** Running Evaluation *****
  Num examples = 886
  Batch size = 16


shape of logits (886, 11) and shape of labels (886, 11)
shape of predicted prob torch.Size([886, 11])
shape of y_pred (886, 11) and shape of y_true (886, 11)


Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-428
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-428/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-428/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-428/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-428/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 886
  Batch size = 16


shape of logits (886, 11) and shape of labels (886, 11)
shape of predicted prob torch.Size([886, 11])
shape of y_pred (886, 11) and shape of y_true (886, 11)


Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-856
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-856/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-856/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-856/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-856/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 886
  Batch size = 16


shape of logits (886, 11) and shape of labels (886, 11)
shape of predicted prob torch.Size([886, 11])
shape of y_pred (886, 11) and shape of y_true (886, 11)


Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-1284
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-1284/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-1284/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-1284/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-1284/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 886
  Batch size = 16


shape of logits (886, 11) and shape of labels (886, 11)
shape of predicted prob torch.Size([886, 11])
shape of y_pred (886, 11) and shape of y_true (886, 11)


Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-1712
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-1712/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-1712/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-1712/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-1712/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 886
  Batch size = 16


shape of logits (886, 11) and shape of labels (886, 11)
shape of predicted prob torch.Size([886, 11])
shape of y_pred (886, 11) and shape of y_true (886, 11)


Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-2140
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-2140/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-2140/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-2140/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-2140/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from bert-finetuned-sem_eval-english/checkpoint-2140 (score: 0.7054128211524071).


TrainOutput(global_step=2140, training_loss=0.28310641440275675, metrics={'train_runtime': 797.4061, 'train_samples_per_second': 42.877, 'train_steps_per_second': 2.684, 'total_flos': 2249123476753920.0, 'train_loss': 0.28310641440275675, 'epoch': 5.0})

In [32]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 886
  Batch size = 16


shape of logits (886, 11) and shape of labels (886, 11)
shape of predicted prob torch.Size([886, 11])
shape of y_pred (886, 11) and shape of y_true (886, 11)


{'epoch': 5.0,
 'eval_accuracy': 0.27765237020316025,
 'eval_f1': 0.7054128211524071,
 'eval_loss': 0.3069103956222534,
 'eval_roc_auc': 0.7985542533990069,
 'eval_runtime': 6.7973,
 'eval_samples_per_second': 130.347,
 'eval_steps_per_second': 8.239}

# Inference

In [33]:
text = "I'm happy I can finally train a model for multi-label classification"

encoding = tokenizer(text, return_tensors="pt")
encoding = {k: v.to(trainer.model.device) for k,v in encoding.items()}

outputs = trainer.model(**encoding)

In [35]:
logits = outputs.logits
logits.shape

torch.Size([1, 11])

In [36]:
logits

tensor([[-3.8801, -2.1663, -3.9700, -4.3732,  3.5939, -0.9269,  0.7458, -4.4428,
         -3.4969, -3.0755, -2.5687]], device='cuda:0',
       grad_fn=<AddmmBackward0>)

To turn them into actual predicted labels, we first apply a sigmoid function independently to every score, such that every score is turned into a number between 0 and 1, that can be interpreted as a "probability" for how certain the model is that a given class belongs to the input text.

Next, we use a threshold (typically, 0.5) to turn every probability into either a 1 (which means, we predict the label for the given example) or a 0 (which means, we don't predict the label for the given example).

In [39]:
sigmoid = torch.nn.Sigmoid()
probs = sigmoid(logits)
probs

tensor([[0.0202, 0.1028, 0.0185, 0.0125, 0.9732, 0.2836, 0.6783, 0.0116, 0.0294,
         0.0441, 0.0712]], device='cuda:0', grad_fn=<SigmoidBackward0>)

In [40]:
probs = sigmoid(logits.squeeze().cpu())
probs

tensor([0.0202, 0.1028, 0.0185, 0.0125, 0.9732, 0.2836, 0.6783, 0.0116, 0.0294,
        0.0441, 0.0712], grad_fn=<SigmoidBackward0>)