<a href="https://colab.research.google.com/github/susantaghosh1/nlp-notebooks/blob/develop/FineTuning_BERT_RoBERTa_DeBERTa_DistilBERT_CANINE_multi_label_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning BERT (and friends) for multi-label text classification

In this notebook, we are going to fine-tune BERT to predict one or more labels for a given piece of text. Note that this notebook illustrates how to fine-tune a bert-base-uncased model, but you can also fine-tune a RoBERTa, DeBERTa, DistilBERT, CANINE, ... checkpoint in the same way.

All of those work in the same way: they add a linear layer on top of the base model, which is used to produce a tensor of shape (batch_size, num_labels), indicating the unnormalized scores [***LOGITS***] for a number of labels for every example in the batch.



# Set-up environment


In [32]:
%%capture
!pip install datasets transformers[sentencepiece]
!pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
!pip install scipy sklearn

In [62]:
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cpu')

In [34]:
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [35]:
import transformers
print(transformers.__version__)

4.19.2


# Load the dataset


In [36]:
from datasets import load_dataset

dataset = load_dataset("sem_eval_2018_task_1", "subtask5.english")



  0%|          | 0/3 [00:00<?, ?it/s]

In [37]:
dataset

DatasetDict({
    train: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
        num_rows: 6838
    })
    test: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
        num_rows: 3259
    })
    validation: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
        num_rows: 886
    })
})

In [38]:
dataset['train'][0]

{'ID': '2017-En-21441',
 'Tweet': "“Worry is a down payment on a problem you may never have'. \xa0Joyce Meyer.  #motivation #leadership #worry",
 'anger': False,
 'anticipation': True,
 'disgust': False,
 'fear': False,
 'joy': False,
 'love': False,
 'optimism': True,
 'pessimism': False,
 'sadness': False,
 'surprise': False,
 'trust': True}


The dataset consists of tweets, labeled with one or more emotions.

In [39]:
dataset['train'].features

{'ID': Value(dtype='string', id=None),
 'Tweet': Value(dtype='string', id=None),
 'anger': Value(dtype='bool', id=None),
 'anticipation': Value(dtype='bool', id=None),
 'disgust': Value(dtype='bool', id=None),
 'fear': Value(dtype='bool', id=None),
 'joy': Value(dtype='bool', id=None),
 'love': Value(dtype='bool', id=None),
 'optimism': Value(dtype='bool', id=None),
 'pessimism': Value(dtype='bool', id=None),
 'sadness': Value(dtype='bool', id=None),
 'surprise': Value(dtype='bool', id=None),
 'trust': Value(dtype='bool', id=None)}

Let's create a list that contains the labels, as well as 2 dictionaries that map labels to integers and back.

In [40]:
labels = [each_key for each_key in dataset['train'].features.keys() if each_key not in ['ID','Tweet']]
labels

['anger',
 'anticipation',
 'disgust',
 'fear',
 'joy',
 'love',
 'optimism',
 'pessimism',
 'sadness',
 'surprise',
 'trust']

In [41]:
id2label = {idx : label for idx,label in enumerate(labels)}
label2id = {label : idx for idx,label in enumerate(labels)}

In [42]:
label2id

{'anger': 0,
 'anticipation': 1,
 'disgust': 2,
 'fear': 3,
 'joy': 4,
 'love': 5,
 'optimism': 6,
 'pessimism': 7,
 'sadness': 8,
 'surprise': 9,
 'trust': 10}

In [43]:
id2label

{0: 'anger',
 1: 'anticipation',
 2: 'disgust',
 3: 'fear',
 4: 'joy',
 5: 'love',
 6: 'optimism',
 7: 'pessimism',
 8: 'sadness',
 9: 'surprise',
 10: 'trust'}

# Tokenize Data


In [44]:
model_checkpoint = "bert-base-uncased"
batch_size = 16

In [45]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint,use_fast=True)

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/bert-base-uncased/resolve/ma

In [46]:
dataset['train'][0]


{'ID': '2017-En-21441',
 'Tweet': "“Worry is a down payment on a problem you may never have'. \xa0Joyce Meyer.  #motivation #leadership #worry",
 'anger': False,
 'anticipation': True,
 'disgust': False,
 'fear': False,
 'joy': False,
 'love': False,
 'optimism': True,
 'pessimism': False,
 'sadness': False,
 'surprise': False,
 'trust': True}

As we are able to see that this dataset doesn;t contain any numeric labels and transformers models only understand numeric labels. As this is a multi-class problem, so each sample will have 11 labels **['anger',
 'anticipation',
 'disgust',
 'fear',
 'joy',
 'love',
 'optimism',
 'pessimism',
 'sadness',
 'surprise',
 'trust']** of floats . So labels for 1st sample will be **[0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1.]**

In [47]:
labels

['anger',
 'anticipation',
 'disgust',
 'fear',
 'joy',
 'love',
 'optimism',
 'pessimism',
 'sadness',
 'surprise',
 'trust']

In [48]:
dataset['train'][0].keys()

dict_keys(['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'])

In [49]:
def make_labels(sample):
  label_list =[]
  for key,value in sample.items():
    if key in labels:
      
      if value==True:
        label_list.append(1.)
      else:
        label_list.append(0.)
  return {'labels':label_list}

In [50]:
dataset_with_labels = dataset.map(make_labels)



In [51]:
dataset_with_labels

DatasetDict({
    train: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust', 'labels'],
        num_rows: 6838
    })
    test: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust', 'labels'],
        num_rows: 3259
    })
    validation: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust', 'labels'],
        num_rows: 886
    })
})

In [52]:
def tokenize_tweet(sample):
  return tokenizer(sample['Tweet'],truncation=True,padding='max_length',max_length=128)

In [53]:
encoded_dataset = dataset_with_labels.map(tokenize_tweet,batched=True)



  0%|          | 0/4 [00:00<?, ?ba/s]



In [54]:
encoded_dataset = encoded_dataset.remove_columns(column_names=['ID','Tweet','anger','anticipation','disgust','fear'
,'joy','love','optimism','pessimism','sadness','surprise','trust'])

In [55]:
encoded_dataset

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 6838
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3259
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 886
    })
})

In [56]:
encoded_dataset['train'][0]['labels']

[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0]

# Model Training

Here we define a model that includes a pre-trained base (i.e. the weights from bert-base-uncased) are loaded, with a random initialized classification head (linear layer) on top. One should fine-tune this head, together with the pre-trained base on a labeled dataset.

This is also printed by the warning.

We set the problem_type to be "multi_label_classification", as this will make sure the appropriate loss function is used (namely BCEWithLogitsLoss). We also make sure the output layer has len(labels) output neurons, and we set the id2label and label2id mappings.

In [57]:
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint,
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id,
                                                           problem_type="multi_label_classification")

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "anger",
    "1": "anticipation",
    "2": "disgust",
    "3": "fear",
    "4": "joy",
    "5": "love",
    "6": "optimism",
    "7": "pessimism",
    "8": "sadness",
    "9": "surprise",
    "10": "trust"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "anger": 0,
    "anticipation": 1,
    "disgust": 2,
    "fear": 3,
    "joy": 4,
    "love": 5,
    "optimism": 6,
 

In [58]:
metric_name="f1"
training_args = TrainingArguments(output_dir="bert-finetuned-sem_eval-english",
                                  evaluation_strategy="epoch",
                                  save_strategy="epoch",learning_rate=2e-5,per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,num_train_epochs=5,
                                  weight_decay=0.01,load_best_model_at_end=True,metric_for_best_model=metric_name,)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


We are also going to compute metrics while training. For this, we need to define a compute_metrics function, that returns a dictionary with the desired metric values.

In the multilabel case with binary label indicators:

>>>
>>> import numpy as np
>>> accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))
0.5

In [59]:
from transformers.utils.dummy_pt_objects import YOSO_PRETRAINED_MODEL_ARCHIVE_LIST
from transformers.trainer_utils import EvalPrediction
import numpy as np
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
threshold = 0.5
def compute_metrics(p : EvalPrediction):
  logits,labels = p.predictions,p.label_ids
  print(f"shape of logits {logits.shape} and shape of labels {labels.shape}") # should be (batch_size,num_labels)
  # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
  sigmoid = torch.nn.Sigmoid()
  predicted_probability = sigmoid(torch.Tensor(logits))
  print(f"shape of predicted prob {predicted_probability.shape}") # should be (batch_size,num_labels)
  # next, use threshold [0.5] to turn them into integer predictions [sparse matrix which will contain either 0 or 1]
  y_pred = np.zeros(predicted_probability.shape)
  # predicted_probability is a matirx of shape (batch_size, num_labels) and y_pred is matrix of shape (batch_size, num_labels) and each element is 0
  # what we will do is that we will get the indices from array predicted_probability where the value is >0.5 and put 1 in those locations in y_pred
  y_pred[np.where(predicted_probability >= threshold)] = 1
  y_true = labels
  print(f"shape of y_pred {y_pred.shape} and shape of y_true {y_true.shape}") # should be (batch_size,num_labels)
  f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
  roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
  accuracy = accuracy_score(y_true, y_pred)
  # return as dictionary
  metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
  return metrics


In [60]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [61]:
trainer.train()

***** Running training *****
  Num examples = 6838
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2140


Epoch,Training Loss,Validation Loss


ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-61-3435b262f1ae>", line 1, in <module>
    trainer.train()
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1321, in train
    ignore_keys_for_eval=ignore_keys_for_eval,
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1554, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 2201, in training_step
    loss.backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 175, in backward
    allow_unreachable=True, a

KeyboardInterrupt: ignored