<a href="https://colab.research.google.com/github/susantaghosh1/nlp-notebooks/blob/develop/FineTuning_BERT_RoBERTa_DeBERTa_DistilBERT_CANINE_multi_label_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning BERT (and friends) for multi-label text classification

In this notebook, we are going to fine-tune BERT to predict one or more labels for a given piece of text. Note that this notebook illustrates how to fine-tune a bert-base-uncased model, but you can also fine-tune a RoBERTa, DeBERTa, DistilBERT, CANINE, ... checkpoint in the same way.

All of those work in the same way: they add a linear layer on top of the base model, which is used to produce a tensor of shape (batch_size, num_labels), indicating the unnormalized scores [***LOGITS***] for a number of labels for every example in the batch.



# Set-up environment


In [4]:
%%capture
!pip install datasets transformers[sentencepiece]
!pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
!pip install scipy sklearn

In [1]:
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cuda')

In [2]:
!nvidia-smi

Wed May 18 03:54:45 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P8    10W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [5]:
import transformers
print(transformers.__version__)

4.19.2


# Load the dataset


In [6]:
from datasets import load_dataset

dataset = load_dataset("sem_eval_2018_task_1", "subtask5.english")

Downloading builder script:   0%|          | 0.00/1.95k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

Downloading and preparing dataset sem_eval2018_task1/subtask5.english (download: 5.70 MiB, generated: 1.24 MiB, post-processed: Unknown size, total: 6.94 MiB) to /root/.cache/huggingface/datasets/sem_eval2018_task1/subtask5.english/1.1.0/a7c0de8b805f1988b118882fb289ccfbbeb9085c7820b6f046b5887e234af182...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/5.98M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/6838 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3259 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/886 [00:00<?, ? examples/s]

Dataset sem_eval2018_task1 downloaded and prepared to /root/.cache/huggingface/datasets/sem_eval2018_task1/subtask5.english/1.1.0/a7c0de8b805f1988b118882fb289ccfbbeb9085c7820b6f046b5887e234af182. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
        num_rows: 6838
    })
    test: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
        num_rows: 3259
    })
    validation: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
        num_rows: 886
    })
})

In [8]:
dataset['train'][0]

{'ID': '2017-En-21441',
 'Tweet': "“Worry is a down payment on a problem you may never have'. \xa0Joyce Meyer.  #motivation #leadership #worry",
 'anger': False,
 'anticipation': True,
 'disgust': False,
 'fear': False,
 'joy': False,
 'love': False,
 'optimism': True,
 'pessimism': False,
 'sadness': False,
 'surprise': False,
 'trust': True}


The dataset consists of tweets, labeled with one or more emotions.

In [9]:
dataset['train'].features

{'ID': Value(dtype='string', id=None),
 'Tweet': Value(dtype='string', id=None),
 'anger': Value(dtype='bool', id=None),
 'anticipation': Value(dtype='bool', id=None),
 'disgust': Value(dtype='bool', id=None),
 'fear': Value(dtype='bool', id=None),
 'joy': Value(dtype='bool', id=None),
 'love': Value(dtype='bool', id=None),
 'optimism': Value(dtype='bool', id=None),
 'pessimism': Value(dtype='bool', id=None),
 'sadness': Value(dtype='bool', id=None),
 'surprise': Value(dtype='bool', id=None),
 'trust': Value(dtype='bool', id=None)}

Let's create a list that contains the labels, as well as 2 dictionaries that map labels to integers and back.

In [11]:
labels = [each_key for each_key in dataset['train'].features.keys() if each_key not in ['ID','Tweet']]
labels

['anger',
 'anticipation',
 'disgust',
 'fear',
 'joy',
 'love',
 'optimism',
 'pessimism',
 'sadness',
 'surprise',
 'trust']

In [12]:
id2label = {idx : label for idx,label in enumerate(labels)}
label2id = {label : idx for idx,label in enumerate(labels)}

In [14]:
label2id

{'anger': 0,
 'anticipation': 1,
 'disgust': 2,
 'fear': 3,
 'joy': 4,
 'love': 5,
 'optimism': 6,
 'pessimism': 7,
 'sadness': 8,
 'surprise': 9,
 'trust': 10}

In [13]:
id2label

{0: 'anger',
 1: 'anticipation',
 2: 'disgust',
 3: 'fear',
 4: 'joy',
 5: 'love',
 6: 'optimism',
 7: 'pessimism',
 8: 'sadness',
 9: 'surprise',
 10: 'trust'}

# Tokenize Data


In [15]:
model_checkpoint = "bert-base-uncased"
batch_size = 16

In [16]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint,use_fast=True)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [27]:
dataset['train'][0]


{'ID': '2017-En-21441',
 'Tweet': "“Worry is a down payment on a problem you may never have'. \xa0Joyce Meyer.  #motivation #leadership #worry",
 'anger': False,
 'anticipation': True,
 'disgust': False,
 'fear': False,
 'joy': False,
 'love': False,
 'optimism': True,
 'pessimism': False,
 'sadness': False,
 'surprise': False,
 'trust': True}

As we are able to see that this dataset doesn;t contain any numeric labels and transformers models only understand numeric labels. As this is a multi-class problem, so each sample will have 11 labels **['anger',
 'anticipation',
 'disgust',
 'fear',
 'joy',
 'love',
 'optimism',
 'pessimism',
 'sadness',
 'surprise',
 'trust']** of floats . So labels for 1st sample will be **[0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1.]**

In [28]:
labels

['anger',
 'anticipation',
 'disgust',
 'fear',
 'joy',
 'love',
 'optimism',
 'pessimism',
 'sadness',
 'surprise',
 'trust']

In [21]:
dataset['train'][0].keys()

dict_keys(['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'])

In [39]:
def make_labels(sample):
  label_list =[]
  for key,value in sample.items():
    if key in labels:
      
      if value==True:
        label_list.append(1.)
      else:
        label_list.append(0.)
  return {'labels':label_list}

In [42]:
dataset_with_labels = dataset.map(make_labels)

Loading cached processed dataset at /root/.cache/huggingface/datasets/sem_eval2018_task1/subtask5.english/1.1.0/a7c0de8b805f1988b118882fb289ccfbbeb9085c7820b6f046b5887e234af182/cache-fcfac134198c81c0.arrow


  0%|          | 0/3259 [00:00<?, ?ex/s]

  0%|          | 0/886 [00:00<?, ?ex/s]

In [44]:
dataset_with_labels

DatasetDict({
    train: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust', 'labels'],
        num_rows: 6838
    })
    test: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust', 'labels'],
        num_rows: 3259
    })
    validation: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust', 'labels'],
        num_rows: 886
    })
})

In [45]:
def tokenize_tweet(sample):
  return tokenizer(sample['Tweet'],truncation=True,padding='max_length',max_length=128)

In [46]:
encoded_dataset = dataset_with_labels.map(tokenize_tweet,batched=True)

  0%|          | 0/7 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [49]:
encoded_dataset = encoded_dataset.remove_columns(column_names=['ID','Tweet','anger','anticipation','disgust','fear'
,'joy','love','optimism','pessimism','sadness','surprise','trust'])

In [50]:
encoded_dataset

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 6838
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3259
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 886
    })
})

In [51]:
encoded_dataset['train'][0]

{'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'input_ids': [101,
  1523,
  4737,
  2003,
  1037,
  2091,
  7909,
  2006,
  1037,
  3291,
  2017,
  2089,
  2196,
  2031,
  1005,
  1012,
  11830,
  11527,
  1012,
  1001,
  14354,
  1001,
  4105,
  1001,
  4737,
  102,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,