<a href="https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning BERT (and friends) for multi-label text classification

In this notebook, we are going to fine-tune BERT to predict one or more labels for a given piece of text. Note that this notebook illustrates how to fine-tune a bert-base-uncased model, but you can also fine-tune a RoBERTa, DeBERTa, DistilBERT, CANINE, ... checkpoint in the same way. 

All of those work in the same way: they add a linear layer on top of the base model, which is used to produce a tensor of shape (batch_size, num_labels), indicating the unnormalized scores for a number of labels for every example in the batch.



## Set-up environment

First, we install the libraries which we'll use: HuggingFace Transformers and Datasets.

## Load dataset

Next, let's download a multi-label text classification dataset from the [hub](https://huggingface.co/).

At the time of writing, I picked a random one as follows:   

* first, go to the "datasets" tab on huggingface.co
* next, select the "multi-label-classification" tag on the left as well as the the "1k<10k" tag (fo find a relatively small dataset).

Note that you can also easily load your local data (i.e. csv files, txt files, Parquet files, JSON, ...) as explained [here](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files).



In [1]:
from datasets import load_dataset
import pandas as pd
from datasets import Dataset, DatasetDict
from lightning.pytorch.utilities.model_summary import summarize

df_train = pd.read_csv("../data/ftdataset_train.tsv", sep=' *\t *', encoding='utf-8', engine='python')
df_val = pd.read_csv("../data/ftdataset_val.tsv", sep=' *\t *', encoding='utf-8', engine='python')
df_test = pd.read_csv("../data/ftdataset_test.tsv", sep=' *\t *', encoding='utf-8', engine='python')

train_dataset = Dataset.from_dict(df_train)
validation_dataset = Dataset.from_dict(df_val)
test_dataset = Dataset.from_dict(df_test)

# Créez un DataSetDict
dataset = DatasetDict({
    'train': train_dataset,
    'validation': validation_dataset,
    'test': test_dataset
})


  from .autonotebook import tqdm as notebook_tqdm


As we can see, the dataset contains 3 splits: one for training, one for validation and one for testing.

In [2]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['NomDuGroupe', 'Restaurant', 'Note', 'Prix', 'Cuisine', 'Service', 'Ambiance', 'Avis', 'URL'],
        num_rows: 4471
    })
    validation: Dataset({
        features: ['NomDuGroupe', 'Restaurant', 'Note', 'Prix', 'Cuisine', 'Service', 'Ambiance', 'Avis', 'URL'],
        num_rows: 639
    })
    test: Dataset({
        features: ['NomDuGroupe', 'Restaurant', 'Note', 'Prix', 'Cuisine', 'Service', 'Ambiance', 'Avis', 'URL'],
        num_rows: 902
    })
})


Let's check the first example of the training split:

In [75]:
example = dataset['train'][0]
example

{'NomDuGroupe': 'Bernard',
 'Restaurant': 'Bistrot André - Grenoble',
 'Note': 5.0,
 'Prix': 'Positive',
 'Cuisine': 'Positive',
 'Service': 'Positive',
 'Ambiance': 'Positive',
 'Avis': 'Ambiance sympathique, cadre agréable, nourriture goûteuse et copieuse et service soigné.  Un petit bonus pour la sommelière.  Le rapport qualité prix est correct.',
 'URL': 'https://www.tripadvisor.fr/ShowUserReviews-g187272-d10312929-r976576336-Bistro_Andre-Valence_Drome_Auvergne_Rhone_Alpes.html'}

The dataset consists of tweets, labeled with one or more emotions. 

Let's create a list that contains the labels, as well as 2 dictionaries that map labels to integers and back.

In [76]:
labels = [label for label in dataset['train'].features.keys() if label not in ['NomDuGroupe', 'Restaurant', 'Avis',"Note","URL"]]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels

['Prix', 'Cuisine', 'Service', 'Ambiance']

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['NomDuGroupe', 'Restaurant', 'Note', 'Prix', 'Cuisine', 'Service', 'Ambiance', 'Avis', 'URL'],
        num_rows: 4471
    })
    validation: Dataset({
        features: ['NomDuGroupe', 'Restaurant', 'Note', 'Prix', 'Cuisine', 'Service', 'Ambiance', 'Avis', 'URL'],
        num_rows: 639
    })
    test: Dataset({
        features: ['NomDuGroupe', 'Restaurant', 'Note', 'Prix', 'Cuisine', 'Service', 'Ambiance', 'Avis', 'URL'],
        num_rows: 902
    })
})

## Preprocess data

As models like BERT don't expect text as direct input, but rather `input_ids`, etc., we tokenize the text using the tokenizer. Here I'm using the `AutoTokenizer` API, which will automatically load the appropriate tokenizer based on the checkpoint on the hub.

What's a bit tricky is that we also need to provide labels to the model. For multi-label text classification, this is a matrix of shape (batch_size, num_labels). Also important: this should be a tensor of floats rather than integers, otherwise PyTorch' `BCEWithLogitsLoss` (which the model will use) will complain, as explained [here](https://discuss.pytorch.org/t/multi-label-binary-classification-result-type-float-cant-be-cast-to-the-desired-output-type-long/117915/3).

In [90]:
from datasets import ClassLabel
from transformers import AutoTokenizer
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("gemma2:2b")
class_label = ClassLabel(names=['Négative', 'Positive','Neutre','NE'])

def preprocess_data(examples):
  # take a batch of texts
  text = examples["Avis"]
  # encode them
  encoding = tokenizer(text, padding="max_length", truncation=True)
  # add labels
  labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
  # create numpy array of shape (batch_size, num_labels)
  labels_matrix = np.zeros((len(text), len(labels)))
  # fill numpy array
  for idx, label in enumerate(labels):
    labels_batch[label] = class_label.str2int(labels_batch[label])
    labels_matrix[:, idx] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()

  return encoding

encoded_dataset = dataset.map(preprocess_data, batched=True, remove_columns=dataset['train'].column_names)

OSError: Incorrect path_or_model_id: 'gemma2:2b'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

In [84]:
example = encoded_dataset['train'][1]
print(example.keys())

dict_keys(['input_ids', 'attention_mask', 'labels'])


In [85]:
tokenizer.decode(example['input_ids'])

"[CLS] Je m' attendais à mieux que ça. Les commentaires de ce restaurant étant très bon, j' y allais les yeux fermés. J' ai mangé des gratins Dauphinois 100 meilleurs ailleurs et idem pour les desserts. J' ai trouvé qu' ils manquaient de goût et de saveurs. Dommage. Je n' y retournerais pas. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

In [86]:
example['labels']

[3.0, 0.0, 3.0, 3.0]

In [87]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 3.0]

['Prix', 'Service', 'Ambiance']

Finally, we set the format of our data to PyTorch tensors. This will turn the training, validation and test sets into standard PyTorch [datasets](https://pytorch.org/docs/stable/data.html). 

In [89]:
from transformers import DataCollatorWithPadding
from torch.utils.data import DataLoader
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
train_dataloader = DataLoader(
    encoded_dataset["train"], shuffle=True, batch_size=12, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    encoded_dataset["validation"], batch_size=12, collate_fn=data_collator
)
encoded_dataset.set_format("torch")
encoded_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 4471
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 639
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 902
    })
})

## Define model

Here we define a model that includes a pre-trained base (i.e. the weights from bert-base-uncased) are loaded, with a random initialized classification head (linear layer) on top. One should fine-tune this head, together with the pre-trained base on a labeled dataset.

This is also printed by the warning.

We set the `problem_type` to be "multi_label_classification", as this will make sure the appropriate loss function is used (namely [`BCEWithLogitsLoss`](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html)). We also make sure the output layer has `len(labels)` output neurons, and we set the id2label and label2id mappings.

In [12]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("almanach/camembertv2-base",
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at almanach/camembertv2-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We are also going to compute metrics while training. For this, we need to define a `compute_metrics` function, that returns a dictionary with the desired metric values.

In [13]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(0.7265, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>) torch.Size([12, 4])


In [14]:
from transformers import AdamW
import torch

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        print(outputs.loss, outputs.logits.shape)
        progress_bar.update(1)

100%|██████████| 1119/1119 [29:09<00:00,  1.57s/it]

We are also going to compute metrics while training. For this, we need to define a `compute_metrics` function, that returns a dictionary with the desired metric values.

In [34]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction
import torch

# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    # return as dictionary
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics


# Définir la fonction compute_metrics
def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    result = multi_label_metrics(predictions=preds, labels=p.label_ids)
    return result




ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.26.0`: Please run `pip install transformers[torch]` or `pip install 'accelerate>={ACCELERATE_MIN_VERSION}'`

Let's verify a batch as well as a forward pass:

In [49]:
encoded_dataset['train'][0]['labels'].type()

'torch.FloatTensor'

In [50]:
encoded_dataset['train']['input_ids'][0]

tensor([    1, 12460,  6785,  ...,     0,     0,     0])

In [30]:
# Définir l'appareil (cuda si disponible, sinon cpu)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

# Déplacer le modèle sur l'appareil
model.to(device)

# Déplacer les tenseurs sur le même appareil
input_ids = encoded_dataset['train']['input_ids'][0].unsqueeze(0).to(device)
labels = encoded_dataset['train'][0]['labels'].unsqueeze(0).to(device)

# Exécuter le modèle avec les tenseurs déplacés
outputs = model(input_ids=input_ids, labels=labels)
print(outputs)

SequenceClassifierOutput(loss=tensor(8.8212e-08, device='cuda:0',
       grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), logits=tensor([[33.7587, 14.8583, 21.7202, 30.1379]], device='cuda:0',
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


Let's start training!

In [65]:
# define list of examples
text_list = ["Je m' attendais à mieux que ça. Les commentaires de ce restaurant étant très bon, j' y allais les yeux fermés. J' ai mangé des gratins Dauphinois 100 meilleurs ailleurs et idem pour les desserts. J' ai trouvé qu' ils manquaient de goût et de saveurs. Dommage. Je n' y retournerais pas."]

print("Untrained model predictions:")
print("----------------------------")
for text in text_list:
    # tokenize text
    inputs = tokenizer.encode(text, return_tensors="pt").to(device)
    # compute logits
    logits = model(inputs).logits
    labels = labels.to(logits.device)
    # convert logits to label
    predictions = torch.argmax(logits)
    print(text + " - " + id2label[predictions.tolist()])

print("Trained model predictions:")
print("--------------------------")
for text in text_list:
    inputs = tokenizer.encode(text, return_tensors="pt").to(device) # to('mps') for Mac

    logits = model(inputs).logits
    predictions = torch.max(logits,1).indices

    print(text + " - " + id2label[predictions.tolist()[0]])

Untrained model predictions:
----------------------------
Je m' attendais à mieux que ça. Les commentaires de ce restaurant étant très bon, j' y allais les yeux fermés. J' ai mangé des gratins Dauphinois 100 meilleurs ailleurs et idem pour les desserts. J' ai trouvé qu' ils manquaient de goût et de saveurs. Dommage. Je n' y retournerais pas. - Prix
Trained model predictions:
--------------------------
Je m' attendais à mieux que ça. Les commentaires de ce restaurant étant très bon, j' y allais les yeux fermés. J' ai mangé des gratins Dauphinois 100 meilleurs ailleurs et idem pour les desserts. J' ai trouvé qu' ils manquaient de goût et de saveurs. Dommage. Je n' y retournerais pas. - Prix


## Evaluate

After training, we evaluate our model on the validation set.

## Inference

Let's test the model on a new sentence:

The logits that come out of the model are of shape (batch_size, num_labels). As we are only forwarding a single sentence through the model, the `batch_size` equals 1. The logits is a tensor that contains the (unnormalized) scores for every individual label.

To turn them into actual predicted labels, we first apply a sigmoid function independently to every score, such that every score is turned into a number between 0 and 1, that can be interpreted as a "probability" for how certain the model is that a given class belongs to the input text.

Next, we use a threshold (typically, 0.5) to turn every probability into either a 1 (which means, we predict the label for the given example) or a 0 (which means, we don't predict the label for the given example).

In [62]:
# apply sigmoid + threshold
logits = outputs.logits
print(logits.shape)
sigmoid = torch.nn.Sigmoid()
probs = sigmoid(logits.squeeze().cpu())
predictions = np.zeros(probs.shape)
predictions[np.where(probs >= 0.5)] = 1
# turn predicted id's into actual label names
predicted_labels = [id2label[idx] for idx, label in enumerate(predictions) if label == 1.0]
print(predicted_labels)

torch.Size([1, 4])
['Prix', 'Cuisine', 'Service', 'Ambiance']


In [83]:
text_list = ["Je m' attendais à mieux que ça. Les commentaires de ce restaurant étant très bon, j' y allais les yeux fermés. J' ai mangé des gratins Dauphinois 100 meilleurs ailleurs et idem pour les desserts. J' ai trouvé qu' ils manquaient de goût et de saveurs. Dommage. Je n' y retournerais pas."]

print("model predictions:")
print("----------------------------")
for text in text_list:
    # tokenize text
    inputs = tokenizer.encode(text, return_tensors="pt").to(device)
    # compute logits
    logits = model(inputs).logits
    # convert logits to label
    predictions = torch.argmax(logits,dim = 1)
    print(logits)
    print(predictions)


# Appliquer la sigmoïde + seuil
logits = outputs.logits
sigmoid = torch.nn.Sigmoid()
probs = sigmoid(logits.squeeze().cpu())
predictions = np.zeros(probs.shape)
predictions[np.where(probs >= 0.5)] = 1

# Convertir les id prédits en noms de labels réels
predicted_labels = [id2label[idx] for idx, label in enumerate(predictions) if label == 1.0]
print(probs)
print(predicted_labels)

model predictions:
----------------------------
tensor([[34.3476, 15.0940, 21.7407, 30.5009]], device='cuda:0',
       grad_fn=<AddmmBackward0>)
tensor([0], device='cuda:0')
tensor([1.0000, 1.0000, 1.0000, 1.0000], grad_fn=<SigmoidBackward0>)
['Prix', 'Cuisine', 'Service', 'Ambiance']


TypeError: sigmoid(): argument 'input' (position 1) must be Tensor, not numpy.ndarray