# Fine-tuning Pretrained [Bert](https://huggingface.co/bert-base-cased) for sentiment classification task

This example is inspired from Token-Classification [notebook](https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb) and [run_glue.py](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py)  from HuggingFace 🤗.

We will be fine-tuning bert-base-cased (pre-trained) model. You can find the details about this model at [🤗 Hub](https://huggingface.co/bert-base-cased)

For more notebooks of the state of the art PyTorch/Tensorflow/Jax you can explore [🤗 Notebooks](https://huggingface.co/transformers/notebooks.html).

## Note on environment
This notebook assumes PyTorch 1.7 DLVM development environment.
Transformers and Datasets will be installed within the notebook to itself.
A script version also accompanies this notebook [TODO]

In [1]:
!pip -q install torch==1.7
!pip -q install transformers
!pip -q install datasets


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pandas-profiling 2.8.0 requires visions[type_image_path]==0.4.4, but you have visions 0.7.0 which is incompatible.[0m


In [3]:
!pip -q install tqdm

In [4]:
from transformers import (
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    EvalPrediction,
    HfArgumentParser,
    PretrainedConfig,
    Trainer,
    TrainingArguments,
    default_data_collator,
    set_seed,
    PreTrainedTokenizerFast
)
import numpy as np
from datasets import load_dataset, load_metric

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.

For this example we will use IMDB movie review dataset for sentiment classification task.

In [5]:
datasets = load_dataset("imdb")
batch_size = 16
max_seq_length=128
model_name_or_path='bert-base-cased'

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1902.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1004.0, style=ProgressStyle(description…




Reusing dataset imdb (/home/jupyter/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3)


The `datasets` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set.

In [6]:
datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

To access an actual element, you need to select a split first, then give an index:

In [7]:
datasets["train"][0]

{'label': 1,
 'text': 'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'}

Using the `unique` method to extract label list. This will allow us to experiment with other datasets without hard-coding labels.

In [8]:
label_list = datasets["train"].unique("label")

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset (automatically decoding the labels in passing).

In [9]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=2):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [11]:
show_random_elements(datasets["train"])

Unnamed: 0,label,text
0,pos,"Well, I get used after awhile to read comments about these movies that don't reflect my experience at all. To me, Amitabh was a better villain here than in some of his most famous movies. He was a die-hard villain, a no-apologies villain. To me it was a breath of fresh air to see him in a role where his villainy isn't sort of undercut in some way.<br /><br />The kid who played Aryan was probably over his head with this cast. There I think maybe the director could have done better. But, to be honest, the very best part of this movie was Shernaz Patel. She is an unsung heroine, a true veteran thespian who is overqualified for every role she is offered. But I must say I appreciated her contribution greatly as she played Virendra Sahi's wife. She may be given little to do, but she does everything with total conviction. I'm sure she sailed right over the heads of most of the audience.<br /><br />So if you are in a habit for settling for Bollywood average, you won't get much out of this movie. But if you constantly search for something more, then this might give you some of what you've been missing."
1,pos,"I'm not a fan of Adam Sandler. In fact, I don't think I've ever liked him in anything I've seen him in. The opening scene of this movie confirmed my worst fears. There was Adam Sandler, playing a somewhat ridiculous looking character riding around New York City on a motor scooter, looking pitiful and lost. Typical Sandler-type loser character again, I thought. I almost gave up then and there. But then, as I stuck with this, I actually discovered something I never knew before: Adam Sandler can act! He is truly outstanding in this movie as Charlie, a lost and lonely figure, whose entire family (including the dog) was killed in one of the hijacked planes on 9/11 and who has apparently lost all touch with reality as a result. Don Cheadle plays his former college roommate who unexpectedly reconnects with Charlie and takes it on as his mission to help him get better. Of course, Cheadle's Alan Johnson has his own problems and sources of unhappiness, and somehow these two men manage to help each other through their difficulties. The two of them made a completely believable team, and Sandler in particular made Charlie real, working through his emotions and feelings. This is not a Sandler comedy. If your looking for that go to some of his other, sillier, stuff. This is a pretty heavy movie - sometimes sad, sometimes hopeful and always engrossing. There are some funny parts in it. I loved the scene in which Charlie convinces Alan to confront his partners by reminding him of how tough he was in college, and then the conversation the two of them have afterward.<br /><br />I personally didn't think that Saffron Burrows added much to the movie as Donna, an obviously needy patient of Johnson's. The only reason for the character seemed (based on one flashback) to be that she looked eerily like Charlie's late wife, but that was never really developed, and I just didn't care that much for the character. Do look for the part of the judge, however, played by Donald Sutherland, who I thought nailed the part bang-on. As far as I'm concerned, though, this is Sandler's movie, and kudos to him for a great performance. Definitely his best in my opinion. 8/10"


## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [13]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path,
    use_fast=True,
)
# 'use_fast' ensure that we use fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. 

You can check which type of models have a fast tokenizer available and which don't on the [big table of models](https://huggingface.co/transformers/index.html#bigtable).

You can directly call this tokenizer on one sentence:

In [14]:
tokenizer("Hello, this is one sentence!")

{'input_ids': [101, 8667, 117, 1142, 1110, 1141, 5650, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Note: If, as is the case here, your inputs have already been split into words, you should pass the list of words to your tokenzier with the argument `is_split_into_words=True`:

In [15]:
example = datasets["train"][4]
print(example)

{'label': 1, 'text': 'This is not the typical Mel Brooks film. It was much less slapstick than most of his movies and actually had a plot that was followable. Leslie Ann Warren made the movie, she is such a fantastic, under-rated actress. There were some moments that could have been fleshed out a bit more, and some scenes that could probably have been cut to make the room to do so, but all in all, this is worth the price to rent and see it. The acting was good overall, Brooks himself did a good job without his characteristic speaking to directly to the audience. Again, Warren was the best actor in the movie, but "Fume" and "Sailor" both played their parts well.'}


In [16]:
tokenizer(["Hello", ",", "this", "is", "one", "sentence", "split", "into", "words", "."], is_split_into_words=True)

{'input_ids': [101, 8667, 117, 1142, 1110, 1141, 5650, 3325, 1154, 1734, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [18]:
# Dataset loading repeated here to make this cell idempotent
# Since we are over-writing datasets variable
datasets=load_dataset('imdb')

# TEMP: We can extract this automatically but Unique method of the dataset
# is not reporting the label -1 which shows up in the pre-processing
# Hence the additional -1 term in the dictionary
label_to_id = {
    1:1,
    0:0,
    -1:0
}
def preprocess_function(examples):
        # Tokenize the texts
        args = (
            (examples['text'],) 
        )
        result = tokenizer(*args, padding='max_length', max_length=max_seq_length, truncation=True)

        # Map labels to IDs (not necessary for GLUE tasks)
        if label_to_id is not None and "label" in examples:
            result["label"] = [label_to_id[l] for l in examples["label"]]
        return result
datasets = datasets.map(preprocess_function, batched=True, load_from_cache_file=True)


Reusing dataset imdb (/home/jupyter/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3)
Loading cached processed dataset at /home/jupyter/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3/cache-454deb419b45fcc6.arrow
Loading cached processed dataset at /home/jupyter/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3/cache-367c209b579ad340.arrow
Loading cached processed dataset at /home/jupyter/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3/cache-9f2effbf0f1cce4b.arrow


Note that transformers are often pretrained with subword tokenizers, meaning that even if your inputs have been split into words already, each of those words could be split again by the tokenizer. Let's look at an example of that:

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since all our tasks are about token classification, we use the `AutoModelForSequenceClassification` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us. The only thing we have to specify is the number of labels for our problem (which we can get from the features, as seen before):

In [19]:
model = AutoModelForSequenceClassification.from_pretrained(
        model_name_or_path,
        num_labels=len(label_list)
    )

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define three more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [20]:
args = TrainingArguments(
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
    weight_decay=0.01,
    output_dir='/tmp/cls'
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay.

The last thing to define for our `Trainer` is how to compute the metrics from the predictions. You can define your custom compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a predictions and label_ids field) and has to return a dictionary string to float.

In [22]:
def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.argmax(preds, axis=1)
    return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}


Now we Create the Trainer object and we are almost ready to train.

In [23]:
trainer = Trainer(
    model,
    args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["test"],
    data_collator=default_data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

We can now finetune our model by just calling the `train` method:

In [21]:
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
1,0.3898,0.287916,0.876,35.8257,697.823




TrainOutput(global_step=782, training_loss=0.3601018705636339, metrics={'train_runtime': 171.8154, 'train_samples_per_second': 4.551, 'total_flos': 2079586752000000, 'epoch': 1.0})

The `evaluate` method allows you to evaluate again on the evaluation dataset or on another dataset:

In [54]:
trainer.evaluate()

{'eval_loss': 0.5855605602264404,
 'eval_accuracy': 0.879040002822876,
 'eval_runtime': 36.0291,
 'eval_samples_per_second': 693.883}

To get the precision/recall/f1 computed for each category now that we have finished training, we can apply the same function as before on the result of the `predict` method: