## Installation 

We'll start by installing the tools we need. 

If you're running on Google Colab, first make sure to **change your runtime to GPU**, and then execute the cells below.

On your own machine, you're better off installing these tools with pip in your python environment. If you have a GPU, the training will take you a few minutes. If you don't, it might be much, much longer, and you probably want to use Google Colab instead. 

First, we install the Huggingface transformers library: 

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/b5/d5/c6c23ad75491467a9a84e526ef2364e523d45e2b0fae28a7cbe8689e7e84/transformers-4.8.1-py3-none-any.whl (2.5MB)
[K     |████████████████████████████████| 2.5MB 29.2MB/s 
[?25hCollecting huggingface-hub==0.0.12
  Downloading https://files.pythonhosted.org/packages/2f/ee/97e253668fda9b17e968b3f97b2f8e53aa0127e8807d24a547687423fe0b/huggingface_hub-0.0.12-py3-none-any.whl
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 41.9MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K    

And then the [Huggingface datasets library](https://huggingface.co/docs/datasets/loading_datasets.html), which makes it easy to download and manipulate datasets: 

In [2]:
!pip install datasets

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/08/a2/d4e1024c891506e1cee8f9d719d20831bac31cb5b7416983c4d2f65a6287/datasets-1.8.0-py3-none-any.whl (237kB)
[K     |████████████████████████████████| 245kB 20.4MB/s 
[?25hCollecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/7d/4f/0a862cad26aa2ed7a7cd87178cbbfa824fc1383e472d63596a0d018374e7/xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243kB)
[K     |████████████████████████████████| 245kB 36.2MB/s 
Collecting fsspec
[?25l  Downloading https://files.pythonhosted.org/packages/0e/3a/666e63625a19883ae8e1674099e631f9737bd5478c4790e5ad49c5ac5261/fsspec-2021.6.1-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 43.0MB/s 
Installing collected packages: xxhash, fsspec, datasets
Successfully installed datasets-1.8.0 fsspec-2021.6.1 xxhash-2.0.2


## The emotion datasets

Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise.

You can find more information on the [Huggingface hub](https://huggingface.co/datasets/emotion)

**Note**: On this page, it is currently written that there are only 5 emotions. In fact, there are 6, as we will see below. 

Let's load the dataset and print it: 

In [11]:
from datasets import load_dataset

dataset = load_dataset("emotion")

Using custom data configuration default
Reusing dataset emotion (/root/.cache/huggingface/datasets/emotion/default/0.0.0/6e4212efe64fd33728549b8f0435c73081391d543b596a05936857df98acb681)


In [12]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['label', 'text'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 2000
    })
})


The dataset dictionary contains 3 datasets, for training, validation, and test. 

We can look at a specific example of the training dataset (example 100 here):

In [13]:
dataset['train'][100]

{'label': 2,
 'text': 'i wont let me child cry it out because i feel that loving her and lily when she was little was going to be opportunities that only lasted for those short few months'}

So what does `'label': 2` corresponds to? 


In [14]:
dataset['train'].features

{'label': ClassLabel(num_classes=6, names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], names_file=None, id=None),
 'text': Value(dtype='string', id=None)}

We confirm that there are indeed 6 classes (6 emotions), and we see that label 2 corresponds to love. 

Note that at this stage, our dataset simply holds plain python objects, such as lists of strings for the input text. 

## Strategy

For this task, we will use the BERT base model, which is a generic transformer model for the English language. 

It has been trained to understand English on a large corpus of English text. 

BERT is an encoder model: it considers the full input text, and embeds its understanding of the text into a vector of values. 

Encoder models are ideal for text classification. 

The only thing we need to do is to add a classification "head" to BERT. 

In our case, the head can simply be a layer with 6 neurons corresponding to our 6 classes, with a softmax activation. This layer will take the output vector from BERT, and convert it the probability for the input text to belong to each of the 6 classes. 

In other words, it's going to make sense of the BERT embedding in terms of our classification problem. 

## Preparing the data

As we have seen, our `dataset` object is a simple python class holding plain python objects. For instance, the text of a given example is a simple list of strings:

In [17]:
dataset['train'][0]

{'label': 0, 'text': 'i didnt feel humiliated'}

These data are not adapted to neural networks, for two reasons: 

First, **neural networks work with numbers, not words**. For a given example, they take as input an array of numbers representing this example, and spit out another array of numbers. 

These arrays can be of varying dimensions (1D, 2D, ...), so they are in fact tensors. 

For example, in an image classification task such as [Dogs vs Cats](https://thedatafrog.com/en/articles/dogs-vs-cats/), the input array representing an image is a 3D tensor. Two dimensions correspond to the height and width of the image, and the third one corresponds to the number of color channels. For a colour image, the shape of such a tensor can be denoted (nx, ny, 3), where:

* nx: number of pixels in the horizontal direction
* ny: number of pixels in the vertical direction 
* 3: number of colour channels

The output tensor, on the other hand, contains the probabilities for the image to belong to each category, as predicted by the neural network. It's a 1D tensor (ie. a vector) with shape `(n_categories,)`. 

The data flows from the input to the output of the network, and is transformed at each network layer. At a typical layer, we first apply a linear operation akin to matrix multiplication to the input data of the layer. Then, we apply a non-linear mathematical function to each element of the resulting tensor. And so on until we reach the output layer. 

For a more detailed introduction to neural networks, you can have a look at my article [The 1-Neuron Network: Logistic Regression](https://thedatafrog.com/en/articles/logistic-regression/). 

Second, **neural networks need input data with a fixed length**. I'm simplifying a bit here... In fact, some neural network architectures such as convolutional neural networks or recurrent networks can work with data of varying length. But if you want to train them on a GPU, you will need to stick to a fixed length anyway. 

So we have two problems: 

* Our text is, well, text;
* It is of varying length.

What we're going to do here is therefore to **convert all text examples in our dataset into arrays of numbers, all with the same length**. 

This operation is called **tokenization**, and it involves several steps:  

* split the text into tokens, which might be individual words. 
* replace each token by an integer called the input ID (for the model). At this stage, our text is a list of integers.
* truncation and padding: the list of input IDs is converted to a fixed length array. 



In [18]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [19]:
def tokenize_function(example):
  return tokenizer(example['text'], truncation=True)

In [20]:
tokenized_dataset = dataset.map(tokenize_function, batched=True)

HBox(children=(FloatProgress(value=0.0, max=16.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




In [21]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'text', 'token_type_ids'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'text', 'token_type_ids'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'text', 'token_type_ids'],
        num_rows: 2000
    })
})

In [None]:
set(tokenized_dataset['train']['label'])

{0, 1, 2, 3, 4, 5}

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
data_collator

DataCollatorWithPadding(tokenizer=PreTrainedTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}), padding=True, max_length=None, pad_to_multiple_of=None)

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=6)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text.
***** Running training *****
  Num examples = 16000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 6000


Step,Training Loss
500,0.7516
1000,0.3577
1500,0.2634
2000,0.2509
2500,0.182
3000,0.1568
3500,0.1391
4000,0.1388
4500,0.0886
5000,0.1126


Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-500/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-1000
Configuration saved in test-trainer/checkpoint-1000/config.json
Model weights saved in test-trainer/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1000/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-1500
Configuration saved in test-trainer/checkpoint-1500/config.json
Model weights saved in test-trainer/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1500/tokenizer_config.json
Special tokens file saved

TrainOutput(global_step=6000, training_loss=0.22143900680541992, metrics={'train_runtime': 659.3443, 'train_samples_per_second': 72.8, 'train_steps_per_second': 9.1, 'total_flos': 1302049199981952.0, 'train_loss': 0.22143900680541992, 'epoch': 3.0})

In [None]:
predictions = trainer.predict(tokenized_dataset["test"])

The following columns in the test set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text.
***** Running Prediction *****
  Num examples = 2000
  Batch size = 8


In [None]:
predictions

PredictionOutput(predictions=array([[ 8.454301  , -1.8288338 , -1.9460562 , -1.1483477 , -1.6323161 ,
        -1.7061889 ],
       [ 8.478015  , -1.8272443 , -1.7600707 , -1.1059024 , -1.8097972 ,
        -1.8396385 ],
       [ 8.411227  , -1.6476626 , -1.5176547 , -1.4721982 , -1.9037353 ,
        -1.8326322 ],
       ...,
       [-1.7230492 ,  8.401841  , -0.6115043 , -1.8850343 , -2.71189   ,
        -1.2977178 ],
       [-1.8113626 ,  8.315013  , -0.97638655, -1.839709  , -1.8033446 ,
        -1.6238836 ],
       [-1.0787348 , -2.1998155 , -2.1257672 , -2.0551157 ,  5.087704  ,
         4.2144423 ]], dtype=float32), label_ids=array([0, 0, 0, ..., 1, 1, 4]), metrics={'test_loss': 0.19932958483695984, 'test_runtime': 6.0181, 'test_samples_per_second': 332.333, 'test_steps_per_second': 41.542})

In [None]:
import numpy as np
preds = np.argmax(predictions.predictions, axis=-1)
preds

array([0, 0, 0, ..., 1, 1, 4])

In [None]:
labels = np.array(tokenized_dataset['test']['label'])

In [None]:
labels

array([0, 0, 0, ..., 1, 1, 4])

In [None]:
np.sum(preds==labels) / float( len(labels) )

0.9345

In [None]:
from datasets import list_metrics

list_metrics()

['accuracy',
 'bertscore',
 'bleu',
 'bleurt',
 'cer',
 'comet',
 'coval',
 'cuad',
 'f1',
 'gleu',
 'glue',
 'indic_glue',
 'matthews_correlation',
 'meteor',
 'pearsonr',
 'precision',
 'recall',
 'rouge',
 'sacrebleu',
 'sari',
 'seqeval',
 'spearmanr',
 'squad',
 'squad_v2',
 'super_glue',
 'wer',
 'xnli']

In [None]:
from datasets import load_metric

metric = load_metric('accuracy')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1362.0, style=ProgressStyle(description…




In [None]:
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.9345}

In [None]:
dataset['train'].features

{'label': ClassLabel(num_classes=6, names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], names_file=None, id=None),
 'text': Value(dtype='string', id=None)}

In [None]:
p = trainer.predict([tokenizer("My first child is born!")])
np.argmax(p.predictions)

***** Running Prediction *****
  Num examples = 1
  Batch size = 8


1

In [None]:
p = trainer.predict([tokenizer("My father died today")])
np.argmax(p.predictions)

***** Running Prediction *****
  Num examples = 1
  Batch size = 8


0

In [None]:
p = trainer.predict([tokenizer("Go and clean up your room")])
np.argmax(p.predictions)

***** Running Prediction *****
  Num examples = 1
  Batch size = 8


3