<a href="https://colab.research.google.com/github/soumya-mishra/AI_DS_ML/blob/master/Sequence_Classification_with_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sequence Classification with Transformers

This colab notebook will guide you through using the Transformers library to obtain state-of-the-art results on the sequence classification task. It is attached to [the following tutorial](https://medium.com/@lysandrejik/using-tensorflow-2-for-state-of-the-art-natural-language-processing-102445cda54a).

We will be using two different models as a means of comparison: Google's BERT and Facebook's RoBERTa. Both have the same architecture but have had different pre-training approached.

## Installing required dependencies
In order to import the TensorFlow modules, we must make sure that TF2 is installed in the environment.

In [1]:
%tensorflow_version 2.x
import tensorflow as tf
print(tf.__version__)

2.2.0


In [2]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |▍                               | 10kB 21.5MB/s eta 0:00:01[K     |▉                               | 20kB 4.6MB/s eta 0:00:01[K     |█▎                              | 30kB 6.0MB/s eta 0:00:01[K     |█▊                              | 40kB 5.9MB/s eta 0:00:01[K     |██▏                             | 51kB 5.2MB/s eta 0:00:01[K     |██▋                             | 61kB 5.6MB/s eta 0:00:01[K     |███                             | 71kB 6.1MB/s eta 0:00:01[K     |███▍                            | 81kB 6.6MB/s eta 0:00:01[K     |███▉                            | 92kB 6.9MB/s eta 0:00:01[K     |████▎                           | 102kB 7.1MB/s eta 0:00:01[K     |████▊                           | 112kB 7.1MB/s eta 0:00:01[K     |█████▏                          | 122kB 7.1M

## Initializing the pre-trained models

Let's initialize the models with pre-trained weights. The list of pre-trained weights is available in [the official documentation](https://huggingface.co/transformers/pretrained_models.html). Downloading the weights may take a bit of time, but it only needs to be done once!

In [3]:
from transformers import (TFBertForSequenceClassification, 
                          BertTokenizer,
                          TFRobertaForSequenceClassification, 
                          RobertaTokenizer)

bert_model = TFBertForSequenceClassification.from_pretrained("bert-base-cased")
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

roberta_model = TFRobertaForSequenceClassification.from_pretrained("roberta-base")
roberta_tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=526681800.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-cased were not used when initializing TFBertForSequenceClassification: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['dropout_37', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=481.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=657434796.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at roberta-base were not used when initializing TFRobertaForSequenceClassification: ['lm_head']
- This IS expected if you are initializing TFRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of TFRobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




### Tokenization

BERT and RoBERTa are both Transformer models that have the same architecture. As such, they accept only a certain kind of inputs: vectors of integers, each value representing a token. Each string of text must first be converted to a list of indices to be fed to the model. The tokenizer takes care of that for us.

BERT and RoBERTa may have the same architecture, but they differ in tokenization. BERT uses a sub-word tokenization, whereas RoBERTa uses the same tokenization than GPT-2: byte-level byte-pair-encoding. Let's see what this means:

In [4]:
sequence = "Systolic arrays are cool. This 🐳 is cool too."

bert_tokenized_sequence = bert_tokenizer.tokenize(sequence)
roberta_tokenized_sequence = roberta_tokenizer.tokenize(sequence)

print("BERT:", bert_tokenized_sequence)
print("RoBERTa:", roberta_tokenized_sequence)

BERT: ['S', '##ys', '##to', '##lic', 'array', '##s', 'are', 'cool', '.', 'This', '[UNK]', 'is', 'cool', 'too', '.']
RoBERTa: ['Sy', 'st', 'olic', 'Ġarrays', 'Ġare', 'Ġcool', '.', 'ĠThis', 'ĠðŁ', 'Ĳ', '³', 'Ġis', 'Ġcool', 'Ġtoo', '.']


**BERT Tokenizer**

Here, the BERT tokenizer splits the string into multiple substrings. If the substrings are in its vocabulary, they will stay as is: this is the case for `array`,  `are` and  `cool`. However, if a resulting string is not in its vocabulary, it will be split again until every string is represented by its vocabulary. For example,  `Systolic` is split multiple times until every token is represented in the BERT vocabulary: it is split into four tokens.
The BERT tokenizer is lacking when it comes to complex characters spread over multiple bytes, as can be seen with emojis. In the sequence used, an emoji of a whale was added. As the BERT tokenizer cannot interpret this emoji on a byte-level, it replaces it by the unknown token [UNK].

**RoBERTa Tokenizer**

On the other hand, the RoBERTa tokenizer has a slightly different approach. Here too, the string is split into multiple substrings, which are themselves split into multiple substrings until every substring can be represented by the vocabulary. However, the RoBERTa tokenizer has a **byte-level approach**. This tokenizer can represent every sequence as a combination of bytes, which makes it shine in the case of complex characters spread over multiple bytes, as with the whale emoji. Instead of using the unknown token, this tokenizer can correctly encode the whale emoji as the combination of multiple bytes. This tokenizer therefore does not require an unknown token, as it can handle every byte separately.

## Getting State-of-the-Art results on sequence classification

In order to get State-of-the-Art results on this task, we will fine-tune our models on a given dataset. Fine-tuning a model means that we will slightly train it on top of an already trained checkpoint. The learning rate will be very low, as having it to high would result in catastrophic forgetting -> the model would forget what it had learned until now semantically and syntaxically.

We will follow the procedure detailed below:

    1) Get the dataset from `tensorflow_datasets`

    2) Pre-process this dataset so that it can be used by the model

    3) Set-up a training loop using Keras' fit API; train the model on the training data

    4) Evaluate the model on the testing data and compare to the actual results

### Getting the dataset

We will be using the Microsoft Research Paraphrase Corpus (MRPC) dataset, which is a sequence classification dataset. We get the train and validation data from the `tensorflow_datasets` package. These values are in the form of `tf.data.Dataset`, which is perfect for our use-case.

In [6]:
import tensorflow_datasets
data = tensorflow_datasets.load("glue/mrpc")

train_dataset = data["train"]
validation_dataset = data["validation"]

INFO:absl:Overwrite dataset info from restored data version.
INFO:absl:Reusing dataset glue (/root/tensorflow_datasets/glue/mrpc/1.0.0)
INFO:absl:Constructing tf.data.Dataset for split None, from /root/tensorflow_datasets/glue/mrpc/1.0.0


Let's output the value of the first item of the dataset to get an idea of the data we're dealing with.

In [7]:
example = list(train_dataset.__iter__())[0]
print('',
    'idx:      ', example['idx'],       '\n',
    'label:    ', example['label'],     '\n',
    'sentence1:', example['sentence1'], '\n',
    'sentence2:', example['sentence2'],
)

 idx:       tf.Tensor(1680, shape=(), dtype=int32) 
 label:     tf.Tensor(0, shape=(), dtype=int64) 
 sentence1: tf.Tensor(b'The identical rovers will act as robotic geologists , searching for evidence of past water .', shape=(), dtype=string) 
 sentence2: tf.Tensor(b'The rovers act as robotic geologists , moving on six wheels .', shape=(), dtype=string)


The values are in the form of dictionaries that are done as follow:

```py
example = {
    idx: number,
    label: number,
    sentence1: string,
    sentence2: string
}
```
The three fields that are of interest to us are the `label`, the `sentence1` and the `sentence2`. The label is equal to `1` when the two sentences are paraphrases of each other, and is equal to `0` otherwise.

We cannot pass this directly to the models as they cannot interpret the meaning of strings (list of characters), so we will see how to convert that example to features that our models can understand. Firstly, we must obtain the token ids that will be fed to the model from the two sequences. The two models (BERT, RoBERTa) have different encoding mechanisms concerning pair sequence encoding, using `sep` and `cls` tokens to specify the end of a sequence or the end of the two sequences. With `A` as the first sequence and `B` as the second, BERT encodes the sequence pair as follows:

`[CLS] A [SEP] B [SEP]`

while RoBERTa encodes the sequence pair differently:

`[CLS] A [SEP][SEP] B [SEP]`

Thankfully, our encode method can handle that on its own. Here are the two sentences we are currently dealing with:

In [None]:
seq0 = example['sentence1'].numpy().decode('utf-8')  # Obtain bytes from tensor and convert it to a string
seq1 = example['sentence2'].numpy().decode('utf-8')  # Obtain bytes from tensor and convert it to a string

print("First sequence:", seq0)
print("Second sequence:", seq1)

First sequence: Tibco has used the Rendezvous name since 1994 for several of its technology products , according to the Palo Alto , California company .
Second sequence: Tibco has used the Rendezvous name since 1994 for several of its technology products , it said .


## Encoding sequences

In order to encode the sequence to be understandable by the model, two different methods can come in handy.

### encode()

`encode` is a high-level method that returns the encoded sequence with the special tokens and truncated to a maximum length if need be. Here we identify the special CLS and SEP tokens of RoBERTa and BERT, and explicit them in the encoded sequence as to understand the difference in tokenization.

In [None]:
encoded_bert_sequence = bert_tokenizer.encode(seq0, seq1, add_special_tokens=True, max_length=128)
encoded_roberta_sequence = roberta_tokenizer.encode(seq0, seq1, add_special_tokens=True, max_length=128)

print("BERT tokenizer separator, cls token id:   ", bert_tokenizer.sep_token_id, bert_tokenizer.cls_token_id)
print("RoBERTa tokenizer separator, cls token id:", roberta_tokenizer.sep_token_id, roberta_tokenizer.cls_token_id)

bert_special_tokens = [bert_tokenizer.sep_token_id, bert_tokenizer.cls_token_id]
roberta_special_tokens = [roberta_tokenizer.sep_token_id, roberta_tokenizer.cls_token_id]

def print_in_red(string):
    print("\033[91m" + str(string) + "\033[0m", end=' ')

print("\nBERT tokenized sequence")
output = [print_in_red(tok) if tok in bert_special_tokens else print(tok, end=' ') for tok in encoded_bert_sequence]

print("\n\nRoBERTa tokenized sequence")
output = [print_in_red(tok) if tok in roberta_special_tokens else print(tok, end=' ') for tok in encoded_roberta_sequence]

BERT tokenizer separator, cls token id:    102 101
RoBERTa tokenizer separator, cls token id: 2 0

BERT tokenized sequence
[91m101[0m 157 13292 2528 1144 1215 1103 16513 15125 11944 1271 1290 1898 1111 1317 1104 1157 2815 2982 117 2452 1106 1103 19585 2858 17762 117 1756 1419 119 [91m102[0m 157 13292 2528 1144 1215 1103 16513 15125 11944 1271 1290 1898 1111 1317 1104 1157 2815 2982 117 1122 1163 119 [91m102[0m 

RoBERTa tokenized sequence
[91m0[0m 565 1452 876 34 341 5 29110 42057 766 187 8148 13 484 9 63 806 785 2156 309 7 5 21065 18402 2156 886 138 479 [91m2[0m [91m2[0m 565 1452 876 34 341 5 29110 42057 766 187 8148 13 484 9 63 806 785 2156 24 26 479 [91m2[0m 

### encode_plus()

`encode_plus` is similar to `encode` but it returns additional information: the token type ids as well as several other features that we don't need to manage right now.

The token type ids are used by some models in the case of sequence classification. It is a mask indicating to the model which sequence a token is from. 

For example, let's say we have two sequences A and B with tokens `[a0, a1, a2, a3]` and `[b0, b1, b2, b3, b4]` respectively.

The BERT tokenizer would create a single sequence from those two lists of tokens that would look like the following:

<pre>
[tokens]         `[CLS] a0 a1 a2 a3 [SEP] b0 b1 b2 b3 b4 [SEP]`. 
[token type ids] `  0    0  0  0  0   0    1  1  1  1  1   1`
</pre>

Thanks to the token type ids, the model is aware of which token belongs to which sequence





We won't need to use encode_plus in this experiment as directly in the `Transformers` library exists a method to directly convert a dataset to features, and is agnostic to both the GLUE task and the specified tokenizer. This method makes use of `encode_plus` under the hood and is called `glue_convert_examples_to_features`:

In [None]:
from transformers import glue_convert_examples_to_features

bert_train_dataset = glue_convert_examples_to_features(train_dataset, bert_tokenizer, 128, 'mrpc')
bert_train_dataset = bert_train_dataset.shuffle(100).batch(32).repeat(2)

bert_validation_dataset = glue_convert_examples_to_features(validation_dataset, bert_tokenizer, 128, 'mrpc')
bert_validation_dataset = bert_validation_dataset.batch(64)

The two BERT datasets are now ready to be used: the training dataset is shuffled and batch, while the validation dataset is only batched.

RoBERTa requires a bit more of work as it does not use the `token_type_ids`, which we need to remove. We use the `tf.data.Dataset.map()` method for this.

In [None]:
def token_type_ids_removal(example, label):
    del example["token_type_ids"]
    return example, label

roberta_train_dataset = glue_convert_examples_to_features(train_dataset, roberta_tokenizer, 128, 'mrpc')
roberta_train_dataset = roberta_train_dataset.map(token_type_ids_removal)
roberta_train_dataset = roberta_train_dataset.shuffle(100).batch(32).repeat(2)

roberta_validation_dataset = glue_convert_examples_to_features(validation_dataset, roberta_tokenizer, 128, 'mrpc')
roberta_validation_dataset = roberta_validation_dataset.map(token_type_ids_removal)
roberta_validation_dataset = roberta_validation_dataset.batch(64)

### Defining the hyper-parameters

Before fine-tuning the model, we must define a few hyperparameters that will be used during the training such as the optimizer, the loss and the evaluation metric.

As an optimizer we'll be using Adam, which was the optimizer used during those models' pre-training. As a loss we'll be using the sparse categorical cross-entropy, and the sparse categorical accuracy as the evaluation metric.

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

bert_model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
roberta_model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

### Training the model

The beauty of tensorflow/keras lies here: using keras' fit method to fine-tune the model with a single line of code

In [None]:
print("Fine-tuning BERT on MRPC")
bert_history = bert_model.fit(bert_train_dataset, epochs=3, validation_data=bert_validation_dataset)

print("\nFine-tuning RoBERTa on MRPC")
roberta_history = roberta_model.fit(roberta_train_dataset, epochs=3, validation_data=roberta_validation_dataset)

Fine-tuning BERT on MRPC
Epoch 1/3
Epoch 2/3
Epoch 3/3

Fine-tuning RoBERTa on MRPC
Epoch 1/3








Epoch 2/3
Epoch 3/3


### Keras's simplicity doesn't end here

Evaluating a model is as simple as it is to train it - using the evaluate method

In [None]:
print("Evaluating the BERT model")
bert_model.evaluate(bert_validation_dataset)

print("Evaluating the RoBERTa model")
roberta_model.evaluate(roberta_validation_dataset)

## Results

The results we obtain for BERT are similar to the paper's original results, which were computed using the official GLUE evaluation server. The accuracy obtained for RoBERTa is slightly less than in the paper, which is probably due to the initialisation done: in the paper, the fine-tuning on the MRPC task was done from the MNLI checkpoint rather than from the base checkpoint.