<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-case-studies/blob/master/huggingface-transformers-practice/training-and-fine-tuning/training_and_fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Training and fine-tuning

Model classes in 🤗 Transformers are designed to be compatible with native PyTorch and TensorFlow 2 and can be used seamlessly with either. In this quickstart, we will show how to fine-tune (or train from scratch) a model using the standard training tools available in either framework. We will also show how to use our included `Trainer()` class which handles much of the complexity of training for you.

Referemce: https://huggingface.co/transformers/training.html

## Setup

In [1]:
import tensorflow as tf
import tensorflow_datasets as tfds
import torch
from torch.nn import functional as F

In [None]:
!pip install transformers

In [5]:
from transformers import pipeline
from transformers import BertTokenizer, BertForSequenceClassification, glue_convert_examples_to_features
from transformers import AdamW, get_linear_schedule_with_warmup, Trainer, TrainingArguments
from transformers import TFBertForSequenceClassification

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

from pprint import pprint

## Fine-tuning in native PyTorch

Model classes in 🤗 Transformers that don't begin with `TF` are [PyTorch Modules](https://pytorch.org/docs/master/generated/torch.nn.Module.html), meaning that you can use them just as you would any
model in PyTorch for both inference and optimization.

Let's consider the common task of fine-tuning a masked language model like BERT on a sequence classification dataset.
When we instantiate a model with `PreTrainedModel.from_pretrained`, the model configuration and
pre-trained weights of the specified model are used to initialize the model. The library also includes a number of
task-specific final layers or 'heads' whose weights are instantiated randomly when not present in the specified
pre-trained model. For example, instantiating a model with
`BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)` will create a BERT model instance
with encoder weights copied from the `bert-base-uncased` model and a randomly initialized sequence classification
head on top of the encoder with an output size of 2. Models are initialized in `eval` mode by default. We can call
`model.train()` to put it in train mode.

In [None]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
model.train()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

This is useful because it allows us to make use of the pre-trained BERT encoder and easily train it on whatever sequence classification dataset we choose. We can use any PyTorch optimizer, but our library also provides the `AdamW()` optimizer which implements gradient bias correction as well as weight decay.

In [None]:
optimizer = AdamW(model.parameters(), lr=1e-5)

The optimizer allows us to apply different hyperpameters for specific parameter groups. 

For example, we can apply weight decay to all parameters other than bias and layer normalization terms:

In [None]:
no_decay = ["bias", "LayerNorm.weight"]

optimizer_grouped_parameters = [
   {"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], "weight_decay": 0.01},
   {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0}                             
]

Now we can set up a simple dummy training batch using `PreTrainedTokenizer.__call__`. This returns a `BatchEncoding` instance which prepares everything we might need to pass to the model.

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [None]:
text_batch = ["I love Pixar.", "I don't care for Pixar."]
encoding = tokenizer(text_batch, return_tensors="pt", padding=True, truncation=True)

input_ids = encoding["input_ids"]
attention_mask = encoding["attention_mask"]

print(input_ids)
print(attention_mask)

tensor([[  101,  1045,  2293, 14255, 18684,  2099,  1012,   102,     0,     0,
             0,     0],
        [  101,  1045,  2123,  1005,  1056,  2729,  2005, 14255, 18684,  2099,
          1012,   102]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


When we call a classification model with the `labels` argument, the first returned element is the Cross Entropy loss
between the predictions and the passed labels. Having already set up our optimizer, we can then do a backwards pass and
update the weights:

In [None]:
labels = torch.tensor([1, 0]).unsqueeze(0)
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)

loss = outputs.loss
loss.backward()
optimizer.step()

Alternatively, you can just get the logits and calculate the loss yourself. The following is equivalent to the previous example:

In [None]:
labels = torch.tensor([1, 0])
outputs = model(input_ids, attention_mask=attention_mask)

loss = F.cross_entropy(outputs.logits, labels)
loss.backward()
optimizer.step()

Of course, you can train on GPU by calling `to('cuda')` on the model and inputs as usual.

We also provide a few learning rate scheduling tools. With the following, we can set up a scheduler which warms up for
`num_warmup_steps` and then linearly decays to 0 by the end of training.

In [None]:
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=10, num_training_steps=10)

Then all we have to do is call `scheduler.step()` after `optimizer.step()`.

In [None]:
loss.backward(retain_graph=True)
optimizer.step()
scheduler.step()

We highly recommend using `Trainer`, discussed below, which conveniently handles the moving parts of training 🤗 Transformers models with features like mixed precision and easy tensorboard logging.

### Freezing the encoder

In some cases, you might be interested in keeping the weights of the pre-trained encoder frozen and optimizing only the
weights of the head layers. To do so, simply set the `requires_grad` attribute to `False` on the encoder
parameters, which can be accessed with the `base_model` submodule on any task-specific model in the library:

In [None]:
for param in model.base_model.parameters():
  param.requires_grad = True

## Fine-tuning in native TensorFlow 2

Models can also be trained natively in TensorFlow 2. Just as with PyTorch, TensorFlow models can be instantiated with `from_pretrained()` to load the weights of the encoder from a pretrained model.

In [4]:
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…




All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Let's use `tensorflow_datasets` to load in the [MRPC dataset](https://www.tensorflow.org/datasets/catalog/glue#gluemrpc) from GLUE. We can then use our built-in `glue_convert_examples_to_features` to tokenize MRPC and convert it to a TensorFlow `Dataset` object. Note that tokenizers are framework-agnostic, so there is no need to prepend `TF` to the pretrained tokenizer name.

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

data = tfds.load("glue/mrpc")
train_dataset = glue_convert_examples_to_features(data["train"], tokenizer, max_length=128, task="mrpc")
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)

The model can then be compiled and trained as any Keras model:

In [None]:
optimizer = tf.keras.

{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102], [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102], [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


As we can see, it returns a dictionary where each value is a list of lists of ints.

To double-check what is fed to the model, we can decode each list in `input_ids` one by one:

In [None]:
for ids in encoded_inputs["input_ids"]:
  print(tokenizer.decode(ids))

[CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]
[CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]
[CLS] And the very very last one [SEP] And I go with the very last one [SEP]


Once again, you can automatically pad your inputs to the maximum sentence length in the batch, truncate to the maximum length the model can accept and return tensors directly with the following:

In [None]:
batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="pt")

In [None]:
print(batch)

{'input_ids': tensor([[  101,  8667,   146,   112,   182,   170,  1423,  5650,   102,   146,
           112,   182,   170,  5650,  1115,  2947,  1114,  1103,  1148,  5650,
           102],
        [  101,  1262,  1330,  5650,   102,  1262,   146,  1431,  1129, 12544,
          1114,  1103,  1248,  5650,   102,     0,     0,     0,     0,     0,
             0],
        [  101,  1262,  1103,  1304,  1304,  1314,  1141,   102,  1262,   146,
          1301,  1114,  1103,  1304,  1314,  1141,   102,     0,     0,     0,
             0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

## Pre-tokenized inputs

The tokenizer also accept pre-tokenized inputs. This is particularly useful when you want to compute labels and extract predictions in named entity recognition (NER) or part-of-speech tagging (POS tagging).

>**Note**: Pre-tokenized does not mean your inputs are already tokenized (you wouldn’t need to pass them through the tokenizer if that was the case) but just split into words (which is often the first step in subword tokenization algorithms like BPE).

If you want to use pre-tokenized inputs, just set `is_split_into_words=True` when passing your inputs to the tokenizer. For instance, we have:

In [None]:
  encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
  print(encoded_input)

{'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


Note that the tokenizer still adds the ids of special tokens (if applicable) unless you pass `add_special_tokens=False`.

This works exactly as before for batch of sentences or batch of pairs of sentences. You can encode a batch of sentences like this:

In [None]:
batch_sentences = [
  ["Hello", "I'm", "a", "single", "sentence"],
  ["And", "another", "sentence"],
  ["And", "the", "very", "very", "last", "one"]                
]

encoded_inputs = tokenizer(batch_sentences, is_split_into_words=True)
print(encoded_inputs)

{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102], [101, 1262, 1330, 5650, 102], [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}


or a batch of pair sentences like this:

In [None]:
batch_of_second_sentences = [
  ["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
  ["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
  ["And", "I", "go", "with", "the", "very", "last", "one"]              
]

encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True)
print(encoded_inputs)

{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102], [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102], [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


And you can add padding, truncation as well as directly return tensors like before:

In [None]:
batch = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True, padding=True, truncation=True, return_tensors="pt")
print(batch)

{'input_ids': tensor([[  101,  8667,   146,   112,   182,   170,  1423,  5650,   102,   146,
           112,   182,   170,  5650,  1115,  2947,  1114,  1103,  1148,  5650,
           102],
        [  101,  1262,  1330,  5650,   102,  1262,   146,  1431,  1129, 12544,
          1114,  1103,  1248,  5650,   102,     0,     0,     0,     0,     0,
             0],
        [  101,  1262,  1103,  1304,  1304,  1314,  1141,   102,  1262,   146,
          1301,  1114,  1103,  1304,  1314,  1141,   102,     0,     0,     0,
             0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [None]:
batch = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True, padding=True, truncation=True, return_tensors="tf")
print(batch)

{'input_ids': <tf.Tensor: shape=(3, 21), dtype=int32, numpy=
array([[  101,  8667,   146,   112,   182,   170,  1423,  5650,   102,
          146,   112,   182,   170,  5650,  1115,  2947,  1114,  1103,
         1148,  5650,   102],
       [  101,  1262,  1330,  5650,   102,  1262,   146,  1431,  1129,
        12544,  1114,  1103,  1248,  5650,   102,     0,     0,     0,
            0,     0,     0],
       [  101,  1262,  1103,  1304,  1304,  1314,  1141,   102,  1262,
          146,  1301,  1114,  1103,  1304,  1314,  1141,   102,     0,
            0,     0,     0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(3, 21), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]],
      dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(3, 21), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1