# TASK 6 - Fine Tuning BERT

In this task we are asked to fine tune a BERT model for a different classification task. The BERT encoder stays the same while the BERT Classifier has to be manually added based on the task in hand. We were asked to follow [This](https://www.tensorflow.org/official_models/fine_tuning_bert) example. In the tensorflow example they have used MRPC dataset which has a pair of sentences and the classifier determines whether the sentence pairs are equivalent or not. We went on to train a classifier for cola dataset, where the network classifies if the given english sentence is grammatically valid.

In [1]:
!pip install -q tf-nightly
!pip install -q tf-models-nightly

[K     |████████████████████████████████| 341.4MB 23kB/s 
[K     |████████████████████████████████| 9.2MB 54.2MB/s 
[K     |████████████████████████████████| 460kB 56.6MB/s 
[K     |████████████████████████████████| 983kB 5.3MB/s 
[K     |████████████████████████████████| 1.1MB 16.8MB/s 
[K     |████████████████████████████████| 174kB 30.8MB/s 
[K     |████████████████████████████████| 358kB 30.1MB/s 
[K     |████████████████████████████████| 36.6MB 82kB/s 
[K     |████████████████████████████████| 102kB 11.1MB/s 
[?25h  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Building wheel for py-cpuinfo (setup.py) ... [?25l[?25hdone


### The imports

In [2]:
import os
import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf

import tensorflow_hub as hub
import tensorflow_datasets as tfds
tfds.disable_progress_bar()

from official.modeling import tf_utils
from official import nlp
from official.nlp import bert

# Load the required submodules
import official.nlp.optimization
import official.nlp.bert.bert_models
import official.nlp.bert.configs
import official.nlp.bert.run_classifier
import official.nlp.bert.tokenization
import official.nlp.data.classifier_data_lib
import official.nlp.modeling.losses
import official.nlp.modeling.models
import official.nlp.modeling.networks

from google.colab import files

### Links to all bert weights and training checkpoints

In [3]:
gs_folder_bert = "gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-12_H-768_A-12"
tf.io.gfile.listdir(gs_folder_bert)
hub_url_bert = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2"

### loading the glue/cola dataset

In [4]:
sst2, info = tfds.load('glue/cola', with_info=True,
                       batch_size=-1)

[1mDownloading and preparing dataset glue/cola/1.0.0 (download: 368.14 KiB, generated: Unknown size, total: 368.14 KiB) to /root/tensorflow_datasets/glue/cola/1.0.0...[0m




Shuffling and writing examples to /root/tensorflow_datasets/glue/cola/1.0.0.incomplete7URXXU/glue-train.tfrecord
Shuffling and writing examples to /root/tensorflow_datasets/glue/cola/1.0.0.incomplete7URXXU/glue-validation.tfrecord
Shuffling and writing examples to /root/tensorflow_datasets/glue/cola/1.0.0.incomplete7URXXU/glue-test.tfrecord
[1mDataset glue downloaded and prepared to /root/tensorflow_datasets/glue/cola/1.0.0. Subsequent calls will reuse this data.[0m


In [39]:
sst2_train = sst2["train"]
for key,value in sst2_train.items():
  print(f"{key:9s}: {value[1000].numpy()}")

idx      : 3393
label    : 1
sentence : b'They rowed the canals of Venice.'


### calling the tokenizer

the tokenizer offered by bert is used here. Since the bert encoding is done with this tokenization.

In [5]:
tokenizer = bert.tokenization.FullTokenizer(
    vocab_file=os.path.join(gs_folder_bert, "vocab.txt"),
     do_lower_case=True)

print("Vocab size:", len(tokenizer.vocab))

Vocab size: 30522


### Pre processor functions

In [34]:
def encode_sentence(s):
   tokens = list(tokenizer.tokenize(s))
   return tokenizer.convert_tokens_to_ids(tokens)
def preprocessing(sst2_dict,tokenizer):
  count = len(sst2_dict["sentence"])
  sentence = tf.ragged.constant([
    encode_sentence(s) for s in sst2_dict["sentence"]])
  cls = [tokenizer.convert_tokens_to_ids(['[CLS]'])]*sentence.shape[0]
  input_word_ids = tf.concat([cls, sentence], axis=-1)
  input_mask = tf.ones_like(input_word_ids).to_tensor()
  type_cls = tf.zeros_like(cls)
  type_s = tf.zeros_like(sentence)
  input_type_ids = tf.concat(
      [type_cls, type_s], axis=-1).to_tensor()
  inputs = {
      'input_word_ids': input_word_ids.to_tensor(),
      'input_mask': input_mask,
      'input_type_ids': input_type_ids
      }
  return inputs

### preprocessing the train, val, Test sets

In [13]:
glue_train = preprocessing(sst2['train'], tokenizer)
glue_train_labels = sst2['train']['label']

glue_validation = preprocessing(sst2['validation'], tokenizer)
glue_validation_labels = sst2['validation']['label']

glue_test = preprocessing(sst2['test'], tokenizer)
glue_test_labels  = sst2['test']['label']

### Retriving the models , weights and checkpoints

In [8]:
import json
bert_config_file = os.path.join(gs_folder_bert, "bert_config.json")
config_dict = json.loads(tf.io.gfile.GFile(bert_config_file).read())
bert_config = bert.configs.BertConfig.from_dict(config_dict)
_ , bert_encoder = bert.bert_models.classifier_model(
    bert_config, num_labels=2)

### Restoring the encoder checkpoints

In [9]:
checkpoint = tf.train.Checkpoint(model=bert_encoder)
checkpoint.restore(
    os.path.join(gs_folder_bert, 'bert_model.ckpt')).assert_consumed()

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f7730bb5b00>

In [10]:
transformer_config = config_dict.copy()

# You need to rename a few fields to make this work:
transformer_config['attention_dropout_rate'] = transformer_config.pop('attention_probs_dropout_prob')
transformer_config['activation'] = tf_utils.get_activation(transformer_config.pop('hidden_act'))
transformer_config['dropout_rate'] = transformer_config.pop('hidden_dropout_prob')
transformer_config['initializer'] = tf.keras.initializers.TruncatedNormal(
          stddev=transformer_config.pop('initializer_range'))
transformer_config['max_sequence_length'] = transformer_config.pop('max_position_embeddings')
transformer_config['num_layers'] = transformer_config.pop('num_hidden_layers')

### Testing the encoder

In [14]:
glue_batch = {key: val[:20] for key, val in glue_train.items()}
print(bert_encoder(glue_batch))

[<tf.Tensor: shape=(20, 1, 768), dtype=float32, numpy=
array([[[ 2.7265620e-01,  2.3244005e-01,  2.9404190e-01, ...,
         -2.0707411e-01,  4.6615210e-01, -5.0804846e-02]],

       [[ 2.4580577e-01, -6.4193029e-03, -2.2461516e-01, ...,
         -3.5705957e-01,  1.8671927e-01, -1.7014639e-01]],

       [[ 6.0351956e-01, -5.4982886e-02,  5.9810340e-01, ...,
          7.2259456e-05,  5.3612846e-01,  4.4433987e-01]],

       ...,

       [[ 5.1136035e-01,  3.6061943e-01,  1.2609263e-01, ...,
         -4.2029239e-02,  6.5500802e-01,  1.3683848e-01]],

       [[ 4.1491485e-01, -4.2447838e-01,  3.3122575e-01, ...,
          4.1855934e-01,  2.8256467e-01,  4.2693657e-01]],

       [[-3.4039792e-02, -2.4015215e-01, -2.5847986e-02, ...,
         -1.5648928e-01,  2.6566467e-01,  2.3303467e-01]]], dtype=float32)>, <tf.Tensor: shape=(20, 768), dtype=float32, numpy=
array([[-0.50040084,  0.03004529,  0.64446235, ...,  0.21715291,
        -0.16101325,  0.799442  ],
       [-0.89464086, -0.14634508

### initializing a custom classifier

We were asked to avoid using the default bert classifier, a manual classifier is instantiated using the nlp library, This helps us to play with various attributes like number of output classes, drop out rates and weight initialization

In [15]:
sentiment_classifier = nlp.modeling.models.BertClassifier(
    bert_encoder,
    num_classes = 2,
    dropout_rate = transformer_config['dropout_rate'],
    initializer=tf.keras.initializers.TruncatedNormal(
          stddev=bert_config.initializer_range))

In [16]:
sentiment_classifier(glue_batch,training=True).numpy()

array([[ 0.14587045,  0.32071835],
       [ 0.1134103 ,  0.45238513],
       [-0.30765978,  0.15355203],
       [ 0.14140067,  0.38017064],
       [-0.31643736,  0.24233669],
       [ 0.40680867,  0.13516405],
       [ 0.30474538,  0.38137674],
       [-0.08823901,  0.2683372 ],
       [-0.02572563,  0.47956342],
       [-0.07893646,  0.17836675],
       [-0.13896203,  0.0202191 ],
       [-0.47501814,  0.17931724],
       [ 0.14061487,  0.4207766 ],
       [ 0.0979514 ,  0.35018104],
       [ 0.056511  , -0.04098345],
       [-0.14823565,  0.1639602 ],
       [ 0.05060878,  0.50964713],
       [ 0.28556955,  0.63980305],
       [-0.23541737,  0.29326594],
       [ 0.53926706,  0.52923226]], dtype=float32)

### Setting up the training process

In [19]:
epochs = 3
batch_size = 32
eval_batch_size = 32

train_data_size = len(glue_train_labels)
steps_per_epoch = int(train_data_size / batch_size)
num_train_steps = steps_per_epoch * epochs
warmup_steps = int(epochs * train_data_size * 0.1 / batch_size)
optimizer = nlp.optimization.create_optimizer(
    2e-5, num_train_steps=num_train_steps, num_warmup_steps=warmup_steps)

metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy', dtype=tf.float32)]
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
sentiment_classifier.compile(
    optimizer=optimizer,
    loss=loss,
    metrics=metrics)

### Training block

In [20]:
sentiment_classifier.fit(
      glue_train, glue_train_labels,
      validation_data=(glue_validation, glue_validation_labels),
      batch_size=32,
      epochs=epochs)
sentiment_classifier.save("Cola_BERT.h5")
files.download("Cola_BERT.h5") 

Epoch 1/3
Epoch 2/3
Epoch 3/3


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Model Evaluation and results

In [None]:
example_sentences = [
            b'Us love they.',
            b'i am a man',
            b'It is nice to go abroad.',
            b'Mary came to be introduced by the bartender and I also came to be.',
            b'John often meets Mary.'
            ]
test_sentence = preprocessing(
    sst2_dict = {
        'sentence': example_sentences,
    },tokenizer = tokenizer)


results = sentiment_classifier(test_sentence)
result = tf.argmax(results,axis = 1).numpy()
print(result)
for i in range(len(example_sentences)):
  print("{} \t {}".format(example_sentences[i],result[i]))

[0 1 1 0 1]
b'Us love they.' 	 0
b'i am a man' 	 1
b'It is nice to go abroad.' 	 1
b'Mary came to be introduced by the bartender and I also came to be.' 	 0
b'John often meets Mary.' 	 1


### Answers to the questions

####**What is the tutorial classifying when using the GLUE MRPC data set?**

GLUE/MRPC dataset contains pairs of sentences, The task is to build a model which compares the sentences and checks for the equivalence

####**In addition to the input itself, the tutorial feeds two binary tensors for input mask and input type to the model.Is this necessary for the data set single sentence classification?**

The bert encoder model demands 3 inputs they are 

* tokenized sentence tensor
* input mask - This is necessary for all language classification models since this filters out the padded instances thus these padded instances are not considered while calculating the loss functions.


####**How does the tokenization in BERT differ from the one in the previous Task 5?**

In the previous task(NMT) we generated our own word tokens and we built the encoder model from scratch. In case of BERT the language model is already trained with a number of sentences and it has a vocabulory size of 30522 words, we just need to rip off this encoder part and attach a classifier with it to perform the required task.

####**What is a [CLS] token and what is it used for?**

bert model has a number of explicit tokens like [CLS],[SEP] and [MASK]. The [CLS] token is always appended at the beginning of the sentences. In the tensorflow example since the task has to deal with pairs of sentences we use a [SEP] token inbetween them. The BERT encoder model is pre trained with all these tokens in their respective places.

####**Which part of the BERT encoding is used for the classification?**

The input sentence is fed into the BERT encoder. These sentences are then encoded into a latent space representation. The encoded representation is then fed into a classifier model of our choice to perform the fine tuning.

####**Does your answer match the output shape of the encoder?**

The targeted output is a tensor of size (n,2) since the task at hand is a binary classification problem. The output shape of the encoder is (n,768) for the above encoder.

####**Are the BERT encoder weights also fine-tuned to the task?**

The encoder weights are also fine-tuned to the the task