<a href="https://colab.research.google.com/github/rahiakela/advanced-natural-language-processing-with-tensorflow-2/blob/main/4-transfer-learning/2_understanding_sentiment_using_bert_based_transfer_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Understanding Sentiment using GloVe based transfer learning

We have used BiLSTM model to predict the sentiment of IMDb movie reviews. That model learned embeddings of the words from scratch. This model had an accuracy of `83.55%` on the test set, while the SOTA result was closer to `97.4%`. If pre-trained embeddings are used, we expect an increase in model accuracy. 

After all the setup is completed, we will need to use TensorFlow to use these pre-trained embeddings. There will be two different models that will be tried – 
- the first will be based on feature extraction
- the second one on fine-tuning

Let's try this out and see the impact of transfer learning on this model.

##Setup

In [None]:
!pip -q install transformers

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds

from transformers import BertTokenizer
from transformers import TFBertForSequenceClassification
from transformers import TFBertModel

# create training and validation splits
from sklearn.model_selection import train_test_split

import numpy as np
import pandas as pd

tf.__version__

'2.5.0'

In [None]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

In [None]:
######## GPU CONFIGS FOR RTX 2070 ###############
## Please ignore if not training on GPU       ##
## this is important for running CuDNN on GPU ##

tf.keras.backend.clear_session() #- for easy reset of notebook state

# chck if GPU can be seen by TF
tf.config.list_physical_devices('GPU')
# only if you want to see how commands are executed
#tf.debugging.set_log_device_placement(True)
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only use the first GPU
  try:
    tf.config.experimental.set_memory_growth(gpus[0], True)
    tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
  except RuntimeError as e:
    # Visible devices must be set before GPUs have been initialized
    print(e)
###############################################

1 Physical GPUs, 1 Logical GPU


In [None]:
# Download the GloVe embeddings
!wget -q http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


##Loading IMDb training data

TensorFlow Datasets or the tfds package will be used to load the data:

In [None]:
imdb_train, ds_info = tfds.load(name="imdb_reviews", split="train", with_info=True, as_supervised=True)
imdb_test = tfds.load(name="imdb_reviews", split="test", as_supervised=True)

In [None]:
# Check label and example from the dataset
for example, label in imdb_train.take(1):
  print(example, "\n", label)

tf.Tensor(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", shape=(), dtype=string) 
 tf.Tensor(0, shape=(), dtype=int64)


##Tokenization and normalization

Hugging Face have provided pre-trained models as well as abstractions that make working with advanced models like BERT a breeze. The general flow for getting BERT to work will be:

1. Load a pre-trained model
2. Instantiate a tokenizer and tokenize the data
3. Set up a model and compile it
4. Fit the model on the data

### 1-Load pre-trained model tokenizer

The tokenizer is the first step – it needs to be imported before it can be used:

In [None]:
# downloads the configuration and the vocabulary file from the cloud and instantiates a tokenizer
bert_name = "bert-base-cased"
tokenizer = BertTokenizer.from_pretrained(bert_name,
                                          add_special_tokens=True,
                                          do_lower_case=False,
                                          max_length=150,
                                          pad_to_max_length=True)

There are three sequences that need to be provided to the BERT model:

- **input_ids**: This corresponds to the tokens in the inputs converted into IDs.
- **token_type_ids**: If the input contains two sequences then these IDs tell the model indicates which input_ids correspond to which sequence.
- **attention_mask**: Given that the sequences are padded, this mask tells the
model where the actual tokens end so that the attention calculation does not
use the padding tokens.

If the input sequence was "Don't be lured", then the figure shows how it is
tokenized with the WordPiece tokenizer as well as the addition of special tokens.

<img src='https://github.com/rahiakela/img-repo/blob/master/advanced-nlp-with-tensorflow-2/bert-sequences.png?raw=1' width='800'/>

Only one sequence is provided, hence the token type IDs or segment IDs all have the same value. The attention mask is set to 1, where the corresponding entry in the tokens is an actual token.

Let's generate these encodings.

In [None]:
tokenizer.encode_plus(" Don't be lured",
                      add_special_tokens=True,
                      max_length=9,
                      truncation=True,
                      pad_to_max_length=True,
                      return_attention_mask=True,
                      return_token_type_ids=True)

{'input_ids': [101, 1790, 112, 189, 1129, 19615, 1181, 102, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 0]}

If two strings are passed to the tokenizer, then they are treated as a pair.

In [None]:
tokenizer.encode_plus(" Don't be", " lured",
                      add_special_tokens=True,
                      max_length=9,
                      truncation=True,
                      pad_to_max_length=True,
                      return_attention_mask=True,
                      return_token_type_ids=True)

{'input_ids': [101, 1790, 112, 189, 1129, 102, 19615, 1181, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

The input IDs have two separators to distinguish between the two sequences. The
token type IDs help distinguish which tokens correspond to which sequence. 

Note that the token type ID for the padding token is set to 0. In the network, it is never used as all the values are multiplied by the attention mask.

To perform encoding of the inputs for all the IMDb reviews, a helper function is
defined.

In [None]:
def bert_encoder(review):
  txt = review.numpy().decode("utf-8")
  encoded = tokenizer.encode_plus(txt, 
                                  add_special_tokens=True,
                                  max_length=150,
                                  truncation=True,
                                  pad_to_max_length=True,
                                  return_attention_mask=True,
                                  return_token_type_ids=True)
  return encoded["input_ids"], encoded["token_type_ids"], encoded["attention_mask"]

Now, this needs to be applied to every review in the training data:

In [None]:
bert_train = [bert_encoder(review) for review, label in imdb_train]
bert_label = [label for review, label in imdb_train]

bert_train = np.array(bert_train)
# Labels of the reviews are also converted into categorical values.
bert_label = tf.keras.utils.to_categorical(bert_label, num_classes=2)

In [None]:
print(bert_train.shape, bert_label.shape)

(25000, 3, 150) (25000, 2)


In [None]:
# create training and validation splits
x_train, x_val, y_train, y_val = train_test_split(bert_train, bert_label, test_size=0.2, random_state=42)
print(x_train.shape, y_train.shape)
print(x_val.shape, y_val.shape)

(20000, 3, 150) (20000, 2)
(5000, 3, 150) (5000, 2)


A little more data processing is required to wrangle the inputs into three input
dictionaries in `tf.DataSet` for easy use in training:

In [None]:
train_reviews, train_segments, train_masks = np.split(x_train, 3, axis=1)
val_reviews, val_segments, val_masks = np.split(x_val, 3, axis=1)

train_reviews = train_reviews.squeeze()
train_segments = train_segments.squeeze()
train_masks = train_masks.squeeze()

val_reviews = val_reviews.squeeze()
val_segments = val_segments.squeeze()
val_masks = val_masks.squeeze()

These training and validation sequences are converted into a dataset like so:

In [None]:
def example_to_features(input_ids, attention_masks, token_type_ids, y):
  return {
      "input_ids": input_ids,
      "attention_mask": attention_masks,
      "token_type_ids": token_type_ids
  }, y

In [None]:
train_ds = tf.data.Dataset.from_tensor_slices((train_reviews, train_masks, train_segments, y_train)) \
                          .map(example_to_features) \
                          .shuffle(100) \
                          .batch(16) 

valid_ds = tf.data.Dataset.from_tensor_slices((val_reviews, val_masks, val_segments, y_val)) \
                          .map(example_to_features) \
                          .shuffle(100) \
                          .batch(16) \

A batch size of 16 has been used here. The memory of the GPU is the limiting factor here. Google Colab can support a batch length of 32. An 8 GB RAM GPU can support a batch size of 16.

Now, we are ready to train a model using BERT for classification.
We will see two approaches. 

- The first approach will use a pre-built classification
model on top of BERT.
- The second approach will use the base BERT model and adds custom layers on top to accomplish the same task.

##Pre-built BERT classification model

Hugging Face libraries make it really easy to use a pre-built BERT model for
classification by providing a class to do so:

In [None]:
bert_model = TFBertForSequenceClassification.from_pretrained(bert_name)

To use this model, we only need to provide an optimizer and a loss
function and compile the model:

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)

bert_model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])
bert_model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  108310272 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 108,311,810
Trainable params: 108,311,810
Non-trainable params: 0
_________________________________________________________________


So, the model has the entire BERT model, a dropout layer, and a classifier layer on top. This is as simple as it gets.

>The BERT paper suggests some settings for fine-tuning. They
suggest a batch size of `16` or `32`, run for `2` to `4` epochs. Further,
they suggest using one of the following learning rates for Adam:
`5e-5, 3e-5, or 2e-5`.

We batched the data into sets of `16`. Here, the Adam optimizer is configured to use a learning rate of `2e-5`.

Let's train this model for 3 epochs. Note that training is going to be quite slow:

In [None]:
print("Fine-tuning BERT on IMDB")
bert_history = bert_model.fit(train_ds, epochs=3, validation_data=valid_ds)

The validation accuracy is quite impressive for the little work we have done here if it holds on the test set. That needs to be checked next. Using the convenience methods, the test data will be tokenized and encoded in the right format:

In [None]:
# prep data for testing
bert_test = [bert_encoder(review) for review, label in imdb_test]
bert_test_label = [label for review, label in imdb_test]

bert_test2 = np.array(bert_test)
# Labels of the reviews are also converted into categorical values.
bert_test_label2 = tf.keras.utils.to_categorical(bert_test_label, num_classes=2)

test_reviews, test_segments, test_masks = np.split(bert_test2, 3, axis=1)

test_reviews = test_reviews.squeeze()
test_segments = test_segments.squeeze()
test_masks = test_masks.squeeze()

test_ds = tf.data.Dataset.from_tensor_slices((test_reviews, test_masks, test_segments, bert_test_label2)) \
                         .map(example_to_features) \
                         .shuffle(100) \
                         .batch(16) 

Evaluating the performance of this model on the test dataset, we get the following:

In [None]:
bert_model.evaluate(test_ds)



[0.4078245759010315, 0.8809599876403809]

The model accuracy is almost 88%! This is higher than the best GloVe model shown
previously, and it took much less code to implement.

##Custom model with BERT

The BERT model outputs contextual embeddings for all of the input tokens. The
embedding corresponding to the `[CLS]` token is generally used for classification tasks, and it represents the entire document. 

The pre-built model from Hugging Face returns the embeddings for the entire sequence as well as this pooled output, which represents the entire document as the output of the model. This pooled output vector can be used in future layers to help with the classification task. This is the approach we will take in building a customer model.

The starting point for this exploration is the base TFBertModel. It can be imported and instantiated like so:

In [None]:
bert_name = "bert-base-cased"
bert = TFBertModel.from_pretrained(bert_name)
bert.summary()

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "tf_bert_model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  108310272 
Total params: 108,310,272
Trainable params: 108,310,272
Non-trainable params: 0
_________________________________________________________________


Since we are using the same pre-trained model, the cased BERT-Base model, we
can reuse the tokenized and prepared data from the section above.

Now, the custom model needs to be defined. The first layer of this model is the BERT layer. This layer will take three inputs, namely the input tokens, attention masks, and token type IDs:

In [None]:
max_seq_len = 150

input_ids = tf.keras.layers.Input((max_seq_len,), dtype=tf.int64, name="input_ids")
attention_mask = tf.keras.layers.Input((max_seq_len,), dtype=tf.int64, name="attention_mask")
token_type_ids = tf.keras.layers.Input((max_seq_len,), dtype=tf.int64, name="token_type_ids")

These names need to match the dictionary defined in the training and testing dataset.

In [None]:
train_ds.element_spec

({'attention_mask': TensorSpec(shape=(None, 150), dtype=tf.int64, name=None),
  'input_ids': TensorSpec(shape=(None, 150), dtype=tf.int64, name=None),
  'token_type_ids': TensorSpec(shape=(None, 150), dtype=tf.int64, name=None)},
 TensorSpec(shape=(None, 2), dtype=tf.float32, name=None))

**The BERT model expects these inputs in a dictionary. It can also accept the inputs as named arguments, but this approach is clearer and makes it easy to trace the inputs.**

Once the inputs are mapped, the output of the BERT model can be computed:

In [None]:
input_dict = {
    "input_ids": input_ids,
    "attention_mask": attention_mask,
    "token_type_ids": token_type_ids
}

outputs = bert(input_dict)
# let's see the output structure
outputs









TFBaseModelOutputWithPooling([('last_hidden_state',
                               <KerasTensor: shape=(None, 150, 768) dtype=float32 (created by layer 'tf_bert_model_1')>),
                              ('pooler_output',
                               <KerasTensor: shape=(None, 768) dtype=float32 (created by layer 'tf_bert_model_1')>)])

The first output has embeddings for each of the input tokens including the special tokens `[CLS]` and `[SEP]`. 

The second output corresponds to the output of the `[CLS]` token. 

This output will be used further in the model:

In [None]:
x = tf.keras.layers.Dropout(0.2)(outputs[1])
x = tf.keras.layers.Dense(200, activation="relu")(x)
x = tf.keras.layers.Dropout(0.2)(x)
x = tf.keras.layers.Dense(2, activation="sigmoid")(x)

custom_model = tf.keras.models.Model(inputs=input_dict, outputs=x)

We add a dense layer and a couple of dropout layers before an output layer. Now, the custom model is ready for training. 

The model needs to be compiled with an optimizer, loss function, and metrics to watch for:

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)

custom_model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])
custom_model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
attention_mask (InputLayer)     [(None, 150)]        0                                            
__________________________________________________________________________________________________
input_ids (InputLayer)          [(None, 150)]        0                                            
__________________________________________________________________________________________________
token_type_ids (InputLayer)     [(None, 150)]        0                                            
__________________________________________________________________________________________________
tf_bert_model_1 (TFBertModel)   TFBaseModelOutputWit 108310272   attention_mask[0][0]             
                                                                 input_ids[0][0]              

This custom model has 154,202 additional trainable parameters in addition to the
BERT parameters. The model is ready to be trained. 

We will use the same settings from the previous BERT section and train the model for 3 epochs:

In [None]:
print("Custom Model: Fine-tuning BERT on IMDB")
custom_history = custom_model.fit(train_ds, epochs=3, validation_data=valid_ds)

Evaluating on the test set gives an accuracy of `88.18%`.

In [None]:
custom_model.evaluate(test_ds)



[0.38145649433135986, 0.8817600011825562]

If a lot of fine-tuning is done, then there is a risk of BERT
forgetting its pretrained parameters. This can be a limitation while building custom models on top as a few epochs may not be sufficient to train the layers that have been added.

In this case, the BERT model layer can be frozen, and training can be
continued further. Freezing the BERT layer is fairly easy, though it needs the
re-compilation of the model:

In [None]:
bert.trainable = False                  # don't train BERT any more
optimizer = tf.keras.optimizers.Adam()  # standard learning rate

custom_model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])
custom_model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
attention_mask (InputLayer)     [(None, 150)]        0                                            
__________________________________________________________________________________________________
input_ids (InputLayer)          [(None, 150)]        0                                            
__________________________________________________________________________________________________
token_type_ids (InputLayer)     [(None, 150)]        0                                            
__________________________________________________________________________________________________
tf_bert_model_1 (TFBertModel)   TFBaseModelOutputWit 108310272   attention_mask[0][0]             
                                                                 input_ids[0][0]              

We can see that all the BERT parameters are now set to non-trainable. Since the model was being recompiled, we also took the opportunity to change the learning rate.

Now, training can be continued for a number of epochs like so:

In [None]:
print("Custom Model: Keep training custom model on IMDB")
custom_history = custom_model.fit(train_ds, epochs=10, validation_data=valid_ds)

Checking the model on the test set yields 88.16% accuracy:

In [None]:
custom_model.evaluate(test_ds)



[0.4942091703414917, 0.8816800117492676]

If you are contemplating whether the accuracy of this custom model is lower than the pre-trained model, then it is a fair question to ponder over.

**A bigger network is not always better, and overtraining can lead to a reduction in model performance due to overfitting.**

Something to try in the custom model is to use the output encodings
of all the input tokens and pass them through an LSTM layer or concatenate them
together to pass through dense layers and then make the prediction.