<a href="https://colab.research.google.com/github/rahiakela/advanced-natural-language-processing-with-tensorflow-2/blob/main/4-transfer-learning/1_understanding_sentiment_using_glove_based_transfer_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Understanding Sentiment using GloVe based transfer learning

We have used BiLSTM model to predict the sentiment of IMDb movie reviews. That model learned embeddings of the words from scratch. This model had an accuracy of `83.55%` on the test set, while the SOTA result was closer to `97.4%`. If pre-trained embeddings are used, we expect an increase in model accuracy. 

After all the setup is completed, we will need to use TensorFlow to use these pre-trained embeddings. There will be two different models that will be tried – 
- the first will be based on feature extraction
- the second one on fine-tuning

Let's try this out and see the impact of transfer learning on this model.

##Setup

In [1]:
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.layers import Embedding, LSTM, Bidirectional, Dense, Dropout

import numpy as np
import pandas as pd

tf.__version__

'2.5.0'

In [2]:
######## GPU CONFIGS FOR RTX 2070 ###############
## Please ignore if not training on GPU       ##
## this is important for running CuDNN on GPU ##

tf.keras.backend.clear_session() #- for easy reset of notebook state

# chck if GPU can be seen by TF
tf.config.list_physical_devices('GPU')
# only if you want to see how commands are executed
#tf.debugging.set_log_device_placement(True)
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only use the first GPU
  try:
    tf.config.experimental.set_memory_growth(gpus[0], True)
    tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
  except RuntimeError as e:
    # Visible devices must be set before GPUs have been initialized
    print(e)
###############################################

1 Physical GPUs, 1 Logical GPU


In [3]:
# Download the GloVe embeddings
!wget -q http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


##Loading IMDb training data

TensorFlow Datasets or the tfds package will be used to load the data:

In [None]:
imdb_train, ds_info = tfds.load(name="imdb_reviews", split="train", with_info=True, as_supervised=True)
imdb_test = tfds.load(name="imdb_reviews", split="test", as_supervised=True)

In [5]:
# Check label and example from the dataset
for example, label in imdb_train.take(1):
  print(example, "\n", label)

tf.Tensor(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", shape=(), dtype=string) 
 tf.Tensor(0, shape=(), dtype=int64)


## Create Vocab and Encoder

After the training and test sets are loaded, the content of the reviews needs to be tokenized and encoded:

In [6]:
# Use the default tokenizer settings
tokenizer = tfds.deprecated.text.Tokenizer()

vocabulary_set = set()
MAX_TOKENS = 0

for example, label in imdb_train:
  some_tokens = tokenizer.tokenize(example.numpy())
  if MAX_TOKENS < len(some_tokens):
    MAX_TOKENS = len(some_tokens)
  vocabulary_set.update(some_tokens)

We tokenizes the review text and constructs a vocabulary.
This vocabulary is used to construct a tokenizer:

In [7]:
imdb_encoder = tfds.deprecated.text.TokenTextEncoder(vocabulary_set, lowercase=True, tokenizer=tokenizer)
vocab_size = imdb_encoder.vocab_size

print(vocab_size, MAX_TOKENS)

93931 2525


Note that text was converted to lowercase before encoding. Converting to lowercase helps reduce the vocabulary size and may benefit the lookup of corresponding GloVe vectors. Note that capitalization may contain important information, which may help in tasks such as NER.

In [8]:
# Lets verify tokenization and encoding works
for example, label in imdb_train.take(1):
  print(example, "\n")
  encoded = imdb_encoder.encode(example.numpy())
  print(encoded, "\n")
  print(imdb_encoder.decode(encoded))

tf.Tensor(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", shape=(), dtype=string) 

[86177, 89394, 73061, 91286, 76833, 52975, 92482, 77926, 81313, 47931, 50833, 87357, 39669, 55642, 58216, 70872, 26279, 66569, 67553, 61135, 36349, 67298, 86177, 87366, 93137, 81313, 85321, 85064, 75702, 50833, 69588, 84506, 85321, 61135, 66611, 6711

Now that the tokenizer is ready, the data needs to be tokenized, and sequences
padded to a maximum length. Since we are interested in comparing performance
with the previosly trained model,we can use the same setting of sampling a maximum of 150 words of the review.

In [9]:
# transformation functions to be used with the dataset
def encode_pad_transform(sample):
  encoded = imdb_encoder.encode(sample.numpy())
  pad = sequence.pad_sequences([encoded], padding="post", maxlen=150)

  return np.array(pad[0], dtype=np.int64)

def encode_tf_fn(sample, label):
  encoded = tf.py_function(encode_pad_transform, inp=[sample], Tout=(tf.int64))
  encoded.set_shape([None])
  label.set_shape([])

  return encoded, label

In [10]:
# test the transformation on a small subset
subset = imdb_train.take(10)
tst = subset.map(encode_tf_fn)

In [11]:
for review, label in tst.take(1):
  print(review, label)
  print("\n", imdb_encoder.decode(review))

tf.Tensor(
[86177 89394 73061 91286 76833 52975 92482 77926 81313 47931 50833 87357
 39669 55642 58216 70872 26279 66569 67553 61135 36349 67298 86177 87366
 93137 81313 85321 85064 75702 50833 69588 84506 85321 61135 66611 67110
 90748 49141 86177 52975 19425 34475 43864 86177 52975 93344 73061 67875
 32163 77808 54259 63289 50675 82410 74436 61608 92902 55146 78727 50675
 70772 67144 92902 63859 85321 76744 89747 92004 82905 66220 69586 90729
 47145 79746 88609 89738 71321 90849 79918 55642 89394 37816 67298 92100
 74436 81142 36699 50833 92100 52975 82472 89394 91204 91949 42556 61655
 49910 16821 48402  8531 82472 63124 67553 50804 46331 86177 84836 91261
 19425 46331 39669 55642 19425 86382 53732 16821 67110 77365 64766 90279
 75708     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0], shape=(150,), dtype=int64) tf.Tensor(0, shape=(), dtype=int64)

 this was 

Finally, the data is encoded using the convenience functions above like so:

In [12]:
# now tokenize/encode/pad all training and testing data
encoded_train = imdb_train.map(encode_tf_fn, num_parallel_calls=tf.data.experimental.AUTOTUNE)
encoded_test = imdb_test.map(encode_tf_fn, num_parallel_calls=tf.data.experimental.AUTOTUNE)

At this point, all the training and test data is ready for training.

## Loading pre-trained GloVe embeddings

The next step is the foremost step in transfer learning – loading the pre-trained GloVe embeddings and using these as the weights of the embedding layer.

The nearest GloVe dimension is 50, so let's use that. The file format is quite simple. Each line of the text has multiple values separated by spaces. The first item of each row is the word, and the rest of the items are the values of the vector for each dimension. So, in the 50-dimensional file, each row will have 51 columns.

In [13]:
dict_w2v = {}

with open("glove.6B.50d.txt", "r") as file:
  for line in file:
    tokens = line.split()
    word = tokens[0]
    vector = np.array(tokens[1:], dtype=np.float32)

    if vector.shape[0] == 50:
      dict_w2v[word] = vector
    else:
      print("There was an issue with ", vector)

You should see a dictionary size of 400,000 words. Once these vectors are loaded, an embedding matrix needs to be created.

In [14]:
# let's check the vocabulary size
print("Dictionary Size: ", len(dict_w2v))

Dictionary Size:  400000


##Creating a pre-trained embedding matrix

So far, we have a dataset, its vocabulary, and a dictionary of GloVe words and
their corresponding vectors. However, there is no correlation between these two
vocabularies. The way to connect them is through the creation of an embedding
matrix.

In [15]:
# First, let's initialize an embedding matrix of zeros
embedding_dim = 50
embedding_matrix = np.zeros((imdb_encoder.vocab_size, embedding_dim))

Note that this is a crucial step. When a pre-trained word list is used, finding a vector for each word in the training/test is not guaranteed.

After this embedding matrix of zeros is initialized, it needs to be populated. For each word in the vocabulary of reviews, the corresponding vector is retrieved from the GloVe dictionary.

The ID of the word is retrieved using the encoder, and then the embedding matrix
entry corresponding to that entry is set to the retrieved vector.

In [16]:
unk_cnt = 0
unk_set = set()

for word in imdb_encoder.tokens:
  embedding_vector = dict_w2v.get(word)

  if embedding_vector is not None:
    token_id = imdb_encoder.encode(word)[0]
    embedding_matrix[token_id] = embedding_vector
  else:
    unk_cnt += 1
    unk_set.add(word)

In [17]:
# how many weren't found?
print("Total unknown words: ", unk_cnt)

Total unknown words:  14553


During the data loading step, we saw that the total number of tokens was 93,931.
Out of these, 14,553 words could not be found, which is approximately 15% of
the tokens. For these words, the embedding matrix will have zeros.

**This is the first step in transfer learning.**

##Feature extraction model

The feature extraction model freezes the pre-trained
weights and does not update them. An important issue with this approach in the
current setup is that there are a large number of tokens, over 14,000, that have
zero embedding vectors. These words could not be matched to an entry in the
GloVe word list.

---
To minimize the chances of not finding matches between the
pre-trained vocabulary and task-specific vocabulary, ensure
that similar tokenization schemes are used.

GloVe uses a wordbased tokenization scheme like the one provided by the Stanford
tokenizer.This works better than a whitespace tokenizer.

We see 15% unmatched tokens due to different tokenizers.

As an exercise, we can implement the Stanford tokenizer
and see the reduction in unknown tokens.

Newer methods like BERT use parts of subword tokenizers.
Subword tokenization schemes can break up words into parts,
which minimizes this chance of mismatch in tokens. Some
examples of subword tokenization schemes are Byte Pair Encoding
(BPE) or WordPiece tokenization.

---

If pre-trained vectors were not used, then the vectors for all the words would start with nearly zero and get trained through gradient descent. 

In this case, the vectors are already trained, so we expect the training to go along much faster. 

For a baseline, one epoch of training of the BiLSTM model while training embeddings takes between 65 seconds and 100 seconds.

Now, let's build the model and plug in the embedding matrix generated above into
the model.


In [21]:
# Length of the vocabulary in chars
vocab_size = imdb_encoder.vocab_size   # len(chars)
# Number of RNN units
rnn_units = 64
# batch size
BATCH_SIZE = 100

A convenience function being set up will enable fast switching. This method enables building models with the same architecture but different hyperparameters.

In [22]:
def build_model_bilstm(vocab_size, embedding_dim, rnn_units, batch_size, train_emb=False):
  model = tf.keras.Sequential([
       Embedding(vocab_size, embedding_dim, mask_zero=True, weights=[embedding_matrix], trainable=train_emb),
       # Dropout(0.25)
       Bidirectional(LSTM(rnn_units, return_sequences=True, dropout=0.5)),
       Bidirectional(LSTM(rnn_units, dropout=0.5)),   
       Dense(1, activation="sigmoid")                     
  ])

  return model

A new parameter, weights, loads the embedding matrix as the weights for the layer. Just after this parameter, a Boolean parameter called trainable is passed that determines whether the weights of this layer should be updated during training time. 

A feature extraction-based model can now be created like so:

In [23]:
featured_model = build_model_bilstm(vocab_size=vocab_size, embedding_dim=embedding_dim, rnn_units=rnn_units, batch_size=BATCH_SIZE)

featured_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 50)          4696550   
_________________________________________________________________
bidirectional (Bidirectional (None, None, 128)         58880     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128)               98816     
_________________________________________________________________
dense (Dense)                (None, 1)                 129       
Total params: 4,854,375
Trainable params: 157,825
Non-trainable params: 4,696,550
_________________________________________________________________


This model needs to be compiled with the loss function, optimizer, and metrics for observation progress of the model. Binary cross-entropy is the right loss function for this problem of binary classification. The Adam optimizer is a decent choice in most cases.

In [26]:
featured_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy", "Precision", "Recall"])

After setting up batches for preloading, the model is ready for training. Similar to previously, the model will be trained for 10 epochs.

In [27]:
# Prefetch for performance
encoded_train_batched = encoded_train.batch(BATCH_SIZE).prefetch(100)

featured_model.fit(encoded_train_batched, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f5b64105990>

A few things can be seen immediately. The model trained significantly faster.
Secondly, the model has not overfit. The final accuracy is just over 81% on
the training set.

>It should also be noted that the accuracy was still increasing at the
end of the tenth epoch, with lots of room to go. This indicates that
training this model for longer would probably increase accuracy
further.

For now, let's understand the utility of this model. To make an assessment of the quality of this model, performance on the test set should be evaluated.

In [28]:
featured_model.evaluate(encoded_test.batch(BATCH_SIZE))



[0.41534289717674255,
 0.8327199816703796,
 0.7972837686538696,
 0.8923199772834778]

This performance is quite impressive because this model is just 40% of the size of the previous model and represents a 70% reduction in training time for a less than 1% drop in accuracy. This model has a slightly better recall for slightly worse accuracy. 

This result should not be entirely unexpected. There are over 14,000 word vectors that are zeros in this model! To fix this issue, and also to try the fine-tuning sequential transfer learning approach, let's build a
fine-tuning-based model.

Let's retain it for epoch 20.

In [29]:
featured_model.fit(encoded_train_batched, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f5b767e6090>

In [30]:
featured_model.evaluate(encoded_test.batch(BATCH_SIZE))



[0.3447362184524536,
 0.8604400157928467,
 0.8177586793899536,
 0.9276000261306763]

##Fine-tuning model

Creating the fine-tuning model is trivial when using the convenience function. All that is needed is to pass the train_emb parameter as true.

In [31]:
fine_tuned_model = build_model_bilstm(vocab_size=vocab_size, embedding_dim=embedding_dim, rnn_units=rnn_units, batch_size=BATCH_SIZE, train_emb=True)

fine_tuned_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 50)          4696550   
_________________________________________________________________
bidirectional_2 (Bidirection (None, None, 128)         58880     
_________________________________________________________________
bidirectional_3 (Bidirection (None, 128)               98816     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 4,854,375
Trainable params: 4,854,375
Non-trainable params: 0
_________________________________________________________________


This model is identical to the feature extraction model in size. However, since the embeddings will be fine-tuned, training is expected to take a little longer. There are several thousand zero embeddings, which can now be updated. The resulting accuracy is expected to be much better than the previous model.

In [34]:
fine_tuned_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy", "Precision", "Recall"])
fine_tuned_model.fit(encoded_train_batched, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f5b79291050>

This accuracy is very impressive but needs to be checked against the test set.

In [35]:
fine_tuned_model.evaluate(encoded_test.batch(BATCH_SIZE))



[0.9734604954719543,
 0.8390399813652039,
 0.8230183124542236,
 0.8638399839401245]

That is the best result we have obtained so far at an accuracy of 87.1%.

It can also be seen that the network is overfitting a little bit. A Dropout layer can be added between the Embedding layer and the first LSTM layer to help reduce this overfitting. It should also be noted that this network is still much faster than training embeddings from scratch. Most epochs took 24 seconds for training. 

Overall, this model is smaller in size, takes much less time to train, and has much higher accuracy!

**This is why transfer learning is so important in machine learning in general and NLP more specifically.**

So far, we have seen the use of context-free word embeddings. The major challenge with this approach is that a word could have multiple meanings depending on the context.

The word bank could refer to a place for storing money and valuables and
also the side of a river.

**A more recent innovation in this area is BERT that is a contextual word embeddings.**