<a href="https://colab.research.google.com/github/rahiakela/advanced-natural-language-processing-with-tensorflow-2/blob/main/4-transfer-learning/1_understanding_sentiment_using_glove_based_transfer_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Understanding Sentiment using GloVe based transfer learning

We have used BiLSTM model to predict the sentiment of IMDb movie reviews. That model learned embeddings of the words from scratch. This model had an accuracy of `83.55%` on the test set, while the SOTA result was closer to `97.4%`. If pre-trained embeddings are used, we expect an increase in model accuracy. 

After all the setup is completed, we will need to use TensorFlow to use these pre-trained embeddings. There will be two different models that will be tried – 
- the first will be based on feature extraction
- the second one on fine-tuning

Let's try this out and see the impact of transfer learning on this model.

##Setup

In [1]:
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.layers import Embedding, LSTM, Bidirectional, Dense, Dropout

import numpy as np
import pandas as pd

tf.__version__

'2.5.0'

In [2]:
######## GPU CONFIGS FOR RTX 2070 ###############
## Please ignore if not training on GPU       ##
## this is important for running CuDNN on GPU ##

tf.keras.backend.clear_session() #- for easy reset of notebook state

# chck if GPU can be seen by TF
tf.config.list_physical_devices('GPU')
# only if you want to see how commands are executed
#tf.debugging.set_log_device_placement(True)
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only use the first GPU
  try:
    tf.config.experimental.set_memory_growth(gpus[0], True)
    tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
  except RuntimeError as e:
    # Visible devices must be set before GPUs have been initialized
    print(e)
###############################################

1 Physical GPUs, 1 Logical GPU


In [3]:
# Download the GloVe embeddings
!wget -q http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


##Loading IMDb training data

TensorFlow Datasets or the tfds package will be used to load the data:

In [None]:
imdb_train, ds_info = tfds.load(name="imdb_reviews", split="train", with_info=True, as_supervised=True)
imdb_test = tfds.load(name="imdb_reviews", split="test", as_supervised=True)

In [5]:
# Check label and example from the dataset
for example, label in imdb_train.take(1):
  print(example, "\n", label)

tf.Tensor(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", shape=(), dtype=string) 
 tf.Tensor(0, shape=(), dtype=int64)


## Create Vocab and Encoder

After the training and test sets are loaded, the content of the reviews needs to be tokenized and encoded:

In [6]:
# Use the default tokenizer settings
tokenizer = tfds.deprecated.text.Tokenizer()

vocabulary_set = set()
MAX_TOKENS = 0

for example, label in imdb_train:
  some_tokens = tokenizer.tokenize(example.numpy())
  if MAX_TOKENS < len(some_tokens):
    MAX_TOKENS = len(some_tokens)
  vocabulary_set.update(some_tokens)

We tokenizes the review text and constructs a vocabulary.
This vocabulary is used to construct a tokenizer:

In [7]:
imdb_encoder = tfds.deprecated.text.TokenTextEncoder(vocabulary_set, lowercase=True, tokenizer=tokenizer)
vocab_size = imdb_encoder.vocab_size

print(vocab_size, MAX_TOKENS)

93931 2525


Note that text was converted to lowercase before encoding. Converting to lowercase helps reduce the vocabulary size and may benefit the lookup of corresponding GloVe vectors. Note that capitalization may contain important information, which may help in tasks such as NER.

In [8]:
# Lets verify tokenization and encoding works
for example, label in imdb_train.take(1):
  print(example, "\n")
  encoded = imdb_encoder.encode(example.numpy())
  print(encoded, "\n")
  print(imdb_encoder.decode(encoded))

tf.Tensor(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", shape=(), dtype=string) 

[57134, 72469, 82410, 88200, 51576, 84943, 62773, 57508, 89461, 89894, 83203, 87043, 58496, 58774, 81304, 85480, 15782, 73561, 68007, 85479, 45055, 74551, 57134, 72145, 84789, 89461, 86090, 83493, 85081, 83203, 84907, 45565, 86090, 85479, 91221, 3970

Now that the tokenizer is ready, the data needs to be tokenized, and sequences
padded to a maximum length. Since we are interested in comparing performance
with the previosly trained model,we can use the same setting of sampling a maximum of 150 words of the review.

In [9]:
# transformation functions to be used with the dataset
def encode_pad_transform(sample):
  encoded = imdb_encoder.encode(sample.numpy())
  pad = sequence.pad_sequences([encoded], padding="post", maxlen=150)

  return np.array(pad[0], dtype=np.int64)

def encode_tf_fn(sample, label):
  encoded = tf.py_function(encode_pad_transform, inp=[sample], Tout=(tf.int64))
  encoded.set_shape([None])
  label.set_shape([])

  return encoded, label

In [10]:
# test the transformation on a small subset
subset = imdb_train.take(10)
tst = subset.map(encode_tf_fn)

In [11]:
for review, label in tst.take(1):
  print(review, label)
  print("\n", imdb_encoder.decode(review))

tf.Tensor(
[57134 72469 82410 88200 51576 84943 62773 57508 89461 89894 83203 87043
 58496 58774 81304 85480 15782 73561 68007 85479 45055 74551 57134 72145
 84789 89461 86090 83493 85081 83203 84907 45565 86090 85479 91221 39709
 58180 47978 57134 84943 89939 49608 57634 57134 84943 92432 82410 88896
 83154 82589 89183 40912 77552 60082 71226 84914 56894 34678 67720 77552
 39756 37323 56894 86380 86090 70778 88204 82581 88934 72006 64482 77802
 45108 78363 56800 85711 60381 83430 40905 58774 72469 88467 74551 81061
 71226 63010 45789 83203 81061 84943 92629 72469 62694 72541 62348 78970
 93662 18537 69702 71347 92629 67104 68007 81008 67417 57134 86492 67610
 89939 67417 58496 58774 89939 27263 48484 18537 39709 90022 54398 87840
 67367     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0], shape=(150,), dtype=int64) tf.Tensor(0, shape=(), dtype=int64)

 this was 

Finally, the data is encoded using the convenience functions above like so:

In [12]:
# now tokenize/encode/pad all training and testing data
encoded_train = imdb_train.map(encode_tf_fn, num_parallel_calls=tf.data.experimental.AUTOTUNE)
encoded_test = imdb_test.map(encode_tf_fn, num_parallel_calls=tf.data.experimental.AUTOTUNE)

At this point, all the training and test data is ready for training.

## Loading pre-trained GloVe embeddings

The next step is the foremost step in transfer learning – loading the pre-trained GloVe embeddings and using these as the weights of the embedding layer.

The nearest GloVe dimension is 50, so let's use that. The file format is quite simple. Each line of the text has multiple values separated by spaces. The first item of each row is the word, and the rest of the items are the values of the vector for each dimension. So, in the 50-dimensional file, each row will have 51 columns.

In [18]:
dict_w2v = {}

with open("glove.6B.50d.txt", "r") as file:
  for line in file:
    tokens = line.split()
    word = tokens[0]
    vector = np.array(tokens[1:], dtype=np.float32)

    if vector.shape[0] == 50:
      dict_w2v[word] = vector
    else:
      print("There was an issue with ", vector)

You should see a dictionary size of 400,000 words. Once these vectors are loaded, an embedding matrix needs to be created.

In [19]:
# let's check the vocabulary size
print("Dictionary Size: ", len(dict_w2v))

Dictionary Size:  400000


##Creating a pre-trained embedding matrix

So far, we have a dataset, its vocabulary, and a dictionary of GloVe words and
their corresponding vectors. However, there is no correlation between these two
vocabularies. The way to connect them is through the creation of an embedding
matrix.

In [20]:
# First, let's initialize an embedding matrix of zeros
embedding_dim = 50
embedding_matrix = np.zeros((imdb_encoder.vocab_size, embedding_dim))

Note that this is a crucial step. When a pre-trained word list is used, finding a vector for each word in the training/test is not guaranteed.

After this embedding matrix of zeros is initialized, it needs to be populated. For each word in the vocabulary of reviews, the corresponding vector is retrieved from the GloVe dictionary.

The ID of the word is retrieved using the encoder, and then the embedding matrix
entry corresponding to that entry is set to the retrieved vector.

In [21]:
unk_cnt = 0
unk_set = set()

for word in imdb_encoder.tokens:
  embedding_vector = dict_w2v.get(word)

  if embedding_vector is not None:
    token_id = imdb_encoder.encode(word)[0]
    embedding_matrix[token_id] = embedding_vector
  else:
    unk_cnt += 1
    unk_set.add(word)

In [22]:
# how many weren't found?
print("Total unknown words: ", unk_cnt)

Total unknown words:  14553


During the data loading step, we saw that the total number of tokens was 93,931.
Out of these, 14,553 words could not be found, which is approximately 15% of
the tokens. For these words, the embedding matrix will have zeros.

**This is the first step in transfer learning.**

##Feature extraction model

The feature extraction model freezes the pre-trained
weights and does not update them. An important issue with this approach in the
current setup is that there are a large number of tokens, over 14,000, that have
zero embedding vectors. These words could not be matched to an entry in the
GloVe word list.

---
To minimize the chances of not finding matches between the
pre-trained vocabulary and task-specific vocabulary, ensure
that similar tokenization schemes are used.

GloVe uses a wordbased tokenization scheme like the one provided by the Stanford
tokenizer.This works better than a whitespace tokenizer.

We see 15% unmatched tokens due to different tokenizers.

As an exercise, we can implement the Stanford tokenizer
and see the reduction in unknown tokens.

Newer methods like BERT use parts of subword tokenizers.
Subword tokenization schemes can break up words into parts,
which minimizes this chance of mismatch in tokens. Some
examples of subword tokenization schemes are Byte Pair Encoding
(BPE) or WordPiece tokenization.

---

If pre-trained vectors were not used, then the vectors for all the words would start with nearly zero and get trained through gradient descent. 

In this case, the vectors are already trained, so we expect the training to go along much faster. 

For a baseline, one epoch of training of the BiLSTM model while training embeddings takes between 65 seconds and 100 seconds.

Now, let's build the model and plug in the embedding matrix generated above into
the model.


##Fine-tuning model