# SCS 3546 Week 7 - Deep Models for Text
(NLP Part 2)


# Introduction

## Learning Objectives

- Develop familiarity with several of the key concepts in Deep Learning models for NLP
- Understanding different word embeddings and their differences 
- Understand how to leverage word embedding in a classification task (Sentiment Analysis: Positive/Negative) 
- Build knowledge of how seq2seq models can be used for NLP tasks such as translation
- Become familiar with Attention, Self-Attention and Transformers 

## Applications of Deep Models


- Text Classification and Categorization
- Named Entity Recognition
- Part of Speech Tagging
- Semantic Parsing and Query Answering
- Summarization
- Paraphrasing Detection
- Synonym generation for search
- Machine Translation

# Word Representations

## Encoding with Integers

As we discussed last week, the traditional method of encoding words was to start by assigning an integer index to each unique word in the vocabulary then create vectors of a length equal to the size of the vocabulary:

- word encoding: all entries zero except the one at the index of the word, which was set to 1
- document encoding: each entry one (or a count of the number of occurrences) if the word is present in the document or zero if not

This has several advantages:

- Every word is represented by an integer that can be used as a fast look up index to additional information about the word (such as its text string)
- Every word vector is a fixed-sized data structure which makes them easier to process
- Comparing two words for equality is very fast
- We can compare the similarity of documents using techniques such as TF-IDF and cosine similarity

There are several things about this that are less than ideal:

- The vectors don't capture any word meaning
- The order of the words is arbitrary
- The vectors are large and therefore wasteful of computer memory and CPU
- Vectors with continuous (floating point) values are more amenable to use with neural nets than those with discrete values

We would prefer an approach where:

- Words that are close together in their encodings are similar in some sense
- The vectors are short and dense rather than long and sparse
- The vectors capture some aspect of the meaning of the words
- The different uses of a particular token (e.g. "bow" as in "bow and arrow" and "bow of a ship") can be disambiguated
- The vectors are more amenable for use with neural nets

If we could find such an encoding, then we could use the vectors for:

- Classifying the document as being about a topic without having to compare it to others already associated with a topic
- Measure the sentiment of a document without having to hand-engineer sentiment ratings for each word in the vocabulary
- Use the meaning of words and sentences to provide better automated translation

## Word Embeddings



Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. Importantly, you do not have to specify this encoding by hand. An embedding is a dense vector of floating point values (the length of the vector is a parameter you specify). Instead of specifying the values for the embedding manually, they are trainable parameters (weights learned by the model during training, in the same way a model learns weights for a dense layer). It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.

Following the list of popular embeddings

- **GloVe**
- **Word2Vec**
- **FastText**
- **TagLM**
- **Bert**
- **Albert**
- **RoBert**
- **GPT**
- **XLNet**
- **Performer**

And this list will be continued since every day a new embedding will be introduced :-) 

### Word2vec 


#### Introduction

Word2Vec is a family of model architectures and optimizations that can be used to learn word embeddings from large datasets. Embeddings learned through Word2Vec have proven to be successful on a variety of downstream natural language processing tasks. there are two main methods for learning the embeddings: **Bag-of-Words** and **Skip-gram**

* **Continuous Bag-of-Words Model** predicts the middle word based on surrounding context words. The context consists of a few words before and after the current (middle) word. This architecture is called a bag-of-words model as the order of words in the context is not important.

* **Continuous Skip-gram Model** predicts words within a certain range before and after the current word in the same sentence. A worked example of this is given below. 

In this Module we will discuss how Skip-gram works; Consider the sentence "The quick brown fox jumps over the lazy dog". The context words for each of the 8 words of this sentence are defined by a window size. The window size determines the span of words on either side of a **target_word** that can be considered **context word**. We wish to find the probability distribution of each of the vocabulary words of appearing in a window of a fixed size on either side of a given word. Following image demonstrates how we can build a training sample from source text. 


<center><img src="https://drive.google.com/uc?id=12StHx-FVVPiuJnRgNul-ObwKoSJdpGmS" ></center>

The training objective of the skip-gram model is to maximize the probability of predicting context words given the target word. For a sequence of words w1, w2, ... wT, the objective can be written as the average log probability

$$
\frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log p\left(w_{t+j} \mid w_{t}\right)
$$



<center><img src="https://drive.google.com/uc?id=1n6P1HxxfS5hGt-lG9IZWd_oyYAhkgKJK" ></center>


If O is the event of a word occuring anywhere in the window and C is the event of having a particular centre word, we wish to find $$P(O|C) = \exp(u_O^Tv_C)/\sum_{w=1}^V\exp(u_w^Tv_C)$$ where u and v are the vector representations and V is the size of the vocabulary.

Computing the denominator of this formulation involves performing a full softmax over the entire vocabulary words which is often is hard to compute. There are ways to make the training more efficient; 

The [Noise Contrastive Estimation loss](http://proceedings.mlr.press/v9/gutmann10a/gutmann10a.pdf) function is an efficient approximation for a full softmax. With an objective to learn word embeddings instead of modelling the word distribution, NCE loss can be simplified to use negative sampling.

The simplified negative sampling objective for a target word is to distinguish the context word from num_ns negative samples drawn from noise distribution Pn(w) of words. More precisely, an efficient approximation of full softmax over the vocabulary is, for a skip-gram pair, to pose the loss for a target word as a classification problem between the context word and num_ns negative samples.

A negative sample is defined as a (target_word, context_word) pair such that the context_word does not appear in the window_size neighborhood of the target_word. 

After the training process, we will use these weights as a vector representation for each word. These embeddings will be useful for downstream tasks like classification or clustering. Further, you can reduce the dimension of these embeddings and plot them in 3D. [here](http://projector.tensorflow.org/) you can find the example of it.

The interesting properties of Word2Vec model is that you can do is doing linear algebra arithmetic with words. For example, a popular example described in lectures and introduction papers is:

`queen = (king - man) + woman`

You can also take a look at the result by changing man to boy and woman to girl and we also have:

`queen = (king - boy) + girl`

In addition to the names, you can apply it to the verbs:

`walking = (walked - swam) + swimming`

Figure below shows this property by these examples:

<img src="https://drive.google.com/uc?id=1ONDACyoRRbqbgeHo1loisHXC-Lf-p9xJ">




#### Generating Word2Vec Embeddings

##### Setup

In [8]:
import io
import re
import string
import tensorflow as tf
import tqdm

from tensorflow.keras import Model
from tensorflow.keras.layers import Dot, Embedding, Flatten
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.layers import Input, LSTM, Dense
import numpy as np

##### Compile all steps into one function


###### Skip-gram Sampling table 

A large dataset means larger vocabulary with higher number of more frequent words such as stopwords. Training examples obtained from sampling commonly occurring words (such as `the`, `is`, `on`) don't add much useful information  for the model to learn from. [Mikolov et al.](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) suggest subsampling of frequent words as a helpful practice to improve embedding quality. 

The `tf.keras.preprocessing.sequence.skipgrams` function accepts a sampling table argument to encode probabilities of sampling any token. You can use the `tf.keras.preprocessing.sequence.make_sampling_table` to  generate a word-frequency rank based probabilistic sampling table and pass it to `skipgrams` function. Take a look at the sampling probabilities for a `vocab_size` of 10.

In [None]:
sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(size=10)
print(sampling_table)

[0.00315225 0.00315225 0.00547597 0.00741556 0.00912817 0.01068435
 0.01212381 0.01347162 0.01474487 0.0159558 ]


`sampling_table[i]` denotes the probability of sampling the i-th most common word in a dataset. The function assumes a [Zipf's distribution](https://en.wikipedia.org/wiki/Zipf%27s_law) of the word frequencies for sampling.

Key point: The `tf.random.log_uniform_candidate_sampler` already assumes that the vocabulary frequency follows a log-uniform (Zipf's) distribution. Using these distribution weighted sampling also helps approximate the Noise Contrastive Estimation (NCE) loss with simpler loss functions for training a negative sampling objective.

###### Generate training data

Compile all the steps described above into a function that can be called on a list of vectorized sentences obtained from any text dataset. Notice that the sampling table is built before sampling skip-gram word pairs. You will use this function in the later sections.

In [None]:
# Generates skip-gram pairs with negative sampling for a list of sequences
# (int-encoded sentences) based on window size, number of negative samples
# and vocabulary size.
def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
  # Elements of each training example are appended to these lists.
  targets, contexts, labels = [], [], []

  # Build the sampling table for vocab_size tokens.
  sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

  # Iterate over all sequences (sentences) in dataset.
  for sequence in tqdm.tqdm(sequences):

    # Generate positive skip-gram pairs for a sequence (sentence).
    positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
          sequence,
          vocabulary_size=vocab_size,
          sampling_table=sampling_table,
          window_size=window_size,
          negative_samples=0)

    # Iterate over each positive skip-gram pair to produce training examples
    # with positive context word and negative samples.
    for target_word, context_word in positive_skip_grams:
      context_class = tf.expand_dims(
          tf.constant([context_word], dtype="int64"), 1)
      negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
          true_classes=context_class,
          num_true=1,
          num_sampled=num_ns,
          unique=True,
          range_max=vocab_size,
          seed=SEED,
          name="negative_sampling")

      # Build context and label vectors (for one target word)
      negative_sampling_candidates = tf.expand_dims(
          negative_sampling_candidates, 1)

      context = tf.concat([context_class, negative_sampling_candidates], 0)
      label = tf.constant([1] + [0]*num_ns, dtype="int64")

      # Append each element from the training example to global lists.
      targets.append(target_word)
      contexts.append(context)
      labels.append(label)

  return targets, contexts, labels

##### Prepare training data for Word2Vec

With an understanding of how to work with one sentence for a skip-gram negative sampling based Word2Vec model, you can proceed to generate training examples from a larger list of sentences!

##### Download text corpus


You will use a text file of Shakespeare's writing for this tutorial. Change the following line to run this code on your own data.

In [None]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

Read text from the file and take a look at the first few lines. 

In [None]:
with open(path_to_file) as f: 
  lines = f.read().splitlines()
for line in lines[:20]:
  print(line)

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.


Use the non empty lines to construct a `tf.data.TextLineDataset` object for next steps.

In [None]:
text_ds = tf.data.TextLineDataset(path_to_file).filter(lambda x: tf.cast(tf.strings.length(x), bool))

##### Vectorize sentences from the corpus

You can use the `TextVectorization` layer to vectorize sentences from the corpus. Learn more about using this layer in this [Text Classification](https://www.tensorflow.org/tutorials/keras/text_classification) tutorial. Notice from the first few sentences above that the text needs to be in one case and punctuation needs to be removed. To do this, define a `custom_standardization function` that can be used in the TextVectorization layer.

In [None]:
# Now, create a custom standardization function to lowercase the text and
# remove punctuation.
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  return tf.strings.regex_replace(lowercase,
                                  '[%s]' % re.escape(string.punctuation), '')


# Define the vocabulary size and number of words in a sequence.
vocab_size = 4096
sequence_length = 10

# Use the text vectorization layer to normalize, split, and map strings to
# integers. Set output_sequence_length length to pad all samples to same length.
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

Call `adapt` on the text dataset to create vocabulary.


In [None]:
vectorize_layer.adapt(text_ds.batch(1024))

Once the state of the layer has been adapted to represent the text corpus, the vocabulary can be accessed with `get_vocabulary()`. This function returns a list of all vocabulary tokens sorted (descending) by their frequency. 

In [None]:
# Save the created vocabulary for reference.
inverse_vocab = vectorize_layer.get_vocabulary()
print(inverse_vocab[:20])

['', '[UNK]', 'the', 'and', 'to', 'i', 'of', 'you', 'my', 'a', 'that', 'in', 'is', 'not', 'for', 'with', 'me', 'it', 'be', 'your']


The vectorize_layer can now be used to generate vectors for each element in the `text_ds`.

In [None]:
AUTOTUNE = tf.data.AUTOTUNE
# Vectorize the data in text_ds.
text_vector_ds = text_ds.batch(1024).prefetch(AUTOTUNE).map(vectorize_layer).unbatch()

##### Obtain sequences from the dataset

You now have a `tf.data.Dataset` of integer encoded sentences. To prepare the dataset for training a Word2Vec model, flatten the dataset into a list of sentence vector sequences. This step is required as you would iterate over each sentence in the dataset to produce positive and negative examples. 

Note: Since the `generate_training_data()` defined earlier uses non-TF python/numpy functions, you could also use a `tf.py_function` or `tf.numpy_function` with `tf.data.Dataset.map()`.

In [None]:
sequences = list(text_vector_ds.as_numpy_iterator())
print(len(sequences))

32777


Take a look at few examples from `sequences`.


In [None]:
for seq in sequences[:5]:
  print(f"{seq} => {[inverse_vocab[i] for i in seq]}")

[ 89 270   0   0   0   0   0   0   0   0] => ['first', 'citizen', '', '', '', '', '', '', '', '']
[138  36 982 144 673 125  16 106   0   0] => ['before', 'we', 'proceed', 'any', 'further', 'hear', 'me', 'speak', '', '']
[34  0  0  0  0  0  0  0  0  0] => ['all', '', '', '', '', '', '', '', '', '']
[106 106   0   0   0   0   0   0   0   0] => ['speak', 'speak', '', '', '', '', '', '', '', '']
[ 89 270   0   0   0   0   0   0   0   0] => ['first', 'citizen', '', '', '', '', '', '', '', '']


##### Generate training examples from sequences

`sequences` is now a list of int encoded sentences. Just call the `generate_training_data()` function defined earlier to generate training examples for the Word2Vec model. To recap, the function iterates over each word from each sequence to collect positive and negative context words. Length of target, contexts and labels should be same, representing the total number of training examples.

In [None]:
SEED = 42

targets, contexts, labels = generate_training_data(
    sequences=sequences,
    window_size=2,
    num_ns=4,
    vocab_size=vocab_size,
    seed=SEED)
print(len(targets), len(contexts), len(labels))

100%|██████████| 32777/32777 [00:09<00:00, 3534.16it/s]


64612 64612 64612


##### Configure the dataset for performance

To perform efficient batching for the potentially large number of training examples, use the `tf.data.Dataset` API. After this step, you would have a `tf.data.Dataset` object of `(target_word, context_word), (label)` elements to train your Word2Vec model!

In [None]:
BATCH_SIZE = 1024
BUFFER_SIZE = 10000
dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print(dataset)

<BatchDataset shapes: (((1024,), (1024, 5, 1)), (1024, 5)), types: ((tf.int32, tf.int64), tf.int64)>


Add `cache()` and `prefetch()` to improve performance.

In [None]:
dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)
print(dataset)

<PrefetchDataset shapes: (((1024,), (1024, 5, 1)), (1024, 5)), types: ((tf.int32, tf.int64), tf.int64)>


##### Model and Training

The Word2Vec model can be implemented as a classifier to distinguish between true context words from skip-grams and false context words obtained through negative sampling. You can perform a dot product between the embeddings of target and context words to obtain predictions for labels and compute loss against true labels in the dataset.

###### Subclassed Word2Vec Model

Use the [Keras Subclassing API](https://www.tensorflow.org/guide/keras/custom_layers_and_models) to define your Word2Vec model with the following layers:


* `target_embedding`: A `tf.keras.layers.Embedding` layer which looks up the embedding of a word when it appears as a target word. The number of parameters in this layer are `(vocab_size * embedding_dim)`.
* `context_embedding`: Another `tf.keras.layers.Embedding` layer which looks up the embedding of a word when it appears as a context word. The number of parameters in this layer are the same as those in `target_embedding`, i.e. `(vocab_size * embedding_dim)`.
* `dots`: A `tf.keras.layers.Dot` layer that computes the dot product of target and context embeddings from a training pair.
* `flatten`: A `tf.keras.layers.Flatten` layer to flatten the results of `dots` layer into logits.

With the subclassed model, you can define the `call()` function that accepts `(target, context)` pairs which can then be passed into their corresponding embedding layer. Reshape the `context_embedding` to perform a dot product with `target_embedding` and return the flattened result.

Key point: The `target_embedding` and `context_embedding` layers can be shared as well. You could also use a concatenation of both embeddings as the final Word2Vec embedding.

In [None]:
class Word2Vec(Model):
  def __init__(self, vocab_size, embedding_dim):
    super(Word2Vec, self).__init__()
    self.target_embedding = Embedding(vocab_size,
                                      embedding_dim,
                                      input_length=1,
                                      name="w2v_embedding")
    self.context_embedding = Embedding(vocab_size,
                                       embedding_dim,
                                       input_length=num_ns+1)
    self.dots = Dot(axes=(3, 2))
    self.flatten = Flatten()

  def call(self, pair):
    target, context = pair
    word_emb = self.target_embedding(target)
    context_emb = self.context_embedding(context)
    dots = self.dots([context_emb, word_emb])
    return self.flatten(dots)

###### Define loss function and compile model


For simplicity, you can use `tf.keras.losses.CategoricalCrossEntropy` as an alternative to the negative sampling loss. If you would like to write your own custom loss function, you can also do so as follows:

``` python
def custom_loss(x_logit, y_true):
      return tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=y_true)
```

It's time to build your model! Instantiate your Word2Vec class with an embedding dimension of 128 (you could experiment with different values). Compile the model with the `tf.keras.optimizers.Adam` optimizer. 

In [None]:
embedding_dim = 128
num_ns=10
word2vec = Word2Vec(vocab_size, embedding_dim)
word2vec.compile(optimizer='adam',
                 loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                 metrics=['accuracy'])

Also define a callback to log training statistics for tensorboard.

In [None]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

Train the model with `dataset` prepared above for some number of epochs.

In [None]:
word2vec.fit(dataset, epochs=20, callbacks=[tensorboard_callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f5b83f89a90>

Tensorboard now shows the Word2Vec model's accuracy and loss.

In [None]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

#docs_infra: no_execute
%tensorboard --logdir logs

<IPython.core.display.Javascript object>

<!-- <img class="tfo-display-only-on-site" src="images/word2vec_tensorboard.png"/> -->

##### Embedding lookup and analysis

Obtain the weights from the model using `get_layer()` and `get_weights()`. The `get_vocabulary()` function provides the vocabulary to build a metadata file with one token per line. 

In [None]:
weights = word2vec.get_layer('w2v_embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

Create and save the vectors and metadata file. 

In [None]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

Download the `vectors.tsv` and `metadata.tsv` to analyze the obtained embeddings in the [Embedding Projector](https://projector.tensorflow.org/).

In [None]:
try:
  from google.colab import files
  files.download('vectors.tsv')
  files.download('metadata.tsv')
except Exception:
  pass

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#### Sentiment analysis of Movie Reviews Using Word2Vec



Following the sample implementation, however you can find the alternative [here](https://www.tensorflow.org/tutorials/text/text_classification_rnn)

In [None]:
import tensorflow_hub as hub
from tensorflow import keras 
import tensorflow as tf
tf.__version__

'2.5.0'

In [None]:
hub_layer = hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-es-dim50-with-normalization/1", output_shape=[50],
                           input_shape=[], dtype=tf.string)

model = keras.Sequential()
model.add(hub_layer)
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 50)                48832000  
_________________________________________________________________
dense (Dense)                (None, 16)                816       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 48,832,833
Trainable params: 833
Non-trainable params: 48,832,000
_________________________________________________________________


In [None]:
import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteRANV04/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteRANV04/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteRANV04/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [None]:
for X_batch, y_batch in datasets["train"].batch(2).take(1):
  for review, label in zip(X_batch.numpy(), y_batch.numpy()):
      print("Review:", review.decode("utf-8")[:200], "...")
      print("Label:", label, "= Positive" if label else "= Negative")
      print()

Review: This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting  ...
Label: 0 = Negative

Review: I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However  ...
Label: 0 = Negative



In [None]:
train_datasets = datasets["train"].batch(128)
test_datasets = datasets["test"].batch(128)


In [None]:
optimizer = tf.keras.optimizers.Adam(lr=1e-2)
model.compile(loss=tf.keras.losses.binary_crossentropy,
              optimizer=optimizer,
              metrics=["accuracy"])
history = model.fit(train_datasets, validation_data=test_datasets, epochs=20,)

Epoch 1/20


  "The `lr` argument is deprecated, use `learning_rate` instead.")


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Try a different model with different types of LSTM, RNN networks and compare the performance 

In [None]:
### Your response here ....

In [None]:
### Your response here ....

In [None]:
### Your response here ....

In [None]:
### Your response here ....

In [None]:
### Your response here ....

Following you have the access to Word2Vec model. Try to run sentiment prediction exercise with the Word2Vec embedding.

In [None]:
hub_layer = hub.KerasLayer("https://tfhub.dev/google/Wiki-words-500-with-normalization/2",
                           input_shape=[], dtype=tf.string)

model = keras.Sequential()
model.add(hub_layer)
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer_1 (KerasLayer)   (None, 500)               504687500 
_________________________________________________________________
dense_2 (Dense)              (None, 16)                8016      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 17        
Total params: 504,695,533
Trainable params: 8,033
Non-trainable params: 504,687,500
_________________________________________________________________


In [None]:
### Your response here ....

In [None]:
### Your response here ....

In [None]:
### Your response here ....

In [None]:
### Your response here ....

In [None]:
### Your response here ....

#### Word Vector Caveats

- Words could be same set of characters but mean differently in different context
- We don't have right representation for unknown words 
- Word vectors are biased by the biases in the text they are trained on
- We know much more about the world than can be derived from how words are used in context: we have a lifetime of experience and are keenly aware that we live in a world of cause and effect
- Words that are used in the same context are not necessarily interchangeable (e.g. one and five, fast and slow)

### TagLM

The TagLM paper by Matthew Peters, These are contextualized word embeddings learned from the internal states of a deep bidirectional language model. For example, the word “queen” will not have the same embedding in “Queen of the United Kingdom” and in “queen bee”.

<center><img src="https://drive.google.com/uc?id=1mpjabIOqjeC90DPO92qc4Muz1c8Tvcrd">

[source:TagLM paper](https://arxiv.org/pdf/1705.00108.pdf)


</center>




Following you can see all the components in TagLM
<center>

<img src="https://drive.google.com/uc?id=1o-_xWorpaV-b4FiFqJo5rfDsObm2HvKY">


[source:TagLM paper](https://arxiv.org/pdf/1705.00108.pdf)


</center>

### ELMO

- [Deep contextualized word representations. NAACL 2018](https://arxiv.org/pdf/1802.05365.pdf)
- Breakout version of word token vectors or contextual word vectors
- Learn word token vectors using long contexts not context  windows (here, whole sentence, could be longer)
- Learn a deep Bi-NLM and use all its layers in prediction

- Train a bidirectional LM Aim at performant but not overly large LM:
    - Use 2 biLSTM layers
    - Use character CNN to build initial word representation (only)
    - 2048 char n-gram filters and 2 highway layers, 512 dim projection
    - User 4096 dim hidden/cell LSTM states with 512 dim  projections to next input
    - Use a residual connection
    - Tie parameters of token input and output (softmax) and tie  these 
    - between forward and backward LMs
    

$$
\begin{aligned} R_{k} &=\left\{\mathbf{x}_{k}^{L M}, \overrightarrow{\mathbf{h}}_{k, j}^{L M}, \overleftarrow{\mathbf{h}}_{k, j}^{L M} | j=1, \ldots, L\right\} \\ &=\left\{\mathbf{h}_{k, j}^{L M} | j=0, \ldots, L\right\} \\ \mathbf{E L M o}_{k}^{t a s k} &=E\left(R_{k} ; \Theta^{t a s k}\right)=\gamma^{t a s k} \sum_{j=0}^{L} s_{j}^{t a s k} \mathbf{h}_{k, j}^{L M} \end{aligned}
$$

- First run biLM to get representations for each word 
- Then let (whatever) end-task model use them
    - Freeze weights of ELMo for purposes of supervised model
    - Concatenate ELMo weights into task-specific model
        - Details depend on task
            - Concatenating into intermediate layer as for TagLM is typical
            - Can provide ELMo representations again when producing outputs,  as in a question answering system



# Sequence To Sequence Models

- Sequence-to-sequence models convert one sequence to another
- They read an entire sequence (e.g. a sentence), encode it into a more general representation, then decode it into a new sequence
- The encoders and decoders are often RNN's
- In a standard seq2seq model for Neural Machine Translation the input and output vectors are embeddings



### Machine Learning Translation

- Machine Translation (MT) is the task of translating a sentence x  from one language (the source language) to a sentence y in  another language (the target language).
- Alignment is one the issues in machine translation.

<center><img src="https://drive.google.com/uc?id=14sAaaV4w6NH_teAPMqPC_EXOIRYkNSV6" >

[source: Lecture 8:
Machine Translation,
Sequence-to-sequence and Attention
Abigail See](http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture08-nmt.pdf)


</center>

You can think about Machine translation as building a conditional language model.

<center><img src="https://drive.google.com/uc?id=1HeD-exFROQRMsLHNR4oxZ9K9sqkFD3vT" >


[source:Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition by Aurélien Chapter 16](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/ch16.html#)


</center>

<center><img src="https://drive.google.com/uc?id=123Z7aJt0pk5CGlpUSFv4Wd7c4ukptoOQ" >

[source: Lecture 8:
Machine Translation,
Sequence-to-sequence and Attention
Abigail See](http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture08-nmt.pdf)


</center>

- RNN encoders take pairs of (input vector, previous hidden state) and the final hidden state becomes the representation to be decoded
- An RNN decoder takes pairs (output vector, previous hidden state), the output vectors taken in order being the output sequence
- The output at each step is a distribution across possible output words, so how do we choose?
  - Greedy Search: Pick the one with the highest probability

<center><img src="https://drive.google.com/uc?id=1mW_TFloszz8LCUE_Yp6JVizxcgRvs1P_" >

[source: Lecture 8:
Machine Translation,
Sequence-to-sequence and Attention
Abigail See](http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture08-nmt.pdf)

</center>


   - Ancestral Sampling: Sample from the distribution
      - Beam Search: Find the k most likely (where $t$ is something like 5); generate $t$ next words for each; evaluate and keep $t$ best

$$
\operatorname{score}\left(y_{1}, \ldots, y_{t}\right)=\sum_{i=1}^{t} \log P_{\mathrm{LM}}\left(y_{i} | y_{1}, \ldots, y_{i-1}, x\right)
$$
      
<center><img src="https://drive.google.com/uc?id=18c86k6lFz5f4_eqZigZ_mvzGhEy1ExB5" >


[source: Lecture 8:
Machine Translation,
Sequence-to-sequence and Attention
Abigail See](http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture08-nmt.pdf)  


</center>

 - Google translates over 100B words/day
- In 2014 NMT burst past other techniques
- Advantages
  - End-to-end training
  - Distributed representations allow reuse of learned language features
  - Ability to use much larger context than small-n-grams
  - Better quality results
- Google NMT 2016
  - Far better
  - Single system for all languages to languages
  - Can do an ok job of translating A to C if it knows A to B and B to C without training A to C
  


### Issues with the basic seq2seq model
- Embeddings like word2vec assume that collocation is important but don't make any finer distinction
- Some words in the context are much more important than others
- The words can be related in many ways: one describes another (adjectives and adverbs), one acts on another (subjects, objects, transitive verbs), one serves as a placeholder for another (pronouns)
- Relationships can be far away; they don't need to be in the same sentence
- Meaning unfurls itself slowly and retroactively; sometimes we can't understand what a sentence means until we get to the end of it; sometimes it only makes sense later or in the context of a lot that has come before
- CNN's and RNN's suffer from information and processing bottlenecks
- RNN's are:
  - slow (unparallelizable)
  - poor at using information that is not very close to the word being translated (the hidden state isn't big enough; LSTM's effectiveness drops rapidly for sentences beyond 30 words)
  - hard to interpret the meaning of the state because it entangles many previous representations
- Models without attention are usually good at producing grammatically correct output but occasionally throw in an unrelated word or antonym

### Example

First we download and unzip the dataset by using `pathlib` tool.

In [2]:
import pathlib

path_to_zip = tf.keras.utils.get_file(
    'spa-eng.zip', origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip',
    extract=True)

path_to_file = pathlib.Path(path_to_zip).parent/'spa-eng/spa.txt'

lines = path_to_file.read_text(encoding='utf-8').splitlines()

Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip


In [4]:
lines[:4]

['Go.\tVe.', 'Go.\tVete.', 'Go.\tVaya.', 'Go.\tVáyase.']

Turn the sentences into 3 Numpy arrays, encoder_input_data, decoder_input_data, decoder_target_data:
* `encoder_input_data` is a 3D array of shape (num_pairs, max_english_sentence_length, num_english_characters) containing a one-hot vectorization of the English sentences.
* `decoder_input_data` is a 3D array of shape (num_pairs, max_french_sentence_length, num_french_characters) containg a one-hot vectorization of the French sentences.
* `decoder_target_data` is the same as decoder_input_data but offset by one timestep. decoder_target_data[:, t, :] will be the same as decoder_input_data[:, t + 1, :].

In [5]:
batch_size = 64  # Batch size for training.
epochs = 100  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 10000  # Number of samples to train on.


# Vectorize the data.
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()


for line in lines[: min(num_samples, len(lines) - 1)]:
    input_text, target_text = line.split('\t')
    # We use "tab" as the "start sequence" character
    # for the targets, and "\n" as "end sequence" character.
    target_text = '\t' + target_text + '\n'
    input_texts.append(input_text)
    target_texts.append(target_text)
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)

In [6]:
input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print('Number of samples:', len(input_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

Number of samples: 10000
Number of unique input tokens: 71
Number of unique output tokens: 86
Max sequence length for inputs: 17
Max sequence length for outputs: 42


In [17]:
input_token_index = dict(
    [(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict(
    [(char, i) for i, char in enumerate(target_characters)])

encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.

Train a basic LSTM-based Seq2Seq model to predict decoder_target_data given encoder_input_data and decoder_input_data. Our model uses teacher forcing.

In [25]:
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

In [26]:
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

In [27]:
# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)


# Run training
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=batch_size,
          epochs=50,
          validation_split=0.2)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7ff8bccf1210>

Decode some sentences to check that the model is working (i.e. turn samples from encoder_input_data into corresponding samples from decoder_target_data)

In [28]:
# Define sampling models
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict(
    (i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
    (i, char) for char, i in target_token_index.items())

In [29]:
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index['\t']] = 1.

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        states_value = [h, c]

    return decoded_sentence

In [30]:
for seq_index in range(10):
    # Take one sequence (part of the training test)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('Decoded sentence:', decoded_sentence)

-
Input sentence: Go.
Decoded sentence: Ven.

-
Input sentence: Go.
Decoded sentence: Ven.

-
Input sentence: Go.
Decoded sentence: Ven.

-
Input sentence: Go.
Decoded sentence: Ven.

-
Input sentence: Hi.
Decoded sentence: Homo.

-
Input sentence: Run!
Decoded sentence: ¡Lo riento.

-
Input sentence: Run.
Decoded sentence: Corre.

-
Input sentence: Who?
Decoded sentence: ¿Qué cante.

-
Input sentence: Fire!
Decoded sentence: ¡Despira.

-
Input sentence: Fire!
Decoded sentence: ¡Despira.



## Attention
- The notion of *attention* is to have a model learn what parts of the context to pay the most attention to
<center><img src="https://drive.google.com/uc?id=18fUzvF3Rh7G2IQGRqTwuEapfXDWXl7UP" >

[source:Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition by Aurélien Chapter 16](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/ch16.html#)

</center>

$$
\begin{aligned} \widetilde{\mathbf{h}}_{(t)} &=\sum_{i} \alpha_{(i, i)} \mathbf{y}_{(i)} \\ \text { with } \alpha_{(t, i)} &=\frac{\exp \left(e_{t(i, i)}\right)}{\sum_{i} \exp \left(e_{(t ; i)}\right)} \\ \text { and } e_{(t, i)} &=\left\{\begin{array}{ll}{\mathbf{h}_{(i)}^{T} \mathbf{y}_{(i)}} & {\text { dot }} \\ {\mathbf{h}_{(i)}^{T} \mathbf{W} \mathbf{y}_{(i)}} & {\text { general }} \\ {\mathbf{v}^{T} \tanh \left(\mathbf{W}\left[\mathbf{h}_{i n} ; \mathbf{y}_{i 0}\right]\right)} & {\text { concat }}\end{array}\right.\end{aligned}
$$

<center><img src="https://drive.google.com/uc?id=1oHDx7OvqcyafvWVASk7pDrJDPrYvZsHN" >

[source: Lecture 8:
Machine Translation,
Sequence-to-sequence and Attention
Abigail See](http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture08-nmt.pdf)  

<img src="https://drive.google.com/uc?id=1UGBluXLO10sG16Wr3k3bU4JMBOw9-Y5z" >

[source: Lecture 8:
Machine Translation,
Sequence-to-sequence and Attention
Abigail See](http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture08-nmt.pdf) 

</center>

- To overcome the context bottleneck we will pass much more information: *all* of the hidden states from the input sequence, not just the last
-  Attention uses queriable memory similar to a cross between a Map (Python dict) and a search tree
- We can query using a key and get a weighted average of matching values, weighted by how well the keys match the query
- Keys, values and queries are all tensors
- So, rather than trying to keep everything in a single context vector for the sentence we will use a tensor
- The longer the sentence, the bigger the tensor (usually one column per word in the sentence), which solves the capacity and gradient flow problem
- The embeddings of each word are typically a concatenation of the hidden states of a forward and reverse RNN at that position in the sentence
- Recurrent Attention: keys = values = input sequence; query is the hidden state from the previous step; calculates attention at each step
- Word2vec doesn't seem to work as pretraining
- Self-Attention: drop the recurrent connection; keys and values stay the same, but query becomes the current input
- Learned Self-Attention: something the model learns during training
- If we allow keys and values to be learned we move into a new area: Memory Networks e.g. Neural Turing Machines
- The attention result is a sum of the weights of the words
- The weights are a softmax of the similarity of each key and query
- Two common methods of similarity
  - Additive Similarity: Single layer neural network
  - Multiplicative Similarity: Essentially cosine similarity (faster)
- Multi-Head Attention: we can use several specialized attention mechanisms
  - Semantic
  - Grammatical
  - Tense
  - Context
- Putting syntax knowledge back in may take us to a new level again
- Typically still use LSTM but The Transformer doesn't (next) doesn't use RNN's at all
- [Attention Visualization](https://distill.pub/2016/augmented-rnns/)
- You can find the example of the implementation [here](https://www.tensorflow.org/tutorials/text/nmt_with_attention)

## Multi-head Attention

<center><img src="https://drive.google.com/uc?id=1Ft6U3JJZdII3PYdBKByj0eYFUB0J8hrO" >

[source: Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)

</center>

$$\text { Attention }(\mathbf{Q}, \mathbf{K}, \mathbf{V})=\operatorname{softmax}\left(\frac{\mathbf{Q K}^{T}}{\sqrt{d_{keys}}}\right) \mathbf{v}$$

$Q$ is a matrix containing one row per query. Its shape is [nqueries, dkeys], where nqueries is the number of queries and dkeys is the number of dimensions of each query and each key.

$K$ is a matrix containing one row per key. Its shape is [nkeys, dkeys], where nkeys is the number of keys and values.

$V$ is a matrix containing one row per value. Its shape is [nkeys, dvalues], where dvalues is the number of each value.

The shape of ${Q K}^{T}$ is [nqueries, nkeys]: it contains one similarity score for each query/key pair. The output of the softmax function has the same shape, but all rows sum up to 1. The final output has a shape of [nqueries, dvalues]: there is one row per query, where each row represents the query result (a weighted sum of the values).

The scaling factor scales down the similarity scores to avoid saturating the softmax function, which would lead to tiny gradients.

It is possible to mask out some key/value pairs by adding a very large negative value to the corresponding similarity scores, just before computing the softmax. This is useful in the Masked Multi-Head Attention layer.



<center><img src="https://drive.google.com/uc?id=1zOfgl5mxWzoKjlmb6r8AfIKltaVbhYUq" >

[source:Transformers and Self-Attention For Generative Models
(guest lecture by Ashish Vaswani and Anna Huang)](http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture14-transformers.pdf)

</center>

## Transformer Architecture

- In 2017 Google showed that attention alone could outperform RNN's at translation (English to French and German)

<center><img src="https://drive.google.com/uc?id=1tOF8_zc9jUFCIii8e8ZbnEtmduwhS3os" >

[source: Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)

</center>

Following image, you can observe we stack few encoders and decoders on top of each other to build a transformer.


<center><img src="https://drive.google.com/uc?id=1qUERxT4gmi3BfYBn3UkO2_imdJOvm-P6" ></center>


- Whereas RNN's must proceed step-by-step (word by word), the Transformer uses only a small (configurable) number of steps
- At each step it applies a self-attention mechanism to weigh the relative strength of the relationships the other words in the sentence have to the word being processed
- A weighted average of all of the current words' representations are used to produce an updated representation for the target word

<center><img src="https://drive.google.com/uc?id=14Cw_bOYanC6O9gZO2MOs8jZ_j9YfVV3g" ></center>

- The translated sentence is generated a word at a time
- Each decoded word attends to the representations of words already translated in the sentence as well as the representations of all of the words in the sentence being translated
- The Transformer performs well on coreference resolution: where for example a pronoun could refer to either of two nouns earlier in the sentence
- You can find the example of the implementation [here](https://www.tensorflow.org/tutorials/text/transformer)


### Positional Embeddings 

A positional embedding is a dense vector which encodes the position of a word within a sentence. These embeddings are fixed, defined using the sine and cosine functions of different frequencies. This solution gives the same performance as learned positional embeddings do, but it can extend to arbitrarily long sentences, which is why it is favored. 

<center><img src="https://drive.google.com/uc?id=1BF3df4rbKnDKbWkND-gO3_qHUtVG6eff" >


[source:Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition by Aurélien Chapter 16](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/ch16.html#)

</center>

### BERT

BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.[source: Bert Github page](https://github.com/google-research/bert)


<center><img src="https://drive.google.com/uc?id=1o6mwx4R345N4CV4werQq_i8VqGWAgU1t" >

[source BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)

</center>

#### Example: Classify text with BERT
This notebook contains an example on how to fine-tune BERT to perform sentiment analysis

##### Setup

In [None]:

!pip install -q -U tensorflow-text -q  # A dependency of the preprocessing for BERT inputs
!pip install -q tf-models-official -q   #AdamW optimizer from tensorflow/models

[K     |████████████████████████████████| 4.3 MB 8.5 MB/s 
[K     |████████████████████████████████| 1.6 MB 8.7 MB/s 
[K     |████████████████████████████████| 352 kB 50.3 MB/s 
[K     |████████████████████████████████| 636 kB 54.7 MB/s 
[K     |████████████████████████████████| 1.2 MB 57.4 MB/s 
[K     |████████████████████████████████| 679 kB 34.6 MB/s 
[K     |████████████████████████████████| 211 kB 50.9 MB/s 
[K     |████████████████████████████████| 90 kB 7.8 MB/s 
[K     |████████████████████████████████| 99 kB 7.1 MB/s 
[K     |████████████████████████████████| 37.1 MB 41 kB/s 
[K     |████████████████████████████████| 43 kB 1.4 MB/s 
[?25h  Building wheel for py-cpuinfo (setup.py) ... [?25l[?25hdone
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


In [None]:
import os
import shutil

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization  # to create AdamW optimizer

import matplotlib.pyplot as plt

tf.get_logger().setLevel('ERROR')

##### Data 
Let's use the plain-text IMDB movie reviews (i.e. Large Movie Review Dataset) which contains the text of 50,000 movie reviews from the Internet Movie Database

In [None]:
# Download and extract the IMDB dataset
url = 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'

dataset = tf.keras.utils.get_file('aclImdb_v1.tar.gz', url,
                                  untar=True, cache_dir='.',
                                  cache_subdir='')
# Explore the directory structure
dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')

train_dir = os.path.join(dataset_dir, 'train')

# remove unused folders to make it easier to load the data
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)


AUTOTUNE = tf.data.AUTOTUNE
batch_size = 32
seed = 42

#Let's use the dataset from directory utility to create a labeled tf.data.Dataset
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=batch_size,
    validation_split=0.2, #80:20 split of the training data 
    subset='training',
    seed=seed)

class_names = raw_train_ds.class_names
train_ds = raw_train_ds.cache().prefetch(buffer_size=AUTOTUNE)

# Let's create a validation set using an 80:20 split of the training data  
val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/test',
    batch_size=batch_size)

test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.


In [None]:
# Let's take a look at a few reviews
for text_batch, label_batch in train_ds.take(1):
  for i in range(2):
    print(f'Review: {text_batch.numpy()[i]}')
    label = label_batch.numpy()[i]
    print(f'Label : {label} ({class_names[label]})')

Review: b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'
Label : 0 (neg)
Review: b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they

##### Loading models from TensorFlow Hub

In [None]:
# Load the  a BERT model to fine-tune
# here we use BERT-Base with fewer parameters (Uncased) which was released by the original BERT authors
bert_model_name = 'small_bert/bert_en_uncased_L-4_H-512_A-8' 

map_name_to_handle = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-768_A-12/1',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_base/2',
    'electra_small':
        'https://tfhub.dev/google/electra_small/2',
    'electra_base':
        'https://tfhub.dev/google/electra_base/2',
    'experts_pubmed':
        'https://tfhub.dev/google/experts/bert/pubmed/2',
    'experts_wiki_books':
        'https://tfhub.dev/google/experts/bert/wiki_books/2',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/1',
}

map_model_to_preprocess = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_preprocess/3',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_preprocess/3',
    'electra_small':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'electra_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_pubmed':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_wiki_books':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
}

tfhub_handle_encoder = map_name_to_handle[bert_model_name]
tfhub_handle_preprocess = map_model_to_preprocess[bert_model_name]

print(f'BERT model selected           : {tfhub_handle_encoder}')
print(f'Preprocess model auto-selected: {tfhub_handle_preprocess}')

BERT model selected           : https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1
Preprocess model auto-selected: https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3


##### Preprocessing 

In [None]:
# load the preprocessing model into a hub.KerasLayer to compose the fine-tuned model
# with the smallBert The input is truncated to 128 tokens (The number of tokens can be customized)
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)

In [None]:
# Let's try the preprocessing model on some text
text_test = [' The 2002 Bourne Identity is such an amazing movie!']
text_preprocessed = bert_preprocess_model(text_test)

print(f'Keys       : {list(text_preprocessed.keys())}')
print(f'Shape      : {text_preprocessed["input_word_ids"].shape}')
print(f'Word Ids   : {text_preprocessed["input_word_ids"][0, :12]}')
print(f'Input Mask : {text_preprocessed["input_mask"][0, :12]}')
print(f'Type Ids   : {text_preprocessed["input_type_ids"][0, :12]}')

#Note: input_type_ids only have one value (0) because this is a single sentence input. For a multiple sentence input, it would have one number for each input

Keys       : ['input_word_ids', 'input_mask', 'input_type_ids']
Shape      : (1, 128)
Word Ids   : [  101  1996  2526 15803  4767  2003  2107  2019  6429  3185   999   102]
Input Mask : [1 1 1 1 1 1 1 1 1 1 1 1]
Type Ids   : [0 0 0 0 0 0 0 0 0 0 0 0]


##### Using the BERT

In [None]:
# load the model from TF HUB
bert_model = hub.KerasLayer(tfhub_handle_encoder)

In [None]:
# Let's see the returned values using the example above
bert_results = bert_model(text_preprocessed)

# "pooled_output" : represents each input sequence as a whole. The shape is [batch_size, H]  -   an embedding for the entire movie review.
# "sequence_output": represents each input token in the context. The shape is [batch_size, seq_length, H] -  a contextual embedding for every token in the movie review.
# "encoder_outputs": are the intermediate activations of the L Transformer blocks. 
   # outputs["encoder_outputs"][i] is a Tensor of shape [batch_size, seq_length, 1024] with the outputs of the i-th Transformer block, for 0 <= i < L. 
   # The last value of the list is equal to sequence_output.


print(f'Loaded BERT: {tfhub_handle_encoder}')
print(f'Pooled Outputs Shape:{bert_results["pooled_output"].shape}')
print(f'Pooled Outputs Values:{bert_results["pooled_output"][0, :5]}')
print(f'Sequence Outputs Shape:{bert_results["sequence_output"].shape}')
print(f'Sequence Outputs Values:{bert_results["sequence_output"][0, :5]}')

Loaded BERT: https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1
Pooled Outputs Shape:(1, 512)
Pooled Outputs Values:[ 0.9599421   0.8852585  -0.13827781  0.49501595  0.23172547]
Sequence Outputs Shape:(1, 128, 512)
Sequence Outputs Values:[[-1.249977    0.65426844  0.6727354  ... -0.4236857   0.2020679
   0.00666368]
 [-0.7421359   0.02141126 -0.49172333 ... -1.0028691   0.05149584
   0.19523036]
 [-1.2351371   0.37037274 -0.4866422  ... -0.66944957  0.19172102
   0.06887599]
 [-1.3988634  -0.00379672 -0.4742977  ... -0.6985862   0.9620378
   0.2523336 ]
 [-0.6657643   1.5479449   1.0051894  ... -0.59739363  0.3769249
  -0.8154006 ]]


##### Define the classifier model

In [None]:
def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
  encoder_inputs = preprocessing_layer(text_input)
  encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)
  net = outputs['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(net)
  net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
  return tf.keras.Model(text_input, net)

In [None]:
# Let's check that the model runs with the output of the preprocessing model
classifier_model = build_classifier_model()
bert_raw_result = classifier_model(tf.constant(text_test))
print("BERT raw results", "\n", tf.sigmoid(bert_raw_result))
print("\n")
print("The model's structure", "\n", tf.keras.utils.plot_model(classifier_model))

BERT raw results 
 tf.Tensor([[0.37239307]], shape=(1, 1), dtype=float32)


The model's structure 
 <IPython.core.display.Image object>


##### Model training

In [None]:
# Loss function: 
# Let's use losses.BinaryCrossentropy because we are doing a binary classification 
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
metrics = tf.metrics.BinaryAccuracy()




In [None]:
# Optimizer
epochs = 5
steps_per_epoch = tf.data.experimental.cardinality(train_ds).numpy()
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)

init_lr = 3e-5
optimizer = optimization.create_optimizer(init_lr=init_lr,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')

In [None]:
# Loading the BERT model and training
classifier_model.compile(optimizer=optimizer,
                         loss=loss,
                         metrics=metrics)

In [None]:
print(f'Training model with {tfhub_handle_encoder}')
history = classifier_model.fit(x=train_ds,
                               validation_data=val_ds,
                               epochs=epochs)

Training model with https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1
Epoch 1/5

In [None]:
# Plot the accuracy and loss over time
history_dict = history.history
print(history_dict.keys())

acc = history_dict['binary_accuracy']
val_acc = history_dict['val_binary_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)
fig = plt.figure(figsize=(10, 6))
fig.tight_layout()

plt.subplot(2, 1, 1)
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'r', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
# plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.subplot(2, 1, 2)
plt.plot(epochs, acc, 'r', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')

# Resources

- Contextual Word Representations: A Contextual Introduction by Noah A. Smith. arXiv:1902.06006v1
- https://en.wikipedia.org/wiki/Word_embedding
- https://medium.com/@aneesha/using-tsne-to-plot-a-subset-of-similar-words-from-word2vec-bb8eeaea6229
- https://machinelearningmastery.com/develop-word-embeddings-python-gensim/
- https://towardsdatascience.com/word-embedding-with-word2vec-and-fasttext-a209c1d3e12c
- https://machinelearningmastery.com/predict-sentiment-movie-reviews-using-deep-learning
- https://www.kaggle.com/drscarlat/imdb-sentiment-analysis-keras-and-tensorflow
- https://keras.io/getting-started/faq/#how-can-i-save-a-keras-model
- https://medium.com/cityai/deep-learning-for-natural-language-processing-part-ii-8b2b99b3fa1e
- https://www.tensorflow.org/guide/summaries_and_tensorboard
- https://skymind.ai/wiki/attention-mechanism-memory-network
- https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
- A Neural Attention Model for Abstractive Sentence Summarization: https://arxiv.org/abs/1509.00685
- https://fasttext.cc/
- https://machinelearningmastery.com/encoder-decoder-attention-sequence-to-sequence-prediction-keras/
- https://www.tensorflow.org/text/guide/word_embeddings
- https://www.tensorflow.org/tutorials/text/word2vec