# Setup

In [0]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
    !pip install -q -U tensorflow-addons
    IS_COLAB = True
except Exception:
    IS_COLAB = False

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

if not tf.config.list_physical_devices('GPU'):
    print("No GPU was detected. LSTMs and CNNs can be very slow without a GPU.")
    if IS_COLAB:
        print("Go to Runtime > Change runtime and select a GPU hardware accelerator.")

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)
tf.random.set_seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "nlp"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Introduction

- recurrent neural networks are popular for natural language processing
- a ***character RNN***  is trained to predict the next character in a sentence
- a ***stateless RNN*** learns on random portions of text at each iteration without any context 
- a ***stateful RNN*** preserves the hidden state between training iterations and continues reading where it left off, allowing it to learn longer patterns
- an RNN performs sentiment analysis by treating sentences as sequences of words rather than characters
- RNNs can be used to build Encoder-Decoder architectures capable of performing neural machine translation (NMT)
---
- ***attention mechanisms*** are neural network components that learn to select the part of the inputs that the rest of the model should focus on at each step: 
 - attention can be used to boost performance
 - there is also an attention-only architecture called the ***Transformer***


# Character RNNs

- the ***Char-RNN*** can be used to generate novel text, one character at a time:
 - it can learn words, grammar, and punctuation just by learning to predict the next character in a sentence

## Character Encoding

- let's first download all of Shakepeare's work using Keras's `get_file()` function:

In [2]:
shakespeare_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
filepath = keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

Downloading data from https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt


In [3]:
print(shakespeare_text[0]) 

F


In [4]:
print(shakespeare_text[:148]) 

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?



- next, we must encode every character as an integer using Keras's `Tokenizer` class: 
 - first, we need to fit a tokenizer to the text, in which it will identify every unique character and map each one to a different character ID from 1 to the number of distinct characters (doesn't start at 0 by default)

In [0]:
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespeare_text)

- we set `char_level=True` to get character-level encoding rather than the default word-level coding

In [6]:
tokenizer.texts_to_sequences(["love"])

[[12, 4, 26, 2]]

In [7]:
tokenizer.sequences_to_texts([[12, 4, 26, 2]])

['l o v e']

In [8]:
max_id = len(tokenizer.word_index) # number of distinct characters
dataset_size = tokenizer.document_count # total number of characters
max_id, dataset_size

(39, 1115394)

- let's encode the full text so each character is represented by its ID: 
 - we subtract 1 to get IDs from 0 to 38 rather than 1 to 39

In [0]:
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1

## Splitting Sequential Data

- it is important to avoid any overlap between the training set, validation set, and the test set
- with time series, it is common to split across time: 
 - for example, taking the years 2000 to 2012 for the training set, 2013 to 2015 for the validation set, etc.
 - this assumes that the patterns the RNN can learn in the past (in the training set) will exist in the future (a ***stationary*** time series)
- to ensure a time series is sufficiently stationary, plot the model's errors on the validation set across time: 
 - if the model performs much better on the first part of the validation set than on the last part, the time series may not be stationary enough
---
- for our Shakespearean text: first 90% --> training set (keeping the rest for the validation/test sets)

In [0]:
train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

## Chopping the Sequential Dataset into Multiple Windows

- the training set now consists of a single million+ character sequence, which means we can't train the neural network just yet:
 - the RNN would have a million+ layers and we would only have a single, very long instance to train it
- instead, we must use the dataset's `window()` method to convert this long sequence of characters into many smaller windows of text:
 - every instance will be a short substring of the whole text
 - the RNN will be unrolled over the length of these substrings (***truncated backpropagation through time***) 

In [0]:
n_steps = 100
window_length = n_steps + 1 # target = input shifted 1 character ahead
dataset = dataset.repeat().window(window_length, shift=1, drop_remainder=True)

- we use `shift=1` so that the first window contains characters 0 to 100, the second contains characters 1 to 101, etc. (nonoverlapping windows)
- to ensure all windows are 101 characters long (which allows us to create batches without any padding), we set `drop_remainder=True`
---
- `window()` creates a dataset containing windows, each of which representing a unique dataset (a ***nested dataset***): 
 - this is a list of lists
- we cannot use a nested dataset directly for training as our model expects tensors as inputs, not datasets:
 - we must call `flat_map()`, which converts a dataset into a ***flat dataset***
 - nested dataset: `{{1, 2}, {3, 4, 5, 6}}` --> 
 - flat dataset: `{1, 2, 3, 4, 5, 6}`
- passing `lambda ds: ds.batch(2)` to `flat_map()` transforms the nested dataset `{{1, 2}, {3, 4, 5, 6}}` into a flat dataset of size 2 tensors `{[1, 2], [3, 4], [5, 6]}`

In [0]:
dataset = dataset.flat_map(lambda window: window.batch(window_length))

- we call `batch(window_length)` on each window since all windows are the same length, we get a single tensor for each
- we now have a dataset containing consecutive windows of 101 characters each

## Shuffling the Windows

- Gradient Descent works best when the instances in the training set are independent and identically distributed, so we need to shuffle these windows
- next, we can batch the windows and separate the inputs (the first 100 characters) from the target (the last character)

In [0]:
np.random.seed(42)
tf.random.set_seed(42)

In [0]:
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

- categorical input features should be encoded as one-hot vectors or embeddings:

In [0]:
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))

- finally, just add prefetching:

In [0]:
dataset = dataset.prefetch(1)

In [17]:
for X_batch, Y_batch in dataset.take(1):
    print(X_batch.shape, Y_batch.shape)

(32, 100, 39) (32, 100)


## Building and Training the Char-RNN Model

- to predict the next character based on the previous 100, we'll use an RNN with 2 `GRU` layers with 128 units each and 20% dropout on both the inputs (`dropout`) and the hidden states (`recurrent_dropout`)
- the output layer is a time-distributed `Dense` layer, which has 39 units (`max_id`) because there's 39 distinct characters in the text:
 - we want to output a probability for each possible character at each time step
 - the output probabilities should sum up to 1 at each time step, so the softmax activation function is applied to the `Dense` layer's outputs

In [18]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id],
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.GRU(128, return_sequences=True,
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,
                                                    activation="softmax"))
])



In [0]:
# takes hours
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
history = model.fit(dataset, steps_per_epoch=train_size // batch_size,
                    epochs=10)

## Using the Model

- we now have a model that can predict the next character in Shakespearean text
- first, we need to employ the same preprocessing steps as earlier:

In [0]:
def preprocess(texts):
    X = np.array(tokenizer.texts_to_sequences(texts)) - 1
    return tf.one_hot(X, max_id)

- now, let's use the model to predict the next letter in some text:

In [0]:
X_new = preprocess(["How are yo"])
Y_pred = model.predict_classes(X_new)
tokenizer.sequences_to_texts(Y_pred + 1)[0][-1] # 1st sentence, last char

`'u'`

## Using the Model to Generate Text

- to generate new text using our Char-RNN model, we pick the next character randomly with a probability equal to the estimated probability using `tf.random.categorical()`:
 - `categorical()` samples random class indices given the class log probabilities (logits)
- for better control over the diversity of the geneated text, we can divide the logits by the ***temperature***:
 - a temperature of ~0 favors high probability characters, while a very high temperature gives all characters an equal probability
- the following `next_char()` function uses this approach to pick the next character to add to the input text:

In [0]:
tf.random.set_seed(42)

In [0]:
def next_char(text, temperature=1):
    X_new = preprocess([text])
    y_proba = model.predict(X_new)[0, -1:, :]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

- the following `complete_text()` function repeatedly calls `next_char()` to get the next character and append it to the text:

In [0]:
def complete_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

- now, let's generate some text (experimenting with different temperatures):

In [0]:
tf.random.set_seed(42)

print(complete_text("t", temperature=0.2))

`the belly the great and who shall be the belly the`

In [0]:
print(complete_text("w", temperature=1))

`thing? or why you gremio.`

`who make which the first`

In [0]:
print(complete_text("w", temperature=2))

`th no cce:`

`yeolg-hormer firi. a play asks.`

`fol rusb`

- our Shakespeare model works best with a temperature close to 1
- to generate more convincing text, use more `GRU` layers with more neurons, increase the training time, and add some regularization (e.g., setting `recurrent_dropout=0.3` in the `GRU` layers)

## Stateful Character RNN

- so far, we have only used ***stateless RNNs***: the model starts with a hidden state full of zeros at each training iteration, then updates this state at each time step, then discards it after the last time step
- ***stateful RNNs*** preserve this final state after processing one training batch and use it as the initial state for the next training batch:
 - this enables them to learn long-term patterns despite only backpropagating through short sequences
---
- stateful RNNs only function if each batch's input sequence starts exactly where the corresponding sequence in the previous batch left off: 
 - therefore, we must use sequential and nonoverlapping input sequences (rather than the shuffled and overlapping sequences we used to train our stateless, Char-RNN)
- when creating the `Dataset`, we must therefore use `shift=n_steps` instead of `shift=1` when calling `window()`
---
- batching is much more difficult with stateful RNNs:
 - we chop Shakespeare's text into 32 texts of equal length
 - then we create one dataset of consecutive input sequences for each of them
 - then we use `tf.train.Dataset.zip(datasets).map(lambda *windows: tf.stack(windows))` to create proper consecutive batches

In [0]:
tf.random.set_seed(42)

In [0]:
batch_size = 32
encoded_parts = np.array_split(encoded[:train_size], batch_size)
datasets = []
for encoded_part in encoded_parts:
    dataset = tf.data.Dataset.from_tensor_slices(encoded_part)
    dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
    dataset = dataset.flat_map(lambda window: window.batch(window_length))
    datasets.append(dataset)
dataset = tf.data.Dataset.zip(tuple(datasets)).map(lambda *windows: tf.stack(windows))
dataset = dataset.repeat().map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

- now, let's create the stateful RNN:
 - first, set `stateful=True` when creating every recurrent layer
 - second, set the `batch_input_shape` argument in the first layer (leave the second dimension unspecified as the inputs could have any length)

In [27]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, stateful=True,
                     dropout=0.2, recurrent_dropout=0.2,
                     batch_input_shape=[batch_size, None, max_id]),
    keras.layers.GRU(128, return_sequences=True, stateful=True,
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,
                                                    activation="softmax"))
])



- at the end of each epoch, we must reset the states before going back to the beginning of the text:
 - for this, we can use a small callback

In [0]:
class ResetStatesCallback(keras.callbacks.Callback):
    def on_epoch_begin(self, epoch, logs):
        self.model.reset_states()

- each epoch is much shorter than earlier and there is only one instance per batch:

In [0]:
# takes an hour to run
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
steps_per_epoch = train_size // batch_size // n_steps
history = model.fit(dataset, steps_per_epoch=steps_per_epoch, epochs=50,
                    callbacks=[ResetStatesCallback()])

- after training this model, it will only make predictions for batches of the same size as were used during training:
 - create an identical model and copy this model's weights to avoid this restriction

# Sentiment Analysis

- while MNIST is the "hello world" of computer vision, the IMDb reviews dataset is the "hello world" of natural language processing: 
 - it consists of 50,000 English movie reviews (25,000 training | 25,000 testing) along with a binary target for each review indicating whether it is negative (0) or positive (1)

In [0]:
tf.random.set_seed(42)

In [0]:
(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data()

In [49]:
X_train[0][:10]

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

- the dataset has already been preprocessed: 
 - `X_train` consits of a list of reviews, each of which is represented as a NumPy array of integers, where each integer represents a word
- the integers 0, 1, and 2 represent the padding token, the ***start-of-sequence*** (SSS) token, and unknown words


In [50]:
# visualizing a review
word_index = keras.datasets.imdb.get_word_index()
id_to_word = {id_ + 3: word for word, id_ in word_index.items()}
for id_, token in enumerate(("<pad>", "<sos>", "<unk>")):
    id_to_word[id_] = token
" ".join([id_to_word[id_] for id_ in X_train[0][:10]])

'<sos> this film was just brilliant casting location scenery story'

- let's load the original IMDb reviews as text (byte strings) using TensorFlow Datasets:

In [51]:
import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete5KD5GA/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete5KD5GA/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete5KD5GA/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))

[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [52]:
datasets.keys()

dict_keys(['test', 'train', 'unsupervised'])

In [53]:
train_size = info.splits["train"].num_examples
test_size = info.splits["test"].num_examples
train_size, test_size

(25000, 25000)

- now, let's write the preprocessing function:

In [0]:
def preprocess(X_batch, y_batch):
    X_batch = tf.strings.substr(X_batch, 0, 300)
    X_batch = tf.strings.regex_replace(X_batch, rb"<br\s*/?>", b" ")
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
    X_batch = tf.strings.split(X_batch)
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch

In [56]:
preprocess(X_batch, y_batch)

(<tf.Tensor: shape=(2, 53), dtype=string, numpy=
 array([[b'This', b'was', b'an', b'absolutely', b'terrible', b'movie',
         b"Don't", b'be', b'lured', b'in', b'by', b'Christopher',
         b'Walken', b'or', b'Michael', b'Ironside', b'Both', b'are',
         b'great', b'actors', b'but', b'this', b'must', b'simply', b'be',
         b'their', b'worst', b'role', b'in', b'history', b'Even',
         b'their', b'great', b'acting', b'could', b'not', b'redeem',
         b'this', b"movie's", b'ridiculous', b'storyline', b'This',
         b'movie', b'is', b'an', b'early', b'nineties', b'US',
         b'propaganda', b'pi', b'<pad>', b'<pad>', b'<pad>'],
        [b'I', b'have', b'been', b'known', b'to', b'fall', b'asleep',
         b'during', b'films', b'but', b'this', b'is', b'usually', b'due',
         b'to', b'a', b'combination', b'of', b'things', b'including',
         b'really', b'tired', b'being', b'warm', b'and', b'comfortable',
         b'on', b'the', b'sette', b'and', b'having', b'j

- `preprocess()` starts by truncating the reviews, only keeping the first 300 characters of each: 
 - this will speed up training and not impact performance as you can generally tell the sentiment of a review from the first few sentences
- then it uses ***regular expressions*** to replace `<br />` tags with spaces, and to replace any characters other than the letters and quotes with spaces:
 - for example, the text `"Well, I can't<br />"` becomes `"Well I can't"`
- finally, `preprocess()` splits the reviews by the spaces, which returns a ragged tensor, and then converts this ragged tensor to a dense tensor, padding all reviews with the padding token `"<pad>"` to equalize their lengths
---
- next, we need to construct the vocabulary: 
 - this requires going through the entire training set once, applying `preprocess()`, and using a `Counter` to count the number of occurances of each word:

In [0]:
from collections import Counter

vocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
    for review in X_batch:
        vocabulary.update(list(review.numpy()))

- let's look at the three most common words:

In [58]:
vocabulary.most_common()[:3]

[(b'<pad>', 214309), (b'the', 61137), (b'a', 38564)]

In [59]:
len(vocabulary)

53893

- we probably don't need our model to know all the words in the dictionary to yield good performance, so let's truncate the vocabulary, only keeping the 10,000 most common words:

In [62]:
vocab_size = 10000
truncated_vocabulary = [
    word for word, count in vocabulary.most_common()[:vocab_size]]
len(truncated_vocabulary)

10000

- now, we need to add a preprocessing step to replace each word with its ID (index in the vocabulary):

In [0]:
words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

- we can use this table to find the IDs of a few words:

In [64]:
table.lookup(tf.constant([b"This movie was faaaaaantastic".split()]))

<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[   22,    12,    11, 10053]])>

- "this", "movie", and "was" were found in the table, so their IDs are lower than 10,000:
 - "faaaaaantastic", however, was not found in the table, so it was mapped to one of the oov buckets with an ID greater than or equal to 10,000
---
- now, we're ready to create the final training set:
 - first, we batch the reviews
 - second, we convert them to short sequences of words using `preprocess()`
 - third, we encode these words using `encode_words()`, which uses the table we just built
 - finally, we prefetch the next batch

In [0]:
def encode_words(X_batch, y_batch):
    return table.lookup(X_batch), y_batch

In [0]:
train_set = datasets["train"].repeat().batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

In [71]:
for X_batch, y_batch in train_set.take(1):
    print(X_batch)
    print(y_batch)

tf.Tensor(
[[  22   11   28 ...    0    0    0]
 [   6   21   70 ...    0    0    0]
 [4099 6881    1 ...    0    0    0]
 ...
 [  22   12  118 ...  331 1047    0]
 [1757 4101  451 ...    0    0    0]
 [3365 4392    6 ...    0    0    0]], shape=(32, 60), dtype=int64)
tf.Tensor([0 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 0], shape=(32,), dtype=int64)


- at last, we can create the model and train it:

In [72]:
embed_size = 128
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size,
                           mask_zero=True, # not shown in the book
                           input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(train_set, steps_per_epoch=train_size // 32, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


- the `Embedding` layer converts word IDs into embeddings:
 - the embedding matrix must have one row per word ID (`vocab_size + num_oov_buckets`) and one column per embedding dimension (this example uses 128 dimensions, but this is a tunable hyperparameter)
- the inputs of the model will be 2D tensors of shape **[batch size, time steps]**, but the output of the `Embedding` layer will be a 3D tensor of shape **[batch size, time steps, embedding size]**
- there are two `GRU` layers, with the second one returning only the output of the last time step
- the output layer is a single neuron using the sigmoid activation function to output the estimated probability that the review expresses a positive sentiment

## Masking

- we want the model to ignore the padding tokens (whose ID is 0) and focus on the data that matters, so we add `masking_zero=True` when creating the `Embedding` layer
- using masking layers and automatic mask propagation works best for simple `Sequential`: 
 - for example, the following model is identical to the previous model, except it is built using the Functional API and handles masking manually

In [73]:
K = keras.backend
embed_size = 128
inputs = keras.layers.Input(shape=[None])
mask = keras.layers.Lambda(lambda inputs: K.not_equal(inputs, 0))(inputs)
z = keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size)(inputs)
z = keras.layers.GRU(128, return_sequences=True)(z, mask=mask)
z = keras.layers.GRU(128)(z, mask=mask)
outputs = keras.layers.Dense(1, activation="sigmoid")(z)
model = keras.models.Model(inputs=[inputs], outputs=[outputs])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(train_set, steps_per_epoch=train_size // 32, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Reusing Pretrained Embeddings

- TensorFlow Hub makes it easy to reuse pretrained model components, called ***modules***, in your own models: 
 - for example, let's use the `nnlm-en-dim50` sentence embedding module, version 1, in our own sentiment analysis model

In [0]:
tf.random.set_seed(42)

- TF Hub caches the downloaded files into the local system's temporary directory, however, you may prefer to download them into a more permanent directory to avoid having to download them again after every system clean up: 
 - for this, set the `TFHUB_CACHE_DIR` environment variable to the directory of your choice

In [0]:
TFHUB_CACHE_DIR = os.path.join(os.curdir, "my_tfhub_cache")
os.environ["TFHUB_CACHE_DIR"] = TFHUB_CACHE_DIR

In [0]:
import tensorflow_hub as hub

model = keras.Sequential([
    hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1",
                   dtype=tf.string, input_shape=[], output_shape=[50]),
    keras.layers.Dense(128, activation="relu"),
    keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam",
              metrics=["accuracy"])

- the `hub.KerasLayer` layer downloads the module from the URL
- this module is a ***sentence encoder***: it takes strings as input and encodes each one as a single vector (in this case, a 50-dimensional vector): 
 - internally, it parses the string (splitting words on spaces) and embeds each word using an embedding matrix that was pretrained on a huge corpus: the Google News 7B corpus (seven billion words long)
---
- next, we can simply load the IMDb reviews dataset and train the model directly: 
 - no need for preprocessing (except for batching and prefetching) 

In [0]:
import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)
train_size = info.splits["train"].num_examples
batch_size = 32
train_set = datasets["train"].repeat().batch(batch_size).prefetch(1)

In [78]:
history = model.fit(train_set, steps_per_epoch=train_size // batch_size, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


# Encoder-Decoder Network for Neural Machine Translation

- let's look at a neural machine translation model that translates English sentences to French: 
 - the English sentences are fed to the encoder and the decoder outputs the French translations
- the English sentences are reversed before they are fed to the encoder, which ensures that the beginning of the English sentence is fed to the encoder last, which is useful because that's the first thing that the decoder needs to translate
---
- initially, each word is represented by its ID
- next, an `embedding` layer returns the word embedding: 
 - these word embeddings are what are actually fed to the encoder and decoder
---
- at each step, the decoder outputs a score for each word in the output vocabulary (French), then the softmax layer turns these scores into probabilities: 
 - for example, at the first step, the word "Je" may have a probability of 20%, "Tu" may have a probability of 1%, etc.
 - the word with the highest probability is output
- this is similar to a regular classification task, so you can train the model using the `"sparse_categorical_crossentropy"` loss
---
- so far, we've assumed that all input sequences have a constant length, but obviously that's not true for sentences
- regular tensors have fixed shapes, so they can only contain sentences of the same length
- therefore, we must group sentences into buckets of similar lengths using padding for the shorter sentences to ensure all sentences in a bucket have the same length
---
- we must ignore any output past the end-of-sentence (EOS) token:
 - for example, if the model outputs `"Je bois du lait <eos> oui"`, the loss for the last word should be ignored
---
- when the output vocabulary is large, outputting a probability for each and every possible word would be tremendously slow
- let's say the target vocabulary contains 50,000 French words: 
 - the decoder would output 50,000-dimensional vectors
 - computing the softmax function over such a large vector would be very computationally intensive
- one solution is to only look at the logits for the correct word and for a random sample of incorrect words, then compute an approximation of the loss based only on these logits
- you can use `tf.nn.sampled_softmax_loss()` to employ this ***sampled softmax*** technique during training: 
 - cannot be used during inference as it requires knowing the target

In [0]:
tf.random.set_seed(42)

In [0]:
vocab_size = 100
embed_size = 10

In [0]:
# doesn't run
import tensorflow_addons as tfa

encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
sequence_lengths = keras.layers.Input(shape=[], dtype=np.int32)

embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)

encoder = keras.layers.LSTM(512, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_embeddings)
encoder_state = [state_h, state_c]

sampler = tfa.seq2seq.sampler.TrainingSampler()

decoder_cell = keras.layers.LSTMCell(512)
output_layer = keras.layers.Dense(vocab_size)
decoder = tfa.seq2seq.basic_decoder.BasicDecoder(decoder_cell, sampler,
                                                 output_layer=output_layer)
final_outputs, final_state, final_sequence_lengths = decoder(
    decoder_embeddings, initial_state=encoder_state,
    sequence_length=sequence_lengths)
Y_proba = tf.nn.softmax(final_outputs.rnn_output)

model = keras.models.Model(
    inputs=[encoder_inputs, decoder_inputs, sequence_lengths],
    outputs=[Y_proba])

In [0]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")

In [0]:
X = np.random.randint(100, size=10*1000).reshape(1000, 10)
Y = np.random.randint(100, size=15*1000).reshape(1000, 15)
X_decoder = np.c_[np.zeros((1000, 1)), Y[:, :-1]]
seq_lengths = np.full([1000], 15)

history = model.fit([X, X_decoder, seq_lengths], Y, epochs=2)

## Bidirectional RNNs

- at each time step, a regular recurrent layer only looks at past and present inputs before generating an output:
 - it is "casual", meaning it cannot look into the future
---
- for many NLP tasks, such as Neural Machine Translation, it is preferable to look ahead at the next words before encoding a given word
- to implement this, run two recurrent layers on the same inputs, one reading the words from left to right and the other reading them from right to left: 
 - then combine their outputs at each time step by concatenating them
 - this is called a ***bidirectional recurrent layer***
---
- for implementation, wrap a recurrent layer in a `keras.layers.Bidirectional` layer:

In [83]:
model = keras.models.Sequential([
    keras.layers.GRU(10, return_sequences=True, input_shape=[None, 10]),
    keras.layers.Bidirectional(keras.layers.GRU(10, return_sequences=True))
])

model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
gru_8 (GRU)                  (None, None, 10)          660       
_________________________________________________________________
bidirectional (Bidirectional (None, None, 20)          1320      
Total params: 1,980
Trainable params: 1,980
Non-trainable params: 0
_________________________________________________________________


- the `Bidirectional` layer will create a clone of the `GRU` layer (in the reverse direction) and it will run both and concatenate their outputs: 
 - so although the `GRU` layer has 10 units, the `Bidirectional` layer will output 20 values per time step

## Beam Search

- ***beam search*** keeps tack of a short list of the *k* most promising sentences, which helps boost performance: 
 - at each decoder step, it tries to extend them by one word, keeping only the *k* most likely sentences
 - *k* is called the ***beam width***

## Attention Mechanisms

- ***attention mechanisms*** allow the decoder to focus on the appropriate words (as encoded by the encoder) at each time step: 
 - for example, at the time step where the decoder needs to output the word "lait", it will focus its attention on the word "milk"
 - this shortens the path from an input word to its translation, which alleviates the impact of short-term memory limitations
- attention mechanisms revolutionized neural machine translation (and NLP in general), allowing a significant improve in the state of the art, especially for long sentences (30+ words)


## Visual Attention

- beyond NMT, attention mechanisms can be used for generating image captions using visual attention: 
 - first, a CNN processes the image and outputs feature maps
 - second, a decoder RNN equipped with an attention mechanisms generates the caption one word at a time
 - at each decoder time step (each word), the decoder uses the attention model to focus on just the right part of the image

## Explainability

- attention mechanisms can also make it easier to understand what led the model to produce its output: 
 - this is called ***explainability***, which can be used as a tool to debug a model
---
- attention mechanisms are so powerful that they can be the sole component of a state-of-the-art model

## The Transformer Architecture

- the ***Transformer*** architecture significantly improved the state of the art in NMT without using any recurrent or convolutional layers (only attention mechanisms):
 - as a bonus, this architecture was also much faster to train and easier to parallelize

### Positional Embeddings

- a ***positional embedding*** is a dense vector that encodes the position of a word within a sentence, which can be learned by the model

In [0]:
class PositionalEncoding(keras.layers.Layer):
    def __init__(self, max_steps, max_dims, dtype=tf.float32, **kwargs):
        super().__init__(dtype=dtype, **kwargs)
        if max_dims % 2 == 1: max_dims += 1 # max_dims must be even
        p, i = np.meshgrid(np.arange(max_steps), np.arange(max_dims // 2))
        pos_emb = np.empty((1, max_steps, max_dims))
        pos_emb[0, :, ::2] = np.sin(p / 10000**(2 * i / max_dims)).T
        pos_emb[0, :, 1::2] = np.cos(p / 10000**(2 * i / max_dims)).T
        self.positional_embedding = tf.constant(pos_emb.astype(self.dtype))
    def call(self, inputs):
        shape = tf.shape(inputs)
        return inputs + self.positional_embedding[:, :shape[-2], :shape[-1]]

- now, we can create the first layers of the Transformer:

In [0]:
embed_size = 512; max_steps = 500; vocab_size = 10000
encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)
positional_encoding = PositionalEncoding(max_steps, max_dims=embed_size)
encoder_in = positional_encoding(encoder_embeddings)
decoder_in = positional_encoding(decoder_embeddings)

### Multi-Head Attention

- the ***Multi-Head Attention*** layer is based on the ***Scaled Dot-Product Attention*** layer
---
- if we ignore the skip connections, the layer normalization layers, the Feed Forward blocks, and the fact that this is Scaled Dot-Product Attention (not exactly Multi-Head Attention), then the rest of the transformer model can be implemented like so:

In [0]:
Z = encoder_in
for N in range(6):
    Z = keras.layers.Attention(use_scale=True)([Z, Z])

encoder_outputs = Z
Z = decoder_in
for N in range(6):
    Z = keras.layers.Attention(use_scale=True, causal=True)([Z, Z])
    Z = keras.layers.Attention(use_scale=True)([Z, encoder_outputs])

outputs = keras.layers.TimeDistributed(
    keras.layers.Dense(vocab_size, activation="softmax"))(Z)

- the Multi-Head Attention layer is just a bunch of Scaled Dot-Product Attention layers, each preceded by a linear transformation of the values, keys, and queries
- all the outputs are concatenated, and they go through a final linear transformation
---
- TensorFlow's great tutorial for building a Transformer model for language understanding: https://www.tensorflow.org/tutorials/text/transformer

# Exercises

*1) What are the pros and cons of using a stateful RNN versues a stateless RNN?*

*2) Why do people use Encoder-Decoder RNNs rather than plain sequence-to-sequence RNNs for automatic translation?*

*3) How can you deal with variable-length input sequences? What about variable-length output sequences?*

*4) What is beam search and why would you use it? What tool can you use to implement it?*

*5) What is an attention mechanism? How does it help?*

*6) What is the most important layer in the Transformer architecture? What is its purpose?*

*7) When would you need to use sampled softmax?*

1) Stateful RNNs can capture longer-term patterns, however, they are much harder to implement.

2) A sequence-to-sequence RNN would start translating a sequence immediately after reading the first word, while an Encoder-Decoder RNN will read the whole sentence before translating it.

3) Variable-length input sequences can be handled by padding the shorter sequences to equalize the lengths of all the sequences in a batch. For variable-length output sequences, train the model to output an end-of-sequence token at the end of each sequence.

4) Beam search is a technique used to improve the performance of a trained Encoder-Decoder model, which can be implemented using TensorFlow Addons.

5) An attention mechanism is a technique initially used in Encoder-Decoder models to give the decoder more direct access to the input sequence, allowing it to process longer input sequences. In addition, it makes the model easier to debug and serves as the core of the Transformer architecture (in the Multi-Head Attention layers).

6) The most important layer in the Transformer architecutre is the Multi-Head Attention layer, which allows the model to identify which words are most aligned with each other, and then improve each word's representation using these contextual clues.

7) Sampled softmax is used when training a classification model when there are many classes (thousands) as it speeds up training considerably.