<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ufidon/ml/blob/main/mod6/lan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ufidon/ml/blob/main/mod6/lan.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>
<br>

Natural Language Processing (NLP) with RNNs and Attention
---
_homl3 ch16_

NLP models in ascending order of capability
- `Character RNN` (char-RNN) predicts the next character in a sentence
  - able to generate some original text
- `Stateless RNN` learns on `random portions` of text at each iteration
  - without any information on the rest of the text
- `Stateful RNN` preserves the hidden state between training iterations and continues reading where it left off
  - able to learn longer patterns
- `Sentiment RNN` extracts movie raters' feeling about movies from their reviews
- `Neural machine translation (NMT)` translates English to Spanish
  - based on an encoder–decoder architecture
- NMT boosted with `attention mechanism`
  - learns to select `the part of the inputs` that the rest of the model should `focus` on at each time step
- `Transformer` is a very successful attention-only architecture
  - used by GPT and Gemini

In [1]:
# Colab: Go to Runtime > Change runtime and select a GPU hardware
# Kaggle: Go to Settings > Accelerator and select GPU
# ⚠️ It may take more than one day to run the whole notebook without GPU
import sys, os, math, copy
from pathlib import Path

if "google.colab" in sys.modules:
    %pip install -q -U transformers
    %pip install -q -U datasets

from functools import partial
import numpy as np, pandas as pd, matplotlib.pyplot as plt, matplotlib as mpl
import sklearn as skl, sklearn.datasets as skds
import tensorflow as tf, tensorflow_datasets as tfds

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25h

💡 Demo of char-RNN
---
Open the [link](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) and explore examples generated by a `character RNN` from [Andrej Karpathy's char-rnn project](https://github.com/karpathy/char-rnn):
- Shakespeare
- Linux source code
- Baby names

Next, let's build a char-RNN step by step.

In [2]:
# 1. Creating the Training Dataset
# 1) download the Shakespeare data from Andrej Karpathy's
#     [char-rnn project](https://github.com/karpathy/char-rnn/)

shakespeare_url = "https://homl.info/shakespeare"  # shortcut URL
filepath = tf.keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

Downloading data from https://homl.info/shakespeare


In [3]:
# 2) print the first few lines of Shakespeare
print(shakespeare_text[:80])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.


In [4]:
# 3) Shakespeare has only 39 distinct characters (after converting to lower case)
shakespeare_charset = "".join(sorted(set(shakespeare_text.lower())))
len(shakespeare_charset), shakespeare_charset

(39, "\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz")

In [5]:
# 4) encode Shakespeare text with character-level encoding
#     rather than the default word-level encoding
#     and convert the text to lowercase

text_vec_layer = tf.keras.layers.TextVectorization(split="character",
                                                   standardize="lower")
text_vec_layer.adapt([shakespeare_text]) # create the vocabulary, here the set of characters
encoded = text_vec_layer([shakespeare_text])[0] # encode Shakespeare

In [6]:
text_vec_layer.get_vocabulary()

['',
 '[UNK]',
 ' ',
 'e',
 't',
 'o',
 'a',
 'i',
 'h',
 's',
 'r',
 'n',
 '\n',
 'l',
 'd',
 'u',
 'm',
 'y',
 'w',
 ',',
 'c',
 'f',
 'g',
 'b',
 'p',
 ':',
 'k',
 'v',
 '.',
 "'",
 ';',
 '?',
 '!',
 '-',
 'j',
 'q',
 'x',
 'z',
 '3',
 '&',
 '$']

In [7]:
sorted(set(encoded.numpy()))

[2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40]

In [8]:
# Each character is now mapped to an integer, starting at 2
# The `TextVectorization layer` reserved
#   the value 0 for padding tokens,
#   1 for unknown characters.
(shakespeare_text[:10]).lower(), encoded[:10]

('first citi',
 <tf.Tensor: shape=(10,), dtype=int64, numpy=array([21,  7, 10,  9,  4,  2, 20,  7,  4,  7])>)

In [9]:
# overloading the meaning of tokens 0 (pad) and 1 (unknown), which we will not use
encoded -= 2
n_tokens = text_vec_layer.vocabulary_size() - 2  # number of distinct chars = 39
dataset_size = len(encoded)  # total number of chars = 1,115,394

In [10]:
(shakespeare_text[:10]).lower(), encoded[:10]

('first citi',
 <tf.Tensor: shape=(10,), dtype=int64, numpy=array([19,  5,  8,  7,  2,  0, 18,  5,  2,  5])>)

In [11]:
# 5)  a small utility function used to
#     (p1) turn this long sequence `encoded` into a dataset of windows
#     for training a sequence-to-sequence RNN
#    The targets will be similar to the inputs,
#     but shifted by one character into the right
def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))
    if shuffle:
        ds = ds.shuffle(100_000, seed=seed)
    ds = ds.batch(batch_size)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

In [12]:
# 6) split the dataset into training set, validation set and test set
#  the RNN will not be able to learn any pattern longer than `length`,
#   so don’t make it too small
length = 100
train_set = to_dataset(encoded[:1_000_000], length=length, shuffle=True,
                       seed=42)
valid_set = to_dataset(encoded[1_000_000:1_060_000], length=length)
test_set = to_dataset(encoded[1_060_000:], length=length)

In [13]:
# 2. Building and Training the Char-RNN Model
# 1) build the Char-RNN Model with a GRU layer with 128 units
model = tf.keras.Sequential([
  # number of input dimensions is the number of distinct character IDs,
  # and the number of output dimensions is a tunable hyperparameter
  tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
  # the inputs of the Embedding layer:
  #     2D tensors of shape [batch size, window length],
  # the output of the Embedding layer:
  #     a 3D tensor of shape [batch size, window length, embedding size]

  tf.keras.layers.GRU(128, return_sequences=True),

  # the output layer must have 39 units (n_tokens)
  #   because there are 39 distinct characters in the text,
  #   and we want to output a probability for each possible character at each time step.
  #   The 39 output probabilities should sum up to 1 at each time step
  tf.keras.layers.Dense(n_tokens, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])


In [14]:
# 2) train the Char-RNN Model
# setup a callback
model_ckpt = tf.keras.callbacks.ModelCheckpoint(
    "my_shakespeare_model", monitor="val_accuracy", save_best_only=True)

In [None]:
# ⚠️ Without a GPU, it may take over 24 hours.
# skip the next two code cells
if "google.colab" in sys.modules:
  physical_devices = tf.config.list_physical_devices('GPU')
  if len(physical_devices) == 0:
    print("no gpu")
  else:
    print('with gpu')
    history = model.fit(train_set, validation_data=valid_set, epochs=10,
                        callbacks=[model_ckpt])

In [15]:
# 3) wrap text preprocessing and the model together
shakespeare_model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Lambda(lambda X: X - 2),  # no <PAD> or <UNK> tokens
    model
])

In [16]:
# download a pretrained model
url = "https://github.com/ageron/data/raw/main/shakespeare_model.tgz"
path = tf.keras.utils.get_file("shakespeare_model.tgz", url, extract=True)
model_path = Path(path).with_name("shakespeare_model")

Downloading data from https://github.com/ageron/data/raw/main/shakespeare_model.tgz


In [17]:
# 4) make a prediction
y_proba = shakespeare_model.predict(["To be or not to b"])[0, -1]
y_pred = tf.argmax(y_proba)  # choose the most probable character ID
text_vec_layer.get_vocabulary()[y_pred + 2]



'u'

In [18]:
# 3. Generate fake Shakespearean text using `greedy decoding`
#   a) feed the char-RNN model some text
#       make it predict the most likely next letter ℓₙ
#   b) add ℓₙ to the end of the text, then feed the extended text
#       to the model to guess the next letter, and so on
# shortcoming: this often leads to the same words being repeated over and over again
# solution:  sample the next character randomly with a probability
#             equal to the estimated probability

# 1) demo: draw samples randomly based on logits distribution
log_probas = tf.math.log([[0.5, 0.4, 0.1]])  # probas = 50%, 40%, and 10%
tf.random.categorical(log_probas, num_samples=8)  # draw 8 samples

<tf.Tensor: shape=(1, 8), dtype=int64, numpy=array([[0, 2, 0, 1, 0, 1, 2, 1]])>

In [19]:
#  2) helper function `next_char` picks the next character
#     by `simulated annealing`
# Lower temperature favors high-probability characters
# higher temperature gives all characters an equal probability
def next_char(text, temperature=1):
    y_proba = shakespeare_model.predict([text])[0, -1:]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]
    return text_vec_layer.get_vocabulary()[char_id + 2]

In [20]:
# 3) helper function `extend_text` repeatedly calls next_char()
#     to get the next character and append it to the given text
def extend_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

In [21]:
# 4) generate some text with different temperatures
# low temperature like a person with a rigidified mind
print(extend_text("\nTo be or not to be", temperature=0.01))


To be or not to beo!gja:fdlcqivr?ifue?gpm!&.w-s:
f3'mwzm;u!
ev;eqom:


In [22]:
# middle temperature like a normal person
print(extend_text("\nTo be or not to be", temperature=1))


To be or not to beg$akd-,goh3c?kabop? nzs!b! sb:z;qz?n?; p?t
xqra
au


In [23]:
# high temperature like a man with a fever
print(extend_text("\nTo be or not to be", temperature=100))


To be or not to bes!$?d'??qv.$ ,3zdptmcf:ud;-mq
egs-$jxiisiyspncgb'q


Improvement
---
- other character sampling techniques
  - `nucleus sampling` samples only from
    - the top k characters,
    - or the smallest set of top characters whose total probability exceeds some threshold
  - ` beam search`
- further model tuning
  - use more GRU layers and more neurons per layer
  - make the window larger
  - train for longer and add some regularization if needed, etc.

Stateful RNN
---
- for `stateless RNNs`, at each training iteration the model
  - starts with a hidden state full of zeros
  - then it updates this state at each time step,
  - and after the last time step, it throws it away as it is not needed anymore
- for `stateful RNNs`, the model
  - preserves this final state after processing a training batch
  - and use it as the initial state for the next training batch
  - ∴ it can learn long-term patterns despite only backpropagating through short sequences
- A stateful RNN only makes sense if `each input sequence in a batch` starts exactly where the `corresponding sequence in the previous batch left off`
  - ∴ (p2) its input sequences must be `sequential and nonoverlapping`
  - rather than the `shuffled and overlapping` sequences used to train stateless RNNs


In [24]:
# 1. prepare a dataset for a stateful RNN
def to_dataset_for_stateful_rnn(sequence, length):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=length, drop_remainder=True)
    ds = ds.flat_map(lambda window: window.batch(length + 1)).batch(1)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

stateful_train_set = to_dataset_for_stateful_rnn(encoded[:1_000_000], length)
stateful_valid_set = to_dataset_for_stateful_rnn(encoded[1_000_000:1_060_000],
                                                 length)
stateful_test_set = to_dataset_for_stateful_rnn(encoded[1_060_000:], length)

In [25]:
# demo of `to_dataset_for_stateful_rnn`
list(to_dataset_for_stateful_rnn(tf.range(10), 3))

[(<tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[0, 1, 2]], dtype=int32)>,
  <tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[1, 2, 3]], dtype=int32)>),
 (<tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[3, 4, 5]], dtype=int32)>,
  <tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[4, 5, 6]], dtype=int32)>),
 (<tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[6, 7, 8]], dtype=int32)>,
  <tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[7, 8, 9]], dtype=int32)>)]

In [26]:

# 2) prepare dataset for more than one window per batch
#  use `to_batched_dataset_for_stateful_rnn()` function instead of
# `to_dataset_for_stateful_rnn()`
def to_non_overlapping_windows(sequence, length):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=length, drop_remainder=True)
    return ds.flat_map(lambda window: window.batch(length + 1))

def to_batched_dataset_for_stateful_rnn(sequence, length, batch_size=32):
    parts = np.array_split(sequence, batch_size)
    datasets = tuple(to_non_overlapping_windows(part, length) for part in parts)
    ds = tf.data.Dataset.zip(datasets).map(lambda *windows: tf.stack(windows))
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

list(to_batched_dataset_for_stateful_rnn(tf.range(20), length=3, batch_size=2))

[(<tf.Tensor: shape=(2, 3), dtype=int32, numpy=
  array([[ 0,  1,  2],
         [10, 11, 12]], dtype=int32)>,
  <tf.Tensor: shape=(2, 3), dtype=int32, numpy=
  array([[ 1,  2,  3],
         [11, 12, 13]], dtype=int32)>),
 (<tf.Tensor: shape=(2, 3), dtype=int32, numpy=
  array([[ 3,  4,  5],
         [13, 14, 15]], dtype=int32)>,
  <tf.Tensor: shape=(2, 3), dtype=int32, numpy=
  array([[ 4,  5,  6],
         [14, 15, 16]], dtype=int32)>),
 (<tf.Tensor: shape=(2, 3), dtype=int32, numpy=
  array([[ 6,  7,  8],
         [16, 17, 18]], dtype=int32)>,
  <tf.Tensor: shape=(2, 3), dtype=int32, numpy=
  array([[ 7,  8,  9],
         [17, 18, 19]], dtype=int32)>)]

In [27]:
# 3. create the stateful RNN
# [stateful=True](https://keras.io/2.15/api/layers/recurrent_layers/rnn/)
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16,
                              batch_input_shape=[1, None]),
    tf.keras.layers.GRU(128, return_sequences=True, stateful=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax")
])

In [28]:
# At the end of each epoch, we need to reset the states before
# we go back to the beginning of the text
class ResetStatesCallback(tf.keras.callbacks.Callback):
    def on_epoch_begin(self, epoch, logs):
        self.model.reset_states()

In [29]:
# 4. compile and train the model
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
history = model.fit(stateful_train_set, validation_data=stateful_valid_set,
                    epochs=10, callbacks=[ResetStatesCallback(), model_ckpt])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [30]:
# 5. A stateless copy is needed to use the stateful model with different batch sizes
stateless_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax")
])

In [31]:
# 6. build the model and set its weights
stateless_model.build(tf.TensorShape([None, None]))
stateless_model.set_weights(model.get_weights())


In [32]:
shakespeare_model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Lambda(lambda X: X - 2),  # no <PAD> or <UNK> tokens
    stateless_model
])

In [33]:
print(extend_text("to be or not to be", temperature=0.01))

to be or not to be
and the grave and be a senseless signior of his b


In [34]:
# clean temporary files
!rm -rf ./my_shakespeare_model


# Sentiment Analysis
- one type of text classification based on `word-level` models
  - instead of `character-level` models like char-RNN
- predicts reviewers' feelings about a movie such as negative (0) or positive (1)
  - based on their review texts on this movie
- needs to handle sequences of variable lengths using masking

In [35]:
# 1. load and split the IMDb dataset
# which consists of 50,000 movie reviews in English
# (25,000 for training, 25,000 for testing) extracted from
# the famous Internet Movie Database, along with a simple binary target
# for each review indicating whether it is negative (0) or positive (1)

# 1) load and split the IMDb dataset
# 90% of the training set for training, the remaining 10% for validation:

raw_train_set, raw_valid_set, raw_test_set = tfds.load(
    name="imdb_reviews",
    split=["train[:90%]", "train[90%:]", "test"],
    as_supervised=True
)
tf.random.set_seed(42)
train_set = raw_train_set.shuffle(5000, seed=42).batch(32).prefetch(1)
valid_set = raw_valid_set.batch(32).prefetch(1)
test_set = raw_test_set.batch(32).prefetch(1)

Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteM8BYX5/imdb_reviews-train.tfrecord…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteM8BYX5/imdb_reviews-test.tfrecord*…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteM8BYX5/imdb_reviews-unsupervised.t…

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


In [36]:
# 2) show a few reviews
# a) some reviews are easy to classify since they contain sentimental words such as
#    `terrible`, `wonderful`, etc.
# b) some reviews are challenging since they may contain turns such
#     starting positively then turning negative, etc.
#
for review, label in raw_train_set.take(4):
    print(review.numpy().decode("utf-8")[:200], "...")
    print("Label:", label.numpy())

This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting  ...
Label: 0
I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However  ...
Label: 0
Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do. <br /><br />But come on Hollywood - a Moun ...
Label: 0
This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful perf ...
Label: 1


Determine words with subword tokenization
---
- Keras layer `TextVectorization` can be used to identify word boundaries by spaces
  - it may not work well in some languages such as
    - Chinese, Japanese and Korean which do not use spaces between words
    - Vietnamese and some English words uses spaces even within words: San Francisco
    - German and some English words often attach multiple words together without spaces: ILoveDeepLearning
  - solutions by `subword tokenization`
    - [Byte pair encoding (BPE)](https://homl.info/wordpiece) splits the whole training set into individual characters (including spaces)
      - then repeatedly merges the most frequent adjacent pairs until the vocabulary reaches the desired size
    - [Subword regularization](https://github.com/google/sentencepiece) improves accuracy and robustness by introducing some randomness in tokenization during training
      - ex. "New England", "New"+"England", or "New"+"Eng"+"land"
    - The [Tokenizers library by Hugging Face](https://huggingface.co/docs/tokenizers/index) implements a wide range of extremely fast tokenizers

In [37]:
# 3. Tokenize IMDb reviews
# a) limit the vocabulary to 1,000 tokens
#     including the most frequent 998 words, a padding token and a token for unknown words
#     since it’s unlikely that very rare words will be important for this task
# b) this limit reduces the number of parameters the model needs to learn
vocab_size = 1000
text_vec_layer = tf.keras.layers.TextVectorization(max_tokens=vocab_size)
text_vec_layer.adapt(train_set.map(lambda reviews, labels: reviews))

In [38]:
# 4. create and train a model
# the model will probably not learn anything because we didn't mask the padding tokens
embed_size = 128
model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Embedding(vocab_size, embed_size),
    tf.keras.layers.GRU(128),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=2)

Epoch 1/2
Epoch 2/2


- The accuracy of the previous model remains close to 50%
  - i.e. no better than random guess
- Reasons
  - The reviews have `different lengths`, but the TextVectorization layer
    - converts them to sequences of token IDs
    - then pads the shorter sequences using the padding token (with ID 0) to make them as long as `the longest sequence in the batch`
  - As a result, most sequences end with `many padding tokens`—often dozens or even hundreds of them
    - a GRU layer with only short-term memory forgets what the review was about after it goes through many padding tokens
- Solutions
  - ❶ feed the model with batches of `equal-length` sentences
    - which also speeds up training
  - ❷ make the RNN ignore the padding tokens using `masking`

Masking
---
- enabled by simply adding `mask_zero=True` when creating the Embedding layer
  - then the padding tokens (whose ID is 0) will be ignored by all downstream layers
    - when a recurrent layer encounters a masked time step
      - it simply copies the output from the previous time step
- supported by many Keras layers such as
  - SimpleRNN, GRU, LSTM, Bidirectional, Dense, TimeDistributed, Add, etc.
- If a layer’s `supports_masking=True` then the mask is automatically propagated to the next layer
  - It keeps propagating this way for as long as the layers have `supports_masking=True`
- A recurrent layer’s `supports_​mask⁠ing` attribute is `True` when `return_sequences=True`
  - but it’s `False` when `return_​sequen⁠ces=False` so it will not propagate the mask any further
- If the mask propagates all the way to the output then it gets applied to the losses as well
  - so the masked time steps will not contribute to the loss (their loss will be 0)
  - This assumes that the model outputs sequences

In [39]:
# 5. Enable masking
# 1) by turning `mask_zero=True` in the Embedding layer
embed_size = 128
model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Embedding(vocab_size, embed_size, mask_zero=True),
    tf.keras.layers.GRU(128),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [40]:
# 2) or by manual masking
# explicitly compute the mask and pass it to the appropriate layers,
# using either the functional API or the subclassing API
inputs = tf.keras.layers.Input(shape=[], dtype=tf.string)
token_ids = text_vec_layer(inputs)
mask = tf.math.not_equal(token_ids, 0)
Z = tf.keras.layers.Embedding(vocab_size, embed_size)(token_ids)

#  add a bit of dropout since the previous model was overfitting slightly
Z = tf.keras.layers.GRU(128, dropout=0.2)(Z, mask=mask)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(Z)
model = tf.keras.Model(inputs=[inputs], outputs=[outputs])

In [41]:
# compiles and trains the model as usual
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [42]:
# 3) One last approach to masking
# feed the model with ragged tensors by setting `ragged=True`
# when creating the TextVectorization layer,
# so that the input sequences are represented as ragged tensors

text_vec_layer_ragged = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size, ragged=True)
text_vec_layer_ragged.adapt(train_set.map(lambda reviews, labels: reviews))
text_vec_layer_ragged(["Great movie!", "This is DiCaprio's best role."])

<tf.RaggedTensor [[86, 18], [11, 7, 1, 116, 217]]>

In [43]:
# Compare this ragged tensor representation
# with the regular tensor representation, which uses padding token
text_vec_layer(["Great movie!", "This is DiCaprio's best role."])

<tf.Tensor: shape=(2, 5), dtype=int64, numpy=
array([[ 86,  18,   0,   0,   0],
       [ 11,   7,   1, 116, 217]])>

In [44]:
# compile and train the model with ragged tensors

embed_size = 128
model = tf.keras.Sequential([
    text_vec_layer_ragged,
    tf.keras.layers.Embedding(vocab_size, embed_size),
    tf.keras.layers.GRU(128),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Reusing Pretrained Embeddings and Language Models
---
- many words are used context-freely
  - such as `awesome` and `amazing` have positive meaning in various contexts
  - so pretrained embeddings are widely reused such as
    - Googles [Word2vec embeddings](https://homl.info/word2vec)
    - Stanford's [GloVe embeddings](https://homl.info/glove)
    - Facebook's [FastText embeddings](https://fasttext.cc/)
- however, there are also many words whose meanings depend on their contexts
  - such as `right` in `left and right` and `right and wrong`
    - while it has just a single representation in word embeddings
  - addressed by [Embeddings from Language Models (ELMo)](https://homl.info/elmo)
    - ELMo are contextualized word embeddings learned from the internal states of a deep bidirectional language model
    - they allow reusing part of a pretrained language model
      - instead of just using pretrained embeddings in your model
- [Universal Language Model Fine-Tuning (ULMFiT)](https://homl.info/ulmfit) demonstrated the effectiveness of unsupervised pretraining for NLP tasks
  - ex. the [Universal Sentence Encoder](https://homl.info/139) based on the transformer architecture

In [45]:
# 1. Download the `Universal Sentence Encoder` model from TensorFlow Hub
# This model is quite large, close to 1 GB in size
#   so it may take a while to download.

import tensorflow_hub as hub

# By default, TensorFlow Hub models are saved to a temporary directory,
# and they get downloaded again and again every time you run your program.
# ⚠️ this code cell will crash Golab due to Golab's limit on memory
if "google.colab" in sys.modules:
    pass
else:
    os.environ["TFHUB_CACHE_DIR"] = "my_tfhub_cache"

if "google.colab" not in sys.modules:
    model = tf.keras.Sequential([
        hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                    trainable=True, dtype=tf.string, input_shape=[]),
        tf.keras.layers.Dense(64, activation="relu"),
        tf.keras.layers.Dense(1, activation="sigmoid")
    ])
    model.compile(loss="binary_crossentropy", optimizer="nadam",
                metrics=["accuracy"])
    model.fit(train_set, validation_data=valid_set, epochs=10)

# An Encoder–Decoder Network for Neural Machine Translation
- (p3) A simple [NMT model](https://homl.info/103) that translates English sentences to Spanish
  - English sentences are fed as inputs to the encoder
  - The decoder outputs the Spanish translations
    - also uses the Spanish translations as inputs during training but `shifted back` by one step
      - i.e. the word output at the `previous step`
      - This is called `teacher forcing`
        - a technique that significantly speeds up training and improves the model’s  performance
  - For the very first word
    - the decoder is given the `start-of-sequence (SOS)` token
    - the decoder is expected to end the sentence with an `end-of-sequence (EOS)` token
  - Each word is initially represented by its ID
  - Next, an Embedding layer returns the word embedding
    - These word embeddings are then fed to the encoder and the decoder
  - At each step the decoder outputs a score for each word in the output vocabulary (i.e., Spanish)
    - then the `softmax` activation function turns these scores into probabilities
- At inference time after training
  - (p4) you will not have the target sentence to feed to the decoder
  - Instead, you need to feed it the word that it has just output at the previous step

In [46]:
# 1. Prepare the dataset
# 1) download a dataset of English/Spanish sentence pairs
# Each line contains an English sentence and the corresponding Spanish translation,
#   separated by a tab

url = "https://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip"
path = tf.keras.utils.get_file("spa-eng.zip", origin=url, cache_dir="datasets",
                               extract=True)
text = (Path(path).with_name("spa-eng") / "spa.txt").read_text()

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip


In [47]:
# 2) tidy the dataset
#  a) remove the Spanish characters “¡” and “¿”,
#     which the TextVectorization layer doesn’t handle
text = text.replace("¡", "").replace("¿", "")

#  b) parse the sentence pairs and shuffle them.
pairs = [line.split("\t") for line in text.splitlines()]
np.random.shuffle(pairs)

#  c) Finally split them into two separate lists, one per language
sentences_en, sentences_es = zip(*pairs)  # separates the pairs into 2 lists

In [48]:
# 3) peek at the first three sentence pairs:
for i in range(3):
    print(sentences_en[i], "=>", sentences_es[i])

What the king says is always absolute. => Lo que dice el rey va siempre a misa.
I have no intention of changing. => No tengo intención de cambiar.
I have to finish up some things before I go. => Tengo que acabar un par de cosas antes de irme.


In [49]:
# 2. create two TextVectorization layers
# one per language and adapt them to the text:

# a) using a small value will speed up training especially for this small dataset
# State-of-the-art translation models typically use
#   a much larger vocabulary (e.g., 30,000),
#   a much larger training set (gigabytes),
#   and a much larger model (hundreds or even thousands of megabytes)
#   ex:  the Opus-MT models by the University of Helsinki,
#        or the M2M-100 model by Facebook
vocab_size = 1000

# b) all sentences in the dataset have a maximum of 50 words
#   setting `output_sequence_length=50` causes that
#     the input sequences automatically been padded with zeros
#     until they are all 50 tokens long
#   any sentences longer than 50 tokens in the training set will be cropped to 50 tokens
max_length = 50
text_vec_layer_en = tf.keras.layers.TextVectorization(
    vocab_size, output_sequence_length=max_length)
text_vec_layer_es = tf.keras.layers.TextVectorization(
    vocab_size, output_sequence_length=max_length)
text_vec_layer_en.adapt(sentences_en)

# c) “startofseq” and “endofseq” are added to each sentence as SOS and EOS tokens
text_vec_layer_es.adapt([f"startofseq {s} endofseq" for s in sentences_es])

In [50]:
# d) inspect the first 10 tokens in the English vocabulary
text_vec_layer_en.get_vocabulary()[:10]

['', '[UNK]', 'the', 'i', 'to', 'you', 'tom', 'a', 'is', 'he']

In [51]:
# e) inspect the first 10 tokens in the Spanish vocabulary
text_vec_layer_es.get_vocabulary()[:10]

['', '[UNK]', 'startofseq', 'endofseq', 'de', 'que', 'a', 'no', 'tom', 'la']

In [52]:
# 3. split the dataset
# a) the first 100,000 sentence pairs for training,
#     and the rest for validation
X_train = tf.constant(sentences_en[:100_000])
X_valid = tf.constant(sentences_en[100_000:])

# b) The decoder’s inputs are the Spanish sentences plus an SOS token prefix.
#     The targets are the Spanish sentences plus an EOS suffix
X_train_dec = tf.constant([f"startofseq {s}" for s in sentences_es[:100_000]])
X_valid_dec = tf.constant([f"startofseq {s}" for s in sentences_es[100_000:]])
Y_train = text_vec_layer_es([f"{s} endofseq" for s in sentences_es[:100_000]])
Y_valid = text_vec_layer_es([f"{s} endofseq" for s in sentences_es[100_000:]])

In [53]:
# 4. build the model with `functional API`
#     since the model is not sequential
# a)  It requires two text inputs:
#       one for the encoder and one for the decoder
encoder_inputs = tf.keras.layers.Input(shape=[], dtype=tf.string)
decoder_inputs = tf.keras.layers.Input(shape=[], dtype=tf.string)

In [54]:
# b) encode these sentences using the TextVectorization layers
encoder_input_ids = text_vec_layer_en(encoder_inputs)
decoder_input_ids = text_vec_layer_es(decoder_inputs)

# c) followed by an Embedding layer for each language,
#     with `mask_zero=True` to ensure masking is handled automatically
#   `embed_size` is a tunable hyperparameter
embed_size = 128
encoder_embedding_layer = tf.keras.layers.Embedding(vocab_size, embed_size,
                                                    mask_zero=True)
decoder_embedding_layer = tf.keras.layers.Embedding(vocab_size, embed_size,
                                                    mask_zero=True)
encoder_embeddings = encoder_embedding_layer(encoder_input_ids)
decoder_embeddings = decoder_embedding_layer(decoder_input_ids)

In [55]:
# e) create the encoder and pass it the embedded inputs
# a single LSTM layer is used here, but you could stack several of them.
#   `return_state=True` allows us get a reference to the layer’s final state
encoder = tf.keras.layers.LSTM(512, return_state=True)

# `*encoder_state` groups the LSTM layer's
#     short-term state and long-term state in a list
encoder_outputs, *encoder_state = encoder(encoder_embeddings)

In [56]:
# f) use this encoder_state as the initial state of the decoder:
decoder = tf.keras.layers.LSTM(512, return_sequences=True)
decoder_outputs = decoder(decoder_embeddings, initial_state=encoder_state)

In [57]:
# g) pass the decoder’s outputs through a Dense layer
#     with the softmax activation function
#     to get the word probabilities for each step
# ⚠️ for a large output vocabulary, outputting a probability
#     for each and every possible word can be quite slow.
# two ways to speedup
# i) sampled softmax technique looks only at the logits output by the model
#     for the correct word and for a random sample of incorrect words, then
#     compute an approximation of the loss based only on these logits
# ii) tie the weights of the output layer to
#     the transpose of the decoder’s embedding matrix

output_layer = tf.keras.layers.Dense(vocab_size, activation="softmax")
Y_proba = output_layer(decoder_outputs)

In [58]:
# h) Assemble, compile and train the model
model = tf.keras.Model(inputs=[encoder_inputs, decoder_inputs],
                       outputs=[Y_proba])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model.fit((X_train, X_train_dec), Y_train, epochs=10,
          validation_data=((X_valid, X_valid_dec), Y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x79bfc0155480>

In [59]:
# 5. use the model to translate
# It is NOT as simple as calling model.predict(),
#     because the decoder expects as input the word that
#     was predicted at the previous time step
# two ways:
# i) One way to do this is to write a custom memory cell that
#     keeps track of the previous output and
#     feeds it to the encoder at the next time step
# ii) just call the model multiple times
#     predicting one extra word at each round


# 1) A utility function
# it simply keeps predicting one word at a time,
# gradually completing the translation,
# and it stops once it reaches the EOS token

def translate(sentence_en):
    translation = ""
    for word_idx in range(max_length):
        X = np.array([sentence_en])  # encoder input
        X_dec = np.array(["startofseq " + translation])  # decoder input
        y_proba = model.predict((X, X_dec))[0, word_idx]  # last token's probas
        predicted_word_id = np.argmax(y_proba)
        predicted_word = text_vec_layer_es.get_vocabulary()[predicted_word_id]
        if predicted_word == "endofseq":
            break
        translation += " " + predicted_word
    return translation.strip()

In [60]:
translate("I like soccer")



'me gusta la nieve'

In [61]:
# However, the model struggles with longer sentences:
# The model can be improved a little bit by
#   increase the training set size and
#   adding more LSTM layers in both the encoder and the decoder

translate("I like soccer and also going to the beach")



'me gusta la nieve para [UNK] [UNK] a la playa'

Bidirectional RNNs
---
- A `regular` recurrent layer generates outputs based on only `past and present inputs`
  - i.e. it cannot look into the future
  - This makes sense when forecasting `time series`
    - or in the `decoder` of a sequence-to-sequence (seq2seq) model
  - but it is often preferable to look ahead at the next words before encoding a given word
    - in `text classification` or in the `encoder` of a seq2seq model
- (p5) A `bidirectional` recurrent layer runs two recurrent layers on the same inputs
  - one reading the words from left to right
  - the other reading them from right to left
  - then combines their outputs at each time step
    - typically by concatenating them
  - ex. to properly encode the word “right” in the phrases
    - “the right arm”, “the right person”, and “the right to criticize”

In [62]:
# 1.  implement a bidirectional recurrent layer in Keras
# 1) just wrap a regular recurrent layer in a `Bidirectional` layer
#
# i) The Bidirectional layer will create a clone of the GRU layer
#   (but in the reverse direction), and it will run both and
#   concatenate their outputs. So although the GRU layer has 10 units,
#   the Bidirectional layer will output 20 values per time step

encoder = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(256, return_state=True))

In [63]:
# 2) state concatenation by type
# the decoder’s LSTM layer expects just two states (short-term and long-term)
# so we need to concatenate the two short-term states,
#     and also concatenate the two long-term states
encoder_outputs, *encoder_state = encoder(encoder_embeddings)
encoder_state = [tf.concat(encoder_state[::2], axis=-1),  # short-term (0 & 2)
                 tf.concat(encoder_state[1::2], axis=-1)]  # long-term (1 & 3)

In [64]:
# 3) complete the model and train it
decoder = tf.keras.layers.LSTM(512, return_sequences=True)
decoder_outputs = decoder(decoder_embeddings, initial_state=encoder_state)
output_layer = tf.keras.layers.Dense(vocab_size, activation="softmax")
Y_proba = output_layer(decoder_outputs)
model = tf.keras.Model(inputs=[encoder_inputs, decoder_inputs],
                       outputs=[Y_proba])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model.fit((X_train, X_train_dec), Y_train, epochs=10,
          validation_data=((X_valid, X_valid_dec), Y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x79bf5411e980>

In [65]:
# 4) try a translation
translate("I like soccer")



'me gusta el fútbol'

Beam Search
---
- A technique solves problematic translations such as
  - “I like soccer” → "me gustan los jugadores" instead of  “me gusta el fútbol"
  - if the training set contains many sentences like “I like cars” → “me gustan los autos”
    - the model will learn the pattern `plural` but soccer is not a plural
- keeps track of a short list of the `k most promising sentences`
  - and at each decoder step it tries to extend them by one word
    - calculating the extended sentences' popularity and again keeping only the k most likely sentences
  - The parameter k is called the `beam width`
- ex. (p6) suppose we use the model to translate the sentence “I like soccer” using beam search with a beam width of 3
  - At the first decoder step, the model will output an estimated probability for each possible first word in the translated sentence
    - Suppose the top three words are “me” (75% estimated probability), “a” (3%), and “como” (1%)
  - Next, we use the model to find the next word for each sentence
    - For the first sentence (“me”), perhaps the model outputs a probability of 36% for the word “gustan”, 32% for the word “gusta”, 16% for the word “encanta”, and so on
      - Note that these are actually `conditional probabilities` given that the sentence starts with “me”
    - For the second sentence (“a”), the model might output a conditional probability of 50% for the word “mi”, and so on.
    - Assuming the vocabulary has 1,000 words, we will end up with 1,000 probabilities per sentence.
  - Next, we compute the probabilities of each of the 3,000 two-word sentences we considered (3 × 1,000)
    - We do this by multiplying the estimated conditional probability of each word by the estimated probability of the sentence it completes
    - For example, the estimated probability of the sentence “me” was 75%, while the estimated conditional probability of the word “gustan” (given that the first word is “me”) was 36%, so the estimated probability of the sentence “me gustan” is 75% × 36% = 27%
    - After computing the probabilities of all 3,000 two-word sentences, we `keep only the top 3`
    - In this example they all start with the word “me”: “me gustan” (27%), “me gusta” (24%), and “me encanta” (12%). Right now, the sentence “me gustan” is winning, but “me gusta” has not been eliminated.
  - Then we repeat the same process:
    - we use the model to predict the next word in each of these three sentences
    - and we compute the probabilities of all 3,000 three-word sentences we considered.
    - Perhaps the top three are now “me gustan los” (10%), “me gusta el” (8%), and “me gusta mucho” (2%).
    - At the next step we may get “me gusta el fútbol” (6%), “me gusta mucho el” (1%), and “me gusta el deporte” (0.2%).
    - Notice that “me gustan” was eliminated, and the correct translation is now ahead.
  - We boosted our encoder–decoder model’s performance without any extra training, simply by  using it more wisely.

In [66]:
# 1. a basic implementation of beam search

def beam_search(sentence_en, beam_width, verbose=False):
    X = np.array([sentence_en])  # encoder input
    X_dec = np.array(["startofseq"])  # decoder input
    y_proba = model.predict((X, X_dec))[0, 0]  # first token's probas
    top_k = tf.math.top_k(y_proba, k=beam_width)
    top_translations = [  # list of best (log_proba, translation)
        (np.log(word_proba), text_vec_layer_es.get_vocabulary()[word_id])
        for word_proba, word_id in zip(top_k.values, top_k.indices)
    ]

    # extra code – displays the top first words in verbose mode
    if verbose:
        print("Top first words:", top_translations)

    for idx in range(1, max_length):
        candidates = []
        for log_proba, translation in top_translations:
            if translation.endswith("endofseq"):
                candidates.append((log_proba, translation))
                continue  # translation is finished, so don't try to extend it
            X = np.array([sentence_en])  # encoder input
            X_dec = np.array(["startofseq " + translation])  # decoder input
            y_proba = model.predict((X, X_dec))[0, idx]  # last token's proba
            for word_id, word_proba in enumerate(y_proba):
                word = text_vec_layer_es.get_vocabulary()[word_id]
                candidates.append((log_proba + np.log(word_proba),
                                   f"{translation} {word}"))
        top_translations = sorted(candidates, reverse=True)[:beam_width]

        # extra code – displays the top translation so far in verbose mode
        if verbose:
            print("Top translations so far:", top_translations)

        if all([tr.endswith("endofseq") for _, tr in top_translations]):
            return top_translations[0][1].replace("endofseq", "").strip()

In [67]:
# 2. shows how the model making an error
sentence_en = "I love cats and dogs"
translate(sentence_en)



'me [UNK] los gatos y los perros'

# Attention Mechanisms
- (p7) the game-changing innovation that addressed the limited short-term memory of RNNs
  - by focusing its attention on important source words such as "soccer" for the destination words such as “fútbol”
  - using `higher weights` for words with `more attention` in a weighted sum of all the encoder outputs
  - in (p7.middle-left): ${\displaystyle \mathbf{h}_{(2)} = ∑_{i=0}^2 α_{(3,i)} \mathbf{\hat{y}}_{(i)} }$
  - ${ α_{(t,i)} }$ is the weight of the ${ i^{th} }$ encoder output ${ \hat{y}_{(i)} }$ at the ${ t^{th} }$ decoder time step
  - these ${ α_{(t,i)} }$ weights are generated by a small neural network called an `alignment model (or an attention layer)` (p7.right)
    - which is trained jointly with the rest of the encoder–decoder model
  - the attention layer  
    - starts with a Dense layer composed of a single neuron that processes each of the encoder’s outputs
      - along with the decoder’s previous hidden state (e.g., ${\mathbf{h}_{(2)}}$)
    - outputs for each encoder output (e.g., ${e_{(3, 2)}}$) a `score (or energy)`
      - that measures how well `each output is aligned with the decoder’s previous hidden state`
        - the one that best aligns with the current state gets a high score
      - Finally, all the scores go through a `softmax` layer to get a final weight for each encoder output (e.g., ${ α_{(3,1)} }$)
        - All the weights for a given decoder time step add up to 1
  - Since it `concatenates` the encoder output with the decoder’s previous hidden state
    - it is sometimes called `concatenative attention (or additive attention)`
  - Another common attention mechanism is `multiplicative attention`
    - simply computes the `dot product` of one of the encoder’s outputs and the decoder’s previous hidden state
      - since the goal of the alignment model is to measure the `similarity` between these two vectors
      - requires both vectors must have the same dimensionality
    - uses the decoder’s hidden state at the current time step rather than at the previous time step (i.e. ${\mathbf{h}_{(t)}}$ rather than ${\mathbf{h}_{(t-1)}}$)
    - then uses the output of the attention mechanism (noted ${\mathbf{h̃}_{(t)}}$ ) directly to compute the decoder’s predictions
      - rather than using it to compute the decoder’s current hidden state
  - A variant of the dot product mechanism where the encoder outputs first go through a fully connected layer (without a bias term) before the dot products are computed
    - called `the “general” dot product approach`
  - Both dot product variants performed better than concatenative attention
  - These three attention mechanisms are summarized as
    - ${\displaystyle \mathbf{h̃}_{(t)} = ∑_{i}α_{(t,i)}\mathbf{y}_{(i)} }$
      - ${ α_{(t,i)}=\dfrac{e^{e_{(t,i)}}} {∑_k e^{e_{(t,i_k)}}} }$
      - ${ e_{(t,i)}=\begin{cases} \mathbf{h}_{(t)}^⊺\mathbf{y}_{(i)} & dot\\ \mathbf{h}_{(t)}^⊺ \mathbf{W} \mathbf{y}_{(i)} & general \\ \mathbf{v}^⊺\tanh(\mathbf{W}[\mathbf{h}_{(t)};\mathbf{y}_{(i)}]) & concat \end{cases} }$
- revolutionized NMT and deep learning in general
  - allowing a significant improvement in the state of the art
  - especially for long sentences (e.g., over 30 words)


In [68]:
# 1. Using attention mechanism
# - tf.keras.layers.Attention layer for multiplicative attention
# - tf.keras.layers.AdditiveAttention layer for additive attention
# 1) `return_sequences=True` passes all the encoder’s outputs to the Attention layer
encoder = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(256, return_sequences=True, return_state=True))

In [69]:
# 2) create the attention layer and pass it the decoder’s states and the encoder’s outputs
#  let’s use the decoder’s outputs instead of its states:
#     in practice this works well too, and it’s much easier to code
encoder_outputs, *encoder_state = encoder(encoder_embeddings)
encoder_state = [tf.concat(encoder_state[::2], axis=-1),  # short-term (0 & 2)
                 tf.concat(encoder_state[1::2], axis=-1)]  # long-term (1 & 3)
decoder = tf.keras.layers.LSTM(512, return_sequences=True)
decoder_outputs = decoder(decoder_embeddings, initial_state=encoder_state)

#  Then we just pass the attention layer’s outputs directly to the output layer
attention_layer = tf.keras.layers.Attention()
attention_outputs = attention_layer([decoder_outputs, encoder_outputs])
output_layer = tf.keras.layers.Dense(vocab_size, activation="softmax")
Y_proba = output_layer(attention_outputs)

In [70]:
# 3) build and train the NMT model with attention mechanism
model = tf.keras.Model(inputs=[encoder_inputs, decoder_inputs],
                       outputs=[Y_proba])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model.fit((X_train, X_train_dec), Y_train, epochs=10,
          validation_data=((X_valid, X_valid_dec), Y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x79bf5606ceb0>

In [71]:
# 4) it can handle much longer sentences
translate("I like soccer and also going to the beach")



'me gusta el fútbol y también va a la playa'

The [Transformer](https://homl.info/transformer) model
---
- significantly improved the state-of-the-art in NMT
  - without using any recurrent or convolutional layers,
  - just attention mechanisms (plus embedding layers, dense layers, normalization layers, and a few other bits and pieces)
- benefits
  - not recurrent
    - can be trained in fewer steps
    - easier to parallelize across multiple GPUs
  - not suffer as much from the unstable gradients problems as RNNs
  - can better capture long-range patterns than RNNs
- The original 2017 transformer architecture is shown in (p8)
  - the left part is the `encoder`
    - it gradually transforms the source word representations until the representation perfectly captures the meaning of that word
  - the right part is the `decoder`
    - it gradually transform each word representation in the translated sentence into a word representation of the next word in the translation
    - After going through the decoder, each word representation goes through a final Dense layer with a softmax activation function
      - which will hopefully output a high probability for the correct next word and a low probability for all other words
  - Each `embedding layer` outputs a 3D tensor of shape `[batch size, sequence length, embedding size]`
    - the tensors are gradually transformed as they flow through the transformer
      - but their shape remains the same
- The big picture of the transformer for NMT
  - during training you must feed the English sentences to the encoder
    - the corresponding Spanish translations to the decoder with an extra SOS token inserted at the start of each sentence
  - At inference time you must call the transformer multiple times to
    - produce the translations one word at a time
    - feed the partial translations to the decoder at each round
  - The encoder’s `multi-head attention layer` updates each word representation by attending to all other words in the same sentence
    - until each word’s representation perfectly captures the meaning of the word, in the  context of the sentence
  - The decoder’s `masked multi-head attention layer` does the same thing
    - but when it processes a word, it doesn’t attend to words located after it
    - i.e. it’s a `causal` layer
  - The decoder’s `upper multi-head attention layer` is where the decoder pays attention to the words in the English sentence
    - This is called `cross-attention`
  - After going through the decoder,
    - each word representation goes through a final Dense layer with a `softmax` activation function
    - which will hopefully output a high probability for the correct next word and a low probability for all other words

Positional encodings
---
- A dense vector that encodes the position of a word within a sentence
  - the ${ i^{th} }$ positional encoding is added to the word embedding of the ${ i^{th} }$ word in the sentence
    - using an Embedding layer and make it encode all the positions from 0 to the maximum sequence length in the batch
    - then adding the result to the word embeddings by the rules of broadcasting
    - this is a `trainable positional encodings`
- There are `fixed positional encodings` such as the one below generating a positional encoding matrix ${ \mathbf{P} }$
  - (p9) based on the sine and cosine functions at different frequencies
    - ${\displaystyle P_{(p,i)} = \begin{cases} \sin(\dfrac{p}{10000^{i/d}}) & \text{if } i \text{ is even} \\ \cos(\dfrac{p}{10000^{(i-1)/d}}) & \text{if } i \text{ is odd} \end{cases}  }$
      - ${ P_{(p,i)} }$ is the ${ i^{th} }$ component of the encoding for the word located at the ${ p^{th} }$ position in the sentence
  - this encoding can give the same performance as trainable positional encodings
    - it can extend to arbitrarily long sentences without adding any parameters to the model
    - but trainable positional encodings are preferred if there is a large amount of pretraining data

In [72]:
# 1. add trainable positional encodings to the encoder and decoder inputs
# Here we assume that the embeddings are represented as regular tensors, not ragged tensors
# The encoder and the decoder share the same Embedding layer for the positional encodings,
# since they usually have the same embedding size

max_length = 50  # max length in the whole training set
embed_size = 128

pos_embed_layer = tf.keras.layers.Embedding(max_length, embed_size)
batch_max_len_enc = tf.shape(encoder_embeddings)[1]
encoder_in = encoder_embeddings + pos_embed_layer(tf.range(batch_max_len_enc))
batch_max_len_dec = tf.shape(decoder_embeddings)[1]
decoder_in = decoder_embeddings + pos_embed_layer(tf.range(batch_max_len_dec))

In [73]:
# 2. fixed positional encodings based on sine/cosine
class PositionalEncoding(tf.keras.layers.Layer):
    def __init__(self, max_length, embed_size, dtype=tf.float32, **kwargs):
        super().__init__(dtype=dtype, **kwargs)
        assert embed_size % 2 == 0, "embed_size must be even"
        # 1) precompute the positional encoding matrix
        p, i = np.meshgrid(np.arange(max_length),
                           2 * np.arange(embed_size // 2))
        pos_emb = np.empty((1, max_length, embed_size))
        pos_emb[0, :, ::2] = np.sin(p / 10_000 ** (i / embed_size)).T
        pos_emb[0, :, 1::2] = np.cos(p / 10_000 ** (i / embed_size)).T
        self.pos_encodings = tf.constant(pos_emb.astype(self.dtype))
        # 2) enable propagation the input's automatic mask to the next layer
        self.supports_masking = True

    def call(self, inputs):
        batch_max_length = tf.shape(inputs)[1]
        return inputs + self.pos_encodings[:, :batch_max_length]

In [74]:
# use the PositionalEncoding layer to add the positional encoding to the encoder’s inputs:
pos_embed_layer = PositionalEncoding(max_length, embed_size)
encoder_in = pos_embed_layer(encoder_embeddings)
decoder_in = pos_embed_layer(decoder_embeddings)

Multi-head attention
---
- based on the `scaled dot-production attention` layer
  - ${\displaystyle \operatorname{Attention(\mathbf{Q,K,V})} = \operatorname{softmax}\left( \dfrac{\mathbf{QK^⊺}} {\sqrt{d_{keys}}}  \right)\mathbf{V} }$
    - ${ \mathbf{Q} }$ is a matrix containing `one row per query`.
      - Its shape is ${[n_{queries}, d_{keys}]}$
    - ${ \mathbf{K} }$ is a matrix containing one row per key
      - Its shape is ${[n_{keys}, d_{keys}]}$
    - ${ \mathbf{K} }$ is a matrix containing one row per value
      - Its shape is ${[n_{keys}, d_{values}]}$
    - ${ \mathbf{QK^⊺} }$ contains one similarity score for each query/key pair
    - The scaling factor ${ 1/\sqrt{d_{keys}} }$ scales down the similarity scores to avoid saturating the softmax function
      - it can be turned into a learnable parameter by setting `use_scale=True` when creating a `tf.keras.layers.Attention` layer
    - It is possible to mask out some key/value pairs by adding a very large negative value to the corresponding similarity scores just before computing the softmax
      - This is useful in the masked multi-head attention layer
- has the architecture shown in (p10) which is just a bunch of scaled dot-product attention layers
  - each preceded by a linear transformation of the values, keys, and queries
    - a time-distributed dense layer with no activation function
    - this allows the model to apply many different projections of the word representation into different `subspaces`
      - each focusing on a `subset of the word’s characteristics` such as word type, tense, plural, etc.
  - All the outputs are simply concatenated
    - and they go through a final linear transformation

In [75]:
# 1. Build a transformer
# Keras includes a tf.keras.layers.MultiHeadAttention layer
# 1) build the full encoder
#     use a stack of two blocks (N = 2)
#     since we don’t have a huge training set,

N = 2  # instead of 6
num_heads = 8

# add a bit of dropout as well:
dropout_rate = 0.1
n_units = 128  # for the first Dense layer in each Feed Forward block
encoder_pad_mask = tf.math.not_equal(encoder_input_ids, 0)[:, tf.newaxis]
Z = encoder_in
for _ in range(N):
    skip = Z
    attn_layer = tf.keras.layers.MultiHeadAttention(
        num_heads=num_heads, key_dim=embed_size, dropout=dropout_rate)
    Z = attn_layer(Z, value=Z, attention_mask=encoder_pad_mask)
    Z = tf.keras.layers.LayerNormalization()(tf.keras.layers.Add()([Z, skip]))
    skip = Z
    Z = tf.keras.layers.Dense(n_units, activation="relu")(Z)
    Z = tf.keras.layers.Dense(embed_size)(Z)
    Z = tf.keras.layers.Dropout(dropout_rate)(Z)
    Z = tf.keras.layers.LayerNormalization()(tf.keras.layers.Add()([Z, skip]))

In [76]:
# 2) build the decoder
decoder_pad_mask = tf.math.not_equal(decoder_input_ids, 0)[:, tf.newaxis]
causal_mask = tf.linalg.band_part(  # creates a lower triangular matrix
    tf.ones((batch_max_len_dec, batch_max_len_dec), tf.bool), -1, 0)

encoder_outputs = Z  # let's save the encoder's final outputs
Z = decoder_in  # the decoder starts with its own inputs
for _ in range(N):
    skip = Z
    attn_layer = tf.keras.layers.MultiHeadAttention(
        num_heads=num_heads, key_dim=embed_size, dropout=dropout_rate)
    Z = attn_layer(Z, value=Z, attention_mask=causal_mask & decoder_pad_mask)
    Z = tf.keras.layers.LayerNormalization()(tf.keras.layers.Add()([Z, skip]))
    skip = Z
    attn_layer = tf.keras.layers.MultiHeadAttention(
        num_heads=num_heads, key_dim=embed_size, dropout=dropout_rate)
    Z = attn_layer(Z, value=encoder_outputs, attention_mask=encoder_pad_mask)
    Z = tf.keras.layers.LayerNormalization()(tf.keras.layers.Add()([Z, skip]))
    skip = Z
    Z = tf.keras.layers.Dense(n_units, activation="relu")(Z)
    Z = tf.keras.layers.Dense(embed_size)(Z)
    Z = tf.keras.layers.LayerNormalization()(tf.keras.layers.Add()([Z, skip]))

In [77]:
# 3) add the final output layer,
# create the model, compile it, and train it

Y_proba = tf.keras.layers.Dense(vocab_size, activation="softmax")(Z)
model = tf.keras.Model(inputs=[encoder_inputs, decoder_inputs],
                       outputs=[Y_proba])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model.fit((X_train, X_train_dec), Y_train, epochs=10,
          validation_data=((X_valid, X_valid_dec), Y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x79bf13616500>

In [78]:
# 4) try a translation
translate("I like soccer and also going to the beach")



'me gusta el fútbol y también va a la playa'

# The burst of Transformer Models
- started from the `ImageNet moment for NLP` in 2018
  - with larger and larger transformer-based architectures trained on immense datasets
- the [GPT paper](https://homl.info/gpt) demonstrated the effectiveness of unsupervised pretraining using a transformer-like architecture composed of a stack of 12 transformer modules using only `masked multi-head attention layers`
- the [Google's BERT(Bidirectional Encoder Representations from Transformers) paper](https://homl.info/bert) also demonstrated the effectiveness of self-supervised pretraining on a large corpus, using a similar architecture to GPT but with `nonmasked multi-head attention layers` only, like in the original transformer’s encoder, in two pretraining tasks
  - ① `Masked language model (MLM)`  is trained to predict the masked words
  - ② `Next sentence prediction (NSP)` is trained to predict whether two sentences are consecutive or not
    - not as important as was initially thought, so it was dropped in most later architectures
- (p11) The BERT model is trained on MLM and NSP simultaneously on a very large corpus of text
  - then fine-tuned on many different tasks, changing very little for each task
- In 2019, the [GPT-2 paper](https://homl.info/gpt2) proposed a very similar architecture to GPT but with over 1.5 billion parameters
  - this new and improved GPT model could perform `zero-shot learning (ZSL)`
    - i.e. it could achieve good performance on many tasks `without any fine-tuning`
- In 2021, Google's [Switch Transformers](https://homl.info/switch) used 1 trillion parameters
- An unfortunate consequence of this trend toward gigantic models is that they cost lots of money and energy
  - soon new ways are found to downsize transformers and make them more data-efficient
  - such as the [DistilBERT model](https://homl.info/distilbert), a small and fast transformer model based on BERT, available on Hugging Face’s model hub
    - it was trained using `distillation`:
      - transferring knowledge from a teacher model to a student one
      - which is usually much smaller than the teacher model
- Many more transformer architectures came out after BERT almost on a monthly basis
  - often improving on the state of the art across all NLP tasks
- On November 30, 2022, [ChatGPT 3.5 (Chat Generative Pre-trained Transformer)](https://chat.openai.com/) was released
- On December 6, 2023, Google released [Gemini](https://gemini.google.com/)

# Vision Transformers
- first were used in generating image captions using [visual attention⁠](https://homl.info/visualattention)
  - a CNN first processes the image and outputs some feature maps
  - then a decoder RNN equipped with an attention mechanism generates the caption
    - (p12) At each decoder time step (i.e., each word), the decoder uses the attention model to focus on just the right part of the image
- then hybrids of CNN–transformer architecture for object detection were proposed
  - the CNN first processes the input images and outputs a set of feature maps
  - then these feature maps are converted to sequences and fed to a transformer
    - which outputs bounding box predictions
- then a fully transformer-based vision model called a [vision transformer (ViT)](https://homl.info/vit) was introduced in October 2020
  - it chops the image into little 16 × 16 squares,
  - and treats the sequence of squares as if it were a sequence of word representations
  - this model beat the state of the art on ImageNet image classification
    - but had to use over 300 million additional images for training
- Just two months later, [data-efficient image transformers (DeiTs)](https://homl.info/deit) achieved competitive results on ImageNet without requiring any additional data for training
  - but used a distillation technique to transfer knowledge from state-of-the-art CNN models to this model
- [the Perceiver architecture](https://homl.info/perceiver)
  - a multimodal transformer, meaning you can feed it text, images, audio, or virtually any other modality
- [DINO](https://homl.info/dino)
  - an vision transformer capable of high-accuracy semantic segmentation
  - trained entirely by self-supervision
- [CLIP](https://homl.info/clip)
  - proposed a large transformer model pretrained to match captions with images
  - followed by [DALLE⋅E](https://homl.info/dalle), then [DALLE⋅E 2](https://homl.info/dalle2)
- the [Flamingo paper](https://homl.info/flamingo) introduced a family of models pretrained on a wide variety of tasks across multiple modalities, including text, images, and videos
  - A single model can be used across very different tasks, such as question answering, image captioning, and more
- [GATO](https://homl.info/gato), a multimodal model that can be used as a policy for a reinforcement learning agent
  - The same transformer can chat with you, caption images, play Atari games, control (simulated) robotic arms, and more, all with “only” 1.2 billion parameters

# Hugging Face’s Transformers Library
- There are many excellent pretrained models readily available for download via `TensorFlow Hub` or [Hugging Face’s model hub](https://huggingface.co/models)

In [79]:
# 1. Use a pretrained transformer
# The simplest way to use the Transformers library is
# to use the transformers.​pipe⁠line() function:
#     just specify which task you want, such as sentiment analysis,
#     and it downloads a default pretrained model ready to be used

# 1) sentiment analysis
from transformers import pipeline

# many other tasks are available on https://huggingface.co/tasks
classifier = pipeline("sentiment-analysis")
result = classifier("The actors were very convincing.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [80]:
result

[{'label': 'POSITIVE', 'score': 0.9998071789741516}]

In [81]:
# ⚠️ Models can be very biased. # For example, it may like or dislike
# some countries depending on the data it was trained on, and how it is used,
# so use it with care:
classifier(["I am from India.", "I am from Iraq."])

[{'label': 'POSITIVE', 'score': 0.9896161556243896},
 {'label': 'NEGATIVE', 'score': 0.9811071157455444}]

In [82]:
# 2) specify a model instead of using the defaults
model_name = "huggingface/distilbert-base-uncased-finetuned-mnli"
classifier_mnli = pipeline("text-classification", model=model_name)
classifier_mnli("She loves me. [SEP] She loves me not.")

config.json:   0%|          | 0.00/729 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/58.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[{'label': 'contradiction', 'score': 0.9790192246437073}]

In [83]:
# 3) load the same DistilBERT along with its corresponding tokenizer
# the Transformers library provides many classes, including all sorts of
# tokenizers, models, configurations, callbacks, and much more.

from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

# then tokenize a couple of pairs of sentences
token_ids = tokenizer(["I like soccer. [SEP] We all love soccer!",
                       "Joe lived for a very long time. [SEP] Joe is old."],
                      padding=True, return_tensors="tf")
token_ids

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


{'input_ids': <tf.Tensor: shape=(2, 15), dtype=int32, numpy=
array([[ 101, 1045, 2066, 4715, 1012,  102, 2057, 2035, 2293, 4715,  999,
         102,    0,    0,    0],
       [ 101, 3533, 2973, 2005, 1037, 2200, 2146, 2051, 1012,  102, 3533,
        2003, 2214, 1012,  102]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 15), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}

In [84]:
# 4)  pass this BatchEncoding object to the model;
# it returns a TFSequenceClassifierOutput object containing its predicted class logits
outputs = model(token_ids)
outputs

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[-2.1123812 ,  1.1786786 ,  1.4101006 ],
       [-0.01478313,  1.0962478 , -0.9919961 ]], dtype=float32)>, hidden_states=None, attentions=None)

In [85]:
# 5) apply the softmax activation function to convert these logits to class probabilities
Y_probas = tf.keras.activations.softmax(outputs.logits)
Y_probas

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[0.01619703, 0.43523577, 0.54856724],
       [0.22655976, 0.6881726 , 0.0852677 ]], dtype=float32)>

In [86]:
Y_pred = tf.argmax(Y_probas, axis=1)
Y_pred  # 0 = contradiction, 1 = entailment, 2 = neutral

<tf.Tensor: shape=(2,), dtype=int64, numpy=array([2, 1])>

In [87]:
# 6) fine-tune the model
sentences = [("Sky is blue", "Sky is red"), ("I love her", "She loves me")]
X_train = tokenizer(sentences, padding=True, return_tensors="tf").data
y_train = tf.constant([0, 2])  # contradiction, neutral
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(loss=loss, optimizer="nadam", metrics=["accuracy"])
history = model.fit(X_train, y_train, epochs=2)

Epoch 1/2


Cause: for/else statement not yet supported


Cause: for/else statement not yet supported
Epoch 2/2
