# Natural Language Processing

Can we build a machine that can master written and spoken language? This is the ultimate goal of
**Natural Language Processiong** research, but it’s a bit too broad, so in practice researchers focus on more specific tasks, such as **text classification**, **translation**, **summarization**, **question answering**, and many more. First of all we need to consider how this type of data can be preprocessed. 

## Embeddings

An embedding is a dense representation of some higher-dimensional data, such as a word in a vocabulary. If there are 50,000 possible categories, the one-hot encoding would produce a 50,000-dimensional sparse vector (containing mostly zeros). In contrast, an embedding would be a comparatively small dense vector, for example, with just 100 dimensions.

In deep learning, embeddings are usually initialized randomly, and they are then trained by gradient descent, along with the other model parameters. For example, the "NEAR BAY" category in the California housing dataset could be represented initially by a random vector such as [0.131, 0.890],
while the "NEAR OCEAN" category might be represented by another random vector such as [0.631, 0.791]. In this example, we use 2D embeddings, but the number of dimensions is a hyperparameter we can tweak.

Since these embeddings are trainable, they will gradually improve during training; and as they represent fairly similar categories in this case, gradient descent will certainly end up pushing them closer together, while it will tend to move them away from the "INLAND" category’s embedding. This is represented in the following figure, where the embeddings are represented as points in a 2D space:

<img src="./images/embedding.png" width="600">

This idea of using vectors to represent words was used int famous [**Word2vec algorithm**](https://arxiv.org/abs/1310.4546). It’s not just about proximity, though: word embeddings were also organized along meaningful axes in the embedding space. Here is a famous example: if we compute "King" – "Man" + "Woman", then the result will be very close to the embedding of the word "Queen". In other words, the word embeddings encode the concept of gender!cSimilarly, you can compute "Madrid" – "Spain" + "France", and the result is close to "Paris", which seems to show that the notion of capital city was
also encoded in the embeddings:

<img src="./images/embedding-example.png" width="250">

Unfortunately, word embeddings sometimes capture our worst biases. For example, although they correctly learn that "Man" is to "King" as "Woman" is to "Queen", they also seem to learn that "Man" is to "Doctor" as "Woman" is to "Nurse": quite a sexist bias! Ensuring fairness in deep learning algorithms is an important and active research topic.

Anyway, the better the representation, the easier it will be for the neural network to make accurate predictions, so training tends to make embeddings useful representations of the categories.  Moreovere, not only will embeddings generally be useful representations for the task at hand, but quite often these same embeddings can be reused successfully for other tasks. In fact, embeddings are so useful that they are often pretrained on very large datasets before being used in a specific task. For example, you can download pretrained word embeddings for free from the [GloVe](https://nlp.stanford.edu/projects/glove/) project, which were trained on 6 billion tokens from Wikipedia (these embeddings have 400,000 words in their vocabulary, and each word is represented as a 100-dimensional vector).

Keras provides an Embedding layer, which wraps an embedding matrix, which has one row per category and one column per embedding dimension. By default, it is initialized randomly. To convert a category to an embedding, the layer just looks up and returns the row that corresponds to that category. For example, let’s initialize an layer with five rows and 2D embeddings, and use it to encode some categories:

In [1]:
import tensorflow as tf
import numpy as np

embedding_layer = tf.keras.layers.Embedding(input_dim=5, output_dim=2)
embedding_layer(np.array([2, 4, 2]))

2023-07-31 15:52:56.519209: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M2 Max
2023-07-31 15:52:56.519249: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 64.00 GB
2023-07-31 15:52:56.519257: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 24.00 GB
2023-07-31 15:52:56.519354: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-07-31 15:52:56.519632: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[-0.01846451, -0.04738495],
       [-0.01139849,  0.02086724],
       [-0.01846451, -0.04738495]], dtype=float32)>

To embed a categorical text attribute, we can chain a StringLookup layer (taht maps string features to integer indices) and an Embedding layer where the number of rows in the embedding matrix needs to be equal to the total number of categories (vocabulary size):

In [2]:
ocean_prox = ["<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", "ISLAND"]

str_lookup_layer = tf.keras.layers.StringLookup()
str_lookup_layer.adapt(ocean_prox)

lookup_and_embed = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=[], dtype=tf.string),
    str_lookup_layer,
    tf.keras.layers.Embedding(input_dim=str_lookup_layer.vocabulary_size(), output_dim=2)
])

2023-07-31 15:52:56.756535: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


In [3]:
lookup_and_embed(np.array(["<1H OCEAN", "ISLAND", "<1H OCEAN"]))

<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[-0.02579045,  0.03048733],
       [ 0.00826917,  0.00335374],
       [-0.02579045,  0.03048733]], dtype=float32)>

In the examples we used 2D embeddings, but as a rule of thumb embeddings typically have 10 to 300 dimensions, depending on the task, the vocabulary size, and the size of our training set.

Putting everything together, we can now create a Keras model that can process a text feature along with regular numerical features and learn an embedding for each category. For example, we can create a dataset with eight numerical features and one text feature per instance:

In [4]:
X_train_num = np.random.rand(10_000, 8)
X_train_cat = np.random.choice(ocean_prox, size=10_000)
y_train = np.random.rand(10_000, 1)

X_valid_num = np.random.rand(2_000, 8)
X_valid_cat = np.random.choice(ocean_prox, size=2_000)
y_valid = np.random.rand(2_000, 1)

The code uses the lookup_and_embed model we create earlier to encode each ocean-proximity category as the corresponding trainable embedding. Next, it concatenates the numerical inputs and the embeddings to produce the complete encoded inputs, which are ready to be fed to a neural network.

In [5]:
num_input = tf.keras.layers.Input(shape=[8], name="num")
cat_input = tf.keras.layers.Input(shape=[], dtype=tf.string, name="cat")

cat_embeddings = lookup_and_embed(cat_input) 
encoded_inputs = tf.keras.layers.concatenate([num_input, cat_embeddings])

We could add any kind of neural network at this point, for simplicity we just add a single dense output layer, and then we create the Keras Model with the inputs and output we’ve just defined:

In [6]:
outputs = tf.keras.layers.Dense(1)(encoded_inputs)
model = tf.keras.models.Model(inputs=[num_input, cat_input], outputs=[outputs])

Next we compile the model and train it, passing both the numerical and categorical inputs:

In [7]:
model.compile(loss="mse", optimizer="sgd")
history = model.fit((X_train_num, X_train_cat), y_train, epochs=5, validation_data=((X_valid_num, X_valid_cat), y_valid))

Epoch 1/5


  1/313 [..............................] - ETA: 1:24 - loss: 0.4684

2023-07-31 15:52:57.056691: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.




2023-07-31 15:52:58.687823: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Notice that a one-hot encoding followed by a dense layer is equivalent to an embedding layer. However, the embedding layer uses way fewer computations (it avoids many multiplications by zero) and the performance difference becomes clear when the size of the embedding matrix grows. 

Now that we have learned how to encode categorical features, it’s time to turn our attention to text preprocessing.

## Text Preprocessing

Keras provides a TextVectorization layer for basic text preprocessing. We can either pass it a vocabulary upon creation, or let it learn the vocabulary from some training data using the adapt() method:

In [8]:
train_data = ["To be", "!(to be)", "That's the question", "Be, be, be."]

text_vec_layer = tf.keras.layers.TextVectorization()
text_vec_layer.adapt(train_data)

2023-07-31 15:53:04.951993: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


In [9]:
text_vec_layer.get_vocabulary()

['', '[UNK]', 'be', 'to', 'the', 'thats', 'question']

The vocabulary was learned from the four sentences in the training data: "be" = 2, "to" = 3, etc. To construct the vocabulary, the adapt() method first converted the training sentences to lowercase and removed punctuation, then sentences were split on whitespace, and the resulting words were sorted by descending frequency, producing the final vocabulary. When encoding sentences, unknown words get encoded as 1 (UNK).

In [10]:
text_vec_layer(["Be good!", "Question: be or be?"])

<tf.Tensor: shape=(2, 4), dtype=int64, numpy=
array([[2, 1, 0, 0],
       [6, 2, 1, 2]])>

The two sentences "Be good!" and "Question: be or be?" were encoded as [2, 1, 0, 0] and [6, 2, 1, 2], respectively. Since the first sentence is shorter than the second, it was padded with 0.

The TextVectorization layer has many options. For example, we can preserve the case and punctuation (standardize=None), or we can pass any standardization function we need. We can prevent splitting (split=None) or we can pass our own splitting function instead. we can ensure that the output sequences all get cropped or padded to the desired length (output_sequence_length).

The word IDs must be encoded, typically using an Embedding layer, as we will see below. Alternatively, we can set the output mode argument to "multi_hot" or "count" to get the corresponding encodings. However, simply counting words is usually not ideal: words like "to" and "the" are so frequent that they hardly matter at all, whereas, rarer words such as "basketball" are much more informative. So, it is usually preferable to set the output mode to "tf_idf". **TF-IDF** (term-frequency × inverse-document-frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents. The higher the TF-IDF score, the more relevant that word is in that particular document. 


<img src="./images/tf-idf.png" width="450">

There are many TF-IDF variants, but the way the TextVectorization layer implements it is by multiplying each word count by a weight equal to

$\displaystyle \log(1+\frac{d}{f+1})$

where d is the total number of sentences (a.k.a., documents) in the training data and f counts how many of these training sentences contain the given word.

In [11]:
text_vec_layer = tf.keras.layers.TextVectorization(output_mode="tf_idf")
text_vec_layer.adapt(train_data)

2023-07-31 15:53:05.108627: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


In [12]:
text_vec_layer(["Be good!", "Question: be or be?"])

<tf.Tensor: shape=(2, 6), dtype=float32, numpy=
array([[0.96725637, 0.6931472 , 0.        , 0.        , 0.        ,
        0.        ],
       [0.96725637, 1.3862944 , 0.        , 0.        , 0.        ,
        1.0986123 ]], dtype=float32)>

For example, in this case there are d=4 sentences in the training data, and the word "be" appears in f=3 of these. Since the word "be" occurs twice in the sentence "Question: be or be?", it gets encoded:

In [13]:
2 * np.log(1 + 4 / (1 + 3))

1.3862943611198906

The word "question" only appears once, but since it is a less common word, its encoding is almost as high:  

In [14]:
1 * np.log(1 + 4 / (1 + 1))

1.0986122886681098

Note that the average weight is used for unknown words.

This approach to text encoding is straightforward to use and it can give fairly good results for basic natural language processing tasks, but it has several important limitations: it only works with languages that separate words with spaces, it doesn’t distinguish between homonyms (e.g., "to bear" versus "teddy bear"), it gives no hint to our model that words like "evolution" and "evolutionary" are related, etc. And the order of the words is lost.

## Pretrained Language Models

A lot of source makes it easy to reuse pretrained model components in our own models (for text, image, audio, and more). These model components are called **modules**. For example, we can explore the [TensorFlow Hub library](https://www.tensorflow.org/hub) or the [Hugging Face Library](https://huggingface.co/docs/transformers/index) to find a model we need, and copy the code example into our project. The module will be automatically downloaded and bundled into a Keras layer that we can directly include in our model. Modules typically contain both preprocessing code and pretrained weights, and they generally require no extra training (but of course, the rest of our model will certainly require training).

For example, the **nnlm-en-dim50** module is a fairly basic module that takes raw text as input and outputs a 50-dimensional embeddings:

In [15]:
import tensorflow_hub as hub

hub_layer = hub.KerasLayer("https://tfhub.dev/google/nnlm-en-dim50/2")

2023-07-31 15:53:05.670758: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


In [16]:
sentence_embeddings = hub_layer(tf.constant(["To be", "Not to be"]))
sentence_embeddings.numpy().round(2)

2023-07-31 15:53:05.847606: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


array([[-0.25,  0.28,  0.01,  0.1 ,  0.14,  0.16,  0.25,  0.02,  0.07,
         0.13, -0.19,  0.06, -0.04, -0.07,  0.  , -0.08, -0.14, -0.16,
         0.02, -0.24,  0.16, -0.16, -0.03,  0.03, -0.14,  0.03, -0.09,
        -0.04, -0.14, -0.19,  0.07,  0.15,  0.18, -0.23, -0.07, -0.08,
         0.01, -0.01,  0.09,  0.14, -0.03,  0.03,  0.08,  0.1 , -0.01,
        -0.03, -0.07, -0.1 ,  0.05,  0.31],
       [-0.2 ,  0.2 , -0.08,  0.02,  0.19,  0.05,  0.22, -0.09,  0.02,
         0.19, -0.02, -0.14, -0.2 , -0.04,  0.01, -0.07, -0.22, -0.1 ,
         0.16, -0.44,  0.31, -0.1 ,  0.23,  0.15, -0.05,  0.15, -0.13,
        -0.04, -0.08, -0.16, -0.1 ,  0.13,  0.13, -0.18, -0.04,  0.03,
        -0.1 , -0.07,  0.07,  0.03, -0.08,  0.02,  0.05,  0.07, -0.14,
        -0.1 , -0.18, -0.13, -0.04,  0.15]], dtype=float32)

The module parses the string (splitting words on spaces) and embeds each word using an embedding matrix that was pretrained on a huge corpus: the Google News 7B corpus (seven billion words long). Then it computes the mean of all the word embeddings, and the result is the sentence embedding. 

Famous pretrained embeddings are [Google’s Word2vec embeddings](https://arxiv.org/abs/1310.4546), [Stanford’s GloVe embeddings](https://nlp.stanford.edu/projects/glove/) and [Facebook’s FastText embeddings](https://fasttext.cc/). Using pretrained word embeddings is powerful, but it has its limits. In particular, a word has a single representation, no matter the context. For example, the word "right" is encoded the same way in "left and right" and "right and wrong", even though it means two very different things. To address this limitation, the [**Embeddings from Language Models** (ELMo)](https://arxiv.org/abs/1802.05365) were introduced: these are contextualized word embeddings learned from the internal states of a deep bidirectional language model.  Moreover, the [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/abs/1801.06146) demonstrate the effectiveness of pretraining for NLP tasks: the authors trained a language model on a huge text corpus, then they fine-tuned it on various tasks: the model outperformed the state of the art on six text classification tasks by a large margin. Moreover, the authors showed a pretrained model fine-tuned on just 100 labeled examples could achieve the same performance as one trained from scratch on 100 times more labeled examples. This is called **transfer learning**.

## Sentences Generation

We can use a **character RNN** to predict the next character in a sentence. This will allow us to generate some original text. Let’s start with a simple model that can write like Shakespeare.

First, we load the [char-rnn](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) dataset, containing all of Shakespeare’s works: 

In [17]:
import urllib.request

url = "https://www.dropbox.com/scl/fi/27gmujs0gjg91dzoc63v5/shakespeare.zip?rlkey=qk562r6wftczcevcavpd5kska&dl=1"  # dl=1 is important
u = urllib.request.urlopen(url)
data = u.read()
u.close()
with open("./data/shakespeare.zip", "wb") as f :
   f.write(data)

In [18]:
import zipfile

with zipfile.ZipFile("./data/shakespeare.zip","r") as zip_ref:
    zip_ref.extractall("./data")

In [19]:
with open("./data/shakespeare.txt") as f:
    shakespeare_text = f.read()

Let’s print the first few lines:

In [20]:
print(shakespeare_text[:80])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.


First of all, We use a TextVectorization layer to encode the text. We set split="character" to get character-level encoding rather than the default word-level encoding, and we use standardize="lower" to convert the text to lowercase:

In [21]:
import tensorflow as tf

text_vec_layer = tf.keras.layers.TextVectorization(split="character", standardize="lower")
text_vec_layer.adapt([shakespeare_text])
encoded = text_vec_layer([shakespeare_text])[0]

2023-07-31 15:53:08.318034: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Each character is now mapped to an integer, starting at 2 (0 is reserved for padding tokens, and 1 is reserved for unknown characters). We won’t need either of these tokens, so let’s subtract 2 from the character IDs and compute the number of distinct characters and the total number of characters:

In [22]:
# drop tokens 0 (pad) and 1 (unknown), which we will not use
encoded -= 2  

# number of distinct chars (39)
n_tokens = text_vec_layer.vocabulary_size() - 2  
print("Number of tokens: ", n_tokens)

# total number of chars (1,115,394)
dataset_size = len(encoded)  
print("Number of characters: ", dataset_size)

Number of tokens:  39
Number of characters:  1115394


We can turn this long sequence into a dataset of windows that we can then use to train a sequence-to-sequence RNN. The targets will be similar to the inputs, but shifted by one time step into the "future". For example, one sample in the dataset may be a sequence of character IDs representing the text "to be or not to b" and the corresponding target will be a sequence of character IDs representing the text "o be or not to b" (with the final "e", but without the leading "t"). Let’s write a small utility function to convert a long sequence of character IDs into a dataset of input/target window pairs:

In [23]:
def to_dataset(sequence, length, shuffle=False, batch_size=32):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))
    if shuffle:
        ds = ds.shuffle(100_000)
    ds = ds.batch(batch_size)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

The function takes a sequence as input and creates a dataset containing all the windows of the desired length. It increases the length by one (we need the next character for the target), then, it shuffles the windows (optionally), batches them, splits them into input/output pairs, and activates prefetching.

<img src="./images/dataset-preparation.png" width="500">

In [24]:
list(to_dataset(text_vec_layer(["To be"])[0], length=4))

[(<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 4,  5,  2, 23]])>,
  <tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 5,  2, 23,  3]])>)]

Now we’re ready to create the training set, the validation set, and the test set. We will use roughly
90% of the text for training, 5% for validation, and 5% for testing. We set the window length to 100, but we can try tuning it: it’s easier and faster to train RNNs on shorter input sequences, but the RNN will not be able to learn any pattern longer than length, so don’t make it too small.

In [25]:
length = 100

train_set = to_dataset(encoded[:1_000_000], length=length, shuffle=True)
valid_set = to_dataset(encoded[1_000_000:1_060_000], length=length)
test_set = to_dataset(encoded[1_060_000:], length=length)

Since the dataset is reasonably large, and modeling language is quite a difficult task, we need more
than a simple RNN with a few recurrent neurons. Let’s build and train a model with one GRU layer composed of 128 units. We use an embedding layer as the first layer, to encode the character IDs and a Dense layer for the output layer with n_tokens units (we want to output a probability for each possible character: 

In [26]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax")
])

In [27]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])

The model may one or two hours to run, depending on the GPU. Without a GPU, it may take over 24 hours:

In [28]:
model_ckpt = tf.keras.callbacks.ModelCheckpoint(".data/my_shakespeare_model", monitor="val_accuracy", save_best_only=True)
history = model.fit(train_set, validation_data=valid_set, epochs=10, callbacks=[model_ckpt])

Epoch 1/10


2023-07-31 15:53:09.940801: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-07-31 15:53:17.645903: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-07-31 15:53:17.840012: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


  31247/Unknown - 638s 20ms/step - loss: 1.3988 - accuracy: 0.5715

2023-07-31 16:03:47.078327: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 12868754173978340980
2023-07-31 16:03:47.078341: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 13907542175109516020
2023-07-31 16:03:47.078351: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 14320546823296743940
2023-07-31 16:03:47.078358: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 10652614770811093167
2023-07-31 16:03:47.078363: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 3999607459647662802
2023-07-31 16:03:47.078369: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 2420370903434578617
2023-07-31 16:03:47.078377: I tensorflow/core/framework/local_rendezvous.cc:409] Local rendezvous send

INFO:tensorflow:Assets written to: .data/my_shakespeare_model/assets


INFO:tensorflow:Assets written to: .data/my_shakespeare_model/assets


Epoch 2/10


INFO:tensorflow:Assets written to: .data/my_shakespeare_model/assets


Epoch 3/10
Epoch 4/10


INFO:tensorflow:Assets written to: .data/my_shakespeare_model/assets


Epoch 5/10


INFO:tensorflow:Assets written to: .data/my_shakespeare_model/assets


Epoch 6/10


INFO:tensorflow:Assets written to: .data/my_shakespeare_model/assets


Epoch 7/10


2023-07-31 20:28:35.344043: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 71334 of 100000
2023-07-31 20:28:35.344482: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 71335 of 100000
2023-07-31 20:28:35.344853: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 71336 of 100000
2023-07-31 20:28:35.345137: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 71337 of 100000
2023-07-31 20:28:35.345308: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 71338 of 100000
2023-07-31 20:28:35.345425: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 71339 of 100000
2023-07-31 20:28:35.345539: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:4

    7/31247 [..............................] - ETA: 9:55 - loss: 1.5515 - accuracy: 0.5259     

2023-07-31 20:28:36.913998: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:450] Shuffle buffer filled.


Epoch 8/10
Epoch 9/10


INFO:tensorflow:Assets written to: .data/my_shakespeare_model/assets


Epoch 10/10


The model does not handle text preprocessing, so let’s wrap it in a final model containing the tf.keras.layers.TextVectorization layer as the first layer, plus a tf.keras.layers.Lambda layer to subtract 2 from the character IDs:

In [29]:
shakespeare_model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Lambda(lambda X: X - 2),  # no <PAD> or <UNK> tokens
    model
])

And now let’s use it to predict the next character in a sentence:

In [30]:
y_proba = shakespeare_model.predict(["To be or not to b"])[0, -1]
y_pred = tf.argmax(y_proba)  # choose the most probable character ID
text_vec_layer.get_vocabulary()[y_pred + 2]



2023-07-31 22:06:31.829317: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-07-31 22:06:31.901732: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


'e'

The model correctly predicts the next character. Now let’s use this model to pretend we’re Shakespeare. To generate new text, we can feed the network some text, make the model predict the most likely next letter, add it to the end of the text, then give the extended text to the model to guess the next letter, and so on. This is called **greedy decoding**. But in practice this often leads to the same words being repeated over and over again. Instead, we can sample the next character randomly, with a probability equal to the estimated probability. This will generate more diverse and interesting text. To have more control over the diversity of the generated text, we can divide the the class log probabilities by a number called the **temperature**: a value close to zero favors high-probability characters, while a high values give all characters an equal probability. Lower temperatures are typically preferred when generating fairly rigid and precise text (e.g. mathematical equations), while higher temperatures are preferred when generating more diverse and creative text.

In [31]:
def next_char(text, temperature=1):
    y_proba = shakespeare_model.predict([text])[0, -1:]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]
    return text_vec_layer.get_vocabulary()[char_id + 2]

Next, we can write a function that will repeatedly call next_char() to get the next character and append it to the given text:

In [32]:
def extend_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

We are now ready to generate some text, let’s try with different temperature values:

In [33]:
print(extend_text("To be or not to be", temperature=0.01))



To be or not to be a states
to see him and angelo, i will be a state


In [34]:
print(extend_text("To be or not to be", temperature=1))

To be or not to be back and
loses him horsemar hear it exceeds. masa


In [35]:
print(extend_text("To be or not to be", temperature=100))

To be or not to bev!?lo.-s?,?dkjqlobl$jph'vlsp bzg;ulh!a.p.d 'bqe&k,


To generate more convincing text, a common technique is to sample only from the top k characters, or only from the smallest set of top characters whose total probability exceeds some threshold (**nucleus sampling**).

Until now, at each training iteration the model starts with a hidden state full of zeros, then it updates this state at each time step, and after the last time step, it throws it away as it is not needed anymore. What if we instructed the RNN to preserve this final state after processing a training batch and use it as the initial state for the next training batch? In this way the model can learn long-term patterns despite only backpropagating through short sequences. This is called **a stateful RNN**. 

Notice that it only makes sense if each input sequence in a batch starts exactly where the corresponding sequence in the previous batch left off. So we need to use **sequential and non-overlapping input sequences** (rather than the shuffled and overlapping sequences we used to train stateless RNNs). When creating the tf.data.Dataset, we must therefore use shift=length (instead of shift=1) when calling the window() method. Moreover, we must not call the shuffle() method.

Batching is much harder, if we call batch(32), then 32 consecutive windows would be put in the same batch, and the following batch would not continue each of these windows where it left off. The first batch would contain windows 1 to 32 and the second batch would contain windows 33 to 64, so if you consider, say, the first window of each batch (i.e., windows 1 and 33), you can see that
they are not consecutive. The simplest solution to this problem is to just use a batch size of 1.

<img src="./images/stateful-dataset-preparation.png" width="500">

In [36]:
def to_dataset_for_stateful_rnn(sequence, length):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=length, drop_remainder=True)
    ds = ds.flat_map(lambda window: window.batch(length + 1)).batch(1)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

In [37]:
stateful_train_set = to_dataset_for_stateful_rnn(encoded[:1_000_000], length)
stateful_valid_set = to_dataset_for_stateful_rnn(encoded[1_000_000:1_060_000],length)
stateful_test_set = to_dataset_for_stateful_rnn(encoded[1_060_000:], length)

Now, let’s create the stateful RNN. We need to set the stateful argument to True when creating each recurrent layer, and because the stateful RNN needs to know the batch size (since it will preserve a state for each input sequence in the batch) we must set the batch_input_shape argument in the first layer:

In [38]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16, batch_input_shape=[1, None]),
    tf.keras.layers.GRU(128, return_sequences=True, stateful=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax")
])

Notice that at the end of each epoch, we need to reset the states before we go back to the beginning of the text:

In [39]:
class ResetStatesCallback(tf.keras.callbacks.Callback):
    def on_epoch_begin(self, epoch, logs):
        self.model.reset_states()

And now we can compile the model and train it using our callback:

In [40]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])

In [41]:
model_ckpt = tf.keras.callbacks.ModelCheckpoint(".data/my_stateful_shakespeare_model", monitor="val_accuracy", save_best_only=True)
history = model.fit(stateful_train_set, validation_data=stateful_valid_set, epochs=10, callbacks=[ResetStatesCallback(), model_ckpt])

Epoch 1/10


2023-07-31 22:06:39.614637: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-07-31 22:06:39.802436: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-07-31 22:06:39.947583: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


   9998/Unknown - 190s 19ms/step - loss: 1.8605 - accuracy: 0.4531

2023-07-31 22:09:49.145033: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 17153124651514579051
2023-07-31 22:09:49.145045: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 11458670720814651833
2023-07-31 22:09:49.145053: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 884662434241526751
2023-07-31 22:09:49.145061: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 5361958577475753345
2023-07-31 22:09:49.145072: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 9612039307905429744
2023-07-31 22:09:49.145075: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 5422606587661787508
2023-07-31 22:09:49.145078: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv it

INFO:tensorflow:Assets written to: .data/my_stateful_shakespeare_model/assets


INFO:tensorflow:Assets written to: .data/my_stateful_shakespeare_model/assets


Epoch 2/10


INFO:tensorflow:Assets written to: .data/my_stateful_shakespeare_model/assets


Epoch 3/10


INFO:tensorflow:Assets written to: .data/my_stateful_shakespeare_model/assets


Epoch 4/10


INFO:tensorflow:Assets written to: .data/my_stateful_shakespeare_model/assets


Epoch 5/10


INFO:tensorflow:Assets written to: .data/my_stateful_shakespeare_model/assets


Epoch 6/10


INFO:tensorflow:Assets written to: .data/my_stateful_shakespeare_model/assets


Epoch 7/10


INFO:tensorflow:Assets written to: .data/my_stateful_shakespeare_model/assets


Epoch 8/10


INFO:tensorflow:Assets written to: .data/my_stateful_shakespeare_model/assets


Epoch 9/10


INFO:tensorflow:Assets written to: .data/my_stateful_shakespeare_model/assets


Epoch 10/10


INFO:tensorflow:Assets written to: .data/my_stateful_shakespeare_model/assets




After this model is trained, it will only be possible to use it to make predictions for batches of the same size as were used during training. To avoid this restriction, create an identical stateless model, and copy the stateful model’s weights to this model:

In [42]:
stateless_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax")
])

In [43]:
stateless_model.build(tf.TensorShape([None, None]))
stateless_model.set_weights(model.get_weights())

In [44]:
shakespeare_model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Lambda(lambda X: X - 2),  # no <PAD> or <UNK> tokens
    stateless_model
])

In [45]:
print(extend_text("to be or not to be", temperature=0.1))



2023-07-31 22:38:09.583422: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-07-31 22:38:09.644545: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


to be or not to be
a worship and heard and heard to her and so with 


Notice that, although the char-RNN model is just trained to predict the next character, this simple task actually requires it to learn some higher-level tasks as well. For example, to find the next character after "Great movie, I really _", it’s helpful to understand that the sentence is positive, so what follows is more likely to be the letter "l" (for "loved") rather than "h" (for "hated"). The 2017 paper ["Learning to Generate Reviews and Discovering Sentiment"](https://arxiv.org/abs/1704.01444)⁠ found that one of the neurons acted as an excellent "sentiment analysis classifier": although the model was trained without any labels, the **sentiment neuron** reached state-of-the-art performance on sentiment analysis benchmarks. This foreshadowed and motivated unsupervised pretraining.

## Sentiment Analysis

One of the most common applications of NLP is text classification, especially sentiment analysis. We can use the [**IMDb reviews dataset**], which consists of 50.000 movie reviews in English extracted from the Internet Movie Database, along with a simple binary target for each review indicating whether it is negative (0) or positive (1). Like MNIST for images, the IMDb reviews dataset is popular for NLP: it is simple to be tackled on a laptop in a reasonable amount of time, but also challenging enough. Let’s load the IMDb dataset using the TensorFlow Datasets library. We use the first 90% of the dataset for training, and the remaining 10% for validation:

In [2]:
import tensorflow_datasets as tfds

raw_train_set, raw_valid_set, raw_test_set = tfds.load(name="imdb_reviews", 
                                                       split=["train[:90%]", "train[90%:]", "test"],
                                                       as_supervised=True
)

train_set = raw_train_set.shuffle(5000, seed=42).batch(32).prefetch(1)
valid_set = raw_valid_set.batch(32).prefetch(1)
test_set = raw_test_set.batch(32).prefetch(1)

  from .autonotebook import tqdm as notebook_tqdm
2023-08-01 13:23:07.327148: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M2 Max
2023-08-01 13:23:07.327167: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 64.00 GB
2023-08-01 13:23:07.327172: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 24.00 GB
2023-08-01 13:23:07.327199: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-08-01 13:23:07.327213: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


Let’s inspect a few reviews:

In [3]:
for review, label in raw_train_set.take(4):
    print(review.numpy().decode("utf-8")[:200], "...")
    print("Label:", label.numpy())

This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting  ...
Label: 0
I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However  ...
Label: 0
Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do. <br /><br />But come on Hollywood - a Moun ...
Label: 0
This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful perf ...
Label: 1


2023-08-01 13:23:09.483798: W tensorflow/core/kernels/data/cache_dataset_ops.cc:854] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


Some reviews are easy to classify. For example, the first review includes the words "terrible movie" in the very first sentence. But in many cases things are not that simple. For example, the third review starts off positively, even though it’s ultimately a negative review. To build a model for this task, we need to preprocess the text, but this time we will chop it into words instead of characters.

We can use the tf.keras.layers.TextVectorization layer, using spaces to identify word boundaries. Sometimes spaces are not always the best way to **tokenize** text (e.g. "San Francisco" or "#ILoveDeepLearning"). Fortunately, there are solutions to address these issues, as presented in [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/abs/1508.07909) and [Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates](https://arxiv.org/abs/1804.10959). A techniques is the **byte pair encoding** (BPE). It works by splitting the whole training set into individual characters (including spaces), then repeatedly merging the most frequent adjacent pairs until the vocabulary reaches the desired size. The TensorFlow Text library implements various of these tokenization strategies. However, for the IMDb task in English, using spaces for token boundaries should be good enough:

In [4]:
import tensorflow as tf

vocab_size = 1000
text_vec_layer = tf.keras.layers.TextVectorization(max_tokens=vocab_size)
text_vec_layer.adapt(train_set.map(lambda reviews, labels: reviews))

2023-08-01 13:23:12.082498: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


We limit the vocabulary to 1,000 tokens, since it’s unlikely that very rare words will be important for this task, and limiting the vocabulary size will reduce the number of parameters the model needs to learn. Now, we can create the model and train it:

In [4]:
embed_size = 128

model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Embedding(vocab_size, embed_size),
    tf.keras.layers.GRU(128),
    tf.keras.layers.Dense(1, activation="sigmoid")
])

The first layer is the TextVectorization, followed by an Embedding layer to convert word IDs into embeddings (one row per token in the vocabulary and one column per embedding dimension, in this example we use 128 dimensions). Next we use a GRU layer and a Dense layer with a single neuron and the sigmoid activation function, since this is a binary classification task: the model output estimats probability that the review expresses a positive sentiment regarding the movie. 

We then compile the model, and we fit it on the dataset for a couple of epochs:

In [5]:
model.compile(loss="binary_crossentropy", optimizer="nadam", metrics=["accuracy"])

In [6]:
history = model.fit(train_set, validation_data=valid_set, epochs=2)

Epoch 1/2


2023-08-01 08:50:27.299000: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-08-01 08:50:27.521326: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-08-01 08:50:27.691849: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.




2023-08-01 08:51:42.895065: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-08-01 08:51:42.955917: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Epoch 2/2


Sadly, the model fails to learn anything at all (the accuracy remains close to 50%, no better than random chance. Why is that? The reviews have **different lengths**, so when the TextVectorization layer converts them to sequences of token IDs, it pads the shorter sequences using the padding token (with ID 0) to make them as long as the longest sequence in the batch. As a result, most sequences end with many padding tokens (often dozens or hundreds). 

<img src="./images/padding.png" width="700">

Even though we are using a GRU layer, its short-term memory is still not great, so when it goes through many padding tokens, it ends up forgetting what the review was about! A solution is to make the RNN ignore the padding tokens, using the **masking** tecniques. In Keras, simply add "mask_zero=True" when creating the Embedding layer:

In [7]:
embed_size = 128

tf.random.set_seed(42)

model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Embedding(vocab_size, embed_size, mask_zero=True),
    tf.keras.layers.GRU(128),
    tf.keras.layers.Dense(1, activation="sigmoid")
])

The Embedding layer creates a **mask tensor** (a Boolean tensor) with the same shape as the inputs, and it is equal to False anywhere the token IDs are 0, or True otherwise. This mask tensor is then automatically propagated by the model to the next layer. This allows layers to ignore the appropriate
time steps. Each layer may handle the mask differently, but in general they simply ignore masked time steps. For example, when a recurrent layer encounters a masked time step, it simply copies the output from the previous time step. Many Keras layers support masking: SimpleRNN, GRU, LSTM, Bidirectional, Dense, TimeDistributed, Add, and a few others. However, convolutional layers do not support masking, it’s not obvious how they would do so anyway. If the mask propagates all the way to the output, then it gets applied to the losses as well, so the masked time steps will not contribute to the loss.

In [8]:
model.compile(loss="binary_crossentropy", optimizer="nadam", metrics=["accuracy"])

In [9]:
with tf.device('CPU: 0'):
    history = model.fit(train_set, validation_data=valid_set, epochs=2)

Epoch 1/5


2023-08-01 08:53:37.861269: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.




2023-08-01 08:58:35.031448: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


After training this model for a few epochs, it will become quite good at judging whether a review is positive or not. It’s impressive that the model is able to learn useful word embeddings based on just 25.000 movie reviews. Imagine how good the embeddings would be if we had billions of reviews to train on! Unfortunately, we don’t, but perhaps we can **reuse word embeddings** trained on some other large text corpus (e.g., Amazon reviews, available on TensorFlow Datasets), even if it is not composed of movie reviews? After all, the word "amazing" generally has the same meaning whether you use it to talk about movies or anything else. So, instead of training word embeddings, we can just download and use pretrained embeddings. For example, let’s build a classifier based on the [Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder/4) available on TensorFlow Hub:

In [5]:
import os
import tensorflow_hub as hub

os.environ["TFHUB_CACHE_DIR"] = "./data"

model = tf.keras.Sequential([
    hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4", trainable=True, dtype=tf.string, input_shape=[]),
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid")
])

2023-08-01 13:23:25.427334: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


By default, TensorFlow Hub modules are saved to a temporary directory, and they get downloaded again and again every time you run your program. To avoid that, we ca set the TFHUB_CACHE_DIR environment variable to a directory of our choice: the modules will then be saved there, and only downloaded once. Also note that we set trainable=True: in this way the pretrained Universal Sentence Encoder is fine-tuned during training. If we set trainable=False, then only the Dense layer will be trained.

In [6]:
model.compile(loss="binary_crossentropy", optimizer="nadam", metrics=["accuracy"])

In [7]:
model.fit(train_set, validation_data=valid_set, epochs=2)

Epoch 1/2


2023-08-01 13:23:36.618468: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.




2023-08-01 14:39:03.872850: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Epoch 2/2

: 

: 

After training, this model should reach a validation accuracy of over 90%. That’s actually really
good: if a human try to perform the task will probably do only marginally better, since many reviews contain both positive and negative comments and classifying these ambiguous reviews is like flipping a coin.

## Machine Translation

Let’s begin with a simple model that translates English sentences to Spanish. English sentences are fed as inputs to a encoder, then a decoder outputs the Spanish translations:

<img src="./images/machine-translation.png" width="600">

Notice that during training the decoder is given as input the word that it should have output at the previous step shifted back by one step, regardless of what it actually output. This is a technique called **teacher forcing** that significantly speeds up training and improves performance. For the very first word, the decoder is given the start-ofsequence (SOS) token, and the decoder is expected to end the sentence with an end-of-sequence (EOS) token. Each word is initially represented by its ID (e.g., 854 for "soccer"). Next, an Embedding layer returns the word embedding. These word embeddings are then fed to the encoder and the decoder. At each step, the decoder outputs a score for each word in the output vocabulary (i.e., Spanish), then the softmax activation function turns these scores into probabilities. For example, at the first step the word "Me" may have a probability of 7%, "Yo" may have a probability of 1%, and so on. The word with the highest probability is output. This is very much like a regular classification task, and indeed we can train the model using the "sparse_categorical_crossentropy" loss, much like we did in the char-RNN model. At inference time, we not have the target sentence to feed to the decoder, instead we feed it the word that it has just output at the previous step.

Let’s build and train this model! First, we download a dataset of English and Spanish sentences from the [Tatoeba Project](https://tatoeba.org/eng/downloads): 

In [1]:
import urllib.request

url = "https://www.dropbox.com/scl/fi/5st5n7uxe849880m5gypj/spa-eng.zip?rlkey=284r6ruizm293xipcatl8l0j6&dl=1"  # dl=1 is important
u = urllib.request.urlopen(url)
data = u.read()
u.close()
with open("./data/spa-eng.zip", "wb") as f :
   f.write(data)

Each line contains an English sentence and the corresponding Spanish translation, separated by a tab:

In [4]:
import tensorflow as tf

path = tf.keras.utils.get_file("./data/spa-eng.zip", origin=url, cache_dir="./data/datasets", extract=True)
text = (Path(path).with_name("./data/spa-eng") / "spa.txt").read_text()

Downloading data from https://www.dropbox.com/scl/fi/5st5n7uxe849880m5gypj/spa-eng.zip?rlkey=284r6ruizm293xipcatl8l0j6&dl=1


FileNotFoundError: [Errno 2] No such file or directory: '/tmp/.keras/datasets/./data/spa-eng.zip'

We start by removing the Spanish characters "¡" and "¿", which the TextVectorization layer doesn’t handle, then we will parse the sentence pairs and shuffle them. Finally, we will split them into two separate lists, one per language:

In [None]:
text = text.replace("¡", "").replace("¿", "")
pairs = [line.split("\t") for line in text.splitlines()]

np.random.shuffle(pairs)

sentences_en, sentences_es = zip(*pairs)  # separates the pairs into 2 lists

Let’s take a look at the first three sentence pairs:

In [None]:
for i in range(3):
    print(sentences_en[i], "=>", sentences_es[i])

Next, let’s create two TextVectorization layers, one per languageì, and adapt them to the text:

In [None]:
vocab_size = 1000
max_length = 50

text_vec_layer_en = tf.keras.layers.TextVectorization(vocab_size, output_sequence_length=max_length)
text_vec_layer_en.adapt(sentences_en)

text_vec_layer_es = tf.keras.layers.TextVectorization(vocab_size, output_sequence_length=max_length)
text_vec_layer_es.adapt([f"startofseq {s} endofseq" for s in sentences_es])

We limit the vocabulary size to 1.000 (quite small) because the training set is not very large, and to speed up training. State-of-the-art translation models typically use a much larger vocabulary (e.g., 30.000) and a much larger training set, and a much larger model. For example, check out the [Opus-MT models](https://github.com/Helsinki-NLP/Opus-MT) by University of Helsinki, or the [M2M-100 model](https://huggingface.co/docs/transformers/model_doc/m2m_100) by Facebook.

Since all sentences in the dataset have a maximum of 50 words, we set output_sequence_length to 50, in this way the input sequences will automatically be padded with zeros until they are all 50 tokens long. If there was any sentence longer than 50 tokens in the training set, it would be cropped to 50 tokens.

For the Spanish text, we add "startofseq" and "endofseq" to each sentence when adapting the TextVectorization layer: we will use these words as SOS and EOS tokens. we can use any other words, as long as they are not actual Spanish words.

Let’s inspect the first 10 tokens in both vocabularies. They start with the padding token, the unknown token, the SOS and EOS tokens (only in the Spanish vocabulary), then the actual words, sorted by decreasing frequency:

In [None]:
text_vec_layer_en.get_vocabulary()[:10]

In [None]:
text_vec_layer_es.get_vocabulary()[:10]

Next, let’s create the training set and the validation set. We will the first 100.000 sentence pairs for training, and the rest for validation. The decoder inputs are the Spanish sentences plus an SOS token prefix, the targets are the Spanish sentences plus an EOS suffix:

In [None]:
X_train = tf.constant(sentences_en[:100_000])
X_valid = tf.constant(sentences_en[100_000:])

X_train_dec = tf.constant([f"startofseq {s}" for s in sentences_es[:100_000]])
X_valid_dec = tf.constant([f"startofseq {s}" for s in sentences_es[100_000:]])

Y_train = text_vec_layer_es([f"{s} endofseq" for s in sentences_es[:100_000]])
Y_valid = text_vec_layer_es([f"{s} endofseq" for s in sentences_es[100_000:]])

OK, we are ready to build the translation model. We will use the functional API for that since the model is not sequential. It requires two text inputs, one for the encoder and one for the decoder:

In [None]:
encoder_inputs = tf.keras.layers.Input(shape=[], dtype=tf.string)
decoder_inputs = tf.keras.layers.Input(shape=[], dtype=tf.string)

Next, we need to encode these sentences using the TextVectorization layers, followed by an Embedding layer for each language, with mask_zero=True to ensure masking is handled automatically:

In [None]:
embed_size = 128

encoder_input_ids = text_vec_layer_en(encoder_inputs)
decoder_input_ids = text_vec_layer_es(decoder_inputs)

encoder_embedding_layer = tf.keras.layers.Embedding(vocab_size, embed_size, mask_zero=True)
decoder_embedding_layer = tf.keras.layers.Embedding(vocab_size, embed_size, mask_zero=True)

encoder_embeddings = encoder_embedding_layer(encoder_input_ids)
decoder_embeddings = decoder_embedding_layer(decoder_input_ids)

Now let’s create the encoder and pass it the embedded inputs. To keep things simple, we just used a single LSTM layer, but we can stack several of them. We also set return_state=True to get a reference to the layer’s final state. Since we’re using an LSTM layer, there are actually two states: the short-term state and the long-term state. The layer returns thesestates separately, which is why we had to write *encoder_state to group both states in a list. 

In [None]:
encoder = tf.keras.layers.LSTM(512, return_state=True)
encoder_outputs, *encoder_state = encoder(encoder_embeddings)

Now we can use this (double) state as the initial state of the decoder:

In [None]:
decoder = tf.keras.layers.LSTM(512, return_sequences=True)
decoder_outputs = decoder(decoder_embeddings, initial_state=encoder_state)

Next, we can pass the decoder outputs through a Dense layer with the softmax activation function to get the word probabilities for each step:

In [None]:
output_layer = tf.keras.layers.Dense(vocab_size, activation="softmax")
Y_proba = output_layer(decoder_outputs)

Finally, we just create the Keras model:

In [None]:
model = tf.keras.Model(inputs=[encoder_inputs, decoder_inputs],
                       outputs=[Y_proba])

In [None]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])

In [None]:
model.fit((X_train, X_train_dec), Y_train, epochs=10, validation_data=((X_valid, X_valid_dec), Y_valid))

When the output vocabulary is large, outputting a probability for each and every possible word can be quite slow. An optimization is to apply the [**sampled softmax technique**](https://arxiv.org/abs/1412.2007) by looking only at the logits output by the model for the correct word and for a random sample of incorrect words, then compute an approximation of the loss based only on these logits. 

We can use the model to translate new English sentences to Spanish. But it’s not as simple as calling model.predict(), because the decoder expects as input the word that was predicted at the previous time step. One way to do this is to write a custom memory cell that keeps track of the previous output and feeds it to the encoder at the next time step. However, to keep things simple, we can just call the model multiple times, predicting one extra word at each round:

In [None]:
def translate(sentence_en):
    translation = ""
    for word_idx in range(max_length):
        X = np.array([sentence_en])  # encoder input 
        X_dec = np.array(["startofseq " + translation])  # decoder input
        y_proba = model.predict((X, X_dec))[0, word_idx]  # last token's probas
        predicted_word_id = np.argmax(y_proba)
        predicted_word = text_vec_layer_es.get_vocabulary()[predicted_word_id]
        if predicted_word == "endofseq":
            break
        translation += " " + predicted_word
    return translation.strip()

The function simply keeps predicting one word at a time, gradually completing the translation, and it
stops once it reaches the EOS token.

In [None]:
translate("I like soccer")

It works, at least it does with very short sentences. If we try playing with this model for a while, we will find that it’s not bilingual yet, and in particular it really struggles with longer sentences:

In [None]:
translate("I like soccer and also going to the beach")

The translation says "I like soccer and sometimes even the bus". How can you improve it? One way is to increase the training set size and add more LSTM layers in both the encoder and the decoder. Hoerver this will only get us so far, so let’s look at more sophisticated techniques.

A regular recurrent layer only looks at past and present inputs before generating its output. In other words, **it is causal**: it cannot look into the future. This type of RNN makes sense when forecasting time series, but for tasks like text classification it is often preferable to look ahead at the next words before encoding a given word. For example, consider the phrases "the right arm", "the right person", and "the right to criticize": to properly encode the word "right", we need to look ahead. One solution is to run two recurrent layers on the same inputs, one reading the words from left to right and the other reading them from right to left, then combine their outputs at each time step, typically by concatenating them. This is what a **bidirectional recurrent layer**

<img src="./images/bidirectional-recurrent-layer.png" width="400">


To implement it in Keras, just wrap a recurrent layer in a tf.keras.layers.Bidirectional layer, it will create a clone of the recurrent layer (but in the reverse direction), and it will run both and
concatenate their outputs.

In [None]:
encoder = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(256, return_state=True))

This layer will now return four states instead of two (short-term and long-term states of forward and backward LSTM layer). We cannot use this quadruple state directly, becouse the decoder layer expects two states. We cannot make the decoder bidirectional, since it must remain causal. Instead, we can concatenate the two short-term states, and the two long-term states:

In [None]:
encoder_outputs, *encoder_state = encoder(encoder_embeddings)
encoder_state = [tf.concat(encoder_state[::2], axis=-1),  # short-term (0 & 2)
                 tf.concat(encoder_state[1::2], axis=-1)]  # long-term (1 & 3)

Now we can build and train the model:

In [None]:
decoder = tf.keras.layers.LSTM(512, return_sequences=True)
decoder_outputs = decoder(decoder_embeddings, initial_state=encoder_state)

output_layer = tf.keras.layers.Dense(vocab_size, activation="softmax")

Y_proba = output_layer(decoder_outputs)

model = tf.keras.Model(inputs=[encoder_inputs, decoder_inputs], outputs=[Y_proba])

In [None]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])

In [None]:
model.fit((X_train, X_train_dec), Y_train, epochs=10, validation_data=((X_valid, X_valid_dec), Y_valid))

In [None]:
translate("I like soccer")

Now suppose we have trained an encoder–decoder model, and we use it to translate the sentence "I like
soccer" to Spanish. You are hoping that it will output the proper translation "me gusta el fútbol", but unfortunately it outputs "me gustan los jugadores". Looking at the training set, we notice many sentences like "I like cars" which translates to "me gustan los autos", so it wasn’t absurd for the model to output "me gustan los" after seeing "I like". Unfortunately, in this case it was a mistake since "soccer" is singular. The model could not go back and fix it, so it tried to complete the sentence as best it could, in this case using the word "jugadores". The following code shows how the model making an error:

In [None]:
sentence_en = "I love cats and dogs"
translate(sentence_en)

How can we give the model a chance to go back and fix mistakes it made earlier?  A common solutions is **beam search**: it keeps track of a short list of the k most promising sentences, and at each decoder step it tries to extend them by one word, keeping conly the k most likely sentences. The parameter k is called the **beam width**.

<img src="./images/beam-search.png" width="700">

In the example, we are using beam search with a beam width of 3. At the first decoder step, the model will output an estimated probability for each possible first word in the translated sentence. The top three words are "me" (75% estimated probability), "a" (3%), and "como" (1%). That’s our short list so far. Next, we use the model to find the next word for each sentence. For the first sentence ("me"),
the model outputs a probability of 36% for the word "gustan", 32% for the word "gusta", 16% for the word "encanta". Note that these are actually conditional probabilities, given that the sentence starts with "me". Assuming the vocabulary has 1.000 words, we will end up with 1.000 probabilities per sentence. Next, we compute the probabilities of each of the 3.000 two-word sentences we considered. We do this by multiplying the estimated conditional probability of each word by the
estimated probability of the sentence it completes. For example, the estimated probability of the
sentence "me" was 75%, while the estimated conditional probability of the word "gustan" (given that
the first word is "me") was 36%, so the estimated probability of the sentence "me gustan" is 75% * 36% = 27%. After computing the probabilities of all 3.000 two-word sentences, we keep only the top
3. In this example they all start with the word "me": "me gustan" (27%), "me gusta" (24%), and "me
encanta" (12%). Right now, the sentence "me gustan" is winning, but "me gusta" has not been eliminated. Then we repeat the same process: we use the model to predict the next word in each of these three sentences, and we compute the probabilities of all 3.000 three-word sentences we considered. Perhaps the top three are now "me gustan los" (10%), "me gusta el" (8%), and "me gusta mucho" (2%). At the next step we may get "me gusta el fútbol" (6%), "me gusta mucho el" (1%), and "me
gusta el deporte" (0.2%). Notice that "me gustan" was eliminated, and the correct translation is now
ahead. We boosted our encoder–decoder model’s performance without any extra training, simply by
using it more wisely.

The following is a very basic implementation of beam search. The TensorFlow Addons library includes a full seq2seq API that lets we build encoder–decoder models with beam search (and more). However, its documentation is currently very limited.

In [None]:
def beam_search(sentence_en, beam_width, verbose=False):
    X = np.array([sentence_en])  # encoder input
    X_dec = np.array(["startofseq"])  # decoder input
    y_proba = model.predict((X, X_dec))[0, 0]  # first token's probas
    top_k = tf.math.top_k(y_proba, k=beam_width)
    top_translations = [  # list of best (log_proba, translation)
        (np.log(word_proba), text_vec_layer_es.get_vocabulary()[word_id])
        for word_proba, word_id in zip(top_k.values, top_k.indices)
    ]
    
    # displays the top first words in verbose mode
    if verbose:
        print("Top first words:", top_translations)

    for idx in range(1, max_length):
        candidates = []
        for log_proba, translation in top_translations:
            if translation.endswith("endofseq"):
                candidates.append((log_proba, translation))
                continue  # translation is finished, so don't try to extend it
            X = np.array([sentence_en])  # encoder input
            X_dec = np.array(["startofseq " + translation])  # decoder input
            y_proba = model.predict((X, X_dec))[0, idx]  # last token's proba
            for word_id, word_proba in enumerate(y_proba):
                word = text_vec_layer_es.get_vocabulary()[word_id]
                candidates.append((log_proba + np.log(word_proba),
                                   f"{translation} {word}"))
        top_translations = sorted(candidates, reverse=True)[:beam_width]

        # displays the top translation so far in verbose mode
        if verbose:
            print("Top translations so far:", top_translations)

        if all([tr.endswith("endofseq") for _, tr in top_translations]):
            return top_translations[0][1].replace("endofseq", "").strip()

The following code shows how beam search can help:

In [None]:
beam_search(sentence_en, beam_width=3, verbose=True)

The correct translation is in the top 3 sentences found by beam search, but it's not the first. Since we're using a small vocabulary, the "UNK" token is quite frequent, so we may want to penalize it: divide its probability by 2 in the beam search function will discourage beam search from using it too much.

With all this improvements, we can get reasonably good translations for fairly short sentences. Unfortunately, this model will be really bad at translating long sentences. Once again, the problem comes from the limited short-term memory of RNNs.

## Attention Mechanism

Consider the path from the word "soccer" to its translation "fútbol" in the network architecture, it is quite long. This means that a representation of a word needs to be carried over many steps before it is actually used. Can’t we make this path shorter? We should allow the decoder to focus on the appropriate words (as encoded by the encoder) at each time step. For example, at the time step where the decoder needs to output the word "fútbol", **it should focus its attention** on the word "soccer".This means that the path from an input word to its translation should be much shorter, so the short-term memory limitations of RNNs have much less impact. 

**Attention** is a widely investigated concept and in its most generic form, it can be described as merely an overall level of alertness or ability to engage with surroundings. When a subject is presented with different images, the eye movements that the subject performs can reveal the salient image parts that the subject’s attention is most attracted to. The human brain attends to these salient visual features at different neuronal stages.  Neurons at the earliest stages are tuned to simple visual attributes such as intensity contrast, colour opponency, orientation, direction and velocity of motion, or stereo disparity at several spatial scales. Neuronal tuning becomes increasingly more specialized with the progression from low-level to high-level visual areas, such that higher-level visual areas include neurons that respond only to corners or junctions shape-from-shading cues or views of specific real-world objects. Interestingly, research has also observed that different subjects tend to be attracted to the same salient visual cues. Research has also discovered several forms of interaction between memory and attention. Since the human brain has a limited memory capacity, then selecting which information to store becomes crucial in making the best use of the limited resources. The human brain does so by relying on attention, such that it dynamically stores in memory the information that the human subject most pays attention to. 

The main idea behind the [**attention mechanism**](https://arxiv.org/abs/1409.0473) is to permit the decoder to utilize the most relevant parts of the input sequence in a flexible manner, by a weighted combination of all the encoded input vectors, with the most relevant vectors being attributed the highest weights.  

<img src="./images/attention-mechanism.png" width="300">

We insert a mechanism that takes the previous hidden state of the decoder and the list of encoded vectors, and uses them to generate score values that indicate how well the elements of the input sequence align with the current output. At each time step, the memory cell computes a weighted sum of all the encoder outputs. This determines which words it will focus on at this step. The weight $\alpha(t,i)$ is the weight of the i encoder output at the t ecoder time step. It is this score vector that is then fed into the decoder to generate a translated output. 

The rest of the decoder works just like earlier.

But where do these weights come from? They are generated by a small neural network (**attention layer**), which is trained jointly with the rest of the encoder–decoder model.

<img src="./images/attention-layer.png" width="300">

It starts with a Dense layer composed of a single neuron that processes each of the encoder outputs,
along with the decoder previous hidden state. This layer outputs a score (or **energy**) for each encoder output. This score measures **how well each output is aligned** with the decoder previous hidden state. For example, the model has already output "me gusta el" so it’s now expecting a noun: the word "soccer" is the one that best aligns with the current state, so it gets a high score Finally, all the scores go through a softmax layer to get a final weight for each encoder output. 
This type of artificial attention is thus a form of iterative re-weighting. Specifically, it dynamically highlights different components of a pre-processed input as they are needed for output generation. This makes it flexible and context dependent, like biological attention. In the absence of the attention mechanism that highlights the salient information across the entirety of the input, the decoder would only have access to the limited information that would be encoded, potentially missing important information. 

There are several way to compute the waights, the previous one is called [**Bahdanau attention** or **additive attention**](https://arxiv.org/abs/1409.0473), since it concatenates the encoder output with the decoder previous hidden state.

Another common attention mechanism is the [**Luong attention** or **multiplicative attention**](https://arxiv.org/abs/1508.04025). Because the goal of the attention layer is to measure the similarity between one of the encoder outputs and the decoder previous hidden state, this mechanism simply compute the dot product of these two vectors, as this is often a fairly good similarity measure, and modern hardware can compute it very efficiently. Moreover, it uses the decoder hidden state at the current time step rather than at the previous time step, then it uses the output of the attention mechanism directly to compute the decoder’s predictions, rather than using it to compute the decoder’s current hidden state. Finally it uses a variant of the dot product mechanism where the encoder outputs first go through a fully connected layer (without a bias term) before the dot products are computed.

Let’s add Luong attention to our encoder–decoder model. Notice that we need to pass all the encoder outputs to the Attention layer, so we need to set return_sequences=True when creating the encoder:

In [None]:
encoder = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(256, return_sequences=True, return_state=True))

encoder_outputs, *encoder_state = encoder(encoder_embeddings)
encoder_state = [tf.concat(encoder_state[::2], axis=-1),  # short-term (0 & 2)
                 tf.concat(encoder_state[1::2], axis=-1)]  # long-term (1 & 3)
decoder = tf.keras.layers.LSTM(512, return_sequences=True)
decoder_outputs = decoder(decoder_embeddings, initial_state=encoder_state)

Next, we create the attention layer and pass it the decoder states and the encoder outputs. However, to access the decoder states at each step we need to write a custom memory cell For simplicity, let’s use the decoder outputs instead of its states (in practice this works well too, and it’s much easier to code). Then we just pass the attention layer outputs directly to the output layer:

In [None]:
attention_layer = tf.keras.layers.Attention()
attention_outputs = attention_layer([decoder_outputs, encoder_outputs])
output_layer = tf.keras.layers.Dense(vocab_size, activation="softmax")
Y_proba = output_layer(attention_outputs)

In [None]:
model = tf.keras.Model(inputs=[encoder_inputs, decoder_inputs],
                       outputs=[Y_proba])

We train the model:

In [None]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])

In [None]:
model.fit((X_train, X_train_dec), Y_train, epochs=10, validation_data=((X_valid, X_valid_dec), Y_valid))

It is able to handle much longer sentences. For example:

In [None]:
translate("I like soccer and also going to the beach")

In [None]:
beam_search("I like soccer and also going to the beach", beam_width=3, verbose=True)