<a href="https://colab.research.google.com/github/xinconggg/Machine-Learning/blob/main/Natural%20Language%20Processing%20with%20RNNs%20and%20Attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup
Have to use Keras 2 instead of 3. To do that, set the `TF_USE_LEGACY_KERAS` environment variable to "1" and import the `tf_keras package`. This ensures that `tf.keras` points to `tf_keras`, which is Keras 2.*.

In [1]:
import os
os.environ["TF_USE_LEGACY_KERAS"] = "1"

import tf_keras

And TensorFlow ≥ 2.8:

In [2]:
from packaging import version
import tensorflow as tf

assert version.parse(tf.__version__) >= version.parse("2.8.0")

## Generating Shakespearean Text using a Character RNN
### Creating the Training Dataset
Using Keras's `tf.keras.utils.get_file` function, download all of Shakespeare's works:

In [3]:
import sys

assert sys.version_info >= (3, 7)

In [4]:
import os
os.environ["TF_USE_LEGACY_KERAS"] = "1"

import tf_keras

In [5]:
from packaging import version
import tensorflow as tf

assert version.parse(tf.__version__) >= version.parse("2.8.0")

In [6]:
import tensorflow as tf

shakespeare_url = "https://homl.info/shakespeare"  # shortcut URL
filepath = tf.keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

Print the first few lines to ensure that the code is working:

In [7]:
print(shakespeare_text[:80])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.


Now use a `tf.keras.layers.TextVectorization` layer to encode this text. Set `split="character"` to convert the text to lowercase:

In [8]:
text_vec_layer = tf.keras.layers.TextVectorization(split="character",
                                                   standardize="lower")
text_vec_layer.adapt([shakespeare_text])
encoded = text_vec_layer([shakespeare_text])[0]

Each character is now mapped to an integer, starting at 2, since the `TextVectorization` layer reserved the value 0 for padding tokens and reversed 1 for unknown characters. Since we won't need either of these tokens, we can substract 2 from the character IDs and compute the number of distinct characters and the total number of characters:

In [9]:
encoded -= 2  # drop tokens 0 (pad) and 1 (unknown), which we will not use
n_tokens = text_vec_layer.vocabulary_size() - 2  # number of distinct chars = 39
dataset_size = len(encoded)  # total number of chars = 1,115,394

We can turn this very long sequence into a dataset of windows that we can then use to train a sequence-to-sequence RNN. Write a small function to convert a long sequence of character IDs into a dataset of input/target window pairs:

In [10]:
def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))
    if shuffle:
        ds = ds.shuffle(100_000, seed=seed)
    ds = ds.batch(batch_size)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

Walk through of function:
- It takes a sequence as input (i.e., the encoded text), and creates a dataset containing all the windows of the desired length.
- It increases the length by one since we need the next character for the target.
- It then shuffles the windows, batches them, splits them into input/output pairs, and activates prefetching.

Create the training(90%), validation(5%) and test(5%) set:

In [11]:
length = 100
tf.random.set_seed(42)
train_set = to_dataset(encoded[:1_000_000], length=length, shuffle=True,
                       seed=42)
valid_set = to_dataset(encoded[1_000_000:1_060_000], length=length)
test_set = to_dataset(encoded[1_060_000:], length=length)

### Building and Training the Char-RNN Model
Since the dataset is large, and the modeling language is a difficult task, we need more than a simple RNN with a few recurrent neurons. Build and train a model with one GRU layer composed of 128 units:


In [12]:
tf.random.set_seed(42)

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model_ckpt = tf.keras.callbacks.ModelCheckpoint(
    "my_shakespeare_model.keras", monitor="val_accuracy", save_best_only=True)
history = model.fit(train_set, validation_data=valid_set, epochs= 3,
                    callbacks=[model_ckpt])

Epoch 1/3
Epoch 2/3
Epoch 3/3


Walk through of code:
- Use `Embedding` layer as the first layer to encode the character IDs. The `Embedding` layer's number of input dimensions is the number of distinct character IDs and the number of output dimensions is a hyperparameter that can be tuned. The inputs of the `Embedding` layer will be 2D tensors of shape and the output will be 3D tensor of shape.
- The `Dense` layer is used for the output layer: it must have 39 units (`n_tokens`) because there are 39 distinct characters in the text. The 39 output probabilities should sum up to 1 at each time step, so we apply the **softmax** activation function to the outputs of the `Dense` layer.
- Lastly, we compile the model using the `sparse_categorical_crossentropy` loss and a **Nadam** optimizer and train the model for several epochs using a `ModelCheckpoint` callback to save the best model as training progresses.

Since the model does not handle text preprocessing, let's wrap it in a final model containing the `tf.keras.layers.TextVectorization` layer as the first layer, plus a `tf.keras.layers.Lambda` layer to subtract 2 from the character IDs:

In [13]:
shakespeare_model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Lambda(lambda X: X - 2),
    model
])

Use it to predict the next character in a sentence:

In [14]:
# Tokenize at the character level (if required by your model)
input_text = tf.convert_to_tensor(["To be or not to b"])

# Predict probabilities
y_proba = shakespeare_model.predict(input_text)[0, -1]

# Find the most probable character (and check if it corresponds to the expected character 'e')
y_pred = tf.argmax(y_proba)  # choose the most probable character ID

# Retrieve the corresponding character
character = text_vec_layer.get_vocabulary()[y_pred + 2]  # Adjust indexing for padding

print(f"Predicted character: {character}")


Predicted character: e


### Generating Fake Shakespearean Text
To generate new text using the Char-RNN model, we could feed it some text, make the model predict
the most likely next letter, add it to the end of the text, then give the extended text to the model to
guess the next letter, and so on. This is called *greedy decoding*. But in practice this often leads to the same words being repeated over and over again. Instead, we can sample the next character
randomly, with a probability equal to the estimated probability, using TensorFlow’s
`tf.random.categorical()` function. This will generate more diverse and interesting text. The
`categorical()` function samples random class indices, given the class log probabilities (logits). For example:

In [15]:
log_probas = tf.math.log([[0.5, 0.4, 0.1]])  # probas = 50%, 40%, and 10%
tf.random.set_seed(42)
tf.random.categorical(log_probas, num_samples=8)  # draw 8 samples

<tf.Tensor: shape=(1, 8), dtype=int64, numpy=array([[0, 0, 1, 1, 1, 0, 0, 0]])>

To have more control over the diversity of the generated text, we can divide the logits by a number called the *temperature*, which we can tweak as we wish. A temperature close to 0 favors high-probability characters, while a high temperature gives all characters an equal probability. Lower temperature are typically preferred when generating a fairly rigid and precise text, such as mathematical equations, while higher temperatures are preferred when generating more diverse and creative text. The following `next_char()` custom helper function uses this approach to pick the next character to add to the input text:

In [16]:
def next_char(text, temperature=1):
    y_proba = shakespeare_model.predict([text])[0, -1:]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]
    return text_vec_layer.get_vocabulary()[char_id + 2]

We can write another helper function that will repeatedly call the `next_char()` function to get the next character, and append it to the given text:

In [17]:
def extend_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

Now, we can try generating some text with different temperature values:

In [18]:
# Temperature = 0.01
tf.random.set_seed(42)

input_text = tf.convert_to_tensor(["To be or not to be"])
print(extend_text(input_text, temperature=0.01))

tf.Tensor([b'To be or not to be a shall the duke\nwill not show me to the duke of '], shape=(1,), dtype=string)


In [19]:
# Temperature = 1
input_text = tf.convert_to_tensor(["To be or not to be"])
print(extend_text(input_text, temperature=1))

tf.Tensor([b'To be or not to begun it off\ntake the battle sprittinous sonds\nas te'], shape=(1,), dtype=string)


In [20]:
# Temperature = 100
input_text = tf.convert_to_tensor(["To be or not to be"])
print(extend_text(input_text, temperature=100))

tf.Tensor([b"To be or not to bepevicm-v lv!?$ez?gmjz :3?ljb'va;!td&\ni.ur3l'-j!3eu"], shape=(1,), dtype=string)


### Stateful RNN
A stateful RNN only makes sense if each input sequence in a batch starts exactly where the correspodning sequence in the previous batch left off. So, we need to use sequential and nonoverlapping input sequences rather than the shuffled and overlapping sequences used to train stateless RNNs. Hence, when creating the `tf.data.Dataset`, we must use `shift=length` instead of `shift=1` when calling the `window()` method. Take note that we must not call the `shuffle()` method.

The following `to_dateset_for_stateful_rnn()` utility function uses the strategy (`batch(1)`), to prepare a dataset for a stateful RNN:

In [21]:
def to_dataset_for_stateful_rnn(sequence, length, batch_size=64):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=length, drop_remainder=True)
    ds = ds.flat_map(lambda window: window.batch(length + 1))
    # Set a fixed batch size here
    ds = ds.batch(batch_size, drop_remainder=True)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

stateful_train_set = to_dataset_for_stateful_rnn(encoded[:1_000_000], length, batch_size=64)
stateful_valid_set = to_dataset_for_stateful_rnn(encoded[1_000_000:1_060_000], length, batch_size=64)
stateful_test_set = to_dataset_for_stateful_rnn(encoded[1_060_000:], length, batch_size=64)

To create the stateful RNN, we need to set `stateful=True` when creating each recurrent layer. Since the stateful RNN needs to know the batch size, we must set the `batch_input_shape` argument in the first layer. Note that the second dimension can be left unspecified since the input sequences can have any length"

In [22]:
from tensorflow.keras.layers import Input

tf.random.set_seed(42)

model = tf.keras.Sequential([
    Input(shape=(None,), batch_size=64),  # Specify the batch size and sequence length
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    tf.keras.layers.GRU(128, return_sequences=True, stateful=True),  # No need for batch_input_shape here
    tf.keras.layers.Dense(n_tokens, activation="softmax")
])

At the end of each epoch, we need to reset the states before we go back to the beginning of the text. For this, we can use a small custom Keras callback:

In [23]:
class ResetStatesCallback(tf.keras.callbacks.Callback):
    def on_epoch_begin(self, epoch, logs=None):
        # Iterate over each RNN layer (GRU in this case) and reset states
        for layer in self.model.layers:
            if isinstance(layer, tf.keras.layers.RNN):
                layer.reset_states()

Now we can compile the model and train it using the callback function:

In [24]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
history = model.fit(stateful_train_set, validation_data=stateful_valid_set,
                    epochs=10, callbacks=[ResetStatesCallback(), model_ckpt])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Converting the Stateful RNN to a Stateless RNN then using it
To use the model with different batch sizes, we need to create a stateless copy:

In [25]:
stateless_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax")
])

To set the weights, we need to firstly build the model so that the weights get created:

In [26]:
stateless_model.build(tf.TensorShape([None, None]))

In [27]:
stateless_model.set_weights(model.get_weights())

In [28]:
shakespeare_model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Lambda(lambda X: X - 2),  # no <PAD> or <UNK> tokens
    stateless_model
])

In [29]:
tf.random.set_seed(42)

text = tf.convert_to_tensor(["To be or not to be"])
print(extend_text(text, temperature=0.01))

tf.Tensor([b'To be or not to be her for the senventio:\ni will her for the senvent'], shape=(1,), dtype=string)


## Sentiment Analysis
**Sentiment Analysis** is a Natural Language Processing (NLP) technique used to determine the emotional tone or opinion expressed in a piece of text. It classifies text as positive, negative, or neutral, helping to understand the author's sentiment or attitude.

This technique is widely applied in areas like:
- **Social media monitoring** to analyze customer feedback.
- **Product reviews** to assess user satisfaction.
- **Customer service chats** to gauge emotions and improve responses.

By leveraging machine learning models and NLP, sentiment analysis can extract insights from large volumes of text data, improving decision-making in marketing, customer service, and brand management.

Load the IMDb dataset using the TensorFlow Datasets library and use the first 90% of the training set for training and the remaining 10% for validation:

In [30]:
import tensorflow_datasets as tfds

raw_train_set, raw_valid_set, raw_test_set = tfds.load(
    name="imdb_reviews",
    split=["train[:90%]", "train[90%:]", "test"],
    as_supervised=True
)
tf.random.set_seed(42)
train_set = raw_train_set.shuffle(5000, seed=42).batch(32).prefetch(1)
valid_set = raw_valid_set.batch(32).prefetch(1)
test_set = raw_test_set.batch(32).prefetch(1)

Inspect a few reviews:

In [31]:
for review, label in raw_train_set.take(4):
    print(review.numpy().decode("utf-8")[:200], "...")
    print("Label:", label.numpy())

This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting  ...
Label: 0
I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However  ...
Label: 0
Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do. <br /><br />But come on Hollywood - a Moun ...
Label: 0
This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful perf ...
Label: 1


Some reviews are easy to classify: the first review includes the word "terrible movie" in the first sentence. However, in many cases it is not easy to classify reviews: the third review start of positively but it ultimately got a negative review (label 0).

To build a model for this task, we need to preprocess the text and chop it into words instead of characters using `tf.keras.layers.TextVectorization`. Limit the vocabulary to 1,000 tokens, including the most frequent 998 words plus a padding token and a token for unknown words:

In [32]:
vocab_size = 1000
text_vec_layer = tf.keras.layers.TextVectorization(max_tokens=vocab_size)
text_vec_layer.adapt(train_set.map(lambda reviews, labels: reviews))

Create the model and train it:

In [33]:
embed_size = 128
tf.random.set_seed(42)
model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Embedding(vocab_size, embed_size),
    tf.keras.layers.GRU(128),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Masking
In Keras, making a model ignore padding tokens is simple: set `mask_zero=True` in the `Embedding` layer. This creates a mask tensor that marks padding tokens (ID 0) as False and other tokens as True. The mask automatically propagates to subsequent layers that support masking (i.e., layers with `supports_masking=True`).

For example:

- **Recurrent layers** use the mask to ignore padding steps by copying the output from the previous time step.
- Mask propagation continues through layers with return_sequences=True, but stops at the first layer with `return_sequences=False`.

In a sentiment analysis model with a GRU layer, the mask will be used by the GRU to handle padding but won’t propagate beyond it if `return_sequences=False`.








Using masking layers and automatic mask propagation is effective for simple models, but it may not work for more complex models, such as those that mix `Conv1D` layers with recurrent layers. In these cases, you need to manually compute the mask and pass it to the appropriate layers, either using the functional API or the subclassing API. The model described is equivalent to the previous one, but it is built using the functional API. It explicitly handles masking and adds dropout to address slight overfitting from the previous model:

In [34]:
embed_size = 128
tf.random.set_seed(42)
model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Embedding(vocab_size, embed_size, mask_zero=True),
    tf.keras.layers.GRU(128),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Another approach to masking is to feed the model with ragged tensors. Just set `ragged=True` when creating the `TextVectorization` layer:

In [35]:
text_vec_layer_ragged = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size, ragged=True)
text_vec_layer_ragged.adapt(train_set.map(lambda reviews, labels: reviews))
text_vec_layer_ragged(["Great movie!", "This is DiCaprio's best role."])

<tf.RaggedTensor [[86, 18], [11, 7, 1, 116, 217]]>

## An Encoder-Decoder Network for Neural Machine Translation
Build a simple Neural Machine Translator (NMT) model as follows: English sentences are fed as inputs to the encoder and the decoder outputs the Spanish translation. Note that the Spanish translations are also used as inputs to the decoder during training but shifted back by one step. In other words, during training the decoder is given as input the word that it should have output at the previous step. This is called *Teacher Forcing*, which is a technique that significantly speeds up trianing and improves the model's performance.

Each word is initially represented by its ID (e.g., 954 for the word "soccer"). Next, an `Embedding` layer returns the word embedding. These word embeddings are then fed into the encoder and the decoder.

At each step, the decoder outputs a score for each word in the output vocabulary (i.e., Spanish), then the softmax activation function turns these scores into probabilities. For example, at the first step the
word “Me” may have a probability of 7%, “Yo” may have a probability of 1%, and so on. The word
with the highest probability is output.

To build the model, we first need to download a dataset of English/Spanish sentence pairs:

In [36]:
from pathlib import Path

IMAGES_PATH = Path() / "images" / "nlp"
IMAGES_PATH.mkdir(parents=True, exist_ok=True)

In [37]:
url = "https://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip"
path = tf.keras.utils.get_file("spa-eng.zip", origin=url, cache_dir="datasets",
                               extract=True)
text = (Path(path).with_name("spa-eng") / "spa.txt").read_text()

Each line contains an English sentence and the corresponding Spanish translation, separated by a tab. First start by removing the Spanish characters  “¡” and “¿”, which the `TextVectorization` layer doesn't handle, then parse the sentence pairs and shuffle them. Finally, we will split them into 2 separate lists, 1 per language:

In [38]:
import numpy as np

text = text.replace("¡", "").replace("¿", "")
pairs = [line.split("\t") for line in text.splitlines()]
np.random.seed(42)
np.random.shuffle(pairs)
sentences_en, sentences_es = zip(*pairs)  # separates the pairs into 2 lists

Take a look at the first 3 sentence pairs:

In [39]:
for i in range(3):
  print(sentences_en[i], "=>", sentences_es[i])

How boring! => Qué aburrimiento!
I love sports. => Adoro el deporte.
Would you like to swap jobs? => Te gustaría que intercambiemos los trabajos?


Next, create 2 `TextVectorization` layers, 1 per language and adapt them to the text:

In [40]:
vocab_size = 1000
max_length = 50
text_vec_layer_en = tf.keras.layers.TextVectorization(
    vocab_size, output_sequence_length=max_length)
text_vec_layer_es = tf.keras.layers.TextVectorization(
    vocab_size, output_sequence_length=max_length)
text_vec_layer_en.adapt(sentences_en)
text_vec_layer_es.adapt([f"startofseq {s} endofseq" for s in sentences_es])

Take note that:
- Vocabulary size was limited to only 1000 since the training set is not very large and using a small value will speed up training.
- Since all sentences in the dataset have a maximum of 50 words, `output_sequence_length` was set to 50 so that the input sequences will automatically be padded with zeros until they are all 50 tokens long. If there were any sentences with more than 50 words, they will be cropped to only 50 tokens.
- For the Spanish text, "startofseq"(S0S) and "endofseq"(EOS) was added to each sentence when adapting the `TextVectorization` layer.

Inspect the first 10 tokens in both vocabularies:

In [41]:
text_vec_layer_en.get_vocabulary()[:10]

['', '[UNK]', 'the', 'i', 'to', 'you', 'tom', 'a', 'is', 'he']

In [42]:
text_vec_layer_es.get_vocabulary()[:10]

['', '[UNK]', 'startofseq', 'endofseq', 'de', 'que', 'a', 'no', 'tom', 'la']

They both start with the padding token, unknown token, the SOS token and EOS token (only for Spanish), then the actual words, sorted by decreasing frequency.

Create the training and validation set:

In [43]:
X_train = tf.constant(sentences_en[:100_000])
X_valid = tf.constant(sentences_en[100_000:])
X_train_dec = tf.constant([f"startofseq {s}" for s in sentences_es[:100_000]])
X_valid_dec = tf.constant([f"startofseq {s}" for s in sentences_es[100_000:]])
Y_train = text_vec_layer_es([f"{s} endofseq" for s in sentences_es[:100_000]])
Y_valid = text_vec_layer_es([f"{s} endofseq" for s in sentences_es[100_000:]])

Now, we can build the translation model. We will use the functional API for that since the model is not sequential. It requires 2 inputs: 1 for encoder and another for decoder:

In [44]:
tf.random.set_seed(42)

encoder_inputs = tf.keras.layers.Input(shape=[], dtype=tf.string)
decoder_inputs = tf.keras.layers.Input(shape=[], dtype=tf.string)

Next, we need to encode these sentences using the `TextVectorization` layers prepared earlier, followed by an `Embedding` layer for each language, with `mask_zero=True` to ensure masking is handled automatically:

In [45]:
embed_size = 128
encoder_input_ids = text_vec_layer_en(encoder_inputs)
decoder_input_ids = text_vec_layer_es(decoder_inputs)
encoder_embedding_layer = tf.keras.layers.Embedding(vocab_size, embed_size,
                                                    mask_zero=True)
decoder_embedding_layer = tf.keras.layers.Embedding(vocab_size, embed_size,
                                                    mask_zero=True)
encoder_embeddings = encoder_embedding_layer(encoder_input_ids)
decoder_embeddings = decoder_embedding_layer(decoder_input_ids)

Now we can create the encoder and pass the embedded inputs:

In [46]:
encoder = tf.keras.layers.LSTM(512, return_state=True)
encoder_outputs, *encoder_state = encoder(encoder_embeddings)

To keep things simple, we just used a single `LSTM` layer, but several of them can be stacked. `return_state=True` to get a reference to the layer's final state. Since the `LSTM` layer is used, the layer returns 2 states: the short-term and the long-term state, separately. Now we can use this (double) state as the initial state of the decoder:

In [47]:
decoder = tf.keras.layers.LSTM(512, return_sequences=True)
decoder_outputs = decoder(decoder_embeddings, initial_state=encoder_state)

Next, we pass the decoder's outputs through a `Dense` layer with the softmax activation function to get the word probabilities for each step:

In [48]:
output_layer = tf.keras.layers.Dense(vocab_size, activation="softmax")
Y_proba = output_layer(decoder_outputs)

Now we just need to create the Keras model, compile it, then train it:

In [49]:
model = tf.keras.Model(inputs=[encoder_inputs, decoder_inputs],
                       outputs=[Y_proba])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model.fit((X_train, X_train_dec), Y_train, epochs=3,
          validation_data=((X_valid, X_valid_dec), Y_valid))

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tf_keras.src.callbacks.History at 0x7da0645d4940>

The model can now be used to translate new English sentences to Spanish, but it's not as simple as just calling `model.predict()`, because the decoder expects the input as words that was predicted at the previous time step. One way to do this is to write a custom memory cell that keeps track of the previous output and feed it to the encoder at the next time step. However, to keep things simple, we can just call the model multiple times, predicting one extra word at each round. Let's write a utility function for that:

In [50]:
def translate(sentence_en):
    translation = ""
    for word_idx in range(max_length):
        X = tf.convert_to_tensor([sentence_en])  # encoder input
        X_dec = tf.convert_to_tensor(["startofseq " + translation])  # decoder input
        y_proba = model.predict((X, X_dec))[0, word_idx]  # last token's probas
        predicted_word_id = np.argmax(y_proba)
        predicted_word = text_vec_layer_es.get_vocabulary()[predicted_word_id]

        if predicted_word == "[UNK]":
            # Handle [UNK] token
            translation += " <unknown>"
            continue

        if predicted_word == "endofseq":
            break

        translation += " " + predicted_word

    return translation.strip()

The function simply keeps predicting 1 word at a time, gradually completing the translation. It will stop once it reaches the EOS token:

In [51]:
translate("I like soccer")



'me gusta el fútbol'

### Bidirectional RNNs
**Directional RNNs** processes input sequences in a single direction — either forward (from past to future) or backward (from future to past). This makes them suitable for tasks like language modeling, where predicting future words relies solely on past context.

In contrast, **Bidirectional RNNs** processes sequences in both directions simultaneously, combining information from both the past and future contexts. This allows the model to have a more comprehensive understanding of each input element by considering what comes before and after it.

For tasks like text classification or in the encoder of a sequence-to-sequence (seq2seq) model, it is often beneficial to look ahead at future words when encoding a given word. This is because understanding the context of a word is improved when both its preceding and following words are considered. For example, in sentiment analysis, the sentiment of a word can be influenced by the words around it, not just those that precede it. Similarly, in seq2seq models, a bidirectional encoder allows the model to better capture the full context of the input sequence before generating the output, improving overall performance.








To implement a Bidirectional recurrent layer in Keras, just wrap a recurrent layer in a `tf.keras.layers.Bidirectional` layer. For example, the following `Bidirectional` layer could be used as the encoder in the translation model:

In [52]:
tf.random.set_seed(42)

encoder = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(256, return_state=True))

However, the output will now return 4 states instead of 2: the final short-term, long-term states of the forward LSTM layer, the final short-term and long-term states of the backward LSTM layer. To deal with this, we can concatenate the 2 short-term states and concatenate the 2 long-term states:

In [53]:
encoder_outputs, *encoder_state = encoder(encoder_embeddings)
encoder_state = [tf.concat(encoder_state[::2], axis=-1),  # short-term (0 & 2)
                 tf.concat(encoder_state[1::2], axis=-1)]  # long-term (1 & 3)

Complete the model and train it:

In [54]:
decoder = tf.keras.layers.LSTM(512, return_sequences=True)
decoder_outputs = decoder(decoder_embeddings, initial_state=encoder_state)
output_layer = tf.keras.layers.Dense(vocab_size, activation="softmax")
Y_proba = output_layer(decoder_outputs)
model = tf.keras.Model(inputs=[encoder_inputs, decoder_inputs],
                       outputs=[Y_proba])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model.fit((X_train, X_train_dec), Y_train, epochs=3,
          validation_data=((X_valid, X_valid_dec), Y_valid))

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tf_keras.src.callbacks.History at 0x7da01fb36410>

In [55]:
translate("I like soccer")



'me gusta el fútbol'

## Attention Mechanisms
Consider the path from the word “soccer” to its translation “fútbol”: it is quite
long! This means that a representation of this word (along with all the other words) needs to be
carried over many steps before it is actually used.

Attention mechanisms allow neural networks to **focus on the most relevant parts of input data** when making predictions. Originally introduced for sequence-to-sequence (seq2seq) models in machine translation, attention has since become a core component in many deep learning tasks like text summarization, image captioning, and question answering.

In traditional RNNs or LSTMs, the encoder compresses the **entire input sequence into a fixed-length vector**, which can cause performance issues for long sequences. Attention solves this by allowing the decoder to dynamically "attend" to different parts of the input sequence at each time step, instead of relying solely on a fixed context vector.

**Types of Attention Mechanisms:**
- **Bahdanau Attention (Additive)**
 -  Computes a weighted sum of encoder outputs by learning a score function that measures the importance of each input step.

- **Luong Attention (Multiplicative)**
 - Uses a dot product between the decoder hidden state and encoder outputs to compute attention scores, making it more efficient.

 Keras provides a `tf.keras.layers.Attention` layer for *Luong attention* and an `AdditiveAttemtopm` layer for *Bahdanau attention*.

Let's add the *Luong Attention* to the encoder-decoder model. Since we need to pass all the encoder's outputs to the `Attention` layer, we first need to set `return_sequences=True` when creating the encoder:

In [56]:
tf.random.set_seed(42)

encoder = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(256, return_sequences=True, return_state=True))

However, the output will now return 4 states instead of 2: the final short-term, long-term states of the forward LSTM layer, the final short-term and long-term states of the backward LSTM layer. To deal with this, we can concatenate the 2 short-term states and concatenate the 2 long-term states:

In [57]:
encoder_outputs, *encoder_state = encoder(encoder_embeddings)
encoder_state = [tf.concat(encoder_state[::2], axis=-1),  # short-term (0 & 2)
                 tf.concat(encoder_state[1::2], axis=-1)]  # long-term (1 & 3)
decoder = tf.keras.layers.LSTM(512, return_sequences=True)
decoder_outputs = decoder(decoder_embeddings, initial_state=encoder_state)

Next, we need to create the attention layer and pass it the decoder's states and the encoder's outputs. However, to access the decoder's states at teach step, we need to write a custom memory cell. For simplicity, use the decoder's ooutputs instead of its states, then pass the attention layer's outputs directly to the output layer:

In [58]:
attention_layer = tf.keras.layers.Attention()
attention_outputs = attention_layer([decoder_outputs, encoder_outputs])
output_layer = tf.keras.layers.Dense(vocab_size, activation="softmax")
Y_proba = output_layer(attention_outputs)

Build, compile and train the model:

In [59]:
model = tf.keras.Model(inputs=[encoder_inputs, decoder_inputs],
                       outputs=[Y_proba])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model.fit((X_train, X_train_dec), Y_train, epochs=10,
          validation_data=((X_valid, X_valid_dec), Y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tf_keras.src.callbacks.History at 0x7d9fc29568c0>

In [60]:
translate("I like soccer and also going to the beach")



'me gusta el fútbol y me gusta también a la playa'

The model is now able to handle much longer sentences since the attention layer provides a way to focus the attention of the model on part of the inputs.

## Hugging Face's Transformation Library
Hugging Face is an AI company that has built a whole ecosystem of easy-to-use open source tools for NLP, vision, and beyond. Hugging Face's Transformers library allows us to easily download a pretrained model, including its corresponding tokenizer, and then fine-tune it on our own dataset, if needed.

The simplest way to use the Transformers library is to use the `transformers.pipeline()` function: just specify which task we want, such as sentiment analysis, and it downloads a default pretrained model that is ready to be used:

In [62]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")  # many other tasks are available
result = classifier("The actors were very convincing.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cuda:0


`Result` is a Python list containing 1 dictionary per input:

In [63]:
result

[{'label': 'POSITIVE', 'score': 0.9998071789741516}]

Models can be very biased, it may like or dislike some countries depending on the data it was trained on, so we must be careful while using it. For example:

In [64]:
classifier(["I am from India.", "I am from Iraq."])

[{'label': 'POSITIVE', 'score': 0.9896161556243896},
 {'label': 'NEGATIVE', 'score': 0.9811071157455444}]

The `pipeline()` function uses the default model for the given task. For example, for text classification tasks such as sentiment analysis, it defaults to `distilbert-base-uncased-finetuned-sst-2-english` model with an uncased tokenizer, trained on English Wikipedia and a corpus of English books. However, we can manually specify a model that we want as well:

In [65]:
model_name = "huggingface/distilbert-base-uncased-finetuned-mnli"
classifier_mnli = pipeline("text-classification", model=model_name)
classifier_mnli("She loves me. [SEP] She loves me not.")

Device set to use cuda:0


[{'label': 'contradiction', 'score': 0.9790192246437073}]

Even though the pipeline API is very simple and convenient, sometimes we might need more control. For such cases, the Transformers library provides many classes, including all sorts of tokenizers, model configurations, callbacks and many more. For example, let's load the `DistilBERT` model along with its corresponding tokenizer, using the `TFAutoModelForSequenceClassification` and
`AutoTokenizer` classes:

In [66]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


Let's tokenize a couple pairs of sentences. In the following code, we activate padding and specify that we want TensorFlow tensors instead of Python lists:

In [67]:
token_ids = tokenizer(["I like soccer. [SEP] We all love soccer!",
                       "Joe lived for a very long time. [SEP] Joe is old."],
                      padding=True, return_tensors="tf")
token_ids

{'input_ids': <tf.Tensor: shape=(2, 15), dtype=int32, numpy=
array([[ 101, 1045, 2066, 4715, 1012,  102, 2057, 2035, 2293, 4715,  999,
         102,    0,    0,    0],
       [ 101, 3533, 2973, 2005, 1037, 2200, 2146, 2051, 1012,  102, 3533,
        2003, 2214, 1012,  102]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 15), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}

The output is a dictionary-like instance of the `BatchEncoding` class, which contains the sequences of token IDs as well as a cask containing "0s" for the padding tokens.

If we set `return_token_type_ids=True` when calling the tokenizer, we will get an extra tensor that indicates which sentence each token belongs to.

Next, we can directly pass this `BatchEncoding` object to the model:

In [68]:
outputs = model(token_ids)
outputs

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[-2.1123817 ,  1.17868   ,  1.4100995 ],
       [-0.01478346,  1.0962477 , -0.99199575]], dtype=float32)>, hidden_states=None, attentions=None)

Lastly, we apply the softmax activation function to convert these logits to class probabilities, and use the argmax() function to predict the class with the highest probability for each input sentence pair:

In [69]:
Y_probas = tf.keras.activations.softmax(outputs.logits)
Y_probas

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[0.01619703, 0.43523633, 0.54856664],
       [0.2265597 , 0.6881726 , 0.08526774]], dtype=float32)>

In [70]:
Y_pred = tf.argmax(Y_probas, axis=1)
Y_pred  # 0 = contradiction, 1 = entailment, 2 = neutral

<tf.Tensor: shape=(2,), dtype=int64, numpy=array([2, 1])>

The model correctly classified the 1st sentence pair as neutral and the 2nd sentence pair as entailment.

To fine-tune the model, we can train the model as usual with Keras. However, since the model outputs logits instead of probabilities, we must use `tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) ` instead of the usual `sparse_categorical_crossentropy`:

In [71]:
sentences = [("Sky is blue", "Sky is red"), ("I love her", "She loves me")]
X_train = tokenizer(sentences, padding=True, return_tensors="tf").data
y_train = tf.constant([0, 2])  # contradiction, neutral
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(loss=loss, optimizer="nadam", metrics=["accuracy"])
history = model.fit(X_train, y_train, epochs=2)

Epoch 1/2
Epoch 2/2
