In [None]:
!pip install tensorflow_datasets

In [18]:
import tensorflow_datasets as tfds

datasets = tfds.load(name="imdb_reviews", as_supervised=True)
train_set, test_set = datasets["train"], datasets["test"]

In [20]:
for review, label in train_set.take(1):
    print(review)
    print(label)

tf.Tensor(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", shape=(), dtype=string)
tf.Tensor(0, shape=(), dtype=int64)


In [21]:
print(datasets.keys())

dict_keys(['train', 'test', 'unsupervised'])


In [22]:
print(train_set.__len__())

tf.Tensor(25000, shape=(), dtype=int64)


*Exercise: Split the test set into a validation set (15,000) and a test set (10,000).*

In [23]:
batch_size = 32

train_set = train_set.shuffle(buffer_size=25000, seed=42)
train_set = train_set.batch(batch_size).prefetch(1)
valid_set = test_set.take(15000).batch(batch_size).prefetch(1)
test_set = test_set.skip(15000).batch(batch_size).prefetch(1)

*Exercise: Create a binary classification model, using a `TextVectorization` layer to preprocess each review.*
We will create a `TextVectorization` layer and adapt it to the training set. Let's use TF-IDF for now.

In [24]:
from tensorflow.keras.layers import TextVectorization

max_tokens = 1000
sample_reviews = train_set.map(lambda review, label: review)
text_vectorization = TextVectorization(max_tokens=max_tokens, 
                                       output_mode = "tf_idf")
text_vectorization.adapt(sample_reviews)

Good! Now let's take a look at the first 10 words in the vocabulary:

In [25]:
text_vectorization.get_vocabulary()[:10]

['[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i']

These are the most common words in the reviews.

We are ready to train the model!

In [26]:
import tensorflow as tf

tf.random.set_seed(42)
model = tf.keras.Sequential([
    text_vectorization, 
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid"),
])
model.compile(loss="binary_crossentropy", optimizer="nadam", 
              metrics=["accuracy"])
model.fit(train_set, epochs=5, validation_data=valid_set)

Epoch 1/5





Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x1aa00dbe790>

We get about 84.4% accuracy on the validation set after just the first epoch, but after that the model makes no significant progress. We will do better in Chapter 16. For now the point is just to perform efficient preprocessing using `tf.data` and Keras preprocessing layers.

*Exercise: Add an `Embedding` layer and compute the mean embedding for each review, multiplied by the square root of the number of words (see Chapter 16). This rescaled mean embedding can then be passed to the rest of your model.*

To compute the mean embedding for each review, and multiply it by the square root of the number of words in that review, we will need a little function. For each sentence, this function needs to compute $M \times \sqrt{N}$
, where $M$ is the mean of all the word embeddings in the sentence (excluding padding tokens), and $N$ is the number of words in the sentence (also excluding padding tokens). We can rewrite $M$ as $\dfrac{S}{N}$, where $S$ is the sum of all word embeddings (it does not matter whether or not we include the padding tokens in this sum, since their representation is a zero vector). So the function must return $M \times \sqrt{N} = \dfrac{S}{N} \times \sqrt{N} = \dfrac{S}{\sqrt{N}}$.

In [27]:
def compute_mean_embedding(inputs):
    not_pad = tf.math.count_nonzero(inputs, axis=-1)
    print(not_pad)
    n_words = tf.math.count_nonzero(not_pad, axis=-1, keepdims=True)
    print(n_words)
    sqrt_n_words = tf.math.sqrt(tf.cast(n_words, tf.float32))
    return tf.reduce_sum(inputs, axis=1) / sqrt_n_words

another_example = tf.constant([[[1., 2., 3.], [4., 5., 0.], [0., 0., 0.]],
                               [[6., 0., 0.], [0., 0., 0.], [0., 0., 0.]]])
compute_mean_embedding(another_example)

tf.Tensor(
[[3 2 0]
 [1 0 0]], shape=(2, 3), dtype=int64)
tf.Tensor(
[[2]
 [1]], shape=(2, 1), dtype=int64)


<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[3.535534 , 4.9497476, 2.1213205],
       [6.       , 0.       , 0.       ]], dtype=float32)>

Let's check that this is correct. The first review contains 2 words (the last token is a zero vector, which represents the `<pad>` token). Let's compute the mean embedding for these 2 words, and multiply the result by the square root of 2:

In [28]:
tf.reduce_mean(another_example[0:1, :2], axis=1) * tf.sqrt(2.)

<tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[3.535534 , 4.9497476, 2.1213202]], dtype=float32)>

Looks good! Now let's check the second review, which contains just one word (we ignore the two padding tokens):

In [29]:
tf.reduce_mean(another_example[1:2, :1], axis=1) * tf.sqrt(1.)

<tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[6., 0., 0.]], dtype=float32)>

Perfect. Now we're ready to train our final model. It's the same as before, except we replaced TF-IDF with ordinal encoding (`output_mode="int"`) followed by an Embedding layer, followed by a Lambda layer that calls the compute_mean_embedding layer:

In [30]:
from tensorflow.keras.layers import Embedding, Lambda, Dense

embedding_size = 20
tf.random.set_seed(42)

text_vectorization = tf.keras.layers.TextVectorization(
    max_tokens=max_tokens, output_mode="int")
text_vectorization.adapt(sample_reviews)

model = tf.keras.Sequential([
    text_vectorization,
    Embedding(input_dim=max_tokens, output_dim=embedding_size, mask_zero=True),
    Lambda(compute_mean_embedding),
    Dense(100, activation="relu"),
    Dense(1, activation="sigmoid"),
])

Tensor("lambda/count_nonzero/Sum:0", shape=(None, None), dtype=int64)
Tensor("lambda/count_nonzero_1/Sum:0", shape=(None, 1), dtype=int64)


In [32]:
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model.fit(train_set, epochs=5, validation_data=valid_set)

Epoch 1/5
Tensor("sequential_2/lambda/count_nonzero/Sum:0", shape=(None, None), dtype=int64)
Tensor("sequential_2/lambda/count_nonzero_1/Sum:0", shape=(None, 1), dtype=int64)
Tensor("sequential_2/lambda/count_nonzero/Sum:0", shape=(None, None), dtype=int64)
Tensor("sequential_2/lambda/count_nonzero_1/Sum:0", shape=(None, 1), dtype=int64)
Tensor("sequential_2/lambda/count_nonzero_1/Sum:0", shape=(None, 1), dtype=int64)
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x1aa033a5b90>

The model is just marginally better using embeddings (but we will do better in Chapter 16). The pipeline looks fast enough.