In [1]:
import numpy as np
import tensorflow as tf
from pathlib import Path

2024-03-21 10:45:01.305242: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Exercise: *In this exercise you will download a dataset, split it, create a `tf.data.Dataset` to load it and preprocess it efficiently, then build and train a binary classification model containing an `Embedding` layer*

### a.

Exercise: *Download the Large Movie Review Dataset, which contains 50,000 movie reviews from the Internet Movie Database (IMDb). The data is organized in two directories, train and test, each containing a pos subdirectory with 12,500 positive reviews and a neg subdirectory with 12,500 negative reviews. Each review is stored in a separate text file. There are other files and folders (including preprocessed bag-of-words versions), but we will ignore them in this exercise.*

In [2]:
root = "https://ai.stanford.edu/~amaas/data/sentiment/"
filename = "aclImdb_v1.tar.gz"
filepath = tf.keras.utils.get_file(
    filename, root + filename, extract=True, cache_dir="."
)
path = Path(filepath).with_name("aclImdb")
path

PosixPath('datasets/aclImdb')

Now we define a `tree()` function to view the structure of the `aclImdb` directory.

In [3]:
def tree(path: Path, level=0, indent=4, max_files=3):
    if level == 0:
        print(f"{path}/")
        level += 1
    sub_paths = sorted(path.iterdir())
    sub_dirs = [sub_path for sub_path in sub_paths if sub_path.is_dir()]
    sub_files = [sub_path for sub_path in sub_paths if not sub_path in sub_dirs]
    indent_str = " " * indent * level
    for dir in sub_dirs:
        print(f"{indent_str}{dir.name}/")
        tree(dir, level + 1)
    for file in sub_files[:max_files]:
        print(f"{indent_str}{file.name}")
    if len(sub_files) > max_files:
        print(f"{indent_str}...")

In [4]:
tree(path)

datasets/aclImdb/
    test/
        neg/
            0_2.txt
            10000_4.txt
            10001_1.txt
            ...
        pos/
            0_10.txt
            10000_7.txt
            10001_9.txt
            ...
        labeledBow.feat
        urls_neg.txt
        urls_pos.txt
    train/
        neg/
            0_3.txt
            10000_4.txt
            10001_4.txt
            ...
        pos/
            0_9.txt
            10000_8.txt
            10001_10.txt
            ...
        unsup/
            0_0.txt
            10000_0.txt
            10001_0.txt
            ...
        labeledBow.feat
        unsupBow.feat
        urls_neg.txt
        ...
    README
    imdb.vocab
    imdbEr.txt


In [5]:
def review_paths(dirpath: Path):
    return [str(path) for path in dirpath.glob("*.txt")]


train_pos = review_paths(path / "train" / "pos")
train_neg = review_paths(path / "train" / "neg")
test_valid_pos = review_paths(path / "test" / "pos")
test_valid_neg = review_paths(path / "test" / "neg")

len(train_pos), len(train_neg), len(test_valid_pos), len(test_valid_neg)

(12500, 12500, 12500, 12500)

### b.

Exercise: *Split the test set into a validation set (15,000) and a test set (10,000)*

In [6]:
np.random.seed(42)
np.random.shuffle(test_valid_pos)

test_pos = test_valid_pos[:5000]
valid_pos = test_valid_pos[5000:]
test_neg = test_valid_neg[:5000]
valid_neg = test_valid_neg[5000:]

### c.

Exercise: *Use `tf.data` to create an efficient dataset for each set.*

In [7]:
def idmb_dataset(filepaths_positive, filepaths_negative):
    reviews = []
    labels = []
    for filepaths, label in [(filepaths_positive, 1), (filepaths_negative, 0)]:
        for filepath in filepaths:
            with open(filepath) as review_file:
                reviews.append(review_file.read())
            labels.append(label)
    return tf.data.Dataset.from_tensor_slices(
        (tf.constant(reviews), tf.constant(labels))
    )

In [8]:
for X, y in idmb_dataset(train_pos, train_neg).take(3):
    print(X)
    print(y)
    print()

tf.Tensor(b'First of all I\'ve got to give it to the people that got this thing together. 9/11 is such a sensitive issue that making a movie that dares to be controversial about it takes a great deal of guts. It\'s a shame, although not surprising, that the movie was banned in the US.<br /><br />That being said I think that the movie is superb with a couple of weak moments. The movie starts up with the Iranian segment which turns out to be somewhat reminiscent of Majid Majidi\'s work (the absolutely beautiful "heaven\'s children" and "the color of paradise"). Much like those 2 films the clip shows what happened through the innocent eyes of a class of Afgan refugees in Iran. Absolutely beautiful clip. Same goes for Sean Penn\'s clip which is superb as well. But just as some of the clips are beutiful others are absolutely brutal. Alejandro Gonz\xc3\xa1les I\xc3\xb1\xc3\xa1rritu does the mexican clip and just like his gut-wrenching "Amores perros" he does it as brutal as he can. Most of t

2024-03-21 10:45:15.873204: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2024-03-21 10:45:15.933392: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int32 and shape [25000]
	 [[{{node Placeholder/_1}}]]


In [9]:
%timeit -r1 for X, y in idmb_dataset(train_pos, train_neg).repeat(10):pass

2024-03-21 10:45:16.264517: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int32 and shape [25000]
	 [[{{node Placeholder/_1}}]]


2024-03-21 10:45:28.300317: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int32 and shape [25000]
	 [[{{node Placeholder/_1}}]]


11.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


It takes 10 seconds to load the dataset and go through it 10 times.

- But let's pretend the dataset does not fit in the memory.
- Luckily, each review fits on just one line (they use `<br />` to indicate line breaks), so we can use `TextLineDataset` to read the reviews.
- If they didn't, we would have to preprocess the input files (e.g., converting them to TFRecords).
- For very large datasets, it would make sense to use a tool like Apache Beam for that.

In [10]:
def idmb_dataset(filepaths_positive, filepaths_negative, n_read_threads=5):
    dataset_neg = tf.data.TextLineDataset(
        filepaths_negative, num_parallel_reads=n_read_threads
    )
    dataset_neg = dataset_neg.map(lambda review: (review, 0))
    dataset_pos = tf.data.TextLineDataset(
        filepaths_positive, num_parallel_reads=n_read_threads
    )
    dataset_pos = dataset_pos.map(lambda review: (review, 1))
    return tf.data.Dataset.concatenate(dataset_pos, dataset_neg)

In [11]:
%timeit -r1 for X, y in idmb_dataset(train_pos, train_neg).repeat(10): pass

2024-03-21 10:45:39.811716: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [12500]
	 [[{{node Placeholder/_0}}]]
2024-03-21 10:45:39.811882: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [12500]
	 [[{{node Placeholder/_0}}]]


2024-03-21 10:46:06.668716: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_8' with dtype string and shape [12500]
	 [[{{node Placeholder/_8}}]]
2024-03-21 10:46:06.668860: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [12500]
	 [[{{node Placeholder/_0}}]]


24.4 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


- Now it takes about 19 seconds to go through the datasets 10 times. That's much slower, essentially because the dataset is not cached in RAM, so it must be reloaded at each epoch.
- If you add `.cache()` just before `.repeat(10)`, you'll see that this implementation is about as fast as the previous one.

In [12]:
%timeit -r1 for X, y in idmb_dataset(train_pos, train_neg).cache().repeat(10):pass

2024-03-21 10:46:31.056601: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_8' with dtype string and shape [12500]
	 [[{{node Placeholder/_8}}]]
2024-03-21 10:46:31.056774: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_8' with dtype string and shape [12500]
	 [[{{node Placeholder/_8}}]]


2024-03-21 10:46:42.112297: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [12500]
	 [[{{node Placeholder/_0}}]]
2024-03-21 10:46:42.112477: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_8' with dtype string and shape [12500]
	 [[{{node Placeholder/_8}}]]


10.9 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [13]:
batch_size = 32

train_set = idmb_dataset(train_pos, train_neg).shuffle(25000, seed=24)
train_set = train_set.batch(batch_size).prefetch(1)
valid_set = idmb_dataset(valid_pos, valid_neg).batch(batch_size).prefetch(1)
test_set = idmb_dataset(test_pos, test_neg).batch(batch_size).prefetch(1)

### d.

Exercise: *Create a binary classification model, using a `TextVectorization` layer to preprocess each review.*

- Now we create a `TextVectorization` layer an adapt it to the full IDMB training set.
- If the training set did not fit in RAM, we could just use a smaller sample of the training set by calling `train_set.take(500)`.
- And we use TF-IDF for now.

In [14]:
max_tokens = 1000
sample_reviews = train_set.map(lambda review, label: review)
text_vectorization = tf.keras.layers.TextVectorization(max_tokens, output_mode="tf_idf")
text_vectorization.adapt(sample_reviews)

2024-03-21 10:46:53.207869: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [12500]
	 [[{{node Placeholder/_0}}]]
2024-03-21 10:46:53.208125: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_8' with dtype string and shape [12500]
	 [[{{node Placeholder/_8}}]]


Now we look at the first 10 words in the vocabulary.

In [15]:
text_vectorization.get_vocabulary()[:10]

['[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i']

There are the most common words in the reviews.

Now we can train the model.

In [16]:
tf.random.set_seed(42)
model = tf.keras.Sequential(
    [
        text_vectorization,
        tf.keras.layers.Dense(100, activation="relu"),
        tf.keras.layers.Dense(1, activation="sigmoid"),
    ]
)
model.compile(loss="binary_crossentropy", optimizer="nadam", metrics=["accuracy"])
model.fit(train_set, epochs=5, validation_data=valid_set)

Epoch 1/5


2024-03-21 10:46:55.820633: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_8' with dtype string and shape [12500]
	 [[{{node Placeholder/_8}}]]
2024-03-21 10:46:55.820958: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [12500]
	 [[{{node Placeholder/_0}}]]


    777/Unknown - 3s 3ms/step - loss: 0.4237 - accuracy: 0.8227

2024-03-21 10:46:58.856654: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [7500]
	 [[{{node Placeholder/_0}}]]
2024-03-21 10:46:58.856847: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [7500]
	 [[{{node Placeholder/_0}}]]


Epoch 2/5

Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x703026d1f650>

We get about 84.8% accuracy on the validation set after just the first epoch, but after that, the model makes no significant progress. We will do better in chapter 16. For now, we just want to practice preprocessing using `tf.data` and Keras preprocessing layers.

### e.

Exercise: *Add an `Embedding` layer and compute the mean embedding for each review, multiplied by the square root of the number of words (see Chapter 16). This rescaled mean embedding can then be passed to the rest of your model.*

- To compute the mean embedding for each review, and multiply it by the square root of the number of words in the review, so we will need a little function.
- For each sentence, this function need to compute $M \times \sqrt{N}$, where M is the mean of all the word embeddings in the sentence, where $M$ is the mean of all the word embeddings in the sentence (excluding padding tokens), and $N$ is the number of words in the sentence (also excluding padding tokens).
- We can rewrite $M$ as $\dfrac{S}{N}$, where $S$ is the sum of all word embeddings (it does not matter whether or not we include the padding tokens in this sum, as their representation is a zero vector anyways).
- So the function must return $M \times \sqrt{N} = \dfrac{S}{N} \times \sqrt{N} = \dfrac{S}{\sqrt{N}}$.

In [17]:
def compute_mean_embedding(inputs):
    # We count the number of words, or count the number of nonzero vectors
    not_pad = tf.math.count_nonzero(inputs, axis=-1)
    n_words = tf.math.count_nonzero(not_pad, axis=-1, keepdims=True)
    sqrt_n_words = tf.sqrt(tf.cast(n_words, tf.float32))
    return tf.reduce_sum(inputs, axis=1) / sqrt_n_words

In [18]:
another_example = tf.constant(
    [
        [[1.0, 2.0, 3.0], [4.0, 5.0, 0.0], [0.0, 0.0, 0.0]],
        [[6.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]],
    ]
)
compute_mean_embedding(another_example)

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[3.535534 , 4.9497476, 2.1213205],
       [6.       , 0.       , 0.       ]], dtype=float32)>

Let's check that this is correct. The first review contains 2 words, as the last token is a zero vector, which represents the `<pad>` token. Let's compute the mean embedding for these 2 words, and multiply the result by the square root of 2. 

In [19]:
tf.reduce_mean(another_example[0][:2], axis=0) * tf.sqrt(2.0)

<tf.Tensor: shape=(3,), dtype=float32, numpy=array([3.535534 , 4.9497476, 2.1213202], dtype=float32)>

Okay. What about the second review, which contains only one word?

In [20]:
tf.reduce_mean(another_example[1][:1], axis=0) * tf.sqrt(1.0)

<tf.Tensor: shape=(3,), dtype=float32, numpy=array([6., 0., 0.], dtype=float32)>

- Great! Now we can train our final model. 
- It's the same as before, we just replace TF-IDF with ordinal encoding (`output_mode="int"`) followed by an `Embedding` layer, followed by a `Lambda` layer that calls the `compute_mean_embedding()` function.

In [22]:
embedding_size = 20
tf.random.set_seed(42)

text_vectorization = tf.keras.layers.TextVectorization(max_tokens, output_mode="int")
text_vectorization.adapt(sample_reviews)

model = tf.keras.Sequential(
    [
        text_vectorization,
        tf.keras.layers.Embedding(
            input_dim=max_tokens,
            output_dim=embedding_size,
            mask_zero=True,  # <pad> token is  equivalent to zero vectors
        ),
        tf.keras.layers.Lambda(compute_mean_embedding),
        tf.keras.layers.Dense(100, activation="relu"),
        tf.keras.layers.Dense(1, activation="sigmoid"),
    ]
)

### f.

Exercise: *Train the model and see what accuracy you get. Try to optimize your pipelines to make training as fast as possible.*

In [24]:
model.compile(loss="binary_crossentropy", optimizer="nadam", metrics=["accuracy"])
model.fit(train_set, epochs=5, validation_data=valid_set)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x702fd76ae690>

The model is just marginally (i.e., slightly) better using embeddings (but we will do better in chapter 16). The pipeline is fast enough (we optimized it earlier).

### g.

Exercise: *Use TFDS to load the same dataset more easily: `tfds.load("imdb_reviews")`.*

In [26]:
import tensorflow_datasets as tfds

datasets = tfds.load("imdb_reviews")
train_set, test_set = datasets["train"], datasets["test"]

[1mDownloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /home/tan/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling /home/tan/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete41HOQB/imdb_reviews-train.tfre…

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling /home/tan/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete41HOQB/imdb_reviews-test.tfrec…

Generating unsupervised examples...: 0 examples [00:00, ? examples/s]

Shuffling /home/tan/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete41HOQB/imdb_reviews-unsupervis…

[1mDataset imdb_reviews downloaded and prepared to /home/tan/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [27]:
for example in train_set.take(1):
    print(example["text"])
    print(example["label"])

tf.Tensor(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", shape=(), dtype=string)
tf.Tensor(0, shape=(), dtype=int64)


2024-03-21 11:00:08.156896: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype string and shape [1]
	 [[{{node Placeholder/_1}}]]
2024-03-21 11:00:08.157392: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_4' with dtype int64 and shape [1]
	 [[{{node Placeholder/_4}}]]
2024-03-21 11:00:08.188401: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline s