In this exercise you will download a dataset, split it, create a tf.data.Dataset to load it and preprocess it efficiently, then build and train a binary classification model containing an Embedding layer:

a. Download the Large Movie Review Dataset, which contains 50,000 movie reviews from the Internet Movie Database (IMDb). The data is organized in two directories, train and test, each containing a pos subdirectory with 12,500 positive reviews and a neg subdirectory with 12,500 negative reviews. Each review is stored in a separate text file. There are other files and folders (including preprocessed bag-of-words versions), but we will ignore them in this exercise.

b. Split the test set into a validation set (15,000) and a test set (10,000).

In [1]:
import tensorflow as tf

## First attempt: Bad approach

This is not a good approach. A better approach is to simply split the files in the folder to validation and test

In [53]:
test_pos_files = tf.data.Dataset.list_files('data/aclImdb/test/pos/*.txt')
test_neg_files = tf.data.Dataset.list_files('data/aclImdb/test/neg/*.txt')

def attach_label(label):
    def _attach_label(x):
        return x, tf.constant([label], dtype=tf.int64)
    return _attach_label

test_pos = tf.data.TextLineDataset(test_pos_files).map(attach_label(1))
test_neg = tf.data.TextLineDataset(test_neg_files).map(attach_label(0))
test_full: tf.data.Dataset = test_pos.concatenate(test_neg).shuffle(25000, seed=42)

In [49]:
valid_input_arr = []
valid_label_arr = []
test_input_arr = []
test_label_arr = []
for index, (input, label) in test_full.enumerate():
    if index < 15000:
        valid_input_arr.append(input)
        valid_label_arr.append(label)
    else:
        test_input_arr.append(input)
        test_label_arr.append(label)

valid: tf.data.Dataset = tf.data.Dataset.from_tensor_slices((valid_input_arr, valid_label_arr))
test: tf.data.Dataset = tf.data.Dataset.from_tensor_slices((test_input_arr, test_label_arr))

## Second attempt

In [138]:
test_pos_files = tf.data.Dataset.list_files('data/aclImdb/test/pos/*.txt', shuffle=False)
test_neg_files = tf.data.Dataset.list_files('data/aclImdb/test/neg/*.txt', shuffle=False)
train_pos_files = tf.data.Dataset.list_files('data/aclImdb/train/pos/*.txt', shuffle=False)
train_neg_files = tf.data.Dataset.list_files('data/aclImdb/train/pos/*.txt', shuffle=False)

test_pos_files = [x.numpy() for x in test_pos_files]
test_neg_files = [x.numpy() for x in test_neg_files]

valid_pos_files, test_pos_files = test_pos_files[:7500], test_pos_files[7500:]
valid_neg_files, test_neg_files = test_neg_files[:7500], test_neg_files[7500:]

print(
    len(valid_pos_files),
    len(valid_neg_files),
    len(test_pos_files),
    len(test_neg_files),
    len(train_pos_files),
    len(train_neg_files),
)

def attach_label(label):
    def _attach_label(x):
        return x, label
    return _attach_label

valid_pos = tf.data.TextLineDataset(valid_pos_files, num_parallel_reads=5).map(attach_label(1))
valid_neg = tf.data.TextLineDataset(valid_neg_files, num_parallel_reads=5).map(attach_label(0))
test_pos = tf.data.TextLineDataset(test_pos_files, num_parallel_reads=5).map(attach_label(1))
test_neg = tf.data.TextLineDataset(test_neg_files, num_parallel_reads=5).map(attach_label(0))
train_pos = tf.data.TextLineDataset(train_pos_files, num_parallel_reads=5).map(attach_label(1))
train_neg = tf.data.TextLineDataset(train_neg_files, num_parallel_reads=5).map(attach_label(0))

valid = valid_pos.concatenate(valid_neg).batch(32).prefetch(1)
test = test_pos.concatenate(test_neg).batch(32).prefetch(1)
train = train_pos.concatenate(train_neg).shuffle(25000).batch(32).prefetch(1)

7500 7500 5000 5000 12500 12500


c. Use tf.data to create an efficient dataset for each set.

d. Create a binary classification model, using a TextVectorization layer to preprocess each review.

In [99]:
vectorization = tf.keras.layers.TextVectorization(output_mode='tf_idf', max_tokens=1000)
vectorization.adapt(train.concatenate(valid).concatenate(test).map(lambda x, label: x))

In [141]:
print(vectorization.get_vocabulary()[:20])
print(vectorization.get_vocabulary()[980:])

['[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i', 'this', 'that', 'br', 'as', 'was', 'with', 'for', 'but', 'movie', 'film']
['laughs', 'whatever', 'members', 'sounds', 'lee', 'beautifully', 'reasons', 'popular', 'secret', '20', 'otherwise', 'box', 'appear', 'minute', 'moves', 'apart', 'uses', 'credits', 'front', 'large']


In [142]:
vectorization.vocabulary_size()

1000

In [143]:
# Embeddings of a sentence seems to simply add the weights for each word
print(vectorization('asdfasdf')[:5])
print(vectorization('asdfasdf the')[:5])
print(vectorization('asdfasdf the and')[:5])

tf.Tensor([3.012064 0.       0.       0.       0.      ], shape=(5,), dtype=float32)
tf.Tensor([3.012064  0.6979414 0.        0.        0.       ], shape=(5,), dtype=float32)
tf.Tensor([3.012064  0.6979414 0.7099822 0.        0.       ], shape=(5,), dtype=float32)


In [144]:
model = tf.keras.models.Sequential([
    vectorization,
    tf.keras.layers.Dense(100, activation='relu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(
    loss=tf.keras.losses.binary_crossentropy,
    optimizer=tf.keras.optimizers.legacy.Nadam(learning_rate=0.0005),
    metrics=[tf.keras.metrics.binary_accuracy]
)
hist = model.fit(train, epochs=5, validation_data=valid)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


e. Add an Embedding layer and compute the mean embedding for each review, multiplied by the square root of the number of words (see Chapter 16). This rescaled mean embedding can then be passed to the rest of your model.

> An embeddings layer starts with a sparse categorical value (a number between 0 and max_tokens). But here, the solution (which I read only up to here) suggests TF-IDF, which produces hot encoded vector. Matrix multiplication between the tf-idf-hot encoded vectorization layer and the embedding layer (dense layer) will essentially take care of "adding the vectors" part. But what about the square root of the number of words? My instinct is to create a custom layer that for a given input tf-idf-hot encoded matrix X, it performs this "normalization".

In [128]:
import numpy as np
x1 = np.ones((2, 2))
x2 = np.zeros((2, 2))
print(x2)
y = tf.keras.layers.Average()([x1, x2])
y.numpy().tolist()

tf.keras.layers.

[[0. 0.]
 [0. 0.]]


[[0.5, 0.5], [0.5, 0.5]]

f. Train the model and see what accuracy you get. Try to optimize your pipelines to make training as fast as possible.

g. Use TFDS to load the same dataset more easily: tfds.load("imdb_reviews").