# Chapter 13 Exercises: 10th

## 10. In this exercise you will download a dataset, split it, create a tf.data.Dataset to load it and preprocess it efficiently, then build and train a binary classification model containing an Embedding layer:

 - a. Download the Large Movie Review Dataset, which contains 50,000 movie
 reviews from the Internet Movie Database (IMDb). The data is organized
 in two directories, train and test, each containing a pos subdirectory with
 12,500 positive reviews and a neg subdirectory with 12,500 negative reviews.
 Each review is stored in a separate text file. There are other files and folders
 (including preprocessed bag-of-words versions), but we will ignore them in
 this exercise.

 - b. Split the test set into a validation set (15,000) and a test set (10,000).

 - c. Use tf.data to create an efficient dataset for each set.

 - d. Create a binary classification model, using a TextVectorization layer to
 preprocess each review.

 - e. Add an Embedding layer and compute the mean embedding for each review,
 multiplied by the square root of the number of words (see Chapter 16). This
 rescaled mean embedding can then be passed to the rest of your model.

 - f. Train the model and see what accuracy you get. Try to optimize your pipelines
 to make training as fast as possible.

 - g. Use TFDS to load the same dataset more easily: tfds.load("imdb_reviews").

---

### a. Download the Large Movie Review Dataset, which contains 50,000 movie reviews from the Internet Movie Database (IMDb). The data is organized in two directories, train and test, each containing a pos subdirectory with 12,500 positive reviews and a neg subdirectory with 12,500 negative reviews. Each review is stored in a separate text file. There are other files and folders (including preprocessed bag-of-words versions), but we will ignore them in this exercise.

In [72]:
import numpy as np
import tensorflow as tf
import os
import tarfile
from pathlib import Path

In [73]:


dataset_url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

# Download the dataset
dataset_dir = tf.keras.utils.get_file("aclImdb_v1.tar.gz", dataset_url, untar=False)

with tarfile.open(dataset_dir, "r:gz") as tar:
    tar.extractall(path=os.path.dirname(dataset_dir))

extracted_dir = os.path.join(os.path.dirname(dataset_dir), 'aclImdb')

# List directory structure to verify
print(os.listdir(extracted_dir))


['imdbEr.txt', 'imdb.vocab', 'train', 'README', 'test']


In [74]:
extracted_dir

'/root/.keras/datasets/aclImdb'

In [75]:
print('Train content:',os.listdir(os.path.join(extracted_dir, 'train')))
print('Test content:',os.listdir(os.path.join(extracted_dir, 'test')))
print('Postive content:',os.listdir(os.path.join(extracted_dir, 'test/pos')))
print('Negative content:',os.listdir(os.path.join(extracted_dir, 'test/neg')))

Train content: ['pos', 'unsupBow.feat', 'labeledBow.feat', 'unsup', 'urls_unsup.txt', 'urls_neg.txt', 'urls_pos.txt', 'neg']
Test content: ['pos', 'labeledBow.feat', 'urls_neg.txt', 'urls_pos.txt', 'neg']
Postive content: ['3487_9.txt', '9939_8.txt', '1338_7.txt', '1182_7.txt', '394_9.txt', '791_10.txt', '7003_10.txt', '9742_9.txt', '8689_10.txt', '2534_10.txt', '5546_10.txt', '7257_8.txt', '4729_9.txt', '10756_7.txt', '2053_8.txt', '911_10.txt', '12011_8.txt', '2455_8.txt', '10527_10.txt', '9221_10.txt', '3847_7.txt', '3568_8.txt', '5645_10.txt', '5003_10.txt', '10904_9.txt', '5743_10.txt', '1399_7.txt', '3381_10.txt', '1898_10.txt', '2597_10.txt', '7925_10.txt', '10814_9.txt', '5688_7.txt', '8435_9.txt', '76_8.txt', '5432_9.txt', '4361_9.txt', '3145_9.txt', '1118_7.txt', '5554_10.txt', '3452_7.txt', '6004_8.txt', '12208_9.txt', '4441_9.txt', '961_8.txt', '11585_7.txt', '12338_8.txt', '2502_8.txt', '3562_8.txt', '5268_10.txt', '4076_7.txt', '2115_10.txt', '10869_7.txt', '9368_9.txt', 

In [76]:
file_path = [os.path.join(extracted_dir, 'test/neg',"960_3.txt"),os.path.join(extracted_dir, 'test/neg',"4005_2.txt")]

temp_dataset = tf.data.TextLineDataset(file_path)

for i, line in enumerate(temp_dataset):
    print(line.numpy().decode('utf-8'))
    print(i)


I'm a fan of the 1950's original and about 20 minutes into this remake I started to think this was going to be as good as the original but it wasn't. The motive for the murders was incredibly stupid. Two of the lovers in the movie turn out to be brother and sister-excuse me while I barf. The main character stops in the middle of the movie to have sex which doesn't make sense considering the situation he's in. If the film makers wanted a sex scene they should have put it earlier in the movie before the main character (Dexter played by Dennis Quaid found he's about to die and that he's accused of a crime. There is a reason for where the sex scene is at. Early in the movie Dexter isn't living life to the fullest so he's not interested in sleeping with Meg Ryan. I still feel it would make more sense for the sex scene to have either been cut or earlier in the film and the two siblings not to have been lovers.<br /><br />One of the dumbest parts of the movie involves a gun fight, a couple pe

### b. Split the test set into a validation set (15,000) and a test set (10,000).

In [77]:
extracted_dir = Path(extracted_dir)

In [78]:
def review_paths(dirpath):
    return [str(path) for path in dirpath.glob("*.txt")]

train_pos = review_paths(extracted_dir / "train" / "pos")
train_neg = review_paths(extracted_dir / "train" / "neg")
test_valid_pos = review_paths(extracted_dir / "test" / "pos")
test_valid_neg = review_paths(extracted_dir / "test" / "neg")

len(train_pos), len(train_neg), len(test_valid_pos), len(test_valid_neg)

(12500, 12500, 12500, 12500)

In [79]:
train_pos[:3]

['/root/.keras/datasets/aclImdb/train/pos/3487_9.txt',
 '/root/.keras/datasets/aclImdb/train/pos/1338_7.txt',
 '/root/.keras/datasets/aclImdb/train/pos/921_7.txt']

In [80]:
np.random.shuffle(test_valid_pos)

test_pos = test_valid_pos[:5000]
test_neg = test_valid_neg[:5000]
valid_pos = test_valid_pos[5000:]
valid_neg = test_valid_neg[5000:]

In [81]:
with open('/root/.keras/datasets/aclImdb/train/pos/10158_7.txt','r') as f:
    w = f.read()
    print(type(w))
    print(len(w))
    print(w)

<class 'str'>
603
Johnny and June Carter Cash financed this film which is a traditional rendering of the Gospel stories. The music is great, you get a real feel of what the world of Jesus looked like (I've been there too), and June gets into the part of Mary Magdalene with a passion. Cash's narration is good too.<br /><br />But....<br /><br />1. The actor who played Jesus was miscast. 2. There is no edge to the story like Cash puts in some of his faith based music. 3. Because it is uncompelling, I doubt we'll see this ever widely distributed again.<br /><br />I'd love to buy the CD.<br /><br />Tom Paine Texas, USA


In [82]:
def imdb_dataset(filepaths_positive, filepaths_negative):
    reviews = []
    labels = []
    for filepaths, label in ((filepaths_negative, 0), (filepaths_positive, 1)):
        for filepath in filepaths:
            with open(filepath) as review_file:
                reviews.append(review_file.read())
            labels.append(label)
    return tf.data.Dataset.from_tensor_slices(
        (tf.constant(reviews), tf.constant(labels)))

In [83]:
for X, y in imdb_dataset(train_pos, train_neg).take(3):
    print(X)
    print(y)
    print()

tf.Tensor(b"If I could i would give ZERO stars for this one, but unfortunately i have to give one...<br /><br />There is no single scene I could laugh about... but the game didn't make me laugh either. So if you're some ill retarded folk, go to your local cinema, watch this movie and give it 10 stars, like some people here already did.<br /><br />but for me... in a movie where children are shot dead to achieve humor... good taste goes over the edge... this was the third time i wasted my time to see a Boll movie and it was definitely my last!<br /><br />0/10... i'm ashamed of being from the same country as Uwe Boll!<br /><br />PLEASE PLEASE KEEP HIM FROM MAKING MORE MOVIES!!!!!", shape=(), dtype=string)
tf.Tensor(0, shape=(), dtype=int32)

tf.Tensor(b"Ok, where do we start with this little gem? Mutant slugs begin to take over a small New England (?) town. Only one man can stop them... and that man... is Mike Brady! Now, if that wasn't laughable enough, stay tuned.<br /><br />The footage

### c. Use tf.data to create an efficient dataset for each set.

In [84]:
train_dataset = imdb_dataset(train_pos, train_neg)
test_dataset = imdb_dataset(test_pos, test_neg)
valid_dataset = imdb_dataset(valid_pos, valid_neg)

In [85]:
print(len(train_dataset))
print(len(test_dataset))
print(len(valid_dataset))

25000
10000
15000


In [86]:
train_dataset = train_dataset.shuffle(25_000,seed=42).batch(32).prefetch(1)
valid_dataset = valid_dataset.shuffle(15_000,seed=42,reshuffle_each_iteration=False).batch(32).prefetch(1)
test_dataset = test_dataset.shuffle(10_000,seed=42,reshuffle_each_iteration=False).batch(32).prefetch(1)

In [87]:
for item in train_dataset.take(1):
    print(item[0].shape)
    print(item[1].shape)

(32,)
(32,)


In [88]:
example, label = next(iter(train_dataset))[0][0], next(iter(train_dataset))[1][0]
example, label

(<tf.Tensor: shape=(), dtype=string, numpy=b"I like Chris Rock, but I feel he is wasted in this film. The idea of remaking Heaven Can Wait is fine, but the filmmakers followed the plot of that turkey too closely. When Eddie Murphy remade Dr. Doolittle and The Nutty Professor, he re-did them totally -- so they became Murphy films/vehicles, not just tepid remakes. That's why they were successful. If Chris had done the same, this could have been a much better film. The few laughs that come are when he is doing his standup routine -- so he might as well have done a concert film. It also would have been much funnier if the white man whose body he inhabits was a truck driver or hillbilly. So why does Hollywood keep making junk like this? Because people go to see it -- because they like Chris Rock. So give Chris a decent script and give us better movies! Don't remake films that weren't that good in the first place!">,
 <tf.Tensor: shape=(), dtype=int32, numpy=1>)

In [89]:
len(example.numpy().decode('utf-8').split())

164

### d. Create a binary classification model, using a TextVectorization layer to preprocess each review.

In [116]:
max_tokens = 1000
vec_layer = tf.keras.layers.TextVectorization(max_tokens=max_tokens, output_mode='tf_idf')

sample_reviews=  train_dataset.map(lambda review, label: review)
vec_layer.adapt(sample_reviews)
vec_layer.vocabulary_size()

In [108]:
vec_layer.get_vocabulary()[:5]

['[UNK]', 'the', 'and', 'a', 'of']

In [109]:
vec_layer(example)

<tf.Tensor: shape=(1000,), dtype=float32, numpy=
array([95.980675  ,  5.5788283 ,  1.4221123 ,  2.8409383 ,  1.4394988 ,
        0.72546387,  2.2478366 ,  1.5188392 ,  1.5586723 ,  1.6727123 ,
        2.235067  ,  3.2442765 ,  0.        ,  0.93541676,  0.9380965 ,
        0.        ,  0.        ,  0.        ,  1.7588904 ,  3.135417  ,
        0.        ,  0.9887731 ,  0.        ,  1.0301237 ,  1.2077285 ,
        3.0502172 ,  6.3328786 ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  3.6937737 ,
        0.        ,  0.        ,  4.626542  ,  3.4754734 ,  0.        ,
        1.21856   ,  1.2381712 ,  0.        ,  0.        ,  2.4227183 ,
        0.        ,  0.        ,  0.        ,  0.        ,  1.2966515 ,
        2.659089  ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  1.4175397 ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        1.51493

In [111]:
tf.random.set_seed(42)
model = tf.keras.Sequential([
    vec_layer,
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid"),
])
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model.fit(train_dataset, epochs=5, validation_data=valid_dataset)

Epoch 1/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 65ms/step - accuracy: 0.7833 - loss: 0.5023 - val_accuracy: 0.8533 - val_loss: 0.3570
Epoch 2/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 64ms/step - accuracy: 0.8577 - loss: 0.3432 - val_accuracy: 0.8511 - val_loss: 0.3532
Epoch 3/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 63ms/step - accuracy: 0.8742 - loss: 0.3011 - val_accuracy: 0.8407 - val_loss: 0.3731
Epoch 4/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m48s[0m 62ms/step - accuracy: 0.9017 - loss: 0.2484 - val_accuracy: 0.8501 - val_loss: 0.3670
Epoch 5/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m87s[0m 68ms/step - accuracy: 0.9217 - loss: 0.1972 - val_accuracy: 0.8511 - val_loss: 0.3743


<keras.src.callbacks.history.History at 0x7cfce352f880>

### e. Add an Embedding layer and compute the mean embedding for each review, multiplied by the square root of the number of words (see Chapter 16). This rescaled mean embedding can then be passed to the rest of your model.

In [112]:
def compute_mean_embedding(inputs):
    not_pad = tf.math.count_nonzero(inputs, axis=-1)
    n_words = tf.math.count_nonzero(not_pad, axis=-1, keepdims=True)
    sqrt_n_words = tf.math.sqrt(tf.cast(n_words, tf.float32))
    return tf.reduce_sum(inputs, axis=1) / sqrt_n_words

In [117]:
embedding_size = 20
tf.random.set_seed(42)

text_vectorization = tf.keras.layers.TextVectorization(
    max_tokens=max_tokens, output_mode="int")
text_vectorization.adapt(sample_reviews)

model = tf.keras.Sequential([
    text_vectorization,
    tf.keras.layers.Embedding(input_dim=max_tokens,
                              output_dim=embedding_size,
                              mask_zero=True),
    tf.keras.layers.Lambda(compute_mean_embedding),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid"),
])

### f. Train the model and see what accuracy you get. Try to optimize your pipelines to make training as fast as possible.

In [118]:
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model.fit(train_dataset, epochs=5, validation_data=valid_dataset)

Epoch 1/5




[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 16ms/step - accuracy: 0.5633 - loss: 0.6768 - val_accuracy: 0.5975 - val_loss: 0.6401
Epoch 2/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 15ms/step - accuracy: 0.7271 - loss: 0.5418 - val_accuracy: 0.8345 - val_loss: 0.4130
Epoch 3/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 23ms/step - accuracy: 0.7832 - loss: 0.4615 - val_accuracy: 0.8462 - val_loss: 0.3814
Epoch 4/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 14ms/step - accuracy: 0.8026 - loss: 0.4296 - val_accuracy: 0.8054 - val_loss: 0.4097
Epoch 5/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 15ms/step - accuracy: 0.8104 - loss: 0.4153 - val_accuracy: 0.6344 - val_loss: 0.8694


<keras.src.callbacks.history.History at 0x7cfcd5843a90>

### g. Use TFDS to load the same dataset more easily: tfds.load("imdb_reviews").

In [119]:
import tensorflow_datasets as tfds

In [121]:
datasets = tfds.load("imdb_reviews")

train_dataset = datasets["train"]
test_dataset = datasets["test"]

Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.S8TK58_1.0.0/imdb_reviews-train.tfrecor…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.S8TK58_1.0.0/imdb_reviews-test.tfrecord…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.S8TK58_1.0.0/imdb_reviews-unsupervised.…

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


In [123]:
for example in train_dataset.take(1):
    print(example["text"])
    print(example["label"])

tf.Tensor(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", shape=(), dtype=string)
tf.Tensor(0, shape=(), dtype=int64)
