In this exercise you will download a dataset, split it, create a tf.data.Dataset to load it and preprocess it efficiently, then build and train a binary classification model containing an Embedding layer:

a. Download the Large Movie Review Dataset, which contains 50,000 movie reviews from the Internet Movie Database (IMDb). The data is organized in two directories, train and test, each containing a pos subdirectory with 12,500 positive reviews and a neg subdirectory with 12,500 negative reviews. Each review is stored in a separate text file. There are other files and folders (including preprocessed bag-of-words versions), but we will ignore them in this exercise.

b. Split the test set into a validation set (15,000) and a test set (10,000).

In [2]:
import tensorrt
import tensorflow as tf

2023-06-26 17:25:16.558411: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Getting the data

## First attempt: Bad approach

This is not a good approach. A better approach is to simply split the files in the folder to validation and test

In [2]:
test_pos_files = tf.data.Dataset.list_files('data/aclImdb/test/pos/*.txt')
test_neg_files = tf.data.Dataset.list_files('data/aclImdb/test/neg/*.txt')

def attach_label(label):
    def _attach_label(x):
        return x, tf.constant([label], dtype=tf.int64)
    return _attach_label

test_pos = tf.data.TextLineDataset(test_pos_files).map(attach_label(1))
test_neg = tf.data.TextLineDataset(test_neg_files).map(attach_label(0))
test_full: tf.data.Dataset = test_pos.concatenate(test_neg).shuffle(25000, seed=42)

2023-06-26 13:01:48.152912: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-06-26 13:01:48.195041: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-06-26 13:01:48.195103: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-06-26 13:01:48.198627: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-06-26 13:01:48.198706: I tensorflow/compile

In [3]:
valid_input_arr = []
valid_label_arr = []
test_input_arr = []
test_label_arr = []
for index, (input, label) in test_full.enumerate():
    if index < 15000:
        valid_input_arr.append(input)
        valid_label_arr.append(label)
    else:
        test_input_arr.append(input)
        test_label_arr.append(label)

valid: tf.data.Dataset = tf.data.Dataset.from_tensor_slices((valid_input_arr, valid_label_arr))
test: tf.data.Dataset = tf.data.Dataset.from_tensor_slices((test_input_arr, test_label_arr))

2023-06-26 13:01:50.024858: I tensorflow/core/grappler/optimizers/data/replicate_on_split.cc:32] Running replicate on split optimization
2023-06-26 13:01:50.032742: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_4' with dtype string and shape [12500]
	 [[{{node Placeholder/_4}}]]
2023-06-26 13:01:50.033087: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_13' with dtype string and shape [12500]
	 [[{{node Placeholder/_13}}]]


KeyboardInterrupt: 

## Second attempt

In [3]:
test_pos_files = tf.data.Dataset.list_files('data/aclImdb/test/pos/*.txt', shuffle=False)
test_neg_files = tf.data.Dataset.list_files('data/aclImdb/test/neg/*.txt', shuffle=False)
train_pos_files = tf.data.Dataset.list_files('data/aclImdb/train/pos/*.txt', shuffle=False)
train_neg_files = tf.data.Dataset.list_files('data/aclImdb/train/neg/*.txt', shuffle=False)

test_pos_files = [x.numpy() for x in test_pos_files]
test_neg_files = [x.numpy() for x in test_neg_files]

test_pos_files, valid_pos_files = test_pos_files[:5000], test_pos_files[5000:]
test_neg_files, valid_neg_files = test_neg_files[:5000], test_neg_files[5000:]

print(
    len(valid_pos_files),
    len(valid_neg_files),
    len(test_pos_files),
    len(test_neg_files),
    len(train_pos_files),
    len(train_neg_files),
)

def attach_label(label):
    def _attach_label(x):
        return x, label
    return _attach_label

valid_pos = tf.data.TextLineDataset(valid_pos_files, num_parallel_reads=5).map(attach_label(1))
valid_neg = tf.data.TextLineDataset(valid_neg_files, num_parallel_reads=5).map(attach_label(0))
test_pos = tf.data.TextLineDataset(test_pos_files, num_parallel_reads=5).map(attach_label(1))
test_neg = tf.data.TextLineDataset(test_neg_files, num_parallel_reads=5).map(attach_label(0))
train_pos = tf.data.TextLineDataset(train_pos_files, num_parallel_reads=5).map(attach_label(1))
train_neg = tf.data.TextLineDataset(train_neg_files, num_parallel_reads=5).map(attach_label(0))

valid = valid_pos.concatenate(valid_neg).batch(32).prefetch(1)
test = test_pos.concatenate(test_neg).batch(32).prefetch(1)
train = train_pos.concatenate(train_neg).shuffle(25000).batch(32).prefetch(1)

2023-06-26 17:25:23.971366: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-06-26 17:25:24.119309: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-06-26 17:25:24.119391: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-06-26 17:25:24.127657: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-06-26 17:25:24.127743: I tensorflow/compile

7500 7500 5000 5000 12500 12500


c. Use tf.data to create an efficient dataset for each set.

d. Create a binary classification model, using a TextVectorization layer to preprocess each review.

# Model 1: without embeddings

In [4]:
vectorization = tf.keras.layers.TextVectorization(output_mode='tf_idf', max_tokens=1000)
vectorization.adapt(train.map(lambda x, label: x))

2023-06-26 17:25:34.028732: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_8' with dtype string and shape [12500]
	 [[{{node Placeholder/_8}}]]
2023-06-26 17:25:34.029079: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_8' with dtype string and shape [12500]
	 [[{{node Placeholder/_8}}]]


In [5]:
print(vectorization.get_vocabulary()[:20])
print(vectorization.get_vocabulary()[980:])

['[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i', 'this', 'that', 'br', 'was', 'as', 'for', 'with', 'movie', 'but', 'film']
['ideas', 'expecting', 'jane', 'fails', 'deserves', 'present', 'political', 'missing', 'attempts', 'twist', 'secret', 'fire', 'dumb', 'unlike', 'fighting', 'fantasy', 'pay', 'air', 'joke', 'gay']


In [8]:
vectorization.vocabulary_size()

1000

In [7]:
# Embeddings of a sentence seems to simply add the weights for each word
print(vectorization('asdfasdf')[:5])
print(vectorization('asdfasdf the')[:5])
print(vectorization('asdfasdf the and')[:5])

tf.Tensor([2.9993966 0.        0.        0.        0.       ], shape=(5,), dtype=float32)
tf.Tensor([2.9993966  0.69735354 0.         0.         0.        ], shape=(5,), dtype=float32)
tf.Tensor([2.9993966  0.69735354 0.7110562  0.         0.        ], shape=(5,), dtype=float32)


In [43]:
x: tf.Tensor = vectorization('asdfasdf the and')[:5]
y: tf.Tensor = vectorization('asdfadsfa')[:5]
data = tf.stack([x, y], axis=0)
mean = tf.reduce_mean(data, axis=1, keepdims=True)
print(mean)
word_count = tf.math.count_nonzero(data, axis=1, keepdims=True, dtype=tf.float32)
tf.sqrt(word_count)

tf.Tensor(
[[0.8815613]
 [0.5998793]], shape=(2, 1), dtype=float32)


<tf.Tensor: shape=(2, 1), dtype=float32, numpy=
array([[1.7320508],
       [1.       ]], dtype=float32)>

In [6]:
model = tf.keras.models.Sequential([
    vectorization,
    tf.keras.layers.Dense(100, activation='relu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(
    loss=tf.keras.losses.binary_crossentropy,
    optimizer=tf.keras.optimizers.Nadam(),
    metrics=[tf.keras.metrics.binary_accuracy]
)
hist = model.fit(train, epochs=5, validation_data=valid)

Epoch 1/5


2023-06-26 17:31:42.530908: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_8' with dtype string and shape [12500]
	 [[{{node Placeholder/_8}}]]
2023-06-26 17:31:42.531166: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_8' with dtype string and shape [12500]
	 [[{{node Placeholder/_8}}]]
2023-06-26 17:31:46.049511: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:637] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2023-06-26 17:31:46.086953: I tensorflow/compiler/xla/service/service.cc:169] XLA service 0x7fd6e41b1740 initialized for platfo

    777/Unknown - 11s 6ms/step - loss: 0.4275 - binary_accuracy: 0.8191

2023-06-26 17:31:53.995078: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_8' with dtype string and shape [7500]
	 [[{{node Placeholder/_8}}]]
2023-06-26 17:31:53.995329: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [7500]
	 [[{{node Placeholder/_0}}]]


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


# Model 2: My custom of embeddings using hot-vector

e. Add an Embedding layer and compute the mean embedding for each review, multiplied by the square root of the number of words (see Chapter 16). This rescaled mean embedding can then be passed to the rest of your model.

> An embeddings layer starts with a sparse categorical value (a number between 0 and max_tokens). But here, the solution (which I read only up to here) suggests TF-IDF, which produces hot encoded vector. Matrix multiplication between the tf-idf-hot encoded vectorization layer and the embedding layer (dense layer) will essentially take care of "adding the vectors" part. But what about the square root of the number of words? My instinct is to create a custom layer that for a given input tf-idf-hot encoded matrix X, it performs this "normalization".

In [67]:
class MyEmbedding(tf.keras.layers.Layer):
    def __init__(self, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.output_dim = output_dim

    def build(self, input_shape):
        self.kernel = self.add_weight(
            'kernel',
            shape=(input_shape[-1], self.output_dim),
            dtype=tf.float32,
            initializer='he_normal',
            trainable=True
        )
        # self.bias = self.add_weight(
        #     'bias',
        #     shape=[self.output_dim],
        #     dtype=tf.float32,
        #     trainable=True
        # )
        super().build(input_shape)

    def call(self, inputs):
        word_count = tf.math.count_nonzero(inputs, axis=1, keepdims=True, dtype=tf.float32)
        return (inputs @ self.kernel) / tf.sqrt(word_count)

    def get_config(self):
        base_config = super().get_config()
        return { **base_config, 'output_dim': self.output_dim }

In [68]:
embeddings_model = tf.keras.models.Sequential([
    vectorization,
    MyEmbedding(300)
])

model = tf.keras.models.Sequential([
    embeddings_model,
    tf.keras.layers.Dense(100, activation='relu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(
    loss=tf.keras.losses.binary_crossentropy,
    optimizer=tf.keras.optimizers.Nadam(),
    metrics=[tf.keras.metrics.binary_accuracy]
)
hist = model.fit(train, epochs=5, validation_data=valid)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [56]:
queen = embeddings_model.predict(['queen'])
king = embeddings_model.predict(['king'])
man = embeddings_model.predict(['man'])
woman = embeddings_model.predict(['woman'])



In [59]:
import numpy as np
def distance(x, y):
    return np.sqrt((x - y) @ (x - y).T)[0][0]

In [65]:
print(distance(king, man))
print(distance(king, queen))
print(distance(queen, man))
print(distance(queen, woman))
print(distance(king - man + woman, queen))

3.1395469
2.8197706
1.9886863
1.9030378
3.5110815


# Model 3: Keras embeddings

In [70]:
max_tokens = 1000
vectorization_layer = tf.keras.layers.TextVectorization(max_tokens=max_tokens, output_mode='int')
sample_reviews = train.map(lambda x, label: x)
vectorization_layer.adapt(sample_reviews)

2023-06-26 19:15:16.963051: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [12500]
	 [[{{node Placeholder/_0}}]]
2023-06-26 19:15:16.963332: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_8' with dtype string and shape [12500]
	 [[{{node Placeholder/_8}}]]


## Understanding the shape of the data

The vectorization layer returns a vector of words per review

In [92]:
x = vectorization_layer('The meaning of life')
y = vectorization_layer('It was pretty good')
z = vectorization_layer('adafdasdf world')

print(x)
print(y)
print(z)

tf.Tensor([  2   1   5 119], shape=(4,), dtype=int64)
tf.Tensor([  9  14 179  50], shape=(4,), dtype=int64)
tf.Tensor([  1 188], shape=(2,), dtype=int64)


When a batch of sentences is involved, the padding token 0 is used to ensure the tensor has the same shape

In [100]:
batch = [
    'The meaning of life is a good movie',
    'I did not like what happened in he end',
    'Why woudld the director do this'
]

batch = vectorization_layer(batch)
batch

<tf.Tensor: shape=(3, 9), dtype=int64, numpy=
array([[  2,   1,   5, 119,   7,   4,  50,  18,   0],
       [ 10, 117,  22,  39,  49, 562,   8,  27, 129],
       [134,   1,   2, 172,  82,  11,   0,   0,   0]])>

An embeddings layer (randomly initialized) returns one vector output_dim-size per word

In [129]:
embeddings_layer = tf.keras.layers.Embedding(input_dim=max_tokens, output_dim=5, mask_zero=True)
embeddings_layer(x)

<tf.Tensor: shape=(4, 5), dtype=float32, numpy=
array([[ 0.02916148, -0.00272601, -0.03369595,  0.02627398,  0.03792412],
       [ 0.01223432, -0.03561933,  0.01368679, -0.04005159, -0.02686861],
       [ 0.01188274,  0.01561645, -0.03099689,  0.03140095, -0.03017026],
       [-0.00366409, -0.00441701, -0.02873117, -0.00038002, -0.01314867]],
      dtype=float32)>

In [130]:
embedded_batch = embeddings_layer(batch)
embedded_batch

<tf.Tensor: shape=(3, 9, 5), dtype=float32, numpy=
array([[[ 0.02916148, -0.00272601, -0.03369595,  0.02627398,
          0.03792412],
        [ 0.01223432, -0.03561933,  0.01368679, -0.04005159,
         -0.02686861],
        [ 0.01188274,  0.01561645, -0.03099689,  0.03140095,
         -0.03017026],
        [-0.00366409, -0.00441701, -0.02873117, -0.00038002,
         -0.01314867],
        [ 0.0154869 ,  0.01773483,  0.04219978,  0.02602367,
         -0.00954236],
        [-0.03284589, -0.01595576,  0.03316909, -0.02532237,
         -0.03351758],
        [-0.00601077,  0.02821207,  0.01040041,  0.02956816,
          0.00224223],
        [-0.00260069,  0.0118472 ,  0.02175767, -0.01263409,
          0.03960122],
        [ 0.00372756, -0.04539081, -0.00437624, -0.00571344,
          0.04703838]],

       [[-0.00243868, -0.01088219, -0.03810662, -0.02347549,
         -0.02771819],
        [ 0.04517824,  0.01395022,  0.03220079, -0.01081709,
         -0.01430261],
        [-0.01549876,  

In [131]:
tf.reduce_mean(embedded_batch, axis=1)

<tf.Tensor: shape=(3, 5), dtype=float32, numpy=
array([[ 0.00304128, -0.00341093,  0.0026015 ,  0.00324058,  0.0015065 ],
       [-0.00237101,  0.01237879,  0.0061543 , -0.0115261 , -0.00654921],
       [ 0.01288377, -0.03629122,  0.00786936, -0.00191442,  0.01627186]],
      dtype=float32)>

In [116]:
# The shape of x is (batch_size, longest_sequence, output_dim)
# Our goal should be to transform it to (batch_size, output_dim)
def compute_embeddings(x, mask=None):
    return tf.reduce_mean(x, axis=1)

In [118]:
model_3_embeddings = tf.keras.models.Sequential([
    vectorization_layer,
    tf.keras.layers.Embedding(input_dim=max_tokens, output_dim=100, mask_zero=True),
])
model_3 = tf.keras.models.Sequential([
    model_3_embeddings,
    tf.keras.layers.Lambda(compute_embeddings),
    tf.keras.layers.Dense(100, activation='relu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model_3.compile(
    loss=tf.keras.losses.binary_crossentropy,
    optimizer=tf.keras.optimizers.Nadam(),
    metrics=[tf.keras.metrics.binary_accuracy]
)
model_3.fit(train, epochs=5, validation_data=valid)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fd718b2e690>

In [125]:
prediction = model_3_embeddings.predict([
    'The meaning of life is a good movie',
    'I did not like what happened in he end',
    'Why woudld the director do this'
])

def compute_mean_embedding(inputs):
    not_pad = tf.math.count_nonzero(inputs, axis=-1)
    n_words = tf.math.count_nonzero(not_pad, axis=-1, keepdims=True)
    # sqrt_n_words = tf.math.sqrt(tf.cast(n_words, tf.float32))
    # return tf.reduce_sum(inputs, axis=1) / sqrt_n_words
    return n_words

compute_mean_embedding(prediction)



<tf.Tensor: shape=(3, 1), dtype=int64, numpy=
array([[9],
       [9],
       [9]])>

In [123]:
input = tf.constant([[[1., 2., 3.], [4., 5., 0.], [0., 0., 0.]],
                     [[6., 0., 0.], [0., 0., 0.], [0., 0., 0.]]])
not_pad = tf.math.count_nonzero(input, axis=-1)
print(not_pad)
n_words = tf.math.count_nonzero(not_pad, axis=-1, keepdims=True)
print(n_words)

tf.Tensor(
[[3 2 0]
 [1 0 0]], shape=(2, 3), dtype=int64)
tf.Tensor(
[[2]
 [1]], shape=(2, 1), dtype=int64)


# Model 4: My sentence embedding

In [138]:
batch_2 = [
    'The meaning of life is a good movie',
    'I did not like what happened in he end',
    'Why woudld the director do this'
]

batch_2 = vectorization_layer(batch_2)
print(batch_2)
tf.sqrt(tf.math.count_nonzero(batch_2, axis=-1, keepdims=True, dtype=tf.float32))

tf.Tensor(
[[  2   1   5 119   7   4  50  18   0]
 [ 10 117  22  39  49 562   8  27 129]
 [134   1   2 172  82  11   0   0   0]], shape=(3, 9), dtype=int64)


<tf.Tensor: shape=(3, 1), dtype=float32, numpy=
array([[2.828427 ],
       [3.       ],
       [2.4494896]], dtype=float32)>

In [141]:
embedded_batch = embeddings_layer(batch_2)
tf.reduce_sum(embedded_batch, axis=1) / tf.sqrt(tf.math.count_nonzero(batch_2, axis=-1, keepdims=True, dtype=tf.float32))

<tf.Tensor: shape=(3, 5), dtype=float32, numpy=
array([[ 0.00967731, -0.01085351,  0.00827792,  0.01031148,  0.00479365],
       [-0.00711304,  0.03713636,  0.01846291, -0.0345783 , -0.01964763],
       [ 0.047338  , -0.13334246,  0.02891386, -0.00703402,  0.05978665]],
      dtype=float32)>

In [145]:
class SentenceEmbedding(tf.keras.layers.Layer):
    def __init__(self, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.embedding_layer = tf.keras.layers.Embedding(
            input_dim=self.input_dim,
            output_dim=self.output_dim,
            mask_zero=True
        )

    def call(self, input):
        n_words = tf.math.count_nonzero(input, axis=-1, keepdims=True, dtype=tf.float32)
        embeddings_output = self.embedding_layer(input)
        return tf.reduce_sum(embeddings_output, axis=1) / tf.sqrt(n_words)

    def get_config(self):
        base_config = super().get_config()
        return {
            **base_config,
            'input_dim': self.input_dim,
            'output_dim': self.output_dim
        }

In [146]:
model_4_embeddings = tf.keras.models.Sequential([
    vectorization_layer,
    SentenceEmbedding(input_dim=max_tokens, output_dim=20),
])
model_4 = tf.keras.models.Sequential([
    model_4_embeddings,
    tf.keras.layers.Dense(100, activation='relu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model_4.compile(
    loss=tf.keras.losses.binary_crossentropy,
    optimizer=tf.keras.optimizers.Nadam(),
    metrics=[tf.keras.metrics.binary_accuracy]
)
model_4.fit(train, epochs=5, validation_data=valid)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fd718a99410>

In [150]:
queen = model_4_embeddings.predict(['queen'])
king = model_4_embeddings.predict(['king'])
man = model_4_embeddings.predict(['man'])
woman = model_4_embeddings.predict(['woman'])

print(distance(king, man))
print(distance(king, queen))
print(distance(queen, man))
print(distance(queen, woman))
print(distance(king - man + woman, queen))

0.28287014
0.27478057
0.2349989
0.2859396
0.5066771


f. Train the model and see what accuracy you get. Try to optimize your pipelines to make training as fast as possible.

g. Use TFDS to load the same dataset more easily: tfds.load("imdb_reviews").

In [152]:
import tensorflow_datasets as tfds

datasets = tfds.load(name="imdb_reviews")
train_set, test_set = datasets["train"], datasets["test"]

2023-06-26 21:53:16.813489: W tensorflow/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".


[1mDownloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /home/amitaharoni/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /home/amitaharoni/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteBGKQ69/imdb_reviews-tr…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /home/amitaharoni/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteBGKQ69/imdb_reviews-te…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /home/amitaharoni/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteBGKQ69/imdb_reviews-un…

[1mDataset imdb_reviews downloaded and prepared to /home/amitaharoni/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [153]:
for example in train_set.take(1):
    print(example["text"])
    print(example["label"])

tf.Tensor(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", shape=(), dtype=string)
tf.Tensor(0, shape=(), dtype=int64)


2023-06-26 21:54:36.418536: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_4' with dtype int64 and shape [1]
	 [[{{node Placeholder/_4}}]]
2023-06-26 21:54:36.418883: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_3' with dtype int64 and shape [1]
	 [[{{node Placeholder/_3}}]]
2023-06-26 21:54:36.484357: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline si