# TensorFlow

The majority of cases we will encounter will not require anything other than the high level Keras API. However, if we need extra control to write custom loss functions, metrics, layers, models, initializers, regularizers, weight constraints, and more, we need to use some lower-level library.[TensorFlow](https://www.tensorflow.org/) is a powerful library for numerical computation, particularly well suited and fine-tuned for large-scale Machine Learning. It was developed by the Google Brain team and it powers many of Google’s large-scale services. It was open sourced in November 2015, and it is now one of the most popular Deep Learning library.  At the lowest level, each TensorFlow operation is implemented using highly efficient C++ code. Many operations have multiple implementations called **kernels**: each kernel is dedicated to a specific device type [(CPU, GPU, or TPU)](https://storage.googleapis.com/nexttpu/index.html). The oerall TensorFlow architecture is shown in the next figure:  

![](images/tensorflow-architecture.png)

## Tensors and Variables

TensorFlow revolves around the **tensor** concept, which is a multidimensional array, exactly like a NumPy ndarray, but it can also hold a scalar. Tensors flow from **operation** to operation (hence the name). We can create a tensor with:

In [283]:
import tensorflow as tf

tf.constant([[1., 2., 3.], [4., 5., 6.]]) # matrix
tf.constant(42) # scalar

<tf.Tensor: shape=(), dtype=int32, numpy=42>

Just like a NumPy array, a tensor has a shape and a data type:

In [284]:
t = tf.constant([[1., 2., 3.], [4., 5., 6.]])

print(t.shape)
print(t.dtype)

(2, 3)
<dtype: 'float32'>


Indexing works much like in NumPy:

In [285]:
t[:, 1:]

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[2., 3.],
       [5., 6.]], dtype=float32)>

All sorts of tensor operations are available:

In [286]:
print(t + 10)
print(tf.square(t))
print(tf.transpose(t))

tf.Tensor(
[[11. 12. 13.]
 [14. 15. 16.]], shape=(2, 3), dtype=float32)
tf.Tensor(
[[ 1.  4.  9.]
 [16. 25. 36.]], shape=(2, 3), dtype=float32)
tf.Tensor(
[[1. 4.]
 [2. 5.]
 [3. 6.]], shape=(3, 2), dtype=float32)


We find all the basic math operations we need (tf.add(), tf.multiply(), tf.square(), tf.exp(), tf.sqrt(), etc.) and most operations that we can find in NumPy (tf.reshape(), tf.squeeze(), tf.tile(). etc.). However some operations has a different name. As an example, tf.reduce_sum() is the equivalent of np.sum(). The TensorFlow operation is named differently becouse it behave in a different way: its GPU kernel uses a reduce algorithm that does not guarantee the order in which the elements are added: because 32-bit floats have limited precision, the result may change slightly every time we call this operation. Every time we find a different name, there is a motivation behind.

Tensors play nice with NumPy: we can create a tensor from a NumPy array, and vice versa. we can even apply TensorFlow operations to NumPy arrays and NumPy operations to tensors:

In [287]:
a = np.array([2., 4., 5.])
t = tf.constant([[1., 2., 3.], [4., 5., 6.]])

print(tf.constant(a))
print(np.array(t))

print(tf.square(a))
print(np.square(t))

tf.Tensor([2. 4. 5.], shape=(3,), dtype=float64)
[[1. 2. 3.]
 [4. 5. 6.]]
tf.Tensor([ 4. 16. 25.], shape=(3,), dtype=float64)
[[ 1.  4.  9.]
 [16. 25. 36.]]


Type conversions can significantly hurt performance, TensorFlow does not perform any type conversions automatically: it just raises an exception. Of course we can use **tf.cast()** when we really need to convert types:

In [288]:
try:
    tf.constant(2.0) + tf.constant(40)
except tf.errors.InvalidArgumentError as ex:
    print(ex)

cannot compute AddV2 as input #1(zero-based) was expected to be a float tensor but is a int32 tensor [Op:AddV2] name: 


In [289]:
t2 = tf.constant(40., dtype=tf.float64)
tf.constant(2.0) + tf.cast(t2, tf.float32)

<tf.Tensor: shape=(), dtype=float32, numpy=42.0>

Tensors are **immutable**, this means that we cannot use them to implement weights
in a neural network, since they need to be tweaked by backpropagation. What we need is a **Variable**:

In [290]:
v = tf.Variable([[1., 2., 3.], [4., 5., 6.]])
print(v)

<tf.Variable 'Variable:0' shape=(2, 3) dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>


A variable acts much like a tensor: we can perform the same operations with it, it plays nicely with NumPy as well. But it can also be modified in place using several methods:

In [291]:
v.assign(2 * v)
print(v)

v[0, 1].assign(42)
print(v)

v[:, 2].assign([0., 1.])
print(v)

v.scatter_nd_update(indices=[[0, 0], [1, 2]], updates=[100., 200.])
print(v)

<tf.Variable 'Variable:0' shape=(2, 3) dtype=float32, numpy=
array([[ 2.,  4.,  6.],
       [ 8., 10., 12.]], dtype=float32)>
<tf.Variable 'Variable:0' shape=(2, 3) dtype=float32, numpy=
array([[ 2., 42.,  6.],
       [ 8., 10., 12.]], dtype=float32)>
<tf.Variable 'Variable:0' shape=(2, 3) dtype=float32, numpy=
array([[ 2., 42.,  0.],
       [ 8., 10.,  1.]], dtype=float32)>
<tf.Variable 'Variable:0' shape=(2, 3) dtype=float32, numpy=
array([[100.,  42.,   0.],
       [  8.,  10., 200.]], dtype=float32)>


In practice, we rarely have to create variables manually, since model parameters will generally be updated directly by the optimizers.

TensorFlow supports several other data structures, including 
- Sparse tensors (**tf.SparseTensor**) efficiently represent tensors containing mostly zeros. 
- Tensor arrays (**tf.TensorArray**) are lists of tensors. They have a fixed size by default but can optionally be made dynamic. All tensors they contain must have the same shape and data type.
- Ragged tensors (**tf.RaggedTensor**) represent static lists of lists of tensors, where every tensor has the same shape and data type. 
- String tensors are regular tensors of type **tf.string**. These represent byte strings, not Unicode strings
- Sets (**tf.sets**) are represented as regular tensors (or sparse tensors). More generally, each set is represented by a vector in the tensor’s last axis. 
- Queues (**tf.queue**) store tensors across multiple steps.

## Custom Functions

### Loss function

Suppose we want to train a regression model, but our training set is a bit noisy. Which loss function should you use? The mean squared error might penalize large errors too much, the mean absolute error would not penalize outliers as much, but training might
take a while to converge. This is probably a good time to use the [**Huber loss**](https://en.wikipedia.org/wiki/Huber_loss) (it is currently part of the Keras API, but let’s pretend it’s not there). We can create a function that takes the labels and predictions as arguments, and use TensorFlow operations to compute every instance’s loss:

In [None]:
def huber_fn(y_true, y_pred):
    error = y_true - y_pred
    is_small_error = tf.abs(error) < 1
    squared_loss = tf.square(error) / 2
    linear_loss  = tf.abs(error) - 0.5
    return tf.where(is_small_error, squared_loss, linear_loss)

In [None]:
plt.figure(figsize=(8, 3.5))
z = np.linspace(-4, 4, 200)
plt.plot(z, huber_fn(0, z), "b-", linewidth=2, label="huber($z$)")
plt.axis([-4, 4, 0, 4])
plt.grid(True)
plt.xlabel("$z$")
plt.title("Huber loss", fontsize=14)
plt.show()

Now you can use this loss when you compile the Keras model, then trainyour model. Let's try with the California housing dataset.

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(housing.data, housing.target.reshape(-1, 1), random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
X_test_scaled = scaler.transform(X_test)

In [None]:
input_shape = X_train.shape[1:]

model = keras.models.Sequential([
    keras.layers.Dense(30, activation="selu", kernel_initializer="lecun_normal", input_shape=input_shape),
    keras.layers.Dense(1),
])

In [None]:
model.compile(loss=huber_fn, optimizer="nadam")

model.fit(X_train_scaled, y_train, epochs=10,
          validation_data=(X_valid_scaled, y_valid))

For each batch during training, Keras will call the huber_fn() function to compute the loss and use it to perform a Gradient Descent step. Moreover, it will keep track of the total loss since the beginning of the epoch, and it will display the mean loss.

Saving a model containing a custom loss function works fine, as Keras saves the name of the function. Whenever we load it, we need to provide a dictionary that maps the function name to the actual function. More generally, when you load a model containing custom objects, we need to map the names to the objects:

In [None]:
model.save("./data/my_model_with_a_custom_loss.h5")

In [None]:
model = keras.models.load_model("./data/my_model_with_a_custom_loss.h5",
                                custom_objects={"huber_fn": huber_fn})

### Other Functions

Most Keras functionalities (losses, regularizers, constraints, initializers, metrics, activation functions, layers, and even full models) can be customized in the same way. Most of the time, you will just need to write a function with the appropriate inputs and outputs.

Here an examples of a custom activation function, a custom initializer, a custom regularizer  a custom constraint:

In [None]:
keras.backend.clear_session()

In [None]:
def my_softplus(z):
    return tf.math.log(tf.exp(z) + 1.0)

def my_glorot_initializer(shape, dtype=tf.float32):
    stddev = tf.sqrt(2. / (shape[0] + shape[1]))
    return tf.random.normal(shape, stddev=stddev, dtype=dtype)

def my_l1_regularizer(weights):
    return tf.reduce_sum(tf.abs(0.01 * weights))

def my_positive_weights(weights): 
    return tf.where(weights < 0., tf.zeros_like(weights), weights)

In [None]:
model = keras.models.Sequential([
    keras.layers.Dense(30, activation="selu", kernel_initializer="lecun_normal",
                       input_shape=input_shape),
    keras.layers.Dense(1, activation=my_softplus,
                       kernel_regularizer=my_l1_regularizer,
                       kernel_constraint=my_positive_weights,
                       kernel_initializer=my_glorot_initializer),
])

In [None]:
model.compile(loss=huber_fn, optimizer="nadam")

model.fit(X_train_scaled, y_train, epochs=10,
          validation_data=(X_valid_scaled, y_valid))

The activation function will be applied to the output of the layer, the layer’s weights will be initialized using the value returned by the initializer. At each training step
the weights will be passed to the regularization function to compute the regularization loss, which will be added to the main loss to get the final loss used for training. Finally, the constraint function will be called after each training step, and the layer’s weights will be replaced by the constrained weights.

### Custom Metrics

Losses and metrics are conceptually not the same thing: losses are used by Gradient Descent to train a model, so they must be differentiable and their gradients should
not be 0 everywhere. Plus, it’s ok if they are not easily interpretable by humans. In contrast, metrics are used to evaluate a model: they must be more easily interpretable, and they can be non-differentiable or have 0 gradients everywhere.

Defining a custom metric function is exactly the same as defining a custom loss function. In fact, we could even use the Huber loss function we created earlier as a metric.

In [None]:
keras.backend.clear_session()

In [None]:
model.compile(loss="mse", optimizer="nadam", metrics=[huber_fn])

model.fit(X_train_scaled, y_train, epochs=2)

For each batch during training, Keras will compute this metric and keep track of its mean since the beginning of the epoch. Most of the time, this is exactly what you want. But not always! Several times we need a **streaming metric** (or **stateful metric**), which is gradually updated, batch after batch (e.g. like the precision of a binary classifier). 

If we need to create such a streaming metric, we can create a subclass of the
**keras.metrics.Metric** class. Here is a simple example that keeps track of the total Huber loss and the number of instances seen so far. 

In [None]:
class HuberMetric(keras.metrics.Metric):
    def __init__(self, **kwargs):
        super().__init__(**kwargs) # handles base args (e.g., dtype)
        self.total = self.add_weight("total", initializer="zeros")
        self.count = self.add_weight("count", initializer="zeros")
    def huber_fn(self, y_true, y_pred): # workaround
        error = y_true - y_pred
        is_small_error = tf.abs(error) < 1
        squared_loss = tf.square(error) / 2
        linear_loss  = tf.abs(error) - 0.5
        return tf.where(is_small_error, squared_loss, linear_loss)
    def update_state(self, y_true, y_pred, sample_weight=None):
        metric = self.huber_fn(y_true, y_pred)
        self.total.assign_add(tf.reduce_sum(metric))
        self.count.assign_add(tf.cast(tf.size(y_true), tf.float32))
    def result(self):
        return self.total / self.count
    def get_config(self):
        base_config = super().get_config()
        return {**base_config}

The constructor uses the **add_weight()** method to create the va riables needed to keep track of the metric’s state over multiple batches (the sum of alllosses and the number of instances seen so far). The **update_state()** updates the variables, given the labels and predictions for one batch. The **result()** method computes and returns the final result, in this case the mean Huber metric over all instances.

In [None]:
model.compile(loss="mse", optimizer="nadam", metrics=[HuberMetric()])

model.fit(X_train_scaled, y_train, epochs=2)

### Custom Layers

We may want to build an architecture that contains an new type of layer for which TensorFlow does not provide a default implementation. In this case, you will need to create a custom layer. Or we may simply want to build a very repetitive architecture, containing identical blocks of layers repeated many times, and it would be convenient to treat each block of layers as a single layer. 

Some layers have no weights (such as keras.layers.Flatten or keras.layers.ReLU). If we want to create a custom layer without any weights, the simplest option is to write a function and wrap it in a **keras.layers.Lambda** layer. For example, the following layer will apply the exponential function to its inputs:

In [None]:
exponential_layer = keras.layers.Lambda(lambda x: tf.exp(x))

This custom layer can then be used like any other layer, using the Sequential API, the Functional API, or the Subclassing API. Adding an exponential layer at the output of a regression model can be useful if the values to predict are positive and with very different scales (e.g., 0.001, 10., 10000).

In [None]:
keras.backend.clear_session()

In [None]:
model = keras.models.Sequential([
    keras.layers.Dense(30, activation="relu", input_shape=input_shape),
    keras.layers.Dense(1),
    exponential_layer
])
model.compile(loss="mse", optimizer="nadam")

model.fit(X_train_scaled, y_train, epochs=5,
          validation_data=(X_valid_scaled, y_valid))

model.evaluate(X_test_scaled, y_test)

To build a custom stateful layer (a layer with weights), we need to create a subclass of the **keras.layers.Layer** class. For example, the following class implements a simplified version of the Dense layer:

In [None]:
class MyDense(keras.layers.Layer):
    def __init__(self, units, activation=None, **kwargs):
        super().__init__(**kwargs)
        self.units = units
        self.activation = keras.activations.get(activation)

    def build(self, batch_input_shape):
        self.kernel = self.add_weight(
            name="kernel", shape=[batch_input_shape[-1], self.units],
            initializer="glorot_normal")
        self.bias = self.add_weight(
            name="bias", shape=[self.units], initializer="zeros")
        super().build(batch_input_shape) # must be at the end

    def call(self, X):
        return self.activation(X @ self.kernel + self.bias)

    def compute_output_shape(self, batch_input_shape):
        return tf.TensorShape(batch_input_shape.as_list()[:-1] + [self.units])

    def get_config(self):
        base_config = super().get_config()
        return {**base_config, "units": self.units,
                "activation": keras.activations.serialize(self.activation)}

The constructor takes all the hyperparameters as arguments (units and activation) and also a kwargs (to take care of standard arguments such as input_shape, trainable, and name). Then it saves the hyperparameters as attributes, converting the activation argument to the appropriate activation function using the **keras.activations.get()** function. The build() method creates the layer’s variables by calling the add_weight() method for each weight. The **build()** method is called the first time the layer is used. At that point, Keras will know the shape of this layer’s inputs, and it will pass it to the build() method, which is often necessary to create some of the weights. For example, we need to know the number of neurons in the previous layer in order to create the connection weights matrix. At the end of the build() method, we must call the parent’s build() method: this tells Keras that the layer is built. The **call()** method performs the desired operations (compute the matrix multiplication of the inputs and the layer’s kernel, add the bias vector, and apply the activation function). The **compute_output_shape()** method returns the shape of this layer’s outputs. We can generally omit this method, as Keras automatically infers the output shape, except when the layer is dynamic.

We can now use a MyDense layer just like any other layer!

In [None]:
keras.backend.clear_session()

In [None]:
model = keras.models.Sequential([
    MyDense(30, activation="relu", input_shape=input_shape),
    MyDense(1)
])

model.compile(loss="mse", optimizer="nadam")

model.fit(X_train_scaled, y_train, epochs=2,
          validation_data=(X_valid_scaled, y_valid))

To create a layer with multiple inputs, the argument to the call() method should be a tuple containing all the inputs, and similarly the argument to the compute_output_shape() method should be a tuple containing each input’s batch shape. 

To create a layer with multiple outputs, the call() method should return the list of outputs, and compute_output_shape() should return the list of batch output shapes
(one per output). 

For example, the following toy layer takes two inputs and returns three outputs:

In [None]:
class MyMultiLayer(keras.layers.Layer):
    def call(self, X):
        X1, X2 = X
        return X1 + X2, X1 * X2

    def compute_output_shape(self, batch_input_shape):
        batch_input_shape1, batch_input_shape2 = batch_input_shape
        return [batch_input_shape1, batch_input_shape2]

This layer may now be used like any other layer, but of course only using the Functional and Subclassing APIs, not the Sequential API (which only accepts layers with one input and one output).

If your layer needs to have a different behavior during training and during testing, then you must add a training argument to the call() method and use it to decide what to do. 

For example, let’s create a layer that adds Gaussian noise during training (for regularization) but does nothing during testing:

In [None]:
class AddGaussianNoise(keras.layers.Layer):
    def __init__(self, stddev, **kwargs):
        super().__init__(**kwargs)
        self.stddev = stddev

    def call(self, X, training=None):
        if training:
            noise = tf.random.normal(tf.shape(X), stddev=self.stddev)
            return X + noise
        else:
            return X

    def compute_output_shape(self, batch_input_shape):
        return batch_input_shape

### Custom Models

To create custom model, we need to subclass the **keras.Model** class, create layers and variables in the constructor, and implement the call() method to do whatever we want the model to do.

As an example, suppose we want to build the following model.

<img src="./images/custom-model.png" width="500">

Note that this model does not make much sense; it’s just an example to illustrate the fact that we can easily build any kind of model we want, even one that contains loops and skip connections.

To implement the model, it is best to first create a ResidualBlock layer, since we are going to create a couple of identical blocks:

In [None]:
class ResidualBlock(keras.layers.Layer):
    def __init__(self, n_layers, n_neurons, **kwargs):
        super().__init__(**kwargs)
        self.hidden = [keras.layers.Dense(n_neurons, activation="elu",
                                          kernel_initializer="he_normal")
                       for _ in range(n_layers)]

    def call(self, inputs):
        Z = inputs
        for layer in self.hidden:
            Z = layer(Z)
        return inputs + Z

Next, let’s use the Subclassing API to define the model itself:

In [None]:
class ResidualModel(keras.models.Model):
    def __init__(self, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.hidden1 = keras.layers.Dense(30, activation="elu",
                                          kernel_initializer="he_normal")
        self.block1 = ResidualBlock(2, 30)
        self.block2 = ResidualBlock(2, 30)
        self.out = keras.layers.Dense(output_dim)

    def call(self, inputs):
        Z = self.hidden1(inputs)
        for _ in range(1 + 3):
            Z = self.block1(Z)
        Z = self.block2(Z)
        return self.out(Z)

We create the layers in the constructor and use them in the call() method. This model can then be used like any other model (compile it, fit it, evaluate it, and use it to make predictions).

## Exploit model internals

The custom losses and metrics we defined earlier were all based on the labels and the predictions. There will be times when we want to define losses based on other parts of your model (may be useful for regularization purposes or to monitor internal aspect of the model).

To define a custom loss based on model internals, compute it based on any part of the model you want, then pass the result to the **add_loss()** method.

For example, let’s build a custom regression MLP model composed of a stack of five hidden layers plus an output layer. This model has an auxiliary output on top of the upper hidden layer. The loss associated to this auxiliary output will be called the reconstruction loss, it is the mean squared difference between the reconstruction and the inputs. By adding this reconstruction loss to the main loss, we will encourage the model to preserve as much information as possible through the hidden layers. It sometimes improves generalization.

In [None]:
 class ReconstructingRegressor(keras.models.Model):
    def __init__(self, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.hidden = [keras.layers.Dense(30, activation="selu",
                                          kernel_initializer="lecun_normal")
                       for _ in range(5)]
        self.out = keras.layers.Dense(output_dim)

    def build(self, batch_input_shape):
        n_inputs = batch_input_shape[-1]
        self.reconstruct = keras.layers.Dense(n_inputs)
        super().build(batch_input_shape)

    def call(self, inputs, training=None):
        Z = inputs
        for layer in self.hidden:
            Z = layer(Z)
        reconstruction = self.reconstruct(Z)
        recon_loss = tf.reduce_mean(tf.square(reconstruction - inputs))
        self.add_loss(0.05 * recon_loss)
        return self.out(Z)

Then the call() method computes the reconstruction loss and adds it to the model’s list of losses using the add_loss() method.

Similarly, we can add a custom metric based on model internals by computing it in any way we want and add it to the model by calling the model **add_metric()** method.

In some rare cases, the fit() method may not be flexible enough for what we need to do and we may need to **customize the training loop itself**. 

For example, the fit() method only uses one optimizer (the one that we specify when compiling the model), if we need to use two differnte optimizers for differnt paths of the network, this requires writing our own custom loop.

However, remember that writing a custom training loop will make the code longer, more error-prone, and harder to maintain.

First, let’s build a simple model:

In [None]:
l2_reg = keras.regularizers.l2(0.05)

model = keras.models.Sequential([
    keras.layers.Dense(30, activation="elu", kernel_initializer="he_normal",
                       kernel_regularizer=l2_reg),
    keras.layers.Dense(1, kernel_regularizer=l2_reg)
])

Next, let’s create a function that randomly samples a batch of instances from the training set:

In [None]:
def random_batch(X, y, batch_size=32):
    idx = np.random.randint(len(X), size=batch_size)
    return X[idx], y[idx]

Let’s also define a function to display the training status (number of steps, the total number of steps, the mean loss since the start of the epoch, etc.):

In [None]:
def print_status_bar(iteration, total, loss, metrics=None):
    metrics = " - ".join(["{}: {:.4f}".format(m.name, m.result())
                         for m in [loss] + (metrics or [])])
    end = "" if iteration < total else "\n"
    print("\r{}/{} - ".format(iteration, total) + metrics, end=end)

Now, we define some hyperparameters, choose the optimizer, the loss function, and the metrics:

In [None]:
n_epochs = 5
batch_size = 32
n_steps = len(X_train) // batch_size
optimizer = keras.optimizers.Nadam(lr=0.01)
loss_fn = keras.losses.mean_squared_error
mean_loss = keras.metrics.Mean()
metrics = [keras.metrics.MeanAbsoluteError()]

We are ready to build the custom loop:

In [None]:
for epoch in range(1, n_epochs + 1):
    print("Epoch {}/{}".format(epoch, n_epochs))
    for step in range(1, n_steps + 1):
        X_batch, y_batch = random_batch(X_train_scaled, y_train)
        with tf.GradientTape() as tape:
            y_pred = model(X_batch)
            main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
            loss = tf.add_n([main_loss] + model.losses)
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
        for variable in model.variables:
            if variable.constraint is not None:
                variable.assign(variable.constraint(variable))
        mean_loss(loss)
        for metric in metrics:
            metric(y_batch, y_pred)
        print_status_bar(step * batch_size, len(y_train), mean_loss, metrics)
    print_status_bar(len(y_train), len(y_train), mean_loss, metrics)
    for metric in [mean_loss] + metrics:
        metric.reset_states()

We create two nested loops: one for the epochs, the other for the batches within an epoch. Then we sample a random batch from the training set. 

To calculate the gradient, we use the **tf.GradientTape** API for [differentiate automatically](https://www.jmlr.org/papers/volume18/17-468/17-468.pdf). TensorFlow needs to remember what operations happen in what order during the forward pass. Then, during the backward pass, it traverses the list of operations in reverse order to compute gradients. TensorFlow "records" relevant operations executed onto a "tape" and then it uses that tape to compute the gradients of a "recorded" computation using **reverse mode differentiation**. The tape is automatically erased immediately after we call its gradient() method, so we will get an exception if we try to call gradient(). 

Inside the tf.GradientTape() block, we make a prediction for one batch (using the model as a function), and we compute the loss: it is equal to the main loss plus the other losses (in this model, there is one regularization loss per layer). Since the mean_squared_error() function returns one loss per instance, we compute the mean over the batch using tf.reduce_mean(). If we need to apply different weights to each instance, this is where we would do it. The regularization losses are already reduced to a single scalar each, so we just need to sum them. Next, we ask the tape to compute the gradient of the loss with regard to each trainable variable, and we apply them to the optimizer to perform a Gradient Descent step. Then we update the mean loss and the metrics and we display the status bar.

If we needto apply any transformation to the gradients, we can simply do so before calling the apply_gradients() method. If we need to add weight constraints, we should update the training loop to apply these constraints just after apply_gradients().

### Functions

To show the graph concept of TensorFlow, let’s start with a trivial function that computes the cube of its input:

In [None]:
def cube(x):
    return x ** 3

cube(2)

Now, let’s use **tf.function()** to convert this Python function to a TensorFlow Function. It can then be used exactly like the original function,
and it will return the same result (but as tensors).

In [None]:
tf_cube = tf.function(cube)
tf_cube(2)

Under the hood, TensorFlow analyze the computations performed by the function and generate an equivalent **computation graph**, which is a data structure  that contain a set of **tf.Operation** objects, which represent units of computation. TensorFlow makes several optimizations, prunes unused nodes, simplifies expressions and more.

Once the optimized graph is ready, the TensorFlow function efficiently executes the operations in the graph, in the appropriate order and in parallel when it can). 

As a result, a TensorFlow function will usually run much faster than the original Python function, especially if it performs complex computations.

When we write a custom loss function, a custom metric, a custom layer, or any other custom function and we use it in a model, Keras automatically converts these functions into a TensorFlow function transparently.

TensorFlow starts the generation of a graph by analyzing the source code to capture all the control flow statements (for loops, while loops, if, break, continue, and return statements). This first step is called **AutoGraph**. After that, AutoGraph outputs an upgraded version of the function in which all the control flow statements are replaced by the appropriate TensorFlow operations, such as **tf.while_loop()** for loops and **tf.cond()** for if statements. Next, TensorFlow calls this upgraded function, but instead of passing the argument, it passes a **symbolic tensor*** (a tensor without any actual value, only a name, a data type, and a shape). The function will run in graph mode, meaning that each TensorFlow operation will add a node in the graph to represent itself and its output tensor. In graph mode, operations do not perform any computations. In the following figure, we can see an exmaple:


<img src="./images/autograph.png" width="500">

we can get the concrete function by calling the **get_concrete_function()** method and we can access the computation graph using the **graph** attribute, and get the list of its operations by calling the **get_operations()** method:

In [None]:
concrete_function = tf_cube.get_concrete_function(tf.constant(2.0))
print(concrete_function)

In [None]:
print(concrete_function.graph)
ops = concrete_function.graph.get_operations()
ops

In the example, the first operation represents the input argument x (it is called a placeholder), the second operation represents the constant 3, the
third operation represents the power operation, and the final operation represents the output of this function (it is an identity operation, meaning it
will do nothing more than copy the output of the addition operation ). 

Each operation has a list of input and output tensors that you can easily access using the **inputs** and **outputs** attributes. For example, let’s get the list of inputs and outputs of the power operation:

In [None]:
pow_op = ops[2]
print(pow_op.inputs)
print(pow_op.outputs)

This computation graph is represented in the folowing figure:

<img src="./images/computation-graph.png" width="500">

Note that each operation and each tensor has a unique name and we can fetch an operation or a tensor by name using the **get_operation_by_name()** or **get_tensor_by_name()** methods:

In [None]:
concrete_function.graph.get_operation_by_name('pow')

In [None]:
concrete_function.graph.get_tensor_by_name('pow:0')

If our function contains a simple for loop, what do we expect will happen? For example, let’s write a function that will add 10 to its input, by just adding 1 10 times (we can use tf.function as a decorator):

In [None]:
@tf.function
def add_10(x):
    for i in range(10):
        x += 1
    return x

If we look at its graph, we find that it does not contain a loop: it just contains 10 addition operations

In [None]:
add_10.get_concrete_function(tf.constant(0)).graph.get_operations()

This actually makes sense: when the function got traced, the loop ran 10 times, so the x+=1 operation was run 10 times, and since it was in graph mode, it recorded this operation 10 times in the graph. We can think of this for loop as a "static" loop that gets unrolled when the graph is created.

If we want the graph to contain a "dynamic" loop instead, we can create one manually using the **tf.while_loop()** operation, but it is not very intuitive. Instead, it is much simpler to use TensorFlow’s AutoGraph feature, it only captures for loops that iterate over **tf.range()**, not **range()**. This is to give you the choice.

In [None]:
@tf.function
def add_10_2(x):
    for i in tf.range(10):
        x += 1
    return x

In [None]:
add_10_2.get_concrete_function(tf.constant(0)).graph.get_operations()

## Data API

So far we have used only datasets that fit in memory, but ML systems are often trained on very large datasets that will not fit in RAM. Ingesting a large dataset and preprocessing it efficiently can be tricky to implement, TensorFlow makes it easy thanks to the **Data API**: we just create a dataset object, and tell it where to get the data and how to transform it. TensorFlow takes care of all the implementation details (multithreading, queuing, batching, and prefetching).  Off the shelf, the Data API can read from text files (such as CSV files), binary files with fixed-size records, and binary files that use the TFRecord format (a flexible and efficient binary format based on the open source **Protocol Buffers**) and SQL databases. 

Reading huge datasets efficiently is not the only difficulty: the data also needs to be preprocessed, usually normalized. Moreover, it is not always composed strictly of convenient numerical fields: there may be text
features, categorical features, and so on. These need to be encoded, for
example using one-hot encoding, bag-of-words encoding, or embeddings. One option to handle all this preprocessing is to write our own custom preprocessing layers. Another is to use the standard preprocessing layers provided by Keras.

The whole Data API revolves around the concept of a **dataset**, which is a sequence of data items. Usually, we will use datasets that gradually read data from disk, but for simplicity let’s create a dataset entirely in RAM using:

In [None]:
X = tf.range(10) # any data tensor
dataset = tf.data.Dataset.from_tensor_slices(X)
print(dataset)

The **from_tensor_slices()** function takes a tensor and creates a **tf.data.Dataset** whose elements are all the slices of X (along the first
dimension), so this dataset contains 10 items. We can iterate over a dataset:

In [None]:
for item in dataset:
    print(item)

Once we have a dataset, we can apply all sorts of transformations to it by calling its transformation methods. Each method returns a new dataset, so
we can chain transformations:

In [None]:
dataset = dataset.repeat(3).batch(7)

for item in dataset:
    print(item)

In the example, we first call the **repeat()** method which returns a new dataset that will repeat the items of the original one N times (withouth copy all the data in memory N times); then we call the **batch()** method and again this creates a new dataset by grouping the items in batches of seven items. Finally, we iterate over the items of this final dataset. The batch() method had to output a final batch of size two, but we can call it with **drop_remainder=True** if we want it to drop this final batch so that all
batches have the exact same size.

We can also transform items by calling the **map()** method, in order to apply any preprocessing to data. 

In [None]:
dataset = dataset.map(lambda x: x * 2)

for item in dataset:
    print(item)

The **apply()** method applies a transformation to the dataset as a whole: 

In [None]:
dataset = dataset.unbatch()

for item in dataset:
    print(item)

It is also possible to simply filter the dataset using the **filter()** method:

In [None]:
dataset = dataset.filter(lambda x: x < 10)

for item in dataset:
    print(item)

To look at just a few items from a dataset, we can use the **take()** method:

In [None]:
for item in dataset.take(3):
    print(item)

It is possible to shuffle instances using the **shuffle()** method. It will create a new dataset that will start by filling up a buffer with the first items of the source. Then, whenever it is asked for an item, it will pull one out randomly from the buffer and replace it with a fresh one from the source dataset, until it has iterated entirely through the source dataset. We haveto specify the buffer size, and it is important to make it large enough, or else shuffling will not be very effective:

In [None]:
dataset = dataset.shuffle(buffer_size=3, seed=42)

for item in dataset:
    print(item)

For large datasets (where the buffer is too small compared to the dataset), this simple shuffling approach may not be sufficient. One solution is to shuffle the source data itself. Even if the source data is shuffled, we will usually want to shuffle it some more, or else the same order will be repeated at each epoch, and the model may end up being biased.

A common approach is to split the source data into multiple files, then read them in a random order during training. However, instances located in the same file will still end up close to each other. To avoid this, we can pick multiple files randomly and read them simultaneously, interleaving their records. Then on top of that we can add a shuffling buffer.

Let’s suppose that we have loaded the California housing dataset, shuffled it  and split it into a training set, a validation set, and a test set. Then we have split each set into many CSV files:

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()

X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target.reshape(-1, 1), random_state=42)

X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)

scaler = StandardScaler()
scaler.fit(X_train)

X_mean = scaler.mean_
X_std = scaler.scale_

In [None]:
def save_to_multiple_csv_files(data, name_prefix, header=None, n_parts=10):
    housing_dir = os.path.join("data", "housing")
    os.makedirs(housing_dir, exist_ok=True)
    path_format = os.path.join(housing_dir, "my_{}_{:02d}.csv")
    filepaths = []
    m = len(data)
    for file_idx, row_indices in enumerate(np.array_split(np.arange(m), n_parts)):
        part_csv = path_format.format(name_prefix, file_idx)
        filepaths.append(part_csv)
        with open(part_csv, "wt", encoding="utf-8") as f:
            if header is not None:
                f.write(header)
                f.write("\n")
            for row_idx in row_indices:
                f.write(",".join([repr(col) for col in data[row_idx]]))
                f.write("\n")
    return filepaths

In [None]:
train_data = np.c_[X_train, y_train]
valid_data = np.c_[X_valid, y_valid]
test_data = np.c_[X_test, y_test]
header_cols = housing.feature_names + ["MedianHouseValue"]
header = ",".join(header_cols)

train_filepaths = save_to_multiple_csv_files(train_data, "train", header, n_parts=20)
valid_filepaths = save_to_multiple_csv_files(valid_data, "valid", header, n_parts=10)
test_filepaths = save_to_multiple_csv_files(test_data, "test", header, n_parts=10)

In [None]:
print(train_filepaths)
print(valid_filepaths)
print(test_filepaths)

Now we can create a dataset containing only the training files, using the **list_files()** function, which returns a dataset that shuffles the files.

In [None]:
filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed=42)

for filepath in filepath_dataset:
    print(filepath)

Next, we can call the **interleave()** method to read from N files at a time and interleave their lines (skipping the first line of each file, which is
the header row, using the **skip()** method):

In [None]:
n_readers = 5
dataset = filepath_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
    cycle_length=n_readers)

The interleave() method will create a dataset that will pull N files from the provided path, and for each one it will call the lambda to create a new dataset (in this case a TextLineDataset). To be clear, at this stage there will be seven datasets in all: the filepath dataset, the interleave dataset, and the five TextLineDatasets. When we iterate over the interleave dataset, it will cycle through these five TextLineDatasets, reading one line at a time from each until all datasets are out of items. Then it will get the next five file paths from the filepath_dataset and interleave them the same way, and so on until it
runs out of file paths.

By default, interleave() does not use parallelism, it just reads one line at a time from each file, sequentially. To read files in parallel, we can set the **num_parallel_calls** argument to the number of threads you want or set it to **tf.data.experimental.AUTOTUNE** to make TensorFlow choose the right number of threads dynamically based on the available CPU.

Let’s look at what the dataset contains now:

In [None]:
for line in dataset.take(5):
    print(line.numpy())

These are the first rows (ignoring the header row) of five CSV files, chosen randomly. However, these are just byte strings, we need to parse them and scale the data. The following preprocess() function takes one CSV line and starts by
parsing it. For this it uses the **tf.io.decode_csv()** function, which takes two arguments: the first is the line to parse, and the second is an array containing the default value for each column.

In [None]:
n_inputs = 8

@tf.function
def preprocess(line):
    defs = [0.] * n_inputs + [tf.constant([], dtype=tf.float32)]
    fields = tf.io.decode_csv(line, record_defaults=defs)
    x = tf.stack(fields[:-1])
    y = tf.stack(fields[-1:])
    return (x - X_mean) / X_std, y

preprocess(b'4.2083,44.0,5.3232,0.9171,846.0,2.3370,37.47,-122.2,2.782')

We can now apply the function to the dataset.

In [None]:
dataset = dataset.map(preprocess, num_parallel_calls=3)

In [None]:
for line in dataset.take(5):
    print(line)

To make the code reusable, let’s put together everything into a small helper function: it will create and return a dataset that will efficiently load California housing data from multiple CSV files, preprocess it, shuffle it, optionally repeat it, and batch it:

In [None]:
def csv_reader_dataset(filepaths, repeat=1, n_readers=5,
                       n_read_threads=None, shuffle_buffer_size=10000,
                       n_parse_threads=5, batch_size=32):
    dataset = tf.data.Dataset.list_files(filepaths).repeat(repeat)
    dataset = dataset.interleave(
        lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
        cycle_length=n_readers, num_parallel_calls=n_read_threads)
    dataset = dataset.shuffle(shuffle_buffer_size)
    dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads)
    dataset = dataset.batch(batch_size)
    return dataset.prefetch(1)

The **prefetch(1)** method will do its best to always be one batch ahead. In other words, while our training algorithm is working on one batch, the dataset will already be working in parallel on getting the next batch ready (e.g., reading the data from disk and preprocessing it). This can improve performance dramatically. 

If we also ensure that loading and preprocessing are multithreaded (by setting num_parallel_calls when calling interleave() and map()), we can exploit multiple cores on the CPU and hopefully make preparing one batch of data  horter than running a training step on the GPU: this way the GPU will be almost 100% utilized (except for the data transfer time from the CPU to the GPU ), and training will run much faster.

<img src="./images/prefetch.png" width="500">

In [None]:
train_set = csv_reader_dataset(train_filepaths, batch_size=3)

for X_batch, y_batch in train_set.take(2):
    print("X =", X_batch)
    print("y =", y_batch)
    print()

If the dataset is small enough to fit in memory, we can significantly speed up training by using the **cache()** method to cache contents to RAM. We should do this after loading and preprocessing the data, but before shuffling, repeating, batching, and prefetching. This way, each instance will only be read and preprocessed once (instead of once per epoch), but the data will still be shuffled differently at each epoch, and the next batch will still be prepared in advance.

Summarizing:

<img src="./images/preprocessing.png" width="500">

Now we can simply build and train Keras models using these datasets. All we need to do is pass the training and validation datasets to the fit() method, instead of X_train, y_train, X_valid, and y_valid

In [None]:
train_set = csv_reader_dataset(train_filepaths, repeat=None)
valid_set = csv_reader_dataset(valid_filepaths)
test_set = csv_reader_dataset(test_filepaths)

keras.backend.clear_session()

model = keras.models.Sequential([
    keras.layers.Dense(30, activation="relu", input_shape=X_train.shape[1:]),
    keras.layers.Dense(1),
])

model.compile(loss="mse", optimizer=keras.optimizers.SGD(lr=1e-3))

batch_size = 32
model.fit(train_set, steps_per_epoch=len(X_train) // batch_size, epochs=10,
          validation_data=valid_set)

In [None]:
model.evaluate(test_set, steps=len(X_test) // batch_size)

## TFRecord Format

So far we have used CSV files, which are common, simple, and convenient but not really efficient, and do not support large or complex data structures (such as images or audio) very well. The **TFRecord format** is TensorFlow’s preferred format for storing large amounts of data and reading it efficiently. It is a very simple binary format that just contains a sequence of binary records of varying sizes (each record is comprised of a length, a CRC checksum to check that the length was not corrupted, then the actual data, and finally a CRC checksum for the data).

We can create a TFRecord file using the **tf.io.TFRecordWriterclass**:

In [None]:
with tf.io.TFRecordWriter("./data/my_data.tfrecord") as f:
    f.write(b"This is the first record")
    f.write(b"And this is the second record")

And we can then use a **tf.data.TFRecordDataset** to read one or more TFRecord files:

In [None]:
filepaths = ["./data/my_data.tfrecord"]
dataset = tf.data.TFRecordDataset(filepaths)
for item in dataset:
    print(item)

By default it will read files one at a time, but if we set num_parallel_reads parameters, it will read more files in parallel and interleave their records:

In [None]:
filepaths = ["./data/my_test_{}.tfrecord".format(i) for i in range(5)]
for i, filepath in enumerate(filepaths):
    with tf.io.TFRecordWriter(filepath) as f:
        for j in range(3):
            f.write("File {} record {}".format(i, j).encode("utf-8"))

dataset = tf.data.TFRecordDataset(filepaths, num_parallel_reads=3)
for item in dataset:
    print(item)

It can sometimes be useful to compress TFRecord files (e.g.  they need to be loaded via a network connection) by setting the options argument:

In [None]:
options = tf.io.TFRecordOptions(compression_type="GZIP")
with tf.io.TFRecordWriter("./data/my_compressed.tfrecord", options) as f:
    f.write(b"This is the first record")
    f.write(b"And this is the second record")

In [None]:
dataset = tf.data.TFRecordDataset(["./data/my_compressed.tfrecord"],
                                  compression_type="GZIP")
for item in dataset:
    print(item)

## Protocol Buffers

Even though each record can use any binary format you want, TFRecord files usually contain **serialized protocol buffers** (also called [**protobufs**](https://developers.google.com/protocol-buffers)), a portable, extensible, and efficient binary format. They are defined using a simple language that looks like this:

In [None]:
%%writefile ./data/person.proto
syntax = "proto3";
message Person {
  string name = 1;
  int32 id = 2;
  repeated string email = 3;
}

This definition says we are using version 3 of the protobuf format, and it
specifies that each Person object may have a "name" of type string, an "id" of type int32, and zero or more "email" each of type string. The numbers 1, 2, and 3 are the field identifiers. 

Once we have a definition in a .proto file, we can compile it. This requires **protoc**, the protobuf compiler, to generate access classes in Python (or some other language).

In [None]:
!protoc ./data/person.proto --python_out=. --descriptor_set_out=./data/person.desc --include_imports

In [None]:
!ls ./data/person*

Then we can import the class generated by protoc, we can create an instance and play with it: visualizing it, reading and writing fields, etc. We can serialize it using the **SerializeToString()** method. to obtain the binary data ready to be saved or transmitted over the network. When we receive this binary data, we can parse it using the **ParseFromString()** method, and we get a copy of the object that was serialized.

In [None]:
from data.person_pb2 import Person

person = Person(name="Al", id=123, email=["a@b.com"])
print(person)

In [None]:
person.name  # read a field

In [None]:
person.name = "Alice"  # modify a field
print(person)

In [None]:
person.email[0]  # repeated fields can be accessed like arrays

In [None]:
person.email.append("c@d.com")  # add an email address
print(person)

In [None]:
s = person.SerializeToString()  # serialize to a byte string
print(s)

In [None]:
person2 = Person()
person2.ParseFromString(s)
print(person2)

However, SerializeToString() and ParseFromString() are not TensorFlow operations, so they cannot be included in a TensorFlow Function. Fortunately, TensorFlow does include special protobuf definitions for which it provides parsing operations.

The main protobuf typically used in a TFRecord file is the **Example protobuf**, which represents one instance in a dataset. It contains a list of
named features, where each feature can either be a list of byte strings, a list
of floats, or a list of integers. Here is its definition:

In [None]:
%%writefile ./data/example.proto
syntax = "proto3";

message BytesList { repeated bytes value = 1; }
message FloatList { repeated float value = 1 [packed = true]; }
message Int64List { repeated int64 value = 1 [packed = true]; }
message Feature {
    oneof kind {
        BytesList bytes_list = 1;
        FloatList float_list = 2;
        Int64List int64_list = 3;
    }
};
message Features { map<string, Feature> feature = 1; };
message Example { Features features = 1; };

Here is how we can create a **tf.train.Example** representing the same person as earlier and write it to a TFRecord file:

In [None]:
from tensorflow.train import BytesList, FloatList, Int64List
from tensorflow.train import Feature, Features, Example

person_example = Example(
    features=Features(
        feature={
            "name": Feature(bytes_list=BytesList(value=[b"Alice"])),
            "id": Feature(int64_list=Int64List(value=[123])),
            "emails": Feature(bytes_list=BytesList(value=[b"a@b.com", b"c@d.com"]))
        }))

with tf.io.TFRecordWriter("./data/my_contacts.tfrecord") as f:
    f.write(person_example.SerializeToString())

Now that we have a TFRecord file containing a serialized Example, let’s try to load it. To load the serialized Example protobufs, we will use a tf.data.TFRecordDataset once again, and we will parse each Example using **tf.io.parse_single_example()**. This is a TensorFlow operation, so it can be included in a TF Function. It requires at least two arguments: a string scalar tensor containing the serialized data, and a description of each
feature:

In [None]:
feature_description = {
    "name": tf.io.FixedLenFeature([], tf.string, default_value=""),
    "id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
    "emails": tf.io.VarLenFeature(tf.string),
}

for serialized_example in tf.data.TFRecordDataset(["./data/my_contacts.tfrecord"]):
    parsed_example = tf.io.parse_single_example(serialized_example, feature_description)
    print(parsed_example)

A BytesList can contain any binary data you want, including any serialized object like images and tensors. For example, we can use **tf.io.encode_jpeg()** to encode an image using the JPEG format and put this binary data in a BytesList. Later, when our code reads the TFRecord, it will start by parsing the Example, then it will need to call **tf.io.decode_jpeg()** to parse the data and get the original image:

In [None]:
from sklearn.datasets import load_sample_images
img = load_sample_images()["images"][0]
plt.imshow(img)
plt.axis("off")
plt.title("Original Image")
plt.show()

In [None]:
data = tf.io.encode_jpeg(img)
example_with_image = Example(features=Features(feature={
    "image": Feature(bytes_list=BytesList(value=[data.numpy()]))}))
serialized_example = example_with_image.SerializeToString()
# then save to TFRecord

In [None]:
feature_description = { "image": tf.io.VarLenFeature(tf.string) }
example_with_image = tf.io.parse_single_example(serialized_example, feature_description)
decoded_img = tf.io.decode_jpeg(example_with_image["image"].values[0])

In [None]:
plt.imshow(decoded_img)
plt.title("Decoded Image")
plt.axis("off")
plt.show()

Tensors can be serialized and parsed easily using **tf.io.serialize_tensor()** and **tf.io.parse_tensor()**:

In [None]:
t = tf.constant([[0., 1.], [2., 3.], [4., 5.]])
s = tf.io.serialize_tensor(t)
print(s)

In [None]:
tf.io.parse_tensor(s, out_type=tf.float32)

The Example protobuf will probably be sufficient for most use cases. However, it may be a bit cumbersome to use when you are dealing with lists of lists (for example, suppose you want to classify text documents: each document may be represented as a list of sentences, where each sentence is represented as a list of words; each document also has a list of comments, where each comment is represented as a list of words; there may be some contextual data too, such as the author, title, and publication date). TensorFlow’s **SequenceExample protobuf** is designed for such use cases. Here is the definition:

In [None]:
%%writefile ./data/SequenceExample.proto
syntax = "proto3";

message FeatureList { repeated Feature feature = 1; };
message FeatureLists { map<string, FeatureList> feature_list = 1; };
message SequenceExample {
  Features context = 1;
  FeatureLists feature_lists = 2;
};

Building a SequenceExample, serializing it, and parsing it is similar to building, serializing, and parsing an Example, but we must use **tf.io.parse_single_sequence_example()** method.

In [None]:
from tensorflow.train import FeatureList, FeatureLists, SequenceExample

context = Features(feature={
    "author_id": Feature(int64_list=Int64List(value=[123])),
    "title": Feature(bytes_list=BytesList(value=[b"A", b"desert", b"place", b"."])),
    "pub_date": Feature(int64_list=Int64List(value=[1623, 12, 25]))
})

content = [["When", "shall", "we", "three", "meet", "again", "?"],
           ["In", "thunder", ",", "lightning", ",", "or", "in", "rain", "?"]]
comments = [["When", "the", "hurlyburly", "'s", "done", "."],
            ["When", "the", "battle", "'s", "lost", "and", "won", "."]]

def words_to_feature(words):
    return Feature(bytes_list=BytesList(value=[word.encode("utf-8")
                                               for word in words]))

content_features = [words_to_feature(sentence) for sentence in content]
comments_features = [words_to_feature(comment) for comment in comments]
            
sequence_example = SequenceExample(
    context=context,
    feature_lists=FeatureLists(feature_list={
        "content": FeatureList(feature=content_features),
        "comments": FeatureList(feature=comments_features)
    }))

print(sequence_example)

In [None]:
serialized_sequence_example = sequence_example.SerializeToString()
print(serialized_sequence_example)

In [None]:
context_feature_descriptions = {
    "author_id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
    "title": tf.io.VarLenFeature(tf.string),
    "pub_date": tf.io.FixedLenFeature([3], tf.int64, default_value=[0, 0, 0]),
}
sequence_feature_descriptions = {
    "content": tf.io.VarLenFeature(tf.string),
    "comments": tf.io.VarLenFeature(tf.string),
}
parsed_context, parsed_feature_lists = tf.io.parse_single_sequence_example(
    serialized_sequence_example, context_feature_descriptions,
    sequence_feature_descriptions)

print(parsed_context)
print(parsed_feature_lists)

## Preprocessing

Preparing data for a neural network requires converting all features into numerical features, generally normalizing them, and more. In particular, if  data contains categorical features or text features, they need to be converted to numbers. This can be done ahead of time when preparing data files, using tools (e.g., NumPy, pandas, or Scikit-Learn), alternatively, data can be preprocess on the fly when loading it with the Data API (e.g., using the map() method), or it is possible to user a preprocessing layer directly in the model.

For example, here is how we can implement a standardization layer using a Lambda layer. For each feature, it subtracts the mean and divides by its standard deviation (plus a tiny smoothing term to avoid division by zero). 

In [None]:
class Standardization(keras.layers.Layer):
    def adapt(self, data_sample):
        self.means_ = np.mean(data_sample, axis=0, keepdims=True)
        self.stds_ = np.std(data_sample, axis=0, keepdims=True)
    def call(self, inputs):
        return (inputs - self.means_) / (self.stds_ + keras.backend.epsilon())

Before we can use this standardization layer, we need to adapt it to our dataset by calling the adapt() method and passing it a data sample. This will allow it to use the appropriate mean and standard deviation for each feature.
The sample must be large enough to be representative of the dataset, but it
does not have to be the full training set: in general, a few hundred randomly
selected instances will suffice.

In [None]:
housing = fetch_california_housing()

X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target.reshape(-1, 1), random_state=42)

X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)

std_layer = Standardization()
std_layer.adapt(X_train)

Next, you can use this preprocessing layer like a normal layer:

In [None]:
keras.backend.clear_session()

model = keras.models.Sequential()
model.add(std_layer)
model.add(keras.layers.Dense(30, activation="relu", input_shape=X_train.shape[1:]))
model.add(keras.layers.Dense(1))

model.compile(loss="mse", optimizer=keras.optimizers.SGD(lr=1e-3))

history = model.fit(X_train, y_train, epochs=30, validation_data=(X_valid, y_valid))

Now let’s look at **categorical features**. We can encoding them as one-hot vectors. For example, consider the ocean_proximity feature in the California housing dataset, it is a categorical feature with five possible values. We need to encode this feature before we feed it to a neural network. Since
there are very few categories, we can use **one-hot encoding**.

In [None]:
vocab = ["<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", "ISLAND"]
indices = tf.range(len(vocab), dtype=tf.int64)
table_init = tf.lookup.KeyValueTensorInitializer(vocab, indices)
num_oov_buckets = 2
table = tf.lookup.StaticVocabularyTable(table_init, num_oov_buckets)

We first define the vocabulary (the list of all possible categories), then we create a tensor with the corresponding indices. Next, we create an initializer for the lookup table, passing it the list of categories and their corresponding indices. In this example, we already have this data, so we use a **KeyValueTensorInitializer**, but if the categories were listed in a text file (with one category per line), we would use a **TextFileInitializer** instead.
In the last two lines we create the lookup table, giving it the initializer and specifying the number of **out-of-vocabulary (oov)** buckets. If we look up a category that does not exist in the vocabulary, the lookup table will compute a hash of this category and use it to assign the unknown category to one of the oov buckets. Their indices start after the known categories, so in this
example the indices of the two oov buckets are 5 and 6. Why use oov buckets? Well, if the number of categories is large (e.g., zip codes, cities, words, products, or users) and the dataset is large as well, or it keeps changing, then getting the full list of categories may not be convenient. One solution is to define the vocabulary based on a data sample (rather than the whole training set) and add some oov buckets for the other categories that were not in the data sample. The more unknown categories you expect to find during training, the more oov buckets you should use.

Now let’s use the lookup table to encode a small batch of categorical features to one-hot vectors using **tf.one_hot()** method. Notice that we have to tell this function the total number of indices (which is equal to the vocabulary size plus the number of oov buckets).

In [None]:
categories = tf.constant(["NEAR BAY", "DESERT", "INLAND", "INLAND"])
cat_indices = table.lookup(categories)
print(cat_indices)

cat_one_hot = tf.one_hot(cat_indices, depth=len(vocab) + num_oov_buckets)
print(cat_one_hot)

Just like earlier, it wouldn’t be too difficult to bundle all of this logic into a nice self-contained class. Its adapt() method would take a data sample and extract all the distinct categories it contains. It would create a lookup table to map each category to its index (including unknown categories using oov
buckets). Then its call() method would use the lookup table to map the input categories to their indices. We could add this layer at the beginning of our model.

This is fine when there are just a few possible categories, but if the vocabulary is large (more then 50 ad a rule of thumb), it is much more efficient to encode them using **embeddings instead**.

An embedding is a trainable dense vector that represents a category. By default, embeddings are initialized randomly, so for example the "NEAR BAY" category could be represented initially by a random vector such as [0.131, 0.890], while the "NEAR OCEAN" category might be represented by another random vector such as [0.631, 0.791]. In this example, we use 2D embeddings, but the number of dimensions is a hyperparameter we can tweak (tipically 10 to 300 dimensions). Since these embeddings are trainable, they will gradually improve during training; and as they represent fairly similar categories, Gradient Descent will certainly end up pushing them closer together, while it will tend to move them away from the "INLAND" category’s embedding. The better the representation, the easier it will be for the neural network to make accurate predictions, so training tends to make embeddings useful representations of the categories. This is called
representation learning.

Let’s look at how we could implement embeddings manually, to understand
how they work. First, we need to create an embedding matrix containing each category’s embedding, initialized randomly; it will have one row per category and per oov bucket, and one column per embedding dimension:

In [None]:
embedding_dim = 2
embed_init = tf.random.uniform([len(vocab) + num_oov_buckets, embedding_dim])
embedding_matrix = tf.Variable(embed_init)

print(embedding_matrix)

Now, let’s encode the same batch of categorical features as earlier, but this
time using these embeddings. The **tf.nn.embedding_lookup()** function looks up the rows in the embedding matrix, at the given indices:

In [None]:
tf.nn.embedding_lookup(embedding_matrix, cat_indices)

Keras provides a **keras.layers.Embedding** layer that handles the embedding matrix (trainable, by default); when the layer is created it initializes the embedding matrix randomly, and then when it is called with some category indices it returns the rows at those indices in the embedding matrix:

In [None]:
embedding = keras.layers.Embedding(input_dim=len(vocab) + num_oov_buckets, 
                                   output_dim=embedding_dim)
embedding(cat_indices)

The TensorFlow team is working on providing a set of standard [Keras preprocessing layers](https://keras.io/guides/preprocessing_layers/). We already discussed two of these layers: the **Normalization** layer that will perform feature standardization, and the **TextVectorization** layer that will be capable of encoding each word in the inputs into its index in the vocabulary. In both cases, you create the layer, you call its adapt() method with a data sample, and then you use the layer normally in your model. The other preprocessing layers will follow the same pattern. The API will also include a **Discretization** layer that will chop continuous data into different bins and encode each bin as a one-hot vector. At the moment, this are available in **tensorflow.keras.layers.experimental**

If the standard preprocessing layers are insufficient for our task, we can create our own custom preprocessing layer: create a subclass of the **keras.layers.PreprocessingLayer class** with an adapt() method, which should take a data_sample argument.

All the preprocessing we have seen so far will be done **on the fly**. During training, however, it may be preferable to perform preprocessing **ahead of time**. In particualr, if preprocessing is computationally expensive, then handling it before training rather than on the fly may give us a significant speedup. If the dataset is small enough to fit in RAM, we can use its cache() method, but if it is too large, then tools like [**Apache Beam**](https://beam.apache.org/) will help. They let you run efficient data processing pipelines over large amounts of data, even distributed across multiple servers, so you can use them to preprocess all the training data before training.

However, once our model is trained, we want to deploy it. In that case, we will need to write some code take care of preprocessing the data before it is fed to the model. This can become a maintenance nightmare: whenever you want to change the preprocessing logic, we will need to update our Apache Beam code and all our deployed code. This is not only time-consuming, but also error-prone. One improvement would be to take the trained model and, before deploying it, add extra preprocessing layers to take care of preprocessing on the fly. That’s definitely better. But what if we could define our preprocessing operations just once? This is what [**TF Transform**](https://www.tensorflow.org/tfx/transform/get_started) was designed for. It is part of [**TensorFlow Extended (TFX)**](https://www.tensorflow.org/tfx), an end-to-end platform for productionizing TensorFlow models.

## Exercise

**1. Implement a custom Tensorflow layer that performs Layer Normalization.** 

**(a) The build() method should define two trainable weights $\alpha$ and $\beta$, both of shape input_shape[-1:] and data type tf.float32. $\alpha$ should be initialized with 1s, and $\beta$ with 0s.**

**(b)The call() method should compute the mean and standard deviation of each instance’s features. For this, you can use tf.nn.moments(inputs, axes=-1, keepdims=True) which returns the mean and the variance of all instances. Then the function should compute and return $\alpha \times (X - \mu) / (\sigma + \epsilon) + \beta$, where $\times$ represents itemwise multiplication and $\epsilon$ is a smoothing term (small constant to avoid division by zero, e.g., 0.001).**

**(c) Ensure that your custom layer produces the same (or very nearly the same) output as the Keras Normalization layer.**

In [None]:
class LayerNormalization(keras.layers.Layer):
    def __init__(self, eps=0.001, **kwargs):
        super().__init__(**kwargs)
        self.eps = eps

    def build(self, batch_input_shape):
        self.alpha = self.add_weight(
            name="alpha", shape=batch_input_shape[-1:],
            initializer="ones")
        self.beta = self.add_weight(
            name="beta", shape=batch_input_shape[-1:],
            initializer="zeros")
        super().build(batch_input_shape)

    def call(self, X):
        mean, variance = tf.nn.moments(X, axes=-1, keepdims=True)
        return self.alpha * (X - mean) / (tf.sqrt(variance + self.eps)) + self.beta

    def compute_output_shape(self, batch_input_shape):
        return batch_input_shape

    def get_config(self):
        base_config = super().get_config()
        return {**base_config, "eps": self.eps}

To encure that this custom layer produce similar results as the Keras Nomralization Layer, let's create one instance of each class, apply them to some data (e.g., the training set), and ensure that the difference is negligeable:

In [None]:
X = X_train.astype(np.float32)

custom_layer_norm = LayerNormalization()
keras_layer_norm = keras.layers.LayerNormalization()

tf.reduce_mean(keras.losses.mean_absolute_error(
    keras_layer_norm(X), custom_layer_norm(X)))

That's close enough.

In [None]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full.astype(np.float32) / 255.
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_test = X_test.astype(np.float32) / 255.

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(10, activation="softmax"),
])

In [None]:
n_epochs = 1 #5
batch_size = 300 #32
n_steps = len(X_train) // batch_size
optimizer = keras.optimizers.Nadam(lr=0.01)
loss_fn = keras.losses.sparse_categorical_crossentropy
mean_loss = keras.metrics.Mean()
metrics = [keras.metrics.SparseCategoricalAccuracy()]

In [None]:
from tqdm import trange
from collections import OrderedDict

with trange(1, n_epochs + 1, desc="All epochs") as epochs:
    for epoch in epochs:
        with trange(1, n_steps + 1, desc="Epoch {}/{}".format(epoch, n_epochs)) as steps:
            for step in steps:
                X_batch, y_batch = random_batch(X_train, y_train)
                with tf.GradientTape() as tape:
                    y_pred = model(X_batch)
                    main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
                    loss = tf.add_n([main_loss] + model.losses)
                gradients = tape.gradient(loss, model.trainable_variables)
                optimizer.apply_gradients(zip(gradients, model.trainable_variables))
                for variable in model.variables:
                    if variable.constraint is not None:
                        variable.assign(variable.constraint(variable))                    
                status = OrderedDict()
                mean_loss(loss)
                status["loss"] = mean_loss.result().numpy()
                for metric in metrics:
                    metric(y_batch, y_pred)
                    status[metric.name] = metric.result().numpy()
                steps.set_postfix(status)
            y_pred = model(X_valid)
            status["val_loss"] = np.mean(loss_fn(y_valid, y_pred))
            status["val_accuracy"] = np.mean(keras.metrics.sparse_categorical_accuracy(
                tf.constant(y_valid, dtype=np.float32), y_pred))
            steps.set_postfix(status)
        for metric in [mean_loss] + metrics:
            metric.reset_states()

**3. Load the Fashion MNIST dataset; split it into a training set, a validation set, and a test set; shuffle the training set and save each dataset to multiple TFRecord files. Each record should be a serialized Example protobuf with two features:
the serialized image (use tf.io.serialize_tensor() to serialize each image), and the label. Then use tf.data to create an efficient dataset for each set. Finally, use a Keras model to train these datasets, including a preprocessing layer to standardize each input feature. Try to make the input pipeline as efficient as possible, using TensorBoard to visualize profiling data.**

In [None]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

In [None]:
train_set = tf.data.Dataset.from_tensor_slices((X_train, y_train)).shuffle(len(X_train))
valid_set = tf.data.Dataset.from_tensor_slices((X_valid, y_valid))
test_set = tf.data.Dataset.from_tensor_slices((X_test, y_test))

In [None]:
def create_example(image, label):
    image_data = tf.io.serialize_tensor(image)
    #image_data = tf.io.encode_jpeg(image[..., np.newaxis])
    return Example(
        features=Features(
            feature={
                "image": Feature(bytes_list=BytesList(value=[image_data.numpy()])),
                "label": Feature(int64_list=Int64List(value=[label])),
            }))

In [None]:
from contextlib import ExitStack

def write_tfrecords(name, dataset, n_shards=10):
    paths = ["./data/{}.tfrecord-{:05d}-of-{:05d}".format(name, index, n_shards)
             for index in range(n_shards)]
    with ExitStack() as stack:
        writers = [stack.enter_context(tf.io.TFRecordWriter(path))
                   for path in paths]
        for index, (image, label) in dataset.enumerate():
            shard = index % n_shards
            example = create_example(image, label)
            writers[shard].write(example.SerializeToString())
    return paths

In [None]:
train_filepaths = write_tfrecords("my_fashion_mnist.train", train_set)
valid_filepaths = write_tfrecords("my_fashion_mnist.valid", valid_set)
test_filepaths = write_tfrecords("my_fashion_mnist.test", test_set)

In [None]:
def preprocess(tfrecord):
    feature_descriptions = {
        "image": tf.io.FixedLenFeature([], tf.string, default_value=""),
        "label": tf.io.FixedLenFeature([], tf.int64, default_value=-1)
    }
    example = tf.io.parse_single_example(tfrecord, feature_descriptions)
    image = tf.io.parse_tensor(example["image"], out_type=tf.uint8)
    #image = tf.io.decode_jpeg(example["image"])
    image = tf.reshape(image, shape=[28, 28])
    return image, example["label"]

def mnist_dataset(filepaths, n_read_threads=5, shuffle_buffer_size=None,
                  n_parse_threads=5, batch_size=32, cache=True):
    dataset = tf.data.TFRecordDataset(filepaths,
                                      num_parallel_reads=n_read_threads)
    if cache:
        dataset = dataset.cache()
    if shuffle_buffer_size:
        dataset = dataset.shuffle(shuffle_buffer_size)
    dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads)
    dataset = dataset.batch(batch_size)
    return dataset.prefetch(1)

In [None]:
train_set = mnist_dataset(train_filepaths, shuffle_buffer_size=60000)
valid_set = mnist_dataset(train_filepaths)
test_set = mnist_dataset(train_filepaths)

In [None]:
for X, y in train_set.take(1):
    for i in range(5):
        plt.subplot(1, 5, i + 1)
        plt.imshow(X[i].numpy(), cmap="binary")
        plt.axis("off")
        plt.title(str(y[i].numpy()))

In [None]:
class Standardization(keras.layers.Layer):
    def adapt(self, data_sample):
        self.means_ = np.mean(data_sample, axis=0, keepdims=True)
        self.stds_ = np.std(data_sample, axis=0, keepdims=True)
    def call(self, inputs):
        return (inputs - self.means_) / (self.stds_ + keras.backend.epsilon())

standardization = Standardization(input_shape=[28, 28])

sample_image_batches = train_set.take(100).map(lambda image, label: image)
sample_images = np.concatenate(list(sample_image_batches.as_numpy_iterator()),
                               axis=0).astype(np.float32)
standardization.adapt(sample_images)

model = keras.models.Sequential([
    standardization,
    keras.layers.Flatten(),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="nadam", metrics=["accuracy"])

In [None]:
from datetime import datetime
logs = os.path.join(os.curdir, "my_logs",
                    "run_" + datetime.now().strftime("%Y%m%d_%H%M%S"))

tensorboard_cb = tf.keras.callbacks.TensorBoard(
    log_dir=logs, histogram_freq=1, profile_batch=10)

model.fit(train_set, epochs=5, validation_data=valid_set,
          callbacks=[tensorboard_cb])

In [None]:
%load_ext tensorboard
%tensorboard --logdir=./my_logs --port=6006

**4. In this exercise you will download a dataset, split it, create a tf.data.Dataset to load it and preprocess it efficiently, then build and train a binary classification model containing an Embedding layer.**

**(a) Download the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/), which contains 50,000 movies reviews from the Internet Movie Database. The data is organized in two directories, train and test, each containing a pos subdirectory with 12,500 positive reviews and a neg subdirectory with 12,500 negative reviews. Each review is stored in a separate text
file. There are other files and folders, but we will ignore them in this exercise.**

**(b) Split the test set into a validation set (15,000) and a test set (10,000).**

**(c) Use tf.data to create an efficient dataset for each set.**

**(d) Create a binary classification model, using a TextVectorization layer to preprocess each review.**

**(e) Add an Embedding layer and compute the mean embedding for each review, multiplied by the square root of the number of words. This rescaled mean embedding can then be passed to the rest of your model.**

**(f) Train the model and see what accuracy you get. Try to optimize your pipelines to make training as fast as possible.**

In [None]:
from pathlib import Path

DOWNLOAD_ROOT = "http://ai.stanford.edu/~amaas/data/sentiment/"
FILENAME = "aclImdb_v1.tar.gz"
filepath = keras.utils.get_file(FILENAME, DOWNLOAD_ROOT + FILENAME, extract=True)
path = Path(filepath).parent / "aclImdb"
path

In [None]:
for name, subdirs, files in os.walk(path):
    indent = len(Path(name).parts) - len(path.parts)
    print("    " * indent + Path(name).parts[-1] + os.sep)
    for index, filename in enumerate(sorted(files)):
        if index == 3:
            print("    " * (indent + 1) + "...")
            break
        print("    " * (indent + 1) + filename)

In [None]:
def review_paths(dirpath):
    return [str(path) for path in dirpath.glob("*.txt")]

train_pos = review_paths(path / "train" / "pos")
train_neg = review_paths(path / "train" / "neg")
test_valid_pos = review_paths(path / "test" / "pos")
test_valid_neg = review_paths(path / "test" / "neg")

len(train_pos), len(train_neg), len(test_valid_pos), len(test_valid_neg)

In [None]:
np.random.shuffle(test_valid_pos)

test_pos = test_valid_pos[:5000]
test_neg = test_valid_neg[:5000]
valid_pos = test_valid_pos[5000:]
valid_neg = test_valid_neg[5000:]

In [None]:
def imdb_dataset(filepaths_positive, filepaths_negative):
    reviews = []
    labels = []
    for filepaths, label in ((filepaths_negative, 0), (filepaths_positive, 1)):
        for filepath in filepaths:
            with open(filepath) as review_file:
                reviews.append(review_file.read())
            labels.append(label)
    return tf.data.Dataset.from_tensor_slices(
        (tf.constant(reviews), tf.constant(labels)))

In [None]:
for X, y in imdb_dataset(train_pos, train_neg).take(3):
    print(X)
    print(y)
    print()

In [None]:
%timeit -r1 for X, y in imdb_dataset(train_pos, train_neg).repeat(10): pass

Let's pretend the dataset does not fit in memory, just to make things more interesting. Luckily, each review fits on just one line, so we can read the reviews using a TextLineDataset. If they didn't we would have to preprocess the input files (e.g., converting them to TFRecords). For very large datasets, it would make sense a tool like Apache Beam for that.

In [None]:
def imdb_dataset(filepaths_positive, filepaths_negative, n_read_threads=5):
    dataset_neg = tf.data.TextLineDataset(filepaths_negative,
                                          num_parallel_reads=n_read_threads)
    dataset_neg = dataset_neg.map(lambda review: (review, 0))
    dataset_pos = tf.data.TextLineDataset(filepaths_positive,
                                          num_parallel_reads=n_read_threads)
    dataset_pos = dataset_pos.map(lambda review: (review, 1))
    return tf.data.Dataset.concatenate(dataset_pos, dataset_neg)

In [None]:
%timeit -r1 for X, y in imdb_dataset(train_pos, train_neg).repeat(10): pass

Now it takes more time to go through the dataset. That's much slower, essentially because the dataset is not cached in RAM, so it must be reloaded at each epoch. If we add .cache() just before .repeat(10), we will see that this implementation will be about as fast as the previous one.

In [None]:
%timeit -r1 for X, y in imdb_dataset(train_pos, train_neg).cache().repeat(10): pass

In [None]:
batch_size = 32

train_set = imdb_dataset(train_pos, train_neg).shuffle(25000).batch(batch_size).prefetch(1)
valid_set = imdb_dataset(valid_pos, valid_neg).batch(batch_size).prefetch(1)
test_set = imdb_dataset(test_pos, test_neg).batch(batch_size).prefetch(1)

Now let's create a binary classifier.

First we write a function to preprocess the reviews, cropping them to 300 characters, converting them to lower case, then replacing <br /> and all non letter characters to spaces, splitting the reviews into words, and finally padding or cropping each review so it ends up with exactly n_words tokens.

In [None]:
def preprocess(X_batch, n_words=50):
    shape = tf.shape(X_batch) * tf.constant([1, 0]) + tf.constant([0, n_words])
    Z = tf.strings.substr(X_batch, 0, 300)
    Z = tf.strings.lower(Z)
    Z = tf.strings.regex_replace(Z, b"<br\\s*/?>", b" ")
    Z = tf.strings.regex_replace(Z, b"[^a-z]", b" ")
    Z = tf.strings.split(Z)
    return Z.to_tensor(shape=shape, default_value=b"<pad>")

X_example = tf.constant(["It's a great, great movie! I loved it.", "It was terrible, run away!!!"])
preprocess(X_example)

Now let's write a second utility function that will take a data sample with the same format as the output of the preprocess() function, and will output the list of the top max_size most frequent words, ensuring that the padding token is first:

In [None]:
from collections import Counter

def get_vocabulary(data_sample, max_size=1000):
    preprocessed_reviews = preprocess(data_sample).numpy()
    counter = Counter()
    for words in preprocessed_reviews:
        for word in words:
            if word != b"<pad>":
                counter[word] += 1
    return [b"<pad>"] + [word for word, count in counter.most_common(max_size)]

get_vocabulary(X_example)

Now we are ready to create the TextVectorization layer. Its constructor just saves the hyperparameters (max_vocabulary_size and n_oov_buckets). The adapt() method computes the vocabulary using the get_vocabulary() function, then it builds a StaticVocabularyTable. The call() method preprocesses the reviews to get a padded list of words for each review, then it uses the StaticVocabularyTable to lookup the index of each word in the vocabulary:

In [None]:
class TextVectorization(keras.layers.Layer):
    def __init__(self, max_vocabulary_size=1000, n_oov_buckets=100, dtype=tf.string, **kwargs):
        super().__init__(dtype=dtype, **kwargs)
        self.max_vocabulary_size = max_vocabulary_size
        self.n_oov_buckets = n_oov_buckets

    def adapt(self, data_sample):
        self.vocab = get_vocabulary(data_sample, self.max_vocabulary_size)
        words = tf.constant(self.vocab)
        word_ids = tf.range(len(self.vocab), dtype=tf.int64)
        vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
        self.table = tf.lookup.StaticVocabularyTable(vocab_init, self.n_oov_buckets)
        
    def call(self, inputs):
        preprocessed_inputs = preprocess(inputs)
        return self.table.lookup(preprocessed_inputs)

In [None]:
text_vectorization = TextVectorization()

text_vectorization.adapt(X_example)
text_vectorization(X_example)

Now let's create another TextVectorization layer and let's adapt it to the full IMDB training set:

In [None]:
max_vocabulary_size = 1000
n_oov_buckets = 100

sample_review_batches = train_set.map(lambda review, label: review)
sample_reviews = np.concatenate(list(sample_review_batches.as_numpy_iterator()),
                                axis=0)

text_vectorization = TextVectorization(max_vocabulary_size, n_oov_buckets,
                                       input_shape=[])
text_vectorization.adapt(sample_reviews)

In [None]:
text_vectorization(X_example)

In [None]:
text_vectorization.vocab[:10]

Now to build our model we will need to encode all these word IDs somehow. One approach is to create bags of words: for each review, and for each word in the vocabulary, we count the number of occurences of that word in the review. For example:

In [None]:
simple_example = tf.constant([[1, 3, 1, 0, 0], [2, 2, 0, 0, 0]])
tf.reduce_sum(tf.one_hot(simple_example, 4), axis=1)

In [None]:
class BagOfWords(keras.layers.Layer):
    def __init__(self, n_tokens, dtype=tf.int32, **kwargs):
        super().__init__(dtype=tf.int32, **kwargs)
        self.n_tokens = n_tokens
    def call(self, inputs):
        one_hot = tf.one_hot(inputs, self.n_tokens)
        return tf.reduce_sum(one_hot, axis=1)[:, 1:]

In [None]:
bag_of_words = BagOfWords(n_tokens=4)
bag_of_words(simple_example)

In [None]:
n_tokens = max_vocabulary_size + n_oov_buckets + 1 # add 1 for <pad>
bag_of_words = BagOfWords(n_tokens)

In [None]:
model = keras.models.Sequential([
    text_vectorization,
    bag_of_words,
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(1, activation="sigmoid"),
])
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model.fit(train_set, epochs=5, validation_data=valid_set)

Now we add an embeddings layer

In [None]:
def compute_mean_embedding(inputs):
    not_pad = tf.math.count_nonzero(inputs, axis=-1)
    n_words = tf.math.count_nonzero(not_pad, axis=-1, keepdims=True)    
    sqrt_n_words = tf.math.sqrt(tf.cast(n_words, tf.float32))
    return tf.reduce_mean(inputs, axis=1) * sqrt_n_words

another_example = tf.constant([[[1., 2., 3.], [4., 5., 0.], [0., 0., 0.]],
                               [[6., 0., 0.], [0., 0., 0.], [0., 0., 0.]]])
compute_mean_embedding(another_example)

In [None]:
embedding_size = 20

model = keras.models.Sequential([
    text_vectorization,
    keras.layers.Embedding(input_dim=n_tokens,
                           output_dim=embedding_size,
                           mask_zero=True), # <pad> tokens => zero vectors
    keras.layers.Lambda(compute_mean_embedding),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(1, activation="sigmoid"),

In [None]:
model.compile(loss="binary_crossentropy", optimizer="nadam", metrics=["accuracy"])
model.fit(train_set, epochs=5, validation_data=valid_set)

The model is not better using embeddings.