Exercise: *Implement a custom layer that performs layer normalization (we will use this type of layer in Chapter 15)*.

# Set up

In [7]:
import tensorflow as tf

a. Exercise: *The `build()` method should define two trainable weights $\alpha$ and $\beta$, both of shape input_shape[-1:] and data type tf.float32. $\alpha$ should be initialized with 1s, and $\beta$ with 0s.*

b. Exercise: *The `call()` method should compute the mean $\mu$ and standard deviation $\sigma$ of each instance’s features. For this, you can use `tf.nn.moments(inputs, axes=-1, keepdims=True)`, which returns the mean $\mu$ and the variance $\sigma^2$ of all instances (compute the square root of the variance to get the standard deviation). Then the function should compute and return $\alpha \otimes (\textbf{X}-\mu)/(\sigma + \varepsilon)$, where $\otimes$ represents item-wise multiplication and $\varepsilon$ is a smoothing term (a small constant to avoid division by zero, e.g., 0.001).*

In [11]:
class MyLayerNormalization(tf.keras.layers.Layer):
    def __init__(self, eps=0.001, **kwargs):
        super().__init__(**kwargs)
        self.eps = eps

    def build(self, input_shape):
        self.alpha = self.add_weight(
            name="alpha", shape=input_shape[-1:], initializer="ones"
        )
        self.beta = self.add_weight(
            name="beta", shape=input_shape[-1:], initializer="zeros"
        )

    def call(self, X):
        mean, variance = tf.nn.moments(X, axes=-1, keepdims=True)
        return self.alpha * (X - mean) / (tf.sqrt(variance + self.eps)) + self.beta

    def get_config(self):
        base_config = super().get_config()
        return {**base_config, "eps": self.eps}

Note that it's preferable to compute `tf.sqrt(variance + self.eps)` rather than `tf.sqrt(variance) + self.eps`. Indeed, the derivative of sqrt(z) is undefined when z=0, so training will bomb whenever the variance vector has at least one component equal to 0. Adding _ε_ within the square root guarantees that this will never happen.

c. Exercise: *Ensure that your custom layer produces the same (or very nearly the same) output as the `tf.keras.layers.LayerNormalization` layer.*

We reload the California housing dataset. We just need a dataset, so no need to preprocessing it at all.

In [9]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target.reshape(-1, 1), random_state=42
)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42
)

Now we create an instance of each class, apply them to the training dataset, and ensure that the difference is negligible.

In [12]:
import numpy as np

X = X_train.astype(np.float32)

custom_layer_norm = MyLayerNormalization()
keras_layer_norm = tf.keras.layers.LayerNormalization()

tf.reduce_mean(
    tf.keras.losses.mean_absolute_error(keras_layer_norm(X), custom_layer_norm(X))
)

<tf.Tensor: shape=(), dtype=float32, numpy=3.44626e-08>

That's very close! Just to be extra sure, let's make alpha and beta completely random and compare again.

In [14]:
tf.keras.utils.set_random_seed(42)

random_alpha = np.random.rand(X.shape[-1])
random_beta = np.random.rand(X.shape[-1])

custom_layer_norm.set_weights([random_alpha, random_beta])
keras_layer_norm.set_weights([random_alpha, random_beta])


tf.reduce_mean(
    tf.keras.losses.mean_absolute_error(keras_layer_norm(X), custom_layer_norm(X))
)

<tf.Tensor: shape=(), dtype=float32, numpy=1.6339667e-08>

Still a negligible difference! Our custom layer did a good job.