## Notes on Chapter 12 of Aurélien Géron's: Hands-On machine learning with SciKit-Learn, Keras and Tensorflow (2nd edition).

In [39]:
import numpy as np

np.random.seed(42)

import tensorflow as tf

tf.random.set_seed(42)

from tensorflow import keras

95% of the code you encounter will require only tf.keras and tf.data. Here you learn to create custom:
  - loss functions
  - metrics
  - layers
  - models
  - initializers
  - regularizers
  - weight constraints
  - training loop
    - 99% of the time you will not need a custom loop
    - for the 1% you need it you can apply
      - special transformations or constraints to the gradients
      - multiple optimizers for different parts of the network

## Tensorflow Library Tour

  - Well suited for heavy computations
  - Fine-tuned for large-scale machine learning
  - Developed by the Google Brain team. Powers many of Google's large-scale services
  - Overview of Tensorflow offerings:
    - Similar to numpy, but with GPU support
    - Each operation
      - is implemented using highly efficient C++ code
      - has multiple implementations called kernels (one for CPUs, one for GPUs, one for TPUs)
    - Supports distributed computing
    - Extracts computation graph from Python functions, optimizes it, runs it in parallel
    - Train a Tensorflow model in one environment (Python on Linux), export the computation graph, run it in another environment (Java on Android)
    - Implements autodiff for computing gradients faster
    - Provides excellent optimizers like RMSProp and Nadam
  - Main uses of Tensorflow:
    - High level Deep learning APIs
      - tf.keras (recommended)
      - tf.estimator
    - Low level deep learning APIs
      - tf.nn
      - tf.losses
      - tf.metrics
      - tf.optimizers
      - tf.train
      - tf.initializers
    - Autodiff
      - tf.GradientTape
      - tf.gradients()
    - I/O and processing
      - tf.data
      - tf.feature_column
      - tf.audio
      - tf.image
      - tf.io
      - tf.queue
    - Visualization with TensorBoard
      - tf.summary
    - Deployment and optimization
      - tf.distribute
      - tf.saved_model
      - tf.autograph
      - tf.graph_util
      - tf.lite
      - tf.quantization
      - tf.tpu
      - tf.xla
    - Special data structures
      - tf.lookup
      - tf.nest
      - tf.ragged
      - tf.sets
      - tf.sparse
      - tf.strings
    - Mathematics, including linear algebra and signal processing
      - tf.math
      - tf.linalg
      - tf.signal
      - tf.random
      - tf.bitwise
    - Miscellaneous
      - tf.compat
      - tf.config
      - and more ...
  - TensorFlow library ecosystem
    - Tensorboard for visualization
    - TensorFlow Extended (TFX) to productionize Tensorflow projects. Includes tools for
      - Data validation
      - Preprocessing
      - Model analysis
      - Serving with TF Serving
    - To download and re-use pre-trained models for particular datasets:
      - https://www.tensorflow.org/resources/models-datasets
      - https://github.com/jtoy/awesome-tensorflow
      - Tensorflow Hub: https://www.tensorflow.org/hub/
      - https://github.com/tensorflow/models/
      - Machine learning papers with code: https://paperswithcode.com/

## Using Tensorflow like numpy

### Tensor operations

In [40]:
t = tf.constant([[1., 2., 3.], [4., 5., 6.]])    # matrix
t

<tf.Tensor: id=86, shape=(2, 3), dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

In [41]:
tf.constant(10)                                  # scalar

<tf.Tensor: id=87, shape=(), dtype=int32, numpy=10>

In [42]:
t.shape, t.dtype

(TensorShape([2, 3]), tf.float32)

#### Tensorflow indexing is same as Numpy indexing

In [43]:
t[:, :1]                                         # Selecting a column as a 2-D array

<tf.Tensor: id=91, shape=(2, 1), dtype=float32, numpy=
array([[1.],
       [4.]], dtype=float32)>

In [44]:
t[..., 1]                                         # Selecting a column as a 1-D array

<tf.Tensor: id=95, shape=(2,), dtype=float32, numpy=array([2., 5.], dtype=float32)>

In [45]:
t[..., 1, tf.newaxis]                            # The selection is made based on
                                                 # t[..., 1] =>
                                                 #   Select all dimensions until the last
                                                 #   Select only index 1 of the last dimension.
                                                 #   Resulting in [2, 5]
                                                 # Since tf.newaxis is at the last position,
                                                 # the selected values are enclosed inside
                                                 # square brackets for each element =>
                                                 #   Resulting in [[2], [5]]

<tf.Tensor: id=99, shape=(2, 1), dtype=float32, numpy=
array([[2.],
       [5.]], dtype=float32)>

In [46]:
t[0, ..., tf.newaxis]                            # The selection is made based on
                                                 # t[0, ...] =>
                                                 #   Select only index 0 of the first dimension.
                                                 #   Resulting in [1, 2, 3]
                                                 #   Select all dimensions until the last
                                                 # Since tf.newaxis is at the last position,
                                                 # the selected values are enclosed inside
                                                 # square brackets for each element =>
                                                 #   Resulting in [[1], [2], [3]]

<tf.Tensor: id=103, shape=(3, 1), dtype=float32, numpy=
array([[1.],
       [2.],
       [3.]], dtype=float32)>

In [47]:
t[0, tf.newaxis, ...]                            # The selection is made based on
                                                 # t[0, ...] (tf.newaxis ignored) =>
                                                 #   Select only index 0 of the first dimension.
                                                 #   Resulting in [1, 2, 3]
                                                 #   Select all dimensions until the last
                                                 # Since tf.newaxis is at the middle position,
                                                 # the selected values are enclosed inside
                                                 # square brackets for each element =>
                                                 #   Resulting in [[1, 2, 3]]

<tf.Tensor: id=107, shape=(1, 3), dtype=float32, numpy=array([[1., 2., 3.]], dtype=float32)>

In [48]:
t + 10

<tf.Tensor: id=109, shape=(2, 3), dtype=float32, numpy=
array([[11., 12., 13.],
       [14., 15., 16.]], dtype=float32)>

In [49]:
tf.square(t)

<tf.Tensor: id=110, shape=(2, 3), dtype=float32, numpy=
array([[ 1.,  4.,  9.],
       [16., 25., 36.]], dtype=float32)>

In [50]:
t @ tf.transpose(t)                              # @ represents matrix multiplication. You can
                                                 # also say tf.matmul(t, tf.transpose(t))

<tf.Tensor: id=113, shape=(2, 2), dtype=float32, numpy=
array([[14., 32.],
       [32., 77.]], dtype=float32)>

In [51]:
tf.matmul(t, tf.transpose(t))

<tf.Tensor: id=116, shape=(2, 2), dtype=float32, numpy=
array([[14., 32.],
       [32., 77.]], dtype=float32)>

In [52]:
# Other math operations:         tf.add(),     tf.multiply(), tf.square(), tf.exp(), tf.sqrt()
# Operations available in numpy: tf.reshape(), tf.squeeze(),  tf.tile()
# Differently-named operations:  tf.reduce_mean(), tf.reduce_sum(), tf.reduce_max(),
#                                tf.math.log()
# The names are different from those in numpy, because the operations do different things. ex:
# t.T in numpy is the transpose, but it is tf.transpose(t) in Tensorflow.
# In numpy, t.T gives you a transposed view on the same data.
# In Tensorflow, you get a copy of the transposed data.
#
# Many classes have aliases. ex. tf.add() is same as tf.math.add().
# This helps keep the packages organized while having concise names for common operations.
#
# If you want code that is usable in other Keras implementations, you should
# use only the functions in the keras.backend. However this is only a subset
# of all the Tensorflow functions. ex:
K = keras.backend
K.square(K.transpose(t)) + 10

<tf.Tensor: id=121, shape=(3, 2), dtype=float32, numpy=
array([[11., 26.],
       [14., 35.],
       [19., 46.]], dtype=float32)>

### Tensors and NumPy

In [53]:
a = np.array([2., 4., 5.])
tf.constant(a)                            # Results in a float64 array.

<tf.Tensor: id=122, shape=(3,), dtype=float64, numpy=array([2., 4., 5.])>

In [54]:
# Numpy uses float64 by default.
# Neural networks (and thus tensorflow) use float32, we should use
# tf.constant(a, dtype=tf.float32) instead.
tf.constant(a, dtype=tf.float32)

<tf.Tensor: id=123, shape=(3,), dtype=float32, numpy=array([2., 4., 5.], dtype=float32)>

In [55]:
t.numpy()                                 # Get numpy array from tensor

array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)

### Type conversions

In [56]:
# Tensorflow does not do type conversions automatically
# tf.constant(2.) + tf.constant(40)         # InvalidArgumentError since float + int
# tf.constant(2.) + \                       # InvalidArgumentError since float32 + float64
#   tf.constant(40., dtype=tf.float64)    

### Variables

In [57]:
# Tensors are immutable. They're used to store data.
# We cannot use tensors to implement weights in a Neural Network.
# We can use Variables instead.
# Can be modified in-place using: 
#   assign() for assigning value to a variable
#   assign_add() for incrementing
#   assign_sub() for decrementing
#
# You can also modify individual cells (or slices) using the cells (or slices) assign(),
# or by using the scatter_update() or scatter_nd_update(). nd stands for n-dimensions.
#
# In practice, you will add weights using add_weight() function.
# You will rarely need to create variables manually.
v = tf.Variable([[1., 2., 3.], [4., 5., 6.]])
v

<tf.Variable 'Variable:0' shape=(2, 3) dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

In [58]:
v.assign(2 * v)

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2.,  4.,  6.],
       [ 8., 10., 12.]], dtype=float32)>

In [59]:
v[0, 1].assign(42)

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2., 42.,  6.],
       [ 8., 10., 12.]], dtype=float32)>

In [60]:
v[:, 2].assign([0., 1.])

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2., 42.,  0.],
       [ 8., 10.,  1.]], dtype=float32)>

In [61]:
v.scatter_nd_update(indices=[[0, 0], [1, 2]], updates=[100., 200.])

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[100.,  42.,   0.],
       [  8.,  10., 200.]], dtype=float32)>

### Other Data Structures
  - Sparse tensors (tf.SparseTensor)
    - Represent tensors with mostly zeros, efficiently.
  - Tensor arrays (tf.TensorArray)
    - List of tensors with fixed size by default. Can optionally be made dynamic.
      All tensors they contain must have the same shape and data type.
  - Ragged tensors (tf.RaggedTensor)
    - List of list of tensors, where each tensor has the same shape and data type.
  - String tensors
    - Regular tensors of type string (represents byte strings). Unicode strings are encoded
      to utf-8 automatically. Represent unicode strings using tensors of type tf.int32 with
      4 int32 values representing a unicode code point.
  - Sets
    - Represented as tensors/sparse tensors. ex. tf.constant([[1, 2], [3, 4]]) represents
      two sets {1, 2}, and {3, 4}. Each set is represented as a vector in the tensors
      last axis.
  - Queues
    - Store tensors across multiple steps. FIFOQueue, PriorityQueue, RandomShuffleQueue (shuffles it's items), PaddingFIFOQueue (pads it's differently-shaped items)

## Customizing Models and Training Algorithms

### Custom Loss Functions

In [None]:
# If your training set is a bit noisy:
#   - you remove/fix outliers
# but it's still noisy.
# With the MSE loss function, it will penalize large errors too much.
# With the MAE loss function, it may take a while to converge, or the model may be imprecise.
# Use Huber function. It's there in tf.keras.losses.Huber, but we can make one and use it.
# def create_huber(threshold=1.0):
#   def huber_fn(y_true, y_pred):
#     error = y_true - y_pred
#     is_small_error = tf.abs(error) < threshold
#     squared_loss = tf.square(error) / 2
#     linear_loss = threshold * tf.abs(error) - threshold**2 / 2
#     return tf.where(is_small_error, squared_loss, linear_loss)
#   return huber_fn

# For better performance, use vectorized implementation.
# If you want to benefit from TensorFlow's graph features, use only Tensorflow operations.
# Return a tensor containing 1 loss per instance, instead of returning mean loss.
# This way Keras can apply class weights or sample weights when requested.

# model.compile(loss=create_huber(2.0), optimizer='nadam')
# model.fit(X_train, y_train, ...)

# Keras will use the created huber_fn as the loss function to perform Gradient Descent.

### Saving and loading models that contain custom components

In [64]:
# Keras saves the name of the function when you save the model.
# But it does not save the threshold value (hyperparameter).
# To load it, you have to map the saved function name to the actual function,
# and give it the threshold value.
# model = keras.models.load_model('model.h5', 
#                                 custom_objects={'huber_fn': create_huber(2.0)})

In [66]:
# If your function does not have any hyperparameters,
# then you just need to map the function name to
# the function when you load the model.

# To get around this issue of saving parameters to functions, 
# create a subclass and implement it's get_config() method:
class HuberLoss(keras.losses.Loss):
    def __init__(self, threshold=1.0, **kwargs):
        self.threshold = threshold
        super().__init__(**kwargs)
        
    # Losses are used to calculate a gradient to train the model.
    # This is why they must be differentiable everywhere they're evaluated.
    # Their gradients should not be 0 everywhere.
    def call(self, y_true, y_pred):
        error = y_true - y_pred
        is_small_error = tf.abs(error) < threshold
        squared_loss = tf.square(error) / 2
        linear_loss = threshold * tf.abs(error) - threshold**2 / 2
        return tf.where(is_small_error, squared_loss, linear_loss)

    def get_config(self):
        base_config = super().get_config()
        return {**base_config, 'threshold': self.threshold}

# keras.losses.Loss class can be initialized using:
#   - name: The name of the loss
#   - reduction: Algorithm to use to aggregate individual instance losses.
#                Default is 'sum_over_batch_size' which:
#                  - weighs samples (if needed)
#                  - adds all values
#                  - divides by batch size (notice - not by sum_of_weights, 
#                                           this is not weighted mean)
#                Other algorithm options are 'sum' and None

# model.compile(HuberLoss(2.0), optimizer='nadam')

# When you save the model, Keras calls the loss instance's get_config()
# and saves the config as JSON in the HDF5 file.
# When you load the model, it calls the from_config() in the HuberLoss() class
# (or if not present, in the Loss class). This method creates an instance of
# the class, passing the **config to the constructor.

# When loading the model, map the class name to the class:
# model = keras.models.load_model('model.h5', 
#                                 custom_objects={'HuberLoss': HuberLoss})

### Custom Activation functions, Initializers, Regularizers, and Constraints

In [69]:
# Most Keras functionalities such as losses, regularizers, initializers, constraints,
# metrics, activation functions, layers, and even full models can be customized in
# the same way as above.
def my_softplus(z): # return value is just tf.nn.softplus(z)
    return tf.math.log(tf.exp(z) + 1.0)

def my_glorot_initializer(shape, dtype=tf.float32):
    stddev = tf.sqrt(2. / (shape[0] + shape[1]))
    return tf.random.normal(shape, stddev=stddev, dtype=dtype)

# At each training step, the weights will be passed to the regularization function
# to compute the regularization loss, which will be added to the main loss
# to get the final training loss.
def my_l1_regularizer(weights):
    return tf.reduce_sum(tf.abs(0.01 * weights))

# The constraint function will be called after each training step,
# and the layer's weights will be replaced by the constrained weights.
def my_positive_weights(weights): # return value is just tf.nn.rely(weights)
    return tf.where(weights < 0., tf.zeros_like(weights), weights)

layer = keras.layers.Dense(30, activation=my_softplus,
                           kernel_initializer=my_glorot_initializer,
                           kernel_regularizer=my_l1_regularizer,
                           kernel_constraint=my_positive_weights)

In [70]:
# If a function has hyperparameters you want to save with the model, 
# you must subclass the appropriate class and call method:
#   - keras.regularizers.Regularizer, implement the __call__() method
#   - keras.constraints.Constraint,   implement the __call__() method
#   - keras.initializers.Initializer, implement the __call__() method
#   - keras.losses.Loss,              implement the call() method
#   - keras.layers.Layer (for any layer, including activation functions),
#                                     implement the call() method
#   - keras.models.Model,             implement the call() method

class MyL1Regularizer(keras.regularizers.Regularizer):
    def __init__(self, factor):
        self.factor = factor
    def __call__(self, weights):
        return tf.reduce_sum(tf.abs(self.factor * weights))
    def get_config(self):
        return {'factor': self.factor}

### Custom Metrics

#### Streaming Metrics

In [87]:
# MSE or MAE is usually preferred as a metric,
# we will just show here how to use huber as a metric.
# model.compile(loss='mse', optimizer='nadam', metrics=[create_huber(2.)])

# Keras will compute the metric and keep track of it's mean for each batch.
# Most of the time, this is what you want.
# This may not be what we want. For example, during classification,
# we want to keep track of, say, (true positives / (true positive + false positive)).
# Keras will keep track of this ratio, per batch, not over all batches.
# keras.metrics.Precision class does this.
p = keras.metrics.Precision()
print(f'1st batch result: {p([0, 1, 1, 1, 0, 1, 0, 1], [1, 1, 0, 1, 0, 1, 0, 1])}')
print(f'2nd batch result: {p([0, 1, 0, 0, 1, 0, 1, 1], [1, 0, 1, 1, 0, 0, 0, 0])}')

# This is called a streaming metric (or stateful metric).
# It's updated batch-by-batch.
print(f'Final result: {p.result()}')

1st batch result: 0.800000011920929
2nd batch result: 0.5
Final result: 0.5


In [88]:
p.variables                   # These track the number of true positives and false positives

[<tf.Variable 'true_positives:0' shape=(1,) dtype=float32, numpy=array([4.], dtype=float32)>,
 <tf.Variable 'false_positives:0' shape=(1,) dtype=float32, numpy=array([4.], dtype=float32)>]

In [89]:
p.result()

<tf.Tensor: id=1207, shape=(), dtype=float32, numpy=0.5>

In [90]:
p.reset_states()              # Reset all counts to zero

In [91]:
p.result()

<tf.Tensor: id=1218, shape=(), dtype=float32, numpy=0.0>

In [92]:
class HuberMetric(keras.metrics.Metric):
    def __init__(self, threshold=1., **kwargs):
        super().__init__(**kwargs)
        self.threshold = threshold
        self.huber_fn = create_huber(threshold)
        self.total = self.add_weight('total', initializer='zeros')
        self.count = self.add_weight('count', initializer='zeros')
    
    # Updates variables for one batch
    def update_state(self, y_true, y_pred, sample_weight=None):
        metric = self.huber_fn(y_true, y_pred)
        self.total.assign_add(tf.reduce_sum(metric))
        self.count.assign_add(tf.cast(tf.size(y_true), tf.float32))  # tf.size(x) flattens x
                                                                     # and returns it's length
    
    # Computes final result
    def result(self):
        return self.total / self.count
    
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, 'threshold': self.threshold}

### Custom Layers

In [98]:
# If tensorflow does not provide a default implementation for a layer,
# or if you want to treat a block of layers as a single layer,
# you can create a custom layer.

# Some layers have no weights, ex.
#   - keras.layers.Flatten
#   - keras.layers.ReLU
# To create such layer, write a function and wrap it in a keras.layers.Lambda.
# This layer can be used like any other layer, or as an activation function.
exponential_layer = \
    keras.layers.Lambda(lambda x: tf.exp(x)) # or, activation='exp',
                                             # or activation=tf.exp,
                                             # or activation=keras.activations.exponential
# The exponential layer is sometimes used in the output layer of a model
# when the values to predict have very different scales (ex. 0.001. 10., 1000.)

class MyDense(keras.layers.Layer):
    def __init__(self, units, activation=None, **kwargs):
        super().__init__(**kwargs)
        self.units = units
        self.activation = keras.activations.get(activation)
        
    def build(self, batch_input_shape):
        self.kernel = self.add_weight(
            name='kernel', shape=[batch_input_shape[-1], self.units],
            initializer='glorot_normal')
        self.bias = self.add_weight(
            name='bias', shape=[self.units], initializer='zeros')
        super().build(batch_input_shape)  # Must be at the end. Sets self.built=True.
                                          # This lets Keras know that the layer is built.
        
    def call(self, X):
        return self.activation(X @ self.kernel + self.bias)

    # Usually you can omit the compute_output_shape() method,
    # except when the layer is dynamic.
    # Returns the shape of the layers outputs.
    # In this case it is the same as the first elements of the shape and
    # the last element replaced with the number of neurons in the layer.
    # In tf.keras, shapes are instances of tf.TensorShape class, which you can
    # convert to list using as_list().
    def compute_output_shape(self, batch_input_shape):
        return tf.TensorShape(                   # TensorShape takes a list of dimensions,
            batch_input_shape.as_list()[:-1] +   # and creates a shape of a tensor.
            [self.units])                        # If list_a = [1, 2, 3],
                                                 # list_a + [x, y] = [1, 2, 3, x, y].
                                                 # Same as list_a.extend([x, y])
    
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, 'units': self.units,
                'activation': keras.activations.serialize(self.activation)}

In [99]:
# Build a layer that takes multiple inputs and
# has multiple outputs
class MyMultiLayer(keras.layers.Layer):
    def call(self, X):                   # For a layer with multiple inputs (ex. Concatenate),
        X1, X2 = X                       # X should be a tuple containing all inputs
        return [X1 + X2, X1 * X2, X1 / X2]  # To create a layer with multiple outputs, call()
                                            # should return the list of outputs.
    
    def compute_output_shape(self, batch_input_shape):
        b1, b2 = batch_input_shape
        return [b1, b1, b1]              # should probably handle broadcasting rules.
                                         # To create a layer with multiple outputs,
                                         # compute_output_shape() should return the 
                                         # list of batch output shapes (one per output).

In [101]:
# For a layer that handles training and testing differntly (ex. if it has
# Dropout or BatchNormalization layers),
# add a training argument to call() and decide what to do within call(). ex:
class MyGaussianNoise(keras.layers.Layer):
    def __init__(self, stddev, **kwargs):
        super().__init__(**kwargs)
        self.stddev = stddev
        
    def call(self, X, training=None):
        if training:
            noise = tf.random.normal(tf.shape(X), 
                                     stddev=self.stddev)
            return X + noise
        else:
            return X
    
    def compute_output_shape(self, batch_input_shape):
        return batch_input_shape

### Custom Models

In [102]:
# Suppose we want to create a model with:
#   - Final Dense output layer
#   - A Residual Block. Each Residual Block contains
#     - A Concatenation layer taking two inputs (one from last Dense layer, 
#                                                one from first input)
#     - A Dense layer
#     - A Dense layer
#   - A Residual Block X 4
#   - A Dense layer
#
# Inputs flow from bottom to top.
# This model is just an example to illustrate how you can build
# any model you want with any layer combination.
#
# This layer contains other layers - it is special.
# Keras detects that the hidden attribute contains trackable objects (layers in this case),
# so their variables are automatically added to this layer's list of variables.

class ResidualBlock(keras.layers.Layer):
    def __init__(self, n_layers, n_neurons, **kwargs):
        
        super().__init__(**kwargs)
        
        self.hidden = [keras.layers.Dense(n_neurons, activation='elu',
                                          kernel_initializer='he_normal')
                       for _ in range(n_layers)]

    def call(self, inputs):
        Z = inputs
        for layer in self.hidden:
            Z = layer(Z)
        return inputs + Z  # This is the concatenation of inputs and output of last Z layer

In [None]:
class ResidualRegressor(keras.Model):
    def __init__(self, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.hidden1 = keras.layers.Dense(30, activation='elu',
                                          kernel_initializer='he_normal')
        self.block1 = ResidualBlock(2, 30)
        self.block2 = ResidualBlock(2, 30)
        self.out = keras.layers.Dense(output_dim)
        
    def call(self, inputs):
        Z = self.hidden1(inputs) # first dense layer
        for _ in range(4):       # 4 residual blocks
            Z = self.block1(Z)
        Z = self.block2(Z)       # one more residual block
        return self.out(Z)       # final dense output layer

# Now you can compile, fit, evaluate this model to make predictions - just like 
# any other model.
# To save model using model.save() and load using keras.models.load_model(),
# implement get_config() in both the ResidualBlock class, and ResidualRegressor class.
# Or you can save/load weights using save_weights() and load_weights() methods.
# Better to use get_config() since you may forget to save/load weights.

# Model is a subclass of Layer, so you can use Models as Layers.
# Extra functionality in Model are these methods:
#   - compile()       - fit()        - evaluate()       - predict()
#   - plus a few variants of the above methods
#   - get_layers() returns the models layers by name or index
#   - save()          - support for keras.models.load_model()
#   - support for keras.models.clone_model()

### Losses and Metrics Based on Model Internals

In [None]:
# Earlier losses were based on labels and predictions 
# (also some on weights like the L1Regularizer).
# You may want to define losses based on other parts
# of your model, such as the weights or activations
# of it's hidden layers. This may be useful for
# regularization or to monitor internal aspects
# of the model.
# You can compute a custom loss based on any internals
# you want, then pass it to model's add_loss() method.

# Let's build this model:
#   Hidden layer  +  Auxiliary output layer
#              Hidden layer
#              Hidden layer
#              Hidden layer
#              Hidden layer
#              Hidden layer
#   Auxiliary layer = (associated loss = reconstruction loss (MSE(reconstruction - inputs)))
#                     By adding the reconstruction loss to the main loss, we will encourage
#                     the model to preserve as much information as possible through the
#                     hidden layers - even the information not useful to the regression task.
#                     In practice, this loss sometimes improves generalization
#                     (it is a regularization loss).

class ReconstructionRegressor(keras.Model):
    def __init__(self, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.hidden = [keras.layers.Dense(30, activation='selu',
                                          initialization='lecun_normal')
                       for _ in range(5)]
        self.out = keras.layers.Dense(output_dim)
        self.mean_recon_error = keras.metrics.Mean()     # Metric tracks mean recon error

        
    def build(self, batch_input_shape):
        n_inputs = batch_input_shape[-1]
        self.reconstruct = keras.layers.Dense(n_inputs)  # n_inputs is unknown before build(),
                                                         # so this layer is created here.
        super().build(batch_input_shape)
        
    def call(self, inputs):
        Z = inputs
        for layer in self.hidden:
            Z = layer(Z)
        reconstruction = self.reconstruct(Z)
        recon_loss = tf.reduce_mean(tf.square(reconstruction - inputs))
        self.add_loss(0.05 * recon_loss)                 # Scale by 0.05 so reconstruction
                                                         # loss does not dominate main loss.
        mre_value = self.mean_recon_error(recon_loss)
        self.add_metric(mre_value)                       # Add metric value to track the metric
        
        return self.out(Z)

### Computing gradients using Autodiff