## Notes on Chapter 12 of Aurélien Géron's: Hands-On machine learning with SciKit-Learn, Keras and Tensorflow (2nd edition).

In [2]:
import numpy as np

np.random.seed(42)

import tensorflow as tf

tf.random.set_seed(42)

from tensorflow import keras

95% of the code you encounter will require only tf.keras and tf.data. Here you learn to create custom:
  - loss functions
  - metrics
  - layers
  - models
  - initializers
  - regularizers
  - weight constraints
  - training loop
    - 99% of the time you will not need a custom loop
    - for the 1% you need it you can apply
      - special transformations or constraints to the gradients
      - multiple optimizers for different parts of the network

## Tensorflow Library Tour

TensorFlow in one sentence: TensorFlow is fine-tuned for large-scale machine learning,
and it supports parallel and distributed processing.

  - Well suited for heavy computations
  - Fine-tuned for large-scale machine learning
  - Developed by the Google Brain team. Powers many of Google's large-scale services
  - Overview of Tensorflow offerings:
    - Similar to numpy, but with GPU support
    - Each operation
      - is implemented using highly efficient C++ code
      - has multiple implementations called kernels (one for CPUs, one for GPUs, one for TPUs)
    - Supports distributed computing
    - Extracts computation graph from Python functions, optimizes it, runs it in parallel
    - Train a Tensorflow model in one environment (Python on Linux), export the computation graph, run it in another environment (Java on Android)
    - Implements autodiff for computing gradients faster
    - Provides excellent optimizers like RMSProp and Nadam
  - Main uses of Tensorflow:
    - High level Deep learning APIs
      - tf.keras (recommended)
      - tf.estimator (not recommended - use tf.keras if possible)
    - Low level deep learning APIs
      - tf.nn
      - tf.losses
      - tf.metrics
      - tf.optimizers
      - tf.train
      - tf.initializers
    - Autodiff
      - tf.GradientTape
      - tf.gradients()
    - I/O and processing
      - tf.data
      - tf.feature_column
      - tf.audio
      - tf.image
      - tf.io
      - tf.queue
    - Visualization with TensorBoard
      - tf.summary
    - Deployment and optimization
      - tf.distribute
      - tf.saved_model
      - tf.autograph
      - tf.graph_util
      - tf.lite
      - tf.quantization
      - tf.tpu
      - tf.xla
    - Special data structures
      - tf.lookup
      - tf.nest
      - tf.ragged
      - tf.sets
      - tf.sparse
      - tf.strings
    - Mathematics, including linear algebra and signal processing
      - tf.math
      - tf.linalg
      - tf.signal
      - tf.random
      - tf.bitwise
    - Miscellaneous
      - tf.compat
      - tf.config
      - and more ...
  - TensorFlow library ecosystem
    - Tensorboard for visualization
    - TensorFlow Extended (TFX) to productionize Tensorflow projects. Includes tools for
      - Data validation
      - Preprocessing
      - Model analysis
      - Serving with TF Serving
    - To download and re-use pre-trained models for particular datasets:
      - https://www.tensorflow.org/resources/models-datasets
      - https://github.com/jtoy/awesome-tensorflow
      - Tensorflow Hub: https://www.tensorflow.org/hub/
      - https://github.com/tensorflow/models/
      - Machine learning papers with code: https://paperswithcode.com/

## Using Tensorflow like numpy

Main Differences between Tensorflow and Numpy:
  - Numpy defaults to 64-bit integers and floats. 
    Tensorflow defaults to 32-bit integers and floats.
    It requires 32-bit floats for neural networks
  - Numpy automatically type-casts when it encounters assymetry.
    Tensorflow just raises an exception.

### Tensor operations

In [101]:
t = tf.constant([[1., 2., 3.], [4., 5., 6.]])    # matrix
t

<tf.Tensor: id=786, shape=(2, 3), dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

In [102]:
tf.constant(10)                                  # scalar

<tf.Tensor: id=787, shape=(), dtype=int32, numpy=10>

In [103]:
t.shape, t.dtype

(TensorShape([2, 3]), tf.float32)

#### Tensorflow indexing is same as Numpy indexing

In [104]:
t[:, :1]                                         # Selecting a column as a 2-D array

<tf.Tensor: id=791, shape=(2, 1), dtype=float32, numpy=
array([[1.],
       [4.]], dtype=float32)>

In [105]:
t[..., 1]                                         # Selecting a column as a 1-D array

<tf.Tensor: id=795, shape=(2,), dtype=float32, numpy=array([2., 5.], dtype=float32)>

In [106]:
t[..., :1]

<tf.Tensor: id=799, shape=(2, 1), dtype=float32, numpy=
array([[1.],
       [4.]], dtype=float32)>

In [107]:
t[..., 1, tf.newaxis]                            # The selection is made based on
                                                 # t[..., 1] =>
                                                 #   Select all dimensions until the last
                                                 #   Select only index 1 of the last dimension.
                                                 #   Resulting in [2, 5]
                                                 # Since tf.newaxis is at the last position,
                                                 # the selected values are enclosed inside
                                                 # square brackets for each element =>
                                                 #   Resulting in [[2], [5]]

<tf.Tensor: id=803, shape=(2, 1), dtype=float32, numpy=
array([[2.],
       [5.]], dtype=float32)>

In [108]:
t[0, ..., tf.newaxis]                            # The selection is made based on
                                                 # t[0, ...] =>
                                                 #   Select only index 0 of the first dimension.
                                                 #   Resulting in [1, 2, 3]
                                                 #   Select all dimensions until the last
                                                 # Since tf.newaxis is at the last position,
                                                 # the selected values are enclosed inside
                                                 # square brackets for each element =>
                                                 #   Resulting in [[1], [2], [3]]

<tf.Tensor: id=807, shape=(3, 1), dtype=float32, numpy=
array([[1.],
       [2.],
       [3.]], dtype=float32)>

In [109]:
t[0, tf.newaxis, ...]                            # The selection is made based on
                                                 # t[0, ...] (tf.newaxis ignored) =>
                                                 #   Select only index 0 of the first dimension.
                                                 #   Resulting in [1, 2, 3]
                                                 #   Select all dimensions until the last
                                                 # Since tf.newaxis is at the middle position,
                                                 # the selected values are enclosed inside
                                                 # square brackets for each element =>
                                                 #   Resulting in [[1, 2, 3]]

<tf.Tensor: id=811, shape=(1, 3), dtype=float32, numpy=array([[1., 2., 3.]], dtype=float32)>

In [110]:
t + 10

<tf.Tensor: id=813, shape=(2, 3), dtype=float32, numpy=
array([[11., 12., 13.],
       [14., 15., 16.]], dtype=float32)>

In [111]:
tf.square(t)

<tf.Tensor: id=814, shape=(2, 3), dtype=float32, numpy=
array([[ 1.,  4.,  9.],
       [16., 25., 36.]], dtype=float32)>

In [112]:
t @ tf.transpose(t)                              # @ represents matrix multiplication. You can
                                                 # also say tf.matmul(t, tf.transpose(t))

<tf.Tensor: id=817, shape=(2, 2), dtype=float32, numpy=
array([[14., 32.],
       [32., 77.]], dtype=float32)>

In [113]:
tf.matmul(t, tf.transpose(t))

<tf.Tensor: id=820, shape=(2, 2), dtype=float32, numpy=
array([[14., 32.],
       [32., 77.]], dtype=float32)>

In [114]:
# Other math operations:         tf.add(),     tf.multiply(), tf.square(), tf.exp(), tf.sqrt()
# Operations available in numpy: tf.reshape(), tf.squeeze(),  tf.tile()
# Differently-named operations:  tf.reduce_mean(), tf.reduce_sum(), tf.reduce_max(),
#                                tf.math.log()
# The names are different from those in numpy, because the operations do different things. ex:
# t.T in numpy is the transpose, but it is tf.transpose(t) in Tensorflow.
# In numpy, t.T gives you a transposed view on the same data.
# In Tensorflow, you get a copy of the transposed data.
#
# Many classes have aliases. ex. tf.add() is same as tf.math.add().
# This helps keep the packages organized while having concise names for common operations.
#
# If you want code that is usable in other Keras implementations, you should
# use only the functions in the keras.backend. However this is only a subset
# of all the Tensorflow functions. ex:
K = keras.backend
K.square(K.transpose(t)) + 10

<tf.Tensor: id=825, shape=(3, 2), dtype=float32, numpy=
array([[11., 26.],
       [14., 35.],
       [19., 46.]], dtype=float32)>

### Tensors and NumPy

In [115]:
a = np.array([2., 4., 5.])
tf.constant(a)                            # Results in a float64 array.

<tf.Tensor: id=826, shape=(3,), dtype=float64, numpy=array([2., 4., 5.])>

In [117]:
a, a.shape

(array([2., 4., 5.]), (3,))

In [118]:
# Numpy uses float64 by default.
# Neural networks (and thus tensorflow) use float32, we should use
# tf.constant(a, dtype=tf.float32) instead.
tf.constant(a, dtype=tf.float32)

<tf.Tensor: id=827, shape=(3,), dtype=float32, numpy=array([2., 4., 5.], dtype=float32)>

In [119]:
t.numpy()                                 # Get numpy array from tensor

array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)

#### Same outputs from both these statements, except for dtype

In [95]:
tf.constant(np.arange(10))                # The output of this statement is the same as
                                          # the output of tf.constant(np.arange(10)),
                                          # except for the dtype !!!

<tf.Tensor: id=777, shape=(10,), dtype=int64, numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])>

In [94]:
tf.range(10)                              # The output of this statement is the same as
                                          # the output of tf.range(10),
                                          # except for the dtype !!!

<tf.Tensor: id=776, shape=(10,), dtype=int32, numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)>

In [97]:
tf.constant(np.arange(10, dtype='int32')) # Now the output is the same as tf.range(10)

<tf.Tensor: id=778, shape=(10,), dtype=int32, numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)>

In [98]:
tf.range(10, dtype='int64')               # Now the output is the same as 
                                          # tf.constant(np.arange(10))

<tf.Tensor: id=785, shape=(10,), dtype=int64, numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])>

### Type conversions

In [19]:
# Tensorflow does not do type conversions automatically
# tf.constant(2.) + tf.constant(40)         # InvalidArgumentError since float + int
# tf.constant(2.) + \                       # InvalidArgumentError since float32 + float64
#   tf.constant(40., dtype=tf.float64)    

### Variables

In [120]:
# Tensors are immutable. They're used to store data.
# We cannot use tensors to implement weights in a Neural Network.
# We can use Variables instead.
# Variables can be modified in-place using: 
#   assign() for assigning value to a variable
#   assign_add() for incrementing
#   assign_sub() for decrementing
#
# You can also modify individual cells (or slices) using the cells (or slices) assign(),
# or by using the scatter_update() or scatter_nd_update(). nd stands for n-dimensions.
#
# In practice, you will add weights using add_weight() function.
# You will rarely need to create variables manually.
v = tf.Variable([[1., 2., 3.], [4., 5., 6.]])
v

<tf.Variable 'Variable:0' shape=(2, 3) dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

In [121]:
v.assign(2 * v)

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2.,  4.,  6.],
       [ 8., 10., 12.]], dtype=float32)>

In [122]:
v[0, 1].assign(42)

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2., 42.,  6.],
       [ 8., 10., 12.]], dtype=float32)>

In [123]:
v[:, 2].assign([0., 1.])

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2., 42.,  0.],
       [ 8., 10.,  1.]], dtype=float32)>

In [124]:
v.scatter_nd_update(indices=[[0, 0], [1, 2]], updates=[100., 200.])

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[100.,  42.,   0.],
       [  8.,  10., 200.]], dtype=float32)>

### Other Data Structures
  - Sparse tensors (tf.SparseTensor)
    - Represent tensors with mostly zeros, efficiently.
  - Tensor arrays (tf.TensorArray)
    - List of tensors with fixed size by default. Can optionally be made dynamic.
      All tensors they contain must have the same shape and data type.
  - Ragged tensors (tf.RaggedTensor)
    - List of tensors, where each tensor can have different length. ex:
      rt=[[3, 1, 4, 1], [], [5, 9, 2], [6], []]
  - String tensors
    - Regular tensors of type string (represents byte strings). Unicode strings are encoded
      to utf-8 automatically. Represent unicode strings using tensors of type tf.int32 with
      4 int32 values representing a unicode code point.
  - Sets
    - Represented as tensors/sparse tensors. ex. tf.constant([[1, 2], [3, 4]]) represents
      two sets {1, 2}, and {3, 4}. Each set is represented as a vector in the tensors
      last axis.
  - Queues
    - Store tensors across multiple steps. FIFOQueue, PriorityQueue, RandomShuffleQueue (shuffles it's items), PaddingFIFOQueue (pads it's differently-shaped items)

## Customizing Models and Training Algorithms

### Custom Loss Functions

In [125]:
# If your training set is a bit noisy:
#   - you remove/fix outliers
# but it's still noisy.
# With the MSE loss function, it will penalize large errors too much.
# With the MAE loss function, it may take a while to converge, or the model may be imprecise.
# Use Huber function. It's there in tf.keras.losses.Huber, but we can make one and use it.
# def create_huber(threshold=1.0):
#   def huber_fn(y_true, y_pred):
#     error = y_true - y_pred
#     is_small_error = tf.abs(error) < threshold
#     squared_loss = tf.square(error) / 2
#     linear_loss = threshold * tf.abs(error) - threshold**2 / 2
#     return tf.where(is_small_error, squared_loss, linear_loss)
#   return huber_fn

# For better performance, use vectorized implementation.
# If you want to benefit from TensorFlow's graph features, use only Tensorflow operations.
# Return a tensor containing 1 loss per instance, instead of returning mean loss.
# This way Keras can apply class weights or sample weights when requested.

# model.compile(loss=create_huber(2.0), optimizer='nadam')
# model.fit(X_train, y_train, ...)

# Keras will use the created huber_fn as the loss function to perform Gradient Descent.

### Saving and loading models that contain custom components

In [126]:
# Keras saves the name of the function when you save the model.
# But it does not save the threshold value (hyperparameter).
# To load it, you have to map the saved function name to the actual function,
# and give it the threshold value.
# model = keras.models.load_model('model.h5', 
#                                 custom_objects={'huber_fn': create_huber(2.0)})

In [127]:
# Should you use a function for your loss, or a class?
# If your function does not have any hyperparameters,
# then you just need to map the function name to
# the function when you load the model.
# If your function has hyperparameters, it's best to
#   - subclass from keras.losses.Loss,
#   - pass in the hyperparameter into the __init__() and save it
#   - use the hyperparameter in the call() method
#   - and provide a get_config() method to return the full config
#     - the config for the keras.losses.Loss and
#     - the hyperparameter for our class
#   - when loading the model from the file, map the class name to the class

# To get around this issue of saving parameters to functions, 
# create a subclass and implement it's get_config() method:
class HuberLoss(keras.losses.Loss):
    def __init__(self, threshold=1.0, **kwargs):
        self.threshold = threshold
        super().__init__(**kwargs)
        
    # Losses are used to calculate a gradient to train the model.
    # This is why they must be differentiable everywhere they're evaluated.
    # Their gradients should not be 0 everywhere.
    def call(self, y_true, y_pred):
        error = y_true - y_pred
        is_small_error = tf.abs(error) < threshold
        squared_loss = tf.square(error) / 2
        linear_loss = threshold * tf.abs(error) - threshold**2 / 2
        return tf.where(is_small_error, squared_loss, linear_loss)

    def get_config(self):
        base_config = super().get_config()
        return {**base_config, 'threshold': self.threshold}

# keras.losses.Loss class can be initialized using:
#   - name: The name of the loss
#   - reduction: Algorithm to use to aggregate individual instance losses.
#                Default is 'sum_over_batch_size' which:
#                  - weighs samples (if needed)
#                  - adds all values
#                  - divides by batch size (notice - not by sum_of_weights, 
#                                           this is not weighted mean)
#                Other algorithm options are 'sum' and None

# model.compile(HuberLoss(2.0), optimizer='nadam')

# When you save the model, Keras calls the loss instance's get_config()
# and saves the config as JSON in the HDF5 file.
# When you load the model, it calls the from_config() in the HuberLoss() class
# (or if not present, in the Loss class). This method creates an instance of
# the class, passing the **config to the constructor.

# When loading the model, map the class name to the class:
# model = keras.models.load_model('model.h5', 
#                                 custom_objects={'HuberLoss': HuberLoss})

### Custom Activation functions, Initializers, Regularizers, and Constraints

In [128]:
# Most Keras functionalities such as losses, regularizers, initializers, constraints,
# metrics, activation functions, layers, and even full models can be customized in
# the same way as above.
def my_softplus(z): # return value is just tf.nn.softplus(z)
    return tf.math.log(tf.exp(z) + 1.0)

def my_glorot_initializer(shape, dtype=tf.float32):
    stddev = tf.sqrt(2. / (shape[0] + shape[1]))
    return tf.random.normal(shape, stddev=stddev, dtype=dtype)

# At each training step, the weights will be passed to the regularization function
# to compute the regularization loss, which will be added to the main loss
# to get the final training loss.
def my_l1_regularizer(weights):
    return tf.reduce_sum(tf.abs(0.01 * weights))

def my_l2_regularizer(weights):                   # We divide by 2. since the derivative
    return tf.reduce_sum(tf.square(weights)) / 2. # of x^2 = 2 x, and this removes the 2
                                                  # when finding gradient of the loss function

# The constraint function will be called after each training step,
# and the layer's weights will be replaced by the constrained weights.
def my_positive_weights(weights): # return value is just tf.nn.relu(weights)
    return tf.where(weights < 0., tf.zeros_like(weights), weights)

layer = keras.layers.Dense(30, activation=my_softplus,
                           kernel_initializer=my_glorot_initializer,
                           kernel_regularizer=my_l1_regularizer,
                           kernel_constraint=my_positive_weights)

In [129]:
# If a function has hyperparameters you want to save with the model, 
# you must subclass the appropriate class and implement this call method:
#   - keras.regularizers.Regularizer, implement the __call__() method
#   - keras.constraints.Constraint,   implement the __call__() method
#   - keras.initializers.Initializer, implement the __call__() method
#   - keras.losses.Loss,              implement the call() method
#   - keras.layers.Layer (for any layer, including activation functions),
#                                     implement the call() method
#   - keras.models.Model,             implement the call() method

class MyL1Regularizer(keras.regularizers.Regularizer):
    def __init__(self, factor):
        self.factor = factor
    def __call__(self, weights):
        return tf.reduce_sum(tf.abs(self.factor * weights))
    def get_config(self):
        return {'factor': self.factor}

### Custom Metrics

#### Streaming Metrics

In [130]:
# MSE or MAE is usually preferred as a metric,
# we will just show here how to use huber as a metric.
# model.compile(loss='mse', optimizer='nadam', metrics=[create_huber(2.)])

# Keras will compute the metric and keep track of it's mean for each batch.
# Most of the time, this is what you want.
# This may not be what we want. For example, during classification,
# we want to keep track of, say, (true positives / (true positive + false positive)).
# Keras will keep track of this ratio, per batch, not over all batches.
# keras.metrics.Precision class does this.
p = keras.metrics.Precision()
print(f'1st batch result: {p([0, 1, 1, 1, 0, 1, 0, 1], [1, 1, 0, 1, 0, 1, 0, 1])}')
print(f'2nd batch result: {p([0, 1, 0, 0, 1, 0, 1, 1], [1, 0, 1, 1, 0, 0, 0, 0])}')

# This is called a streaming metric (or stateful metric).
# It's updated batch-by-batch.
print(f'Final result: {p.result()}')

1st batch result: 0.800000011920929
2nd batch result: 0.5
Final result: 0.5


In [131]:
p.variables                   # These track the number of true positives and false positives

[<tf.Variable 'true_positives:0' shape=(1,) dtype=float32, numpy=array([4.], dtype=float32)>,
 <tf.Variable 'false_positives:0' shape=(1,) dtype=float32, numpy=array([4.], dtype=float32)>]

In [132]:
p.result()

<tf.Tensor: id=988, shape=(), dtype=float32, numpy=0.5>

In [133]:
p.reset_states()              # Reset all counts to zero

In [134]:
p.result()

<tf.Tensor: id=999, shape=(), dtype=float32, numpy=0.0>

In [135]:
# When would you use a function for a custom metric
# versus a class?
# If you have a streaming metric, then you would need a class,
# since you need to keep track of parameters, and use those
# parameters in the update_state() function to calculate
# your metric. Also, if you have hyperparameters for your metric,
# you should use a class.
# If you don't need a streaming metric/hyperparameters,
# then you can use a function.
# In the custom metric class:
#   - init():         Pass in the hyperparameter, and save it.
#                     Save parameters by using 
#                       self.param = self.add_weight('name', initializer=...)
#   - update_state(): Get the metric by metric = cutom_fn(y_true, y_pred)
#                     Update your parameters
#   - result():       Return the result by calculating using the parameters
#   - get_config():   return the super.get_config() and your class's hyperparameter,
#                     as a key-value pair
class HuberMetric(keras.metrics.Metric):
    def __init__(self, threshold=1., **kwargs):
        super().__init__(**kwargs)
        self.threshold = threshold
        self.huber_fn = create_huber(threshold)
        self.total = self.add_weight('total', initializer='zeros')
        self.count = self.add_weight('count', initializer='zeros')
    
    # Updates variables for one batch
    def update_state(self, y_true, y_pred, sample_weight=None):
        metric = self.huber_fn(y_true, y_pred)
        self.total.assign_add(tf.reduce_sum(metric))
        self.count.assign_add(tf.cast(tf.size(y_true), tf.float32))  # tf.size(x) flattens x
                                                                     # and returns it's length
        # Here we ignore sample_weight, but to use it, we would do this?:
        # self.add_weight('sample_weights', 
        #                 shape=sample_weights.shape, 
        #                 initializer=sample_weights)
    
    # Computes final result
    def result(self):
        return self.total / self.count
    
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, 'threshold': self.threshold}

### Custom Layers

In [136]:
# Should you create a function for a custom layer,
# or a class?
# If you have a custom layer that has no weights,
# like the exponential_layer defined below,
# create a function.
# If you have weights for your layer, you should create a class:
#   - init():   pass in the activation function and save it
#   - build():  pass in the batch_input_shape
#               create kernel by adding weights and initializer
#               create bias by adding weights and the initializer
#               super.build() as the last call
#   - call():   return self.activation(X @ self.kernel + self.bias)
#   - get_config(): return the super.get_config() and your class's hyperparameter,
#                   as a key-value pair

# If tensorflow does not provide a default implementation for a layer,
# or if you want to treat a block of layers as a single layer,
# you can create a custom layer.

# Some layers have no weights, ex.
#   - keras.layers.Flatten
#   - keras.layers.ReLU
# To create such layer, write a function and wrap it in a keras.layers.Lambda.
# This layer can be used like any other layer, or as an activation function.
exponential_layer = \
    keras.layers.Lambda(lambda x: tf.exp(x)) # or, activation='exp',
                                             # or activation=tf.exp,
                                             # or activation=keras.activations.exponential
# The exponential layer is sometimes used in the output layer of a model
# when the values to predict have very different scales (ex. 0.001, 10., 1000.)

class MyDense(keras.layers.Layer):
    def __init__(self, units, activation=None, **kwargs):
        super().__init__(**kwargs)
        self.units = units
        self.activation = keras.activations.get(activation)
        
    def build(self, batch_input_shape):
        self.kernel = self.add_weight(
            name='kernel', shape=[batch_input_shape[-1], self.units],
            initializer='glorot_normal')
        self.bias = self.add_weight(
            name='bias', shape=[self.units], initializer='zeros')
        super().build(batch_input_shape)  # Must be at the end. Sets self.built=True.
                                          # This lets Keras know that the layer is built.
        
    def call(self, X):
        return self.activation(X @ self.kernel + self.bias)

    # Usually you can omit the compute_output_shape() method,
    # except when the layer is dynamic.
    # compute_output_shape() returns the shape of the layers outputs.
    # In this case it is the same as the first elements of the shape and
    # the last element replaced with the number of neurons in the layer.
    # In tf.keras, shapes are instances of tf.TensorShape class, which you can
    # convert to list using as_list().
    def compute_output_shape(self, batch_input_shape):
        return tf.TensorShape(                   # TensorShape takes a list of dimensions,
            batch_input_shape.as_list()[:-1] +   # and creates a shape of a tensor.
            [self.units])                        # If list_a = [1, 2, 3],
                                                 # list_a + [x, y] = [1, 2, 3, x, y].
                                                 # Same as list_a.extend([x, y])
    
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, 'units': self.units,
                'activation': keras.activations.serialize(self.activation)}

In [37]:
# Build a layer that takes multiple inputs and
# has multiple outputs
class MyMultiLayer(keras.layers.Layer):
    def call(self, X):                   # For a layer with multiple inputs (ex. Concatenate),
        X1, X2 = X                       # X should be a tuple containing all inputs
        return [X1 + X2, X1 * X2, X1 / X2]  # To create a layer with multiple outputs, call()
                                            # should return the list of outputs.
    
    def compute_output_shape(self, batch_input_shape):
        b1, b2 = batch_input_shape
        return [b1, b1, b1]              # should probably handle broadcasting rules.
                                         # To create a layer with multiple outputs,
                                         # compute_output_shape() should return the 
                                         # list of batch output shapes (one per output).

In [38]:
# For a layer that handles training and testing differntly (ex. if it has
# Dropout or BatchNormalization layers),
# add a training argument to call() and decide what to do within call(). ex:
class MyGaussianNoise(keras.layers.Layer):
    def __init__(self, stddev, **kwargs):
        super().__init__(**kwargs)
        self.stddev = stddev
        
    def call(self, X, training=None):
        if training:
            noise = tf.random.normal(tf.shape(X), 
                                     stddev=self.stddev)
            return X + noise
        else:
            return X
    
    def compute_output_shape(self, batch_input_shape):
        return batch_input_shape

### Custom Models

In [39]:
# Suppose we want to create a model with:
#   - Final Dense output layer
#   - A Residual Block. Each Residual Block contains
#     - A Concatenation layer taking two inputs (one from last Dense layer, 
#                                                one from first input)
#     - A Dense layer
#     - A Dense layer
#   - A Residual Block X 4
#   - A Dense layer
#
# Inputs flow from bottom to top.
# This model is just an example to illustrate how you can build
# any model you want with any layer combination.
#
# This layer contains other layers - it is special.
# Keras detects that the hidden attribute contains trackable objects (layers in this case),
# so their variables are automatically added to this layer's list of variables.

class ResidualBlock(keras.layers.Layer):
    def __init__(self, n_layers, n_neurons, **kwargs):
        
        super().__init__(**kwargs)
        
        self.hidden = [keras.layers.Dense(n_neurons, activation='elu',
                                          kernel_initializer='he_normal')
                       for _ in range(n_layers)]

    def call(self, inputs):
        Z = inputs
        for layer in self.hidden:
            Z = layer(Z)
        return inputs + Z  # This is the concatenation of inputs and output of last Z layer

In [40]:
class ResidualRegressor(keras.Model):
    def __init__(self, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.hidden1 = keras.layers.Dense(30, activation='elu',
                                          kernel_initializer='he_normal')
        self.block1 = ResidualBlock(2, 30)
        self.block2 = ResidualBlock(2, 30)
        self.out = keras.layers.Dense(output_dim)
        
    def call(self, inputs):
        Z = self.hidden1(inputs) # first dense layer
        for _ in range(4):       # 4 residual blocks
            Z = self.block1(Z)
        Z = self.block2(Z)       # one more residual block
        return self.out(Z)       # final dense output layer

# Now you can compile, fit, evaluate this model to make predictions - just like 
# any other model.
# To save model using model.save() and load using keras.models.load_model(),
# implement get_config() in both the ResidualBlock class, and ResidualRegressor class.
# Or you can save/load weights using save_weights() and load_weights() methods.
# Better to use get_config() since you may forget to save/load weights.

# Model is a subclass of Layer, so you can use Models as Layers.
# Extra functionality in Model are these methods:
#   - compile()       - fit()        - evaluate()       - predict()
#   - plus a few variants of the above methods
#   - get_layers() returns the models layers by name or index
#   - save()          - support for keras.models.load_model()
#   - support for keras.models.clone_model()

# When would you need to create a dynamic Keras model?
# How do you do that?
# Why not make all your models dynamic?

### Losses and Metrics Based on Model Internals

In [41]:
# Earlier losses were based on labels and predictions 
# (also some on weights like the L1Regularizer).
# You may want to define losses based on other parts
# of your model, such as the weights or activations
# of it's hidden layers. This may be useful for
# regularization or to monitor internal aspects
# of the model.
# You can compute a custom loss based on any internals
# you want, then pass it to model's add_loss() method.

# Let's build this model:
#   Hidden layer  +  Auxiliary output layer
#              Hidden layer
#              Hidden layer
#              Hidden layer
#              Hidden layer
#              Hidden layer
#   Auxiliary layer = (associated loss = reconstruction loss (MSE(reconstruction - inputs)))
#                     By adding the reconstruction loss to the main loss, we will encourage
#                     the model to preserve as much information as possible through the
#                     hidden layers - even the information not useful to the regression task.
#                     In practice, this loss sometimes improves generalization
#                     (it is a regularization loss).

class ReconstructionRegressor(keras.Model):
    def __init__(self, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.hidden = [keras.layers.Dense(30, activation='selu',
                                          initialization='lecun_normal')
                       for _ in range(5)]
        self.out = keras.layers.Dense(output_dim)
        self.mean_recon_error = keras.metrics.Mean()     # Metric tracks mean recon error

        
    def build(self, batch_input_shape):
        n_inputs = batch_input_shape[-1]
        self.reconstruct = keras.layers.Dense(n_inputs)  # n_inputs is unknown before build(),
                                                         # so this layer is created here.
        super().build(batch_input_shape)
        
    def call(self, inputs):
        Z = inputs
        for layer in self.hidden:
            Z = layer(Z)
        reconstruction = self.reconstruct(Z)
        recon_loss = tf.reduce_mean(tf.square(reconstruction - inputs))
        self.add_loss(0.05 * recon_loss)                 # Scale by 0.05 so reconstruction
                                                         # loss does not dominate main loss.
        mre_value = self.mean_recon_error(recon_loss)
        self.add_metric(mre_value)                       # Add metric value to track the metric
        
        return self.out(Z)

### Computing gradients using Autodiff

#### Calculating gradients using differentiation

In [42]:
# To understand how to use autodiff, let's consider differentiating this function:
def f(w1, w2):
    return 3 * w1 ** 2 + 2 * w1 * w2

# You can differentiate this analytically by finding the partial derivatives
# with respect to w1, and w2.
# The function for a neural network would be much more complex,
# with tens of thousands of parameters, and finding the derivative
# by hand will be impossible. So you could find an approximation:
w1, w2 = 5, 3
eps = 1e-6
(f(w1 + eps, w2) - f(w1, w2)) / eps

36.000003007075065

In [43]:
(f(w1, w2 + eps) - f(w1, w2)) / eps

10.000000003174137

#### Calculating gradients using GradientTape (Autodiff)

In [44]:
# There are minor errors in these derivative values.
# Additionally, to calculate these you need to call the function twice.
# In general you need to call the function as many times as there are parameters.
# Evaluating this at each point will take a long time.
# Autodiff allows us to find the answer much much faster.

w1, w2 = tf.Variable(5.), tf.Variable(3.)
with tf.GradientTape() as tape:           # GradientTape records each operation that involves
    z = f(w1, w2)                         # a variable. To save memory, only put what is
                                          # required within the tf.GradientTape() block.
                                          # Alternatively, pause recording using a
                                          #   with tape.stop_recording():
                                          # block inside the tf.GradientTape() block.
gradients = tape.gradient(z, (w1, w2))    # The tape is automatically erased immediately after
gradients                                 # calling it's gradient() method. Calling gradient()
                                          # again will cause a runtime error.

# These derivatives have no error in them.
# The gradient() method only goes through the recorded computations once (in reverse order).
# So no matter the number of variables, this method is incredibly efficient.

(<tf.Tensor: id=248, shape=(), dtype=float32, numpy=36.0>,
 <tf.Tensor: id=240, shape=(), dtype=float32, numpy=10.0>)

#### Making the tape persistent

In [45]:
# To call gradient() more than once, make the tape persistent,
# and delete it each time you're done with it to free resources.
with tf.GradientTape(persistent=True) as tape:
    z = f(w1, w2)

dz_dw1 = tape.gradient(z, w1)             # results in tensor of value 36.0
dz_dw2 = tape.gradient(z, w2)             # results in tensor of value 10.0
del tape

#### Forcing the tape to watch any tensor

In [46]:
# By default, the tape will only track variables.
# If you try to compute the gradient of z w.r.t.
# anything else, the result will be None
c1, c2 = tf.constant(5.), tf.constant(3.)
with tf.GradientTape() as tape:
    z = f(c1, c2)
tape.gradient(z, [c1, c2])

[None, None]

In [47]:
# You can force the tape to watch any tensors:
with tf.GradientTape() as tape:
    tape.watch(c1)
    tape.watch(c2)
    z = f(c1, c2)
tape.gradient(z, [c1, c2])

# If your inputs vary slightly, but your activations vary a lot,
# you can create a regularization loss based on the gradient
# of these activations w.r.t. the inputs.
# Since the inputs are constant values, you can use
# tape.watch(inputs) and find the gradients.

[<tf.Tensor: id=315, shape=(), dtype=float32, numpy=36.0>,
 <tf.Tensor: id=307, shape=(), dtype=float32, numpy=10.0>]

#### Gradient of a vector

In [48]:
# If you try to compute the gradient of a vector containing multiple losses,
# Tensorflow will calculate the gradient of the vector's sum.
# If you need gradients for each component of the vector w.r.t. the model parameters,
# you must call the tape's jacobian().

#### Stop gradients from backpropagating through some part of your network

In [53]:
# To stop gradients from backpropagating through some part of your network,
# you must use tf.stop_gradient().
# tf.stop_gradient() acts as an identity function during the forward pass,
# but it does not allow gradients to flow back (it acts like a constant).
def f(w1, w2):
    return 3 * w1 ** 2 + tf.stop_gradient(2 * w1 * w2)

with tf.GradientTape() as tape:
    z = f(w1, w2)           # same result as without stop_gradient() on foward pass

tape.gradient(z, (w1, w2))  # returns (tensor 30, None)

(<tf.Tensor: id=394, shape=(), dtype=float32, numpy=30.0>, None)

#### Computing errors when calculating gradients

In [54]:
# Sometimes, when computing gradients, we get a result of NaN. ex.
# Computing the gradients of my_softplus() function for large inputs.
x = tf.Variable(100.)
with tf.GradientTape() as tape:
    z = my_softplus(x)

tape.gradient(z, [x])  # retrns NaN result. This is because of floating point precision errors,
                       # autodiff computed inf/inf which results in NaN.

[<tf.Tensor: id=410, shape=(), dtype=float32, numpy=nan>]

In [69]:
# We know the derivative of the softplus function is 1 / (1 + 1/exp(x)) which is
# numerically stable. To use this stable function when computing the derivative of
# my_softplus():
#   - decorate my_softplus() with @tf.custom_gradient making it return it's normal output +
#     the function that computes it's derivatives. The numerically stable derivative fn,
#     will receive as input the gradients propagated so far. According to the chain rule,
#     we should multiply those gradients with this derivative fn.
@tf.custom_gradient
def my_better_softplus(z):
    
    # The main output still explodes 
    # because of the exponential.
    # You can use tf.where() to return the inputs 
    # when they're large instead.
    exp = tf.where(z > 30., z, tf.math.log(tf.exp(z) + 1.))
    
    def my_softplus_gradients(grad):
        return grad / (1 + 1 / exp)
    return exp, my_softplus_gradients

# Now when we calculate the gradients for large input values,
# we will get the correct result.
x = tf.Variable(100.)
with tf.GradientTape() as tape:
    z = my_better_softplus(x)

tape.gradient(z, [x]), z

([<tf.Tensor: id=721, shape=(), dtype=float32, numpy=0.990099>],
 <tf.Tensor: id=714, shape=(), dtype=float32, numpy=100.0>)

### Custom training loops

In [None]:
# Sometimes, the fit() method may not be flexible enough for what you need to do.
# ex. The Wide and Deep paper uses one optimizer for the wide path, and one for the deep.
# Since fit() uses only one optimizer (the one specified when compiling the model),
# implementing this paper requires writing your own custom loop.
# Unless you really need the flexibility, stick with the fit() method.

In [None]:
l2_reg = keras.regularizers.l2(0.05)
model = keras.models.Sequential([
    keras.layers.Dense(30, activation='elu', kernel_initializer='he_normal',
                       kernel_regularizer=l2_reg),
    keras.layers.Dense(1, kernel_regularizer=l2_reg)
])

def random_batch(X, y, batch_size=32):                # Function that samples a batch of
    idx = np.random.randint(len(X), size=batch_size)  # instances from the training set.
    return X[idx], y[idx]                             # The Data API is much better.
                                                      # Use it instead.

# Displays the training status including:
#   - number of steps
#   - total number of steps
#   - mean loss since start of epoch
# Use the tqdm library to show a progress bar.
def print_status_bar(iteration, total, loss, metrics=None):
    metrics = " - ".join(["{}: {:.4f}".format(m.name, m.result())
                         for m in [loss] + (metrics or [])])
    end = "" if iteration < total else "\n"
    print("\r{}/{} - ".format(iteration, total) + metrics,
          end=end)
    
n_epochs = 5
batch_size = 32
n_steps = len(X_train) // batch_size
optimizer = keras.optimizers.Nadam(lr=0.01)
loss_fn = keras.losses.mean_squared_error
mean_loss = keras.metrics.Mean()
metrics = [keras.metrics.MeanAbsoluteError()]

# This training loop does not handle layers that behave differently
# between training and testing (ex. BatchNormalization, Dropout).
# To handle these, call the model with training=True and make sure
# it propagates this to every layer that needs it.
for epoch in range(1, n_epochs + 1):
    print("Epoch {}/{}".format(epoch, n_epochs))
    for step in range(1, n_steps + 1):
        X_batch, y_batch = random_batch(X_train_scaled, y_train)
        with tf.GradientTape() as tape:
            y_pred = model(X_batch)
            
            # The mean_squared_error loss function returns one loss per instance
            main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred)) # You can weigh each error
                                                                 # here before calculating
                                                                 # the mean.
    
            # The + operator does broadcasting, 
            # so here each element of the [main_loss] array will get 
            # model_loss added to it.
            loss = tf.add_n([main_loss] + model.losses)     # tf.add_n() waits for all inputs 
                                                            # to be ready, then accumulates 
                                                            # (sums) the inputs together.
                                                            # It sums multiple tensors of
                                                            # the same shape and data type.

        gradients = tape.gradient(loss, model.trainable_variables)
        
        # You can apply any transformation to the gradients before the apply_gradients()
        # function call.
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
        
        # If you add weight constraints or bias constraints when creating a layer,
        # this will apply the constraints. This needs to be just after apply_gradients()
        for variable in model.variables:
            if variable.constraint is not None:
                variable.assign(variable.constraint(variable))
                
        mean_loss(loss)
        for metric in metrics:
            metric(y_batch, y_pred)
        print_status_bar(step * batch_size, len(y_train), mean_loss, metrics)
    print_status_bar(len(y_train), len(y_train), mean_loss, metrics)
    for metric in [mean_loss] + metrics:
        metric.reset_states()

## Tensorflow Functions and Graphs

In [77]:
def cube(x):
    return x ** 3

tf_cube = tf.function(cube)  # Converts this Python function to a Tensorflow function
tf_cube(2)                   # Now it returns the same values, but as Tensors

# Under the hood, tf.function() analyzed the computations
# performed and generated a computation graph.
# Tensorflow optimizes the compuation graph.
# During execution, the TF function runs in parallel
# wherever it can. TF function is usually faster
# than the Python function.
# By default, a TF function generates a new graph for every unique set of input_shapes,
# and data types and caches it for subsequent calls. This is only true for tensor arguments.
# If you call your TF function with a numerical Python value, a new graph will be generated
# for every distinct value. Doing so for multiple values will take up a lot of RAM,
# and slow down your computations. You have to delete the TF function to release the RAM used.
# A better approach is to use Python values for arguments with few unique values ex.
# hyperparameters like the number of neurons per layer.

# When you write a custom loss function, custom metric, custom layer,
# or any other custom function, and you use it in a Keras model,
# Keras automatically converts your function into a TF function.

# To tell Keras NOT to convert your Python functions to TF functions,
# set dynamic=True when creating a custom layer or model.
# Alternatively, when calling the model's compile(), set
# run_eagerly=True.

<tf.Tensor: id=761, shape=(), dtype=int32, numpy=8>

In [79]:
# It is more common to decorate using tf.function
@tf.function
def cube(x):
    return x ** 3

# The original function is still available via the TF function's
# python_function attribute
tf_cube.python_function(2)  # Returns a scalar value, not a tensor, since this is a 
                            # python function

8

### AutoGraph and Tracing

How does TensorFlow generate Graphs?
  - It analyzes the Python function's source code to capture all control flow statements
  - It outputs an upgraded version of the function in which all the control flow statements
    are replaced by the appropriate TF operations, ex. tf.while_loop() for loops, and 
    tf.cond() for if statements
  - TF calls this upgraded function, but instead of passing the argument, it passes
    the symbolic tensor - a tensor without any actual value, only a name, data type, 
    and shape.
  - The function will run in graph mode (each TF operation will add a node in the graph
    to represent itself and its output tensors). This is as opposed to eager execution
    or eager mode. In graph mode TF operations do not perform any computations.
  - For debugging purposes, you can view the generated function's source code by calling
    tf.autograph.to_code(tf_cube.python_function)

In [85]:
tf.autograph.to_code(lambda x: x ** 3)

"tf__lambda = lambda x: ag__.with_function_scope(lambda lambda_scope: x ** 3, 'lambda_scope', ag__.ConversionOptions(recursive=True, user_requested=True, optional_features=(), internal_convert_user_code=True))\n"

In [88]:
def test(x):
    return (x + 2) ** 3
tf.autograph.to_code(test)

"def tf__test(x):\n  do_return = False\n  retval_ = ag__.UndefinedReturnValue()\n  with ag__.FunctionScope('test', 'test_scope', ag__.ConversionOptions(recursive=True, user_requested=True, optional_features=(), internal_convert_user_code=True)) as test_scope:\n    do_return = True\n    retval_ = test_scope.mark_return_value((x + 2) ** 3)\n  do_return,\n  return ag__.retval(retval_)\n"

### TF Function Rules

  - If you call any external library, it will not be part of the graph.
    A TF graph can only include TF constructs (tensors, operations, variables, 
    datasets, etc.). This has a few additional implications:
    - If your function just returns a random number using np.random.rand(), 
      a random number is created during tracing (the stage of building the graph).
      This is why calls to the function using the same data type will return
      the same random number. Instead, use tf.random.uniform() to generate a
      random number on each call.
    - Side-effects occur only during tracing (graph building)
    - Wrapping arbitrary Python code in a tf.py_function() operation will hinder performance,
      as TF cannot do any graph optimizations on this code. It will also reduce portability,
      since the code will only run on systems where Python is available and the correct
      libraries are installed.
  - You can call other Python or TF functions, but they should follow the same rules.
    These other functions do not need to be decorated with @tf.function
  - If the function creates a TF variable (or a stateful TF object), it must do so upon
    the very first call, and only then, or else you will get an exception.
    It's preferable to create variables outside of the TF function
    (ex. a custom layer's build() method).
    To assign a new value to a variable, use v.assign() rathen than the = operator.
  - Python function's source code should be available to TF. ex:
    Defining your python code in a python shell, or
    deploying only the compiled *.pyc files to production
    will cause failures.
  - TF will only capture for loops that iterate over a tensor or dataset, so use
    for i in tf.range(x): rather than
    for i in range(x):
    or else the loop will not be captured in the graph. Instead it will run during tracing.
    This may be what you want (ex. use the loop to create each layer in the network).
  - Prefer a vectorized implementation

## Implement a custom layer that performs Layer Normalization

In [137]:
class NormalizedLayer(keras.layers.Layer):
    def __init__(self, units, **kwargs):
        super().__init__(**kwargs)
        self.units = units
    
    def build(self, batch_input_size):
        self.alpha = tf.Variable(1., shape=batch_input_size[-1:], data_type=tf.float32)
        self.beta = tf.Variable(0., shape=batch_input_size[-1:], data_type=tf.float32)
        super().build(batch_input_shape)
        
    def call(self, X):
        mean, var = tf.nn.moments(X, axes=[0])
        std = tf.math.sqrt(var)
        eps = 0.001
        norm = self.alpha * (X - mean) / (std + eps) + self.beta
        return self.activation(X @ self.kernel + self.bias)