<a href="https://colab.research.google.com/github/rahiakela/machine-learning-research-and-practice/blob/main/hands-on-machine-learning-with-scikit-learn-keras-and-tensorflow/12-custom-models-and-training-with-tensorflow/04_customizing_gradients_and_training_loops.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Customizing Gradients and Training Loops

In fact, 95% of the use cases you will encounter will not require anything other than `tf.keras` and `tf.data`.

But now it’s time to dive deeper into TensorFlow
and take a look at its lower-level Python API. This will be useful when you need extra
control to write custom loss functions, custom metrics, layers, models, initializers,
regularizers, weight constraints, and more. 

You may even need to fully control the
training loop itself, for example to apply special transformations or constraints to the
gradients (beyond just clipping them) or to use multiple optimizers for different parts
of the network.

TensorFlow’s API revolves around tensors, which flow from operation to operation—hence the name TensorFlow.

A tensor is very similar to a NumPy ndarray: it is usually
a multidimensional array, but it can also hold a scalar (a simple value, such as 42).
These tensors will be important when we create custom cost functions, custom metrics,
custom layers, and more, so let’s see how to create and manipulate them.



##Setup

In [1]:
import sys
import sklearn
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import tensorflow as tf
from tensorflow import keras

import numpy as np
import os
import time

# to make this notebook's output stable across runs
np.random.seed(42)
tf.random.set_seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

In [None]:
def my_softplus(z): # return value is just tf.nn.softplus(z)
  return tf.math.log(tf.exp(z) + 1.0)

## Loading Dataset

Let's start by loading and preparing the California housing dataset. 

In [2]:
housing = fetch_california_housing()

x_train_full, x_test, y_train_full, y_test = train_test_split(housing.data, housing.target.reshape(-1, 1), random_state=42)
x_train, x_valid, y_train, y_valid = train_test_split(x_train_full, y_train_full, random_state=42)

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_valid_scaled = scaler.transform(x_valid)
x_test_scaled = scaler.transform(x_test)

##Computing Gradients using Autodiff

To understand how to use autodiff compute gradients
automatically, let’s consider a simple toy function:

In [36]:
def f(w1, w2):
  return 3 * w1 ** 2 + 2 * w1 * w2

Using calculus, we can analytically find that the partial derivative of this function with regard to $w_1$ and $w_2$.

$$
\frac{d \mathbf{f}}{d \mathbf{w_1}} = \frac{d \mathbf{(3*w_1^2+2*w_1*w_2)}}{d \mathbf{w_1}} = 3* 2* w_1^{2-1}+2*w_1^{1-1}*w_2 = 6 * w_1+2*w_2
$$

$$
\frac{d \mathbf{f}}{d \mathbf{w_2}} = \frac{d \mathbf{(3*w_1^2+2*w_1*w_2)}}{d \mathbf{w_2}} = 3* 2* 0 +2*w_1*w_2^{1-1} = 2 * w_1
$$

For example, at the point $(w1, w2) = (5, 3)$, these partial
derivatives are equal to 36 and 10, respectively, so the gradient vector at this point is `(36, 10)`.


In [37]:
dw1 = 6 * 5 + 2 * 3
print(dw1) 

36


In [39]:
dw2 = 2 * 5
print(dw2)

10


But if this were a neural network, the function would be much more complex,
typically with tens of thousands of parameters, and finding the partial derivatives
analytically by hand would be an almost impossible task. 

One solution could be
to compute an approximation of each partial derivative by measuring how much the
function’s output changes when you tweak the corresponding parameter:

In [40]:
w1, w2 = 5, 3
eps = 1e-6

In [41]:
(f(w1 + eps, w2) - f(w1, w2)) / eps

36.000003007075065

In [42]:
(f(w1, w2 + eps) - f(w1, w2)) / eps

10.000000003174137

Looks about right! This works rather well and is easy to implement, but it is just an approximation, and importantly you need to call `f()` at least once per parameter (not twice, since we could compute $f(w_1, w_2)$ just once).

Needing to call `f()` at least once
per parameter makes this approach intractable for large neural networks. 

So instead, we should use autodiff. TensorFlow makes this pretty simple:

In [46]:
w1, w2 = tf.Variable(5.), tf.Variable(3.)

with tf.GradientTape() as tape:
  z = f(w1, w2)

gradients = tape.gradient(z, [w1, w2])

Let’s take a look at the gradients that TensorFlow computed.

In [47]:
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=36.0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=10.0>]

Perfect! Not only is the result accurate, but the `gradient()` method only goes through the recorded computations once (in reverse order), no matter how many variables there are, so it is
incredibly efficient. 

It’s like magic!

The tape is automatically erased immediately after you call its `gradient()` method, so
you will get an exception if you try to call `gradient()` twice:

In [52]:
w1, w2 = tf.Variable(5.), tf.Variable(3.)

with tf.GradientTape() as tape:
  z = f(w1, w2)

dz_dw1 = tape.gradient(z, w1)  # => tensor 36.0
print(dz_dw1)  

try:
  dz_dw2 = tape.gradient(z, w2)  # RuntimeError!
except RuntimeError as re:
  print(f"RuntimeError: {re}")

tf.Tensor(36.0, shape=(), dtype=float32)
RuntimeError: A non-persistent GradientTape can only be used to compute one set of gradients (or jacobians)


If you need to call `gradient()` more than once, you must make the tape persistent
and delete it each time you are done with it to free resources.

In [54]:
w1, w2 = tf.Variable(5.), tf.Variable(3.)

with tf.GradientTape(persistent=True) as tape:
  z = f(w1, w2)

dz_dw1 = tape.gradient(z, w1)  # => tensor 36.0
print(dz_dw1)  

dz_dw2 = tape.gradient(z, w2)  # => tensor 10.0, works fine now!
print(dz_dw2)
del tape

tf.Tensor(36.0, shape=(), dtype=float32)
tf.Tensor(10.0, shape=(), dtype=float32)


By default, the tape will only track operations involving variables, so if you try to
compute the gradient of z with regard to anything other than a variable, the result
will be None:

In [55]:
c1, c2 = tf.constant(5.), tf.constant(3.)

with tf.GradientTape() as tape:
  z = f(c1, c2)

gradients = tape.gradient(z, [c1, c2])  # returns [None, None]
gradients

[None, None]

However, you can force the tape to watch any tensors you like, to record every operation
that involves them. 

You can then compute gradients with regard to these tensors,
as if they were variables.

In [56]:
c1, c2 = tf.constant(5.), tf.constant(3.)

with tf.GradientTape() as tape:
  tape.watch(c1)
  tape.watch(c2)
  z = f(c1, c2)

gradients = tape.gradient(z, [c1, c2])  # returns [tensor 36., tensor 10.]
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=36.0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=10.0>]

This can be useful in some cases, like if you want to implement a regularization loss
that penalizes activations that vary a lot when the inputs vary little: the loss will be
based on the gradient of the activations with regard to the inputs. 

Since the inputs are
not variables, you would need to tell the tape to watch them.


In some cases you may want to stop gradients from backpropagating through some
part of your neural network. To do this, you must use the `tf.stop_gradient()` function.

The function returns its inputs during the forward pass (like `tf.identity()`), but it does not let gradients through during backpropagation (it acts like a constant).

In [57]:
def f(w1, w2):
  return 3 * w1 ** 2 + tf.stop_gradient(2 * w1 * w2)

with tf.GradientTape() as tape:
  z = f(w1, w2)  # same result as without stop_gradient()

gradients = tape.gradient(z, [w1, w2])  # => returns [tensor 30., None]
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=30.0>, None]

Finally, you may occasionally run into some numerical issues when computing gradients.

For example, if you compute the gradients of the `my_softplus()` function for
large inputs, the result will be `NaN`.

In [58]:
x = tf.Variable([100.])

with tf.GradientTape() as tape:
  z = my_softplus(x)

gradients = tape.gradient(z, [x])
gradients

[<tf.Tensor: shape=(1,), dtype=float32, numpy=array([nan], dtype=float32)>]

This is because computing the gradients of this function using autodiff leads to some
numerical difficulties: due to floating-point precision errors, autodiff ends up computing
infinity divided by infinity (which returns `NaN`).

Fortunately, we can analytically find that the derivative of the softplus function is just $1 / (1 + 1 / exp(x))$, which
is numerically stable. 

Next, we can tell TensorFlow to use this stable function when
computing the gradients of the `my_softplus()` function by decorating it with
`@tf.custom_gradient` and making it return both its normal output and the function that computes the derivatives.

In [60]:
@tf.custom_gradient
def my_netter_softplus(z): 
  exp = tf.exp(z)
  def my_softplus_gradients(grad):
    return grad / (1 + 1 / exp)
  return tf.math.log(exp + 1.0), my_softplus_gradients

Now when we compute the gradients of the `my_better_softplus()` function, we get
the proper result, even for large input values.

In [61]:
x = tf.Variable([100.])

with tf.GradientTape() as tape:
  z = my_netter_softplus(x)

gradients = tape.gradient(z, [x])
gradients

[<tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>]

Congratulations! You can now compute the gradients of any function (provided it is
differentiable at the point where you compute it), even blocking backpropagation
when needed, and write your own gradient functions! 

This is probably more flexibility
than you will ever need, even if you build your own custom training loops.

##Custom Training Loops