# Automatic differentiation (AD) and gradient tape #

This notebook gives a short overlook at how to compute weight derivatives using automatic differentiation. It uses tensorflow's GradientTape to effectively compute all gradients required by the backward pass procedure. First we look at the basic implementation of Gradient tape, then apply it to linear regression: optimising (using gradient descent or Adam) the loss function so we can learn the weights associated to the straight-line-with-noise data.

## Gradient tape ##

Evaluate the cell below to import the packages and to initialise some tensorflow variables used in the derivative process.

In [None]:
import numpy as np
import tensorflow as tf

x1 = tf.Variable(3.0, name="x1")
x2 = tf.Variable(5.0, name="x2")
x3 = tf.Variable(6.0, name="x3")

def f(a,b):
    return a**2 + b

with tf.GradientTape() as tape: # remove/insert persistent=True in the argument to see issue.
  y = f(x1,x2)**2 + 5*x3

Gradient tape records the necessary derivative operations for the function "on a tape". This is achieved using the ``with tf.GradientTape() as tape`` line, followed by the function we wish to take derivatives of. Setting this up, we can calculate the value of the derivative of ``y`` with respect to input variable ``x1``. Evaluate the cell below to see this

In [None]:
gradx1 = tape.gradient(y,x1)
print(gradx1.numpy())

Unless we specify ``persistent=True`` in the GradientTape argument, we can only make one gradient calculation. Evaluate the next cell to see the issue.

In [None]:
gradx2 = tape.gradient(y,x2)
print(gradx2.numpy())

Update the argument with ``persistent=True`` and run the previous cells above once more. You should get the derivative of ``y`` with respect to ``x2`` this time. **Repeat the same idea for the x3 variable derivative in the cell below. Does the numerical output make sense?**

We must use trainable variables if we want gradient tape to watch/recognise them when establishing derivatives.

In [None]:
# A trainable variable
x0 = tf.Variable(3.0, name='x0')
# Not trainable
x1 = tf.Variable(3.0, name='x1', trainable=False)
# Not a Variable: A variable + tensor returns a tensor.
x2 = tf.Variable(2.0, name='x2') + 1.0
# Not a variable
x3 = tf.constant(3.0, name='x3')

with tf.GradientTape() as tape:
  y = (x0**2) + (x1**2) + (x2**2) + (x3**2)

grad = tape.gradient(y, [x0, x1, x2, x3])

for g in grad:
  print(g)

The ``watched_variables()`` method allows us to check which variables are being watched on the tape.

In [None]:
[var.name for var in tape.watched_variables()]

As a simple exercise, we now check the gradient tape calculations do indeed adhere to standard calculus procedures like chain rule. Execute the cells below to see the print out of these derivatives.

In [None]:
x = tf.Variable([1, 3.0])
with tf.GradientTape(persistent=True) as tape:
  y = x * x
  z = y * y

print("dz/dx = ", tape.gradient(z, x).numpy()) 
print("dy/dx = ",tape.gradient(y, x).numpy()) 
print("dz/dy = ",tape.gradient(z, y).numpy())  

Another example...

In [None]:
x = tf.Variable([1, 3.0])
with tf.GradientTape(persistent=True) as tape:
  y = 3*x**2
  z = tf.sin(y) + 2*y

print("dz/dx = ", tape.gradient(z, x).numpy())  
print("dy/dx = ",tape.gradient(y, x).numpy()) 
print("dz/dy = ",tape.gradient(z, y).numpy())  

**Is it clear how to obtain dz/dx by the chain rule multiplication? If you know how, symbolically differentiate the function z as a function of x for each example, then insert the numerical values (1 and 3) into the result to check the answer.**

## Linear regression, GradientTape and AD ##

Here we present another route of computing the line of best fit (linear regression) as we saw in the Linear Regression notebook. The process is in essence identical, but here we explictly call on GradientTape rather than use Keras. We borrow part of the code from the previous exercise in order to create some line data with noise.

In [None]:
import warnings
warnings.filterwarnings("ignore")
import tensorflow as tf
import matplotlib.pyplot as plt
import random
import numpy as np

# # If plots are not being outputted and Kernel quitting, try uncommenting the two lines below.
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

"""
Generate the data to be fitted
    xmin      Minimum value in x to sample
    xmax      Maximum value in x to sample
    Ntrain    Number of train data to generate
    Ntest     Number of test data to generate
    m         gradient for the line
    c         constant offset
    Noise     (fractional) Noise level to generate
"""
xmin   = -10
xmax   = 10
Ntrain = 100
Ntest  = 100
m      = 10.0
c      = 5.0
Noise  = 0.05

def genData(xmin, xmax, Ntrain, Ntest, m, c, Noise):
    """
    Function to generate an ensemble of test and train data for fitting
    """
    print("\033[92mGenerating the parabola data set\033[0m") #The format is the colour of the words
    x_train = []
    y_train = []
    x_test  = []
    y_test  = []

    #--------------------------------------------------------------------
    def sim_line(xmin, xmax, m, c, Noise):
        """
        Function to simulate a random data point for a straight line
        """
        x = np.random.random()*(xmax-xmin)+xmin
        y = (m*x+c)*(1 + random.random()*Noise)
    
        return x, y
    #--------------------------------------------------------------------
  
    for i in range( Ntrain ):
        x,y = sim_line(xmin, xmax, m, c, Noise)
        x_train.append([x])
        y_train.append([y])

    for i in range( Ntest ):
        x,y = sim_line(xmin, xmax, m, c, Noise)
        x_test.append([x])
        y_test.append([y])
    
    return x_train, y_train, x_test, y_test

# generate data for fitting
x_train, y_train, x_test, y_test = genData(xmin, xmax, Ntrain, Ntest, m, c, Noise)

print("Have generated:")
print("\tN(train) examples            = ", len(x_train))
print("\tN(test) examples             = ", len(x_test))

x_train = np.array(x_train)
y_train = np.array(y_train)
x_test = np.array(x_test)
y_test = np.array(y_test)

Let us plot the training data...

In [None]:
plt.scatter(x_train,y_train)

In [None]:
#Checking some functions used below...
?tf.random.normal

In [None]:
?np.ones_like

Running the cell below optimises the weight $m$ and bias $c$ (packaged into a variable we call ``w``) using Adam. Have a read through and check you get the gist of things :)

In [None]:
#Packaging the data in a simple vector/matrix form

Y = y_train
XX = np.hstack([x_train, np.ones_like(x_train)]) # Creating a stacked array for easy multiplication with the weight vector

# Prepare TensorFlow objects
w = tf.Variable(tf.random.normal((2,1)), name="m_and_c")# Packaging the weights (m,c) into a single, randomly initialised variable
x = tf.constant(XX, dtype=tf.float32)     # input sample array
y = tf.constant(Y, dtype=tf.float32)      # output sample sample array
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
print(w)

# Run optimizer
# The underscore is just a placeholder for the iteration variable since we do not use one explictly in the code.
mse_list=[]
for _ in range(3000):
    with tf.GradientTape(persistent=True) as tape: # Use persistent=True so we can take derivatives of any node quantity
        Sigma = x @ w # Graph node 1. Also x@w signifies matrix mutliplcation: every x becomes dot producted with weights w.
                      # So Sigma is a vector with i'th component (m x_{i} + c). Note c in previous sentence is also an Ntrain length vector 
                      # with all components c
        y_pred = Sigma # Graph node 2. A kind of redundant addition, but it at least helps us connect with the graph nodes.
        mse = (1/Ntrain)*tf.reduce_sum(tf.square(y-y_pred)) # Graph node 3.
        print(mse)
        mse_list.append(mse) # Collecting the mse values for each loop into a list.
        
    grad = tape.gradient(mse, w) # Here we take the gradient of the mse with respect to w=(weight,bias)=(m,c) - the main goal
                                 # in AD.
    grad1 = tape.gradient(mse, Sigma)    # The tape will also create/evaluate these values automatically, 
                                         # but by choosing persistent=True, we
    grad2 = tape.gradient(mse, y_pred)   # can manually create/calculate them ourselves.
    
    
    optimizer.apply_gradients([(grad, w)]) # Finally we optimize. 

print("The final weight array w = (m,c): ")
print(w)

In [None]:
print("The final weight array w = (m,c): ")
print(w)

Note how the linear activation function step within the tape, given by ``y_pred = Sigma``, is pretty much redundant since we could just use ``y_pred = x @ w`` from the get go. To reiterate, this is just to establish the graph nodes as we did in the talk. For example, in another task such as binary classification, we could replace ``y_pred = Sigma`` with ``y_pred = tf.keras.activations.sigmoid(Sigma)`` instead. 

Above, we printed the value of the MSE to see how it was changing. **Type ``plt.plot(mse_list)`` below and evaluate it to visualise how the loss function changes.**

**Using your final output weight and bias, both stored in variable ``w``, construct a line from these and plot it on top of the test data - using x_test and y_test.**

Finally, let us look at the gradients we calculated. Evaluate the cell below. Again, each of these gradients are $\frac{\partial L}{\partial v} = \bar{v}$ where $v$ is either $\hat{y} \equiv y_{\text{pred}}$, $\Sigma$ or $w$.

In [None]:
print(grad)
print(grad1)
print(grad2)

**What can you say about the shape of these gradients?**

It may help to look at each of these quantities in index form below. Here, $i$ indexes the sample set, running from 1 to $N_{train}$.

Forward pass creates the quantities at each graph node:
$$\Sigma_{i} = m x_{i} + c$$

$$\hat{y}_{i} = \Sigma_{i}$$

$$L = \frac{1}{N_{train}} \sum_{i=1}^{N_{train}} (y_{i}- \hat{y}_{i})^2 = \frac{1}{N_{train}} \sum_{i=1}^{N_{train}} ( y_{i}-(m x_{i} + c) )^2 $$

where $\hat{y} \equiv y_{pred}$. 

On the backward pass the GradientTape calculates all necessary the derivatives. The ones we explicitly save in ``grad``, ``grad1`` and ``grad2`` are 

``grad`` = $$ \bar{w} = (\bar{m}, \bar{c}) = \left(\frac{\partial L}{\partial m}, \quad \frac{\partial L}{\partial c}\right) = \left(-\frac{2}{N_{train}} \sum_{i=1}^{N_{train}} x_{i}( y_{i}-(m x_{i} + c)) , \quad -\frac{2}{N_{train}} \sum_{i=1}^{N_{train}}(y_{i}-(m x_{i} + c )) \right)$$

``grad1`` = $$ \bar{ \Sigma}_{i} = \frac{\partial L}{\partial \Sigma_{i}} = - \frac{2}{N_{train}} \left( y_{i} - \Sigma_{i} \right)$$

``grad2`` = $$\bar{ \hat{y}_{i}} = \frac{\partial L}{\partial \hat{y}_{i}} = - \frac{2}{N_{train}} \left( y_{i} - \hat{y}_{i}\right)$$


## References ##
- 1/. For the reference code and further description seen in this notebook: https://machinelearningmastery.com/using-autograd-in-tensorflow-to-solve-a-regression-problem/
- 2/. For tensorflows introduction to autodifferentiation: https://www.tensorflow.org/guide/autodiff#:~:text=TensorFlow%20provides%20the%20tf.,GradientTape%20onto%20a%20%22tape%22.
- 3/. Further notes from Roger Grosse on autodifferentiation, See lecture L06 backpropagation: https://sgfin.github.io/files/notes/CS321_Grosse_Lecture_Notes.pdf