# **Hands-on Deep Learning – Gradient and AutoGrad with TensorFlow/Keras**

_This hands-on was created by Thomas Grenier (TensorFlow) and Fabien Millioz (PyTorch), CREATIS, [deepimaging2019](https://deepimaging2019.sciencesconf.org/)_

thomas.grenier@creatis.insa-lyon.fr

fabien.millioz@creatis.insa-lyon.fr


## <span style="color:brown"> A - Import common modules

In [None]:
import numpy
import tensorflow as tf

tf.__version__

> _**if no error occurs, your working environment is ok and you can go to next part,
> else ... call an assistant for help!**_

## <span style="color:brown"> B - Tensors of TensorFlow   

This section presents the tensors of TensorFlow and the link with numpy


### <span style="color:brown"> B1- Tensorflow

All tensorflow 2 variables are declared and interpreted.
As example, a 2 by 2 random unfirom matrix with values in the range [-1;1] :

In [None]:
a_tf = tf.random.uniform(shape=[2, 2], minval=-1.0, maxval=1.0, seed=42)
print(a_tf)

In [None]:
tensor = tf.multiply(a_tf, 10)
print(tensor)

Or more simply with

In [None]:
print(a_tf * 10)

### <span style="color:brown"> B2- Link with numpy

We can use numpy array to do the same, and mix tensorflow and numpy

In [None]:
import numpy as np

a_np = np.random.rand(2, 2)
print(a_np)

print("TensorFlow operations convert numpy arrays to Tensors automatically")
tensor = tf.multiply(a_np, 10)
print(tensor)

print("And NumPy operations convert Tensors to numpy arrays automatically")
print(np.add(tensor, 1))

print("The .numpy() method explicitly converts a Tensor to a numpy array")
print(tensor.numpy())

## <span style="color:brown"> C- Calculate a simple gradient
We can calculate gradients with automatic differentiation of the function $ c(a) = \frac{1}{n.m} \sum_{i,j}^{n,m} 3(a_{ij}+2)^2 $. 

Let us note $ b(a) = 3(a_{ij}+2)^2 $

In [None]:
a = tf.random.uniform(shape=[2, 2])
print("a=", a.numpy())
b = 3 * (a + 2) ** 2
print("b=", b.numpy())

Then we define $ c(a) $ the mean of $ b(a) $

In [None]:
c = tf.reduce_mean(b, name="c")
print("Tensorflow mean : ", c)
print("Numpy mean      : ", np.mean(b))

**We remind that the gradient of $c(a)$ is** $ \nabla c(a) = \frac{3(a+2)}{2} $.

Here we evaluate it automatically from $c$.

In [None]:
x = tf.random.uniform(shape=[2, 2])
with tf.GradientTape(persistent=True) as g:
    g.watch(x)
    b = 3 * (x + 2) ** 2
    y = tf.reduce_mean(b)
dy_dx = g.gradient(y, x)

print(" TF auto differenciation gradient \n ", dy_dx)
print(" \n Manually expressed gradient   \n ", 3 * (x + 2) / 2)

print(" \n \n by the way db_dx = ", g.gradient(b, x))

del g  # Drop the reference to the tape

## <span style="color:brown"> D- Gradient descent example

Here we plan to minimize the cost function $L(y, \hat{y}) = (y - \hat{y})^2$ according to $w$ and $b$ the weights and biais of a neuron.

The gradients $\frac{\partial L(y - h(x))}{\partial w}$ and $\frac{\partial L(y - h(x))}{\partial b}$ are (automatically) computed with:
- $h(x) = \sigma(w.x + b)$
- $\sigma$ is the sigmoid activation function
- $L(y, \hat{y}) = (y - \hat{y})^2$ (quadratic error)
- $y = 0.2$
- $x = 1.5$
- $b = -2$
- $w = 3$

In [None]:
x = tf.constant(1.5)  # x = torch.tensor([1.5])
y = tf.constant(0.2)  # y = torch.tensor([0.2])
b = tf.Variable(-2.0, name="b")  # b = torch.tensor([-2.0], requires_grad=True)
w = tf.Variable(3.0, name="w")  # w = torch.tensor([3.0], requires_grad=True)

with tf.GradientTape(persistent=True) as g:
    g.watch(x)
    h = tf.math.sigmoid(w * x + b)  # h = torch.sigmoid(w * x + b)
    error = (y - h) ** 2  # error = (y - h)**2

gradients = g.gradient(error, [b, w])  # error.backward()

print("h      = ", h.numpy())
print("grad b = ", gradients[0].numpy())
print("grad w = ", gradients[1].numpy())
del g

We minimize $L(y, h(x))$ iteratively.

Weights and bias are updated according to their gradients:

 - $ w = w - \alpha . \frac{\partial L(y - h(x))}{\partial w}$ 

 - $ b = b - \alpha . \frac{\partial L(y - h(x))}{\partial b}$ 

where $\alpha$ is the learning rate.

In [None]:
alpha = tf.constant(1.0)  # alpha = 1
nb_epochs = 20  # number of epochs
# corresponding pyTorch code (Fabien Milloz)
x = tf.constant(1.5)  # x = torch.tensor([1.5])
y = tf.constant(0.2)  # y = torch.tensor([0.2])
b = tf.Variable(-2.0, name="b")  # b = torch.tensor([-2.0], requires_grad=True)
w = tf.Variable(3.0, name="w")  # w = torch.tensor([3.0], requires_grad=True)

for i in range(nb_epochs):
    with tf.GradientTape(persistent=False) as g:
        h = tf.math.sigmoid(w * x + b)  # h = torch.sigmoid(w * x + b)
        error = (y - h) ** 2  # error = (y - h)**2
    gradients = g.gradient(error, [b, w])
    b.assign(b - alpha * gradients[0])
    w.assign(w - alpha * gradients[1])
    print(
        "Epoch {} error={:.05f} h={:.05f} w={:.05f} b={:.05f} dE_db={:.05f} dE_dw={:.05f} ".format(
            i + 1,
            error.numpy(),
            h.numpy(),
            w.numpy(),
            b.numpy(),
            gradients[0].numpy(),
            gradients[1].numpy(),
        )
    )

 > **Question:** Observe the influence of $\alpha$ for values of 0.01, 0.1, 10 and 100, and try to adapt the number of epochs accordingly.
 
 

## <span style="color:brown"> E- Hinge Loss example

Here we study another loss function for classification : the multiclass Hinge loss. This is the 'SVM' loss.
In tensorflow/keras, this loss is available thank to the function tf.keras.losses.CategoricalHinge().

  > **Question:** Using Tensorflow online help, give the expression of this loss.
    
  > **Question:** Give the tensorflow code used to implement this function (explore the source code of tensorflow and more specifically the categorcial_hinge function).


In [None]:
y_true = tf.constant([[1.0, 0.0 , 0.0]], dtype=tf.float32)
y_pred_bSM = tf.constant([[.0, 1.0, 0.0]], dtype=tf.float32) #before SoftMax  : bSM
y_pred = tf.nn.softmax(y_pred_bSM)
print("y_pred", y_pred.numpy())
h = tf.keras.losses.CategoricalHinge()
#h=tf.keras.losses.CategoricalCrossentropy()
print("h =", h( y_true, y_pred ).numpy() )
print("y_true shape", y_true.shape)

  > **Question:** What do the values of y_pred and y_true mean ?
  
  > **Question:** With equations of the Hinge loss, verify the previous results.
  
The loss value is interesting but its derivatives are mandatory to be used within a gradient descent scheme. The following code calculates automatically these derivatives according each y_pred.   

In [None]:
with tf.GradientTape() as g:
  g.watch((y_pred, y_pred_bSM))
  y_pred=tf.nn.softmax(y_pred_bSM)
  h_graph = h(y_true, y_pred) # <=> tf.keras.losses.categorical_hinge(y_true, y_pred)
  dh_dy = g.gradient( h_graph, (y_pred,y_pred_bSM) )
print(dh_dy[0])
print(dh_dy[1])

Other realistic Tests

In [None]:
y_true = tf.constant([[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]], dtype=tf.float32)
#y_pred_bSM = tf.constant([[0.0575, 0.0195, 0.4029, 0.0945, 0.0288, 0.0031, 0.0908, 0.0372, 0.2652]], dtype=tf.float32)
y_pred_bSM = tf.constant([[1.26908707619, 0.61077165603, -0.7032195329, 0.92837673425, -0.5457400083, -0.6242400407, 0.45880278945, 0.64756596088, -0.0037845533]], dtype=tf.float32)
y_pred = tf.nn.softmax(y_pred_bSM)
h_hinge = tf.keras.losses.CategoricalHinge()
h_cce = tf.keras.losses.CategoricalCrossentropy()
print("h_hinge =", h_hinge(y_true,y_pred).numpy() )
print("h_cce =", h_cce(y_true, y_pred).numpy() )
print("y_pred ", y_pred.numpy())

In [None]:
with tf.GradientTape() as g:
  g.watch((y_pred,y_pred_bSM))
  y_pred = tf.nn.softmax(y_pred_bSM)
  h_graph = h_hinge(y_true, y_pred) # <=> tf.keras.losses.categorical_hinge(y_true, y_pred)
  dh_dy = g.gradient(h_graph, (y_pred,y_pred_bSM))
print(dh_dy[0])
print(dh_dy[1])

In [None]:
with tf.GradientTape() as g:
  g.watch(y_pred)
  h_graph = h_cce(y_true, y_pred)
  dh_dy = g.gradient(h_graph, y_pred)
print(dh_dy)