# **Hands-on Deep Learning – Gradient and AutoGrad with TensorFlow/Keras**

_This hands-on was created by Thomas Grenier (TensorFlow) and Fabien Millioz (PyTorch), CREATIS, [deepimaging2019](https://deepimaging2019.sciencesconf.org/)_

thomas.grenier@creatis.insa-lyon.fr

fabien.millioz@creatis.insa-lyon.fr


## <span style="color:brown"> A - Import common modules

In [1]:
import numpy
import tensorflow as tf

tf.__version__

2022-07-05 20:57:11.331905: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1


'2.4.1'

> _**if no error occurs, your working environment is ok and you can go to next part,
> else ... call an assistant for help!**_

## <span style="color:brown"> B - Tensors of TensorFlow   

This section presents the tensors of TensorFlow and the link with numpy


### <span style="color:brown"> B1- Tensorflow

All tensorflow 2 variables are declared and interpreted.
As example, a 2 by 2 random unfirom matrix with values in the range [-1;1] :

In [2]:
a_tf = tf.random.uniform(shape=[2, 2], minval=-1.0, maxval=1.0, seed=42)
print(a_tf)

tf.Tensor(
[[0.9045429  0.35481548]
 [0.5906365  0.51156354]], shape=(2, 2), dtype=float32)


2022-07-05 20:57:16.725133: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-07-05 20:57:16.726125: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2022-07-05 20:57:16.748581: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-05 20:57:16.748938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2022-07-05 20:57:16.748984: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2022-07-05 20:57:16.751543: I tensorflow/stream_executor/platform/default/dso_loade

In [3]:
tensor = tf.multiply(a_tf, 10)
print(tensor)

tf.Tensor(
[[9.045429  3.5481548]
 [5.906365  5.1156354]], shape=(2, 2), dtype=float32)


Or more simply with

In [4]:
print(a_tf * 10)

tf.Tensor(
[[9.045429  3.5481548]
 [5.906365  5.1156354]], shape=(2, 2), dtype=float32)


### <span style="color:brown"> B2- Link with numpy

We can use numpy array to do the same, and mix tensorflow and numpy

In [5]:
import numpy as np

a_np = np.random.rand(2, 2)
print(a_np)

print("TensorFlow operations convert numpy arrays to Tensors automatically")
tensor = tf.multiply(a_np, 10)
print(tensor)

print("And NumPy operations convert Tensors to numpy arrays automatically")
print(np.add(tensor, 1))

print("The .numpy() method explicitly converts a Tensor to a numpy array")
print(tensor.numpy())

[[0.94490572 0.7810241 ]
 [0.09161492 0.4269503 ]]
TensorFlow operations convert numpy arrays to Tensors automatically
tf.Tensor(
[[9.44905717 7.810241  ]
 [0.91614919 4.26950304]], shape=(2, 2), dtype=float64)
And NumPy operations convert Tensors to numpy arrays automatically
[[10.44905717  8.810241  ]
 [ 1.91614919  5.26950304]]
The .numpy() method explicitly converts a Tensor to a numpy array
[[9.44905717 7.810241  ]
 [0.91614919 4.26950304]]


## <span style="color:brown"> C- Calculate a simple gradient
We can calculate gradients with automatic differentiation of the function $ c(a) = \frac{1}{n.m} \sum_{i,j}^{n,m} 3(a_{ij}+2)^2 $. 

Let us note $ b(a) = 3(a_{ij}+2)^2 $

In [6]:
a = tf.random.uniform(shape=[2, 2])
print("a=", a.numpy())
b = 3 * (a + 2) ** 2
print("b=", b.numpy())

a= [[0.83705306 0.6542785 ]
 [0.80893373 0.13885796]]
b= [[24.14661  21.135582]
 [23.670326 13.724138]]


Then we define $ c(a) $ the mean of $ b(a) $

In [7]:
c = tf.reduce_mean(b, name="c")
print("Tensorflow mean : ", c)
print("Numpy mean      : ", np.mean(b))

Tensorflow mean :  tf.Tensor(20.669163, shape=(), dtype=float32)
Numpy mean      :  20.669163


**We remind that the gradient of $c(a)$ is** $ \nabla c(a) = \frac{3(a+2)}{2} $.

Here we evaluate it automatically from $c$.

In [8]:
x = tf.random.uniform(shape=[2, 2])
with tf.GradientTape(persistent=True) as g:
    g.watch(x)
    b = 3 * (x + 2) ** 2
    y = tf.reduce_mean(b)
dy_dx = g.gradient(y, x)

print(" TF auto differenciation gradient \n ", dy_dx)
print(" \n Manually expressed gradient   \n ", 3 * (x + 2) / 2)

print(" \n \n by the way db_dx = ", g.gradient(b, x))

del g  # Drop the reference to the tape

 TF auto differenciation gradient 
  tf.Tensor(
[[4.1359673 3.4799776]
 [3.4273653 3.6203592]], shape=(2, 2), dtype=float32)
 
 Manually expressed gradient   
  tf.Tensor(
[[4.1359673 3.4799776]
 [3.4273653 3.6203594]], shape=(2, 2), dtype=float32)
 
 
 by the way db_dx =  tf.Tensor(
[[16.543869 13.91991 ]
 [13.709461 14.481437]], shape=(2, 2), dtype=float32)


## <span style="color:brown"> D- Gradient descent example

Here we plan to minimize the cost function $L(y, \hat{y}) = (y - \hat{y})^2$ according to $w$ and $b$ the weights and biais of a neuron.

The gradients $\frac{\partial L(y - h(x))}{\partial w}$ and $\frac{\partial L(y - h(x))}{\partial b}$ are (automatically) computed with:
- $h(x) = \sigma(w.x + b)$
- $\sigma$ is the sigmoid activation function
- $L(y, \hat{y}) = (y - \hat{y})^2$ (quadratic error)
- $y = 0.2$
- $x = 1.5$
- $b = -2$
- $w = 3$

In [9]:
x = tf.constant(1.5)  # x = torch.tensor([1.5])
y = tf.constant(0.2)  # y = torch.tensor([0.2])
b = tf.Variable(-2.0, name="b")  # b = torch.tensor([-2.0], requires_grad=True)
w = tf.Variable(3.0, name="w")  # w = torch.tensor([3.0], requires_grad=True)

with tf.GradientTape(persistent=True) as g:
    g.watch(x)
    h = tf.math.sigmoid(w * x + b)  # h = torch.sigmoid(w * x + b)
    error = (y - h) ** 2  # error = (y - h)**2

gradients = g.gradient(error, [b, w])  # error.backward()

print("h      = ", h.numpy())
print("grad b = ", gradients[0].numpy())
print("grad w = ", gradients[1].numpy())
del g

h      =  0.9241418
grad b =  0.10153006
grad w =  0.15229508


We minimize $L(y, h(x))$ iteratively.

Weights and bias are updated according to their gradients:

 - $ w = w - \alpha . \frac{\partial L(y - h(x))}{\partial w}$ 

 - $ b = b - \alpha . \frac{\partial L(y - h(x))}{\partial b}$ 

where $\alpha$ is the learning rate.

In [10]:
alpha = tf.constant(1.0)  # alpha = 1
nb_epochs = 20  # number of epochs
# corresponding pyTorch code (Fabien Milloz)
x = tf.constant(1.5)  # x = torch.tensor([1.5])
y = tf.constant(0.2)  # y = torch.tensor([0.2])
b = tf.Variable(-2.0, name="b")  # b = torch.tensor([-2.0], requires_grad=True)
w = tf.Variable(3.0, name="w")  # w = torch.tensor([3.0], requires_grad=True)

for i in range(nb_epochs):
    with tf.GradientTape(persistent=False) as g:
        h = tf.math.sigmoid(w * x + b)  # h = torch.sigmoid(w * x + b)
        error = (y - h) ** 2  # error = (y - h)**2
    gradients = g.gradient(error, [b, w])
    b.assign(b - alpha * gradients[0])
    w.assign(w - alpha * gradients[1])
    print(
        "Epoch {} error={:.05f} h={:.05f} w={:.05f} b={:.05f} dE_db={:.05f} dE_dw={:.05f} ".format(
            i + 1,
            error.numpy(),
            h.numpy(),
            w.numpy(),
            b.numpy(),
            gradients[0].numpy(),
            gradients[1].numpy(),
        )
    )

Epoch 1 error=0.52438 h=0.92414 w=2.84770 b=-2.10153 dE_db=0.10153 dE_dw=0.15230 
Epoch 2 error=0.48654 h=0.89753 w=2.65524 b=-2.22984 dE_db=0.12831 dE_dw=0.19246 
Epoch 3 error=0.42554 h=0.85233 w=2.40893 b=-2.39404 dE_db=0.16421 dE_dw=0.24631 
Epoch 4 error=0.32713 h=0.77195 w=2.10687 b=-2.59542 dE_db=0.20138 dE_dw=0.30206 
Epoch 5 error=0.19148 h=0.63758 w=1.80353 b=-2.79765 dE_db=0.20223 dE_dw=0.30334 
Epoch 6 error=0.07669 h=0.47693 w=1.59628 b=-2.93582 dE_db=0.13817 dE_dw=0.20726 
Epoch 7 error=0.02818 h=0.36786 w=1.47917 b=-3.01388 dE_db=0.07807 dE_dw=0.11710 
Epoch 8 error=0.01234 h=0.31107 w=1.40777 b=-3.06149 dE_db=0.04761 dE_dw=0.07141 
Epoch 9 error=0.00623 h=0.27892 w=1.36015 b=-3.09323 dE_db=0.03174 dE_dw=0.04762 
Epoch 10 error=0.00344 h=0.25865 w=1.32641 b=-3.11572 dE_db=0.02249 dE_dw=0.03374 
Epoch 11 error=0.00201 h=0.24488 w=1.30152 b=-3.13232 dE_db=0.01660 dE_dw=0.02490 
Epoch 12 error=0.00123 h=0.23504 w=1.28261 b=-3.14492 dE_db=0.01260 dE_dw=0.01890 
Epoch 13 erro

 > **Question:** Observe the influence of $\alpha$ for values of 0.01, 0.1, 10 and 100, and try to adapt the number of epochs accordingly.
 
 

## <span style="color:brown"> E- Hinge Loss example

Here we study another loss function for classification : the multiclass Hinge loss. This is the 'SVM' loss.
In tensorflow/keras, this loss is available thank to the function tf.keras.losses.CategoricalHinge().

  > **Question:** Using Tensorflow online help, give the expression of this loss.
    
  > **Question:** Give the tensorflow code used to implement this function (explore the source code of tensorflow and more specifically the categorcial_hinge function).


In [11]:
y_true = tf.constant([[1.0, 0.0 , 0.0]], dtype=tf.float32)
y_pred_bSM = tf.constant([[.0, 1.0, 0.0]], dtype=tf.float32) #before SoftMax  : bSM
y_pred = tf.nn.softmax(y_pred_bSM)
print("y_pred", y_pred.numpy())
h = tf.keras.losses.CategoricalHinge()
#h=tf.keras.losses.CategoricalCrossentropy()
print("h =", h( y_true, y_pred ).numpy() )
print("y_true shape", y_true.shape)

y_pred [[0.21194157 0.5761169  0.21194157]]
h = 1.3641753
y_true shape (1, 3)


  > **Question:** What do the values of y_pred and y_true mean ?
  
  > **Question:** With equations of the Hinge loss, verify the previous results.
  
The loss value is interesting but its derivatives are mandatory to be used within a gradient descent scheme. The following code calculates automatically these derivatives according each y_pred.   

In [12]:
with tf.GradientTape() as g:
  g.watch((y_pred, y_pred_bSM))
  y_pred=tf.nn.softmax(y_pred_bSM)
  h_graph = h(y_true, y_pred) # <=> tf.keras.losses.categorical_hinge(y_true, y_pred)
  dh_dy = g.gradient( h_graph, (y_pred,y_pred_bSM) )
print(dh_dy[0])
print(dh_dy[1])

tf.Tensor([[-1.  1.  0.]], shape=(1, 3), dtype=float32)
tf.Tensor([[-0.28912547  0.36630934 -0.07718389]], shape=(1, 3), dtype=float32)


Other realistic Tests

In [13]:
y_true = tf.constant([[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]], dtype=tf.float32)
#y_pred_bSM = tf.constant([[0.0575, 0.0195, 0.4029, 0.0945, 0.0288, 0.0031, 0.0908, 0.0372, 0.2652]], dtype=tf.float32)
y_pred_bSM = tf.constant([[1.26908707619, 0.61077165603, -0.7032195329, 0.92837673425, -0.5457400083, -0.6242400407, 0.45880278945, 0.64756596088, -0.0037845533]], dtype=tf.float32)
y_pred = tf.nn.softmax(y_pred_bSM)
h_hinge = tf.keras.losses.CategoricalHinge()
h_cce = tf.keras.losses.CategoricalCrossentropy()
print("h_hinge =", h_hinge(y_true,y_pred).numpy() )
print("h_cce =", h_cce(y_true, y_pred).numpy() )
print("y_pred ", y_pred.numpy())

h_hinge = 0.926781
h_cce = 1.3720545
y_pred  [[0.25358546 0.13128695 0.03528275 0.18036644 0.04130046 0.03818236
  0.11277747 0.13620754 0.07101061]]


In [14]:
with tf.GradientTape() as g:
  g.watch((y_pred,y_pred_bSM))
  y_pred = tf.nn.softmax(y_pred_bSM)
  h_graph = h_hinge(y_true, y_pred) # <=> tf.keras.losses.categorical_hinge(y_true, y_pred)
  dh_dy = g.gradient(h_graph, (y_pred,y_pred_bSM))
print(dh_dy[0])
print(dh_dy[1])

tf.Tensor([[-1.  0.  0.  1.  0.  0.  0.  0.  0.]], shape=(1, 9), dtype=float32)
tf.Tensor(
[[-0.23501818  0.0096127   0.00258337  0.1935727   0.00302398  0.00279568
   0.00825746  0.00997298  0.00519933]], shape=(1, 9), dtype=float32)


In [15]:
with tf.GradientTape() as g:
  g.watch(y_pred)
  h_graph = h_cce(y_true, y_pred)
  dh_dy = g.gradient(h_graph, y_pred)
print(dh_dy)

tf.Tensor(
[[-2.9434438  1.         1.         1.         1.         1.
   1.         1.         1.       ]], shape=(1, 9), dtype=float32)
