<a href="https://colab.research.google.com/github/thalitadru/ml-class-epf/blob/main/TutorialAutoDiff.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automatic differentiation software (Auto Diff SW)
*Credits*: compilation of Tensorflow and PyTorch tutorials:
- https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html
- https://www.tensorflow.org/guide/autodiff

In [101]:
import numpy as np

import tensorflow as tf
import torch


To differentiate, AutoDiff SW needs to keep a "track record" of the order in which operations are applyed to variables. The SW package then uses this "track record" to compute the gradients of the "recorded" computations using [reverse mode differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation). This record takes the form of a **computational graph**.


In order for the entire chain of operations to be differentiable, each operator implemented needs to have a *forward mode*, that gets called to compute the normal operation, along with a *backward mode*, that can be called to compute its derivative.


Using these operators, we only need to explicitly declare the **forward pass**, that is, the computations leading to the expression we want to differentiate (typically a cost function). AutoDiff SW will be able to follow operations backwards and compute gradients for the **backward pass**. 


## Example: derivative of y = x² with respect to x

### Tensorflow


TensorFlow provides the `tf.GradientTape` API for automatic differentiation; that is, computing the gradient of a computation with respect to some inputs, usually `tf.Variable`s.
TensorFlow "records" relevant operations executed inside the context of a `tf.GradientTape` onto a "tape". 

Here is a simple example:

In [80]:
x = tf.Variable(3.0)

with tf.GradientTape() as tape:
  y = x**2

Once you've recorded some operations, use `GradientTape.gradient(target, sources)` to calculate the gradient of some target (often a loss) relative to some source (often the model's variables):

In [81]:
# dy = 2x * dx
dy_dx = tape.gradient(y, x)
dy_dx

<tf.Tensor: shape=(), dtype=float32, numpy=6.0>

We can call `.numpy()` to cast the tensor into a numpy array.

In [82]:
dy_dx.numpy()

6.0

## PyTorch
PyTorch has a built-in differentiation engine called `torch.autograd`. It supports automatic computation of gradient for any computational graph.

In this exemple, we want to be able to compute the gradients of y with respect to x. In order to do that, we set the `requires_grad` property of the tensor containing the dependent variable x.

In [83]:
x = torch.tensor(3.0, requires_grad=True)
y = x**2

**Note**: You can set the value of `requires_grad` when creating a tensor, or later by using `x.requires_grad_(True)` method.

A function that we apply to tensors during the forward pass is in fact an object of class [`Function`](https://pytorch.org/docs/stable/autograd.html#function). This object knows how to compute the function in the *forward* direction, and also how to compute its derivative during the *backward* propagation step. A reference to the backward propagation function is stored in `grad_fn` property of a tensor:

In [84]:
print(f"Gradient function for y = {y.grad_fn}")

Gradient function for y = <PowBackward0 object at 0x7fad16946dd0>


To compute derivatives of y, we call `y.backward()`:

In [85]:
# computes gradients of y with respect to any dependent variables
# having requires_grad=True
y.backward()

Then we retrieve the value of $\partial y/ \partial x$ in `x.grad`

In [86]:
# dy/dx
dy_dx = x.grad
dy_dx

tensor(6.)

Like in Tensorflow, we can call `.numpy()` to cast the tensor into a numpy array.

In [87]:
dy_dx.numpy()

array(6., dtype=float32)

## Example: cost derivative for a layer of neurons

In this example we compute forward and backward passes for a single layer of neurons with softmax activation (in other words, we are applying multinomial logistic regression).

In this model, w and b are parameters, which we need to optimize. Thus, we need to be able to compute the gradients of loss function with respect to those variables. 

The following sections show how to implement this graph in Tensorflow and PyTorch. The forward operations are:
$$logits = wx + b$$
$$loss = \mathtt{CrossEntropyFromLogits}(logits, y)$$

Here `CrossEntropyFromLogits` is a function that computes Cross-entropy between target propabilities $y$ and predicted $logits$ (the values predicted prior to aplying sigmoid or softmax).



Here is the forward computational graph for this chain of operations:



![computational graph](https://pytorch.org/tutorials/_images/comp-graph.png)

Here is some example data to test the code: 2 samples with 4 features each, and predicted probabilities for 3 classes

In [88]:
x = [[-2., 2., 2., -2.], [ 2.1, 1., 1.5, -1.]] # input
y = [[1., 0., 0.], [0., 1., 0]] # expected output

This implies we need $w$ to be shape=(4,3) and b to be shape=(3,).

### Tensorflow
The previous example uses scalars, but `tf.GradientTape` works as easily on any tensor. 

Here we use the built-in cost function [`tf.nn.softmax_cross_entropy_with_logits`](https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits) which expects not the predicted probabilities, but the logits (prior to the application of softmax):

In [89]:
# parameters
w = tf.Variable(tf.random.normal((4, 3)), name='w')
b = tf.Variable(tf.zeros(3, dtype=tf.float32), name='b')

# Computations we want to track get done with the GradientTape context
with tf.GradientTape() as tape:
    # NOTE tensorflow is implcitly casting x and y to tf.Tensor
    logits = x @ w + b
    loss = tf.nn.softmax_cross_entropy_with_logits(y,logits)


To get the gradient of `loss` with respect to both variables, you can pass both as sources to the `gradient` method. The tape is flexible about how sources are passed and will accept any nested combination of lists or dictionaries and return the gradient structured the same way (see `tf.nest`).

In [90]:
[dl_dw, dl_db] = tape.gradient(loss, [w, b])

The gradient with respect to each source has the shape of the source:

In [91]:
print(w.shape)
print(dl_dw.shape)

(4, 3)
(4, 3)


A `GradientTape` may be used only once to compute gradients. After that the computational graph is erased. 

In [92]:
# This should raise an error
[dl_dw, dl_db] = tape.gradient(loss, [w, b])

RuntimeError: ignored

To be able to repeat the gradient computations, you need to set `persistent=True` when instantiating `GradientTape`:

In [94]:
# Need to regriate the forwad computations, this time with persistent gradient tape
with tf.GradientTape(persistent=True) as tape:
    # NOTE tensorflow is implcitly casting x and y to tf.Tensor
    logits = x @ w + b
    loss = tf.nn.softmax_cross_entropy_with_logits(y,logits)

# Now gradients can be computed multiple times
[dl_dw, dl_db] = tape.gradient(loss, [w, b])

Here is the gradient calculation again, this time passing a dictionary of variables. In this case gradients are returned in a dictionnary with the same indexed as the input:

In [None]:
my_vars = {
    'w': w,
    'b': b
}

grad = tape.gradient(loss, my_vars)
grad['b']

### PyTorch
PyTorch will also compute the backward pass for arbitrary-sized tensors. 
It is more strict about the type of inputs and targets, that must be cast to `torch.tensor` explicitly prior to manipulation.

Here we use the built-in cost function [`torch.nn.functional.cross_entropy`](https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html#torch.nn.functional.cross_entropy) which expects not the predicted probabilities, but the logits (prior to the application of softmax)

In [95]:
# parameters
w = torch.randn(4, 3, requires_grad=True)
b = torch.zeros(3, requires_grad=True)

# NOTE in Pytorch we need to cast x and y to tensors explicitly
logits = torch.matmul(torch.tensor(x), w) + b

loss = torch.nn.functional.cross_entropy(logits, torch.tensor(y))

We call `loss.backward()` to compute the derivatives with respect to all dependent variables (w and b):

In [96]:
loss.backward()

Then gradients can be retrieved under `w.grad` and `b.grad`:

In [97]:
[dl_dw, dl_db] = w.grad, b.grad

As expected, the gradient with respect to each source has the shape of the source:

In [98]:
print(w.shape)
print(dl_dw.shape)

torch.Size([4, 3])
torch.Size([4, 3])


We can only perform gradient calculations using backward once on a given graph, for performance reasons. If we need to do several backward calls on the same graph, we need to pass `retain_graph=True` to the backward call.

In [99]:
# this will raise an error
loss.backward()

RuntimeError: ignored

In [100]:
# need to recriate the forward graph and use retain_graph=True in backward call
logits = torch.matmul(torch.tensor(x), w) + b

loss = torch.nn.functional.cross_entropy(logits, torch.tensor(y))

loss.backward(retain_graph=True)