# 3. Calculus & Automatic Differentiation

In [1]:
import numpy as np
import pandas as pd

import torch

## Gradient

For an n-dimensional vector $\mathbf{x}=[x_1,x_2,\ldots,x_n]^\top$, the gradient of a function $f(\mathbf{x})$ in respect to $\mathbf{x}$ is given as:

$$\nabla_{\mathbf{x}} f(\mathbf{x}) = \bigg[\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_n}\bigg]^\top$$

The following rules applies:

- For all $\mathbf{A} \in \mathbb{R}^{m \times n}$, we have$\nabla_{\mathbf{x}} \mathbf{A} \mathbf{x} = \mathbf{A}^\top$
- For all $\mathbf{A} \in \mathbb{R}^{n \times m}$, we have$\nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{A}  = \mathbf{A}$
- For all $\mathbf{A} \in \mathbb{R}^{n \times n}$, we have$\nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{A} \mathbf{x}  = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}$
- $\nabla_{\mathbf{x}} \|\mathbf{x} \|^2 = \nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{x} = 2\mathbf{x}$

## Chain Rule

Consider a differentiable function $y$ with variables $u_1, u_2, \ldots, u_m$, each $u_i$ has variables $x_1, x_2, \ldots, x_n$.

For any $i = 1, 2, \ldots, n$, the chain rule gives:

$$\frac{\partial y}{\partial x_i} = \frac{\partial y}{\partial u_1} \frac{\partial u_1}{\partial x_i} + \frac{\partial y}{\partial u_2} \frac{\partial u_2}{\partial x_i} + \cdots + \frac{\partial y}{\partial u_m} \frac{\partial u_m}{\partial x_i}$$

## Automatic Differentiation

A deep learning framework like Pytorch can perform automaic differentiation. That is, it constructs a **computation graph** that getting track of the outputs useful for **backpropagation**.

For example, we want to find the derivative of a **`scalar function`** $y=2\mathbf{x}^{\top}\mathbf{x}$ in repsect to the vector $\mathbf{x}$:

In [9]:
x = torch.arange(4.0, requires_grad=True)
#x.requires_grad_(True)

y = 2 * torch.dot(x.T, x)

#back propagation on y
y.backward()

#gradients in respect to x
x.grad

tensor([ 0.,  4.,  8., 12.])

We got $4\mathbf{x}$ which is the derivative of $y$ in respect to $x$.

To compute the derivative for **`another function`**, we first need to zero out the gradients:

In [10]:
x.grad.zero_()         #zero out the gradients

y = x.sum()
y.backward()

x.grad

tensor([1., 1., 1., 1.])

When we have a **`non-scalar`** (multiple samples) function $y$, we aim to compute the sum of the derivatives for each sample:

In [13]:
x.grad.zero_()

y = x * x
y.sum().backward()     #equivalent to y.backward(torch.ones(len(x)))

x.grad

tensor([0., 2., 4., 6.])

Sometimes, we want to **`detach`** some computations.

Consider two functions $y=x*x$ and $z=y*x$, we now want to compute the gradient of $z$ in respect to $x$ while having $y$ as a constant.

That is, we want to have $\frac{dz}{dx}=\frac{d}{dx}(y*x)=y$ ($y$ as a constant) instead of $\frac{dz}{dx}=\frac{d}{dx}(x*x*x)=3x^2$:

In [23]:
x.grad.zero_()

y = x * x
z = y * x

z.sum().backward()
x.grad

tensor([ 0.,  3., 12., 27.])

In [24]:
x.grad.zero_()

y = x * x
u = y.detach()           #create a new variable u by detaching y
z = u * x

z.sum().backward()
x.grad

tensor([0., 1., 4., 9.])

We can also compute the derivatives of $y$ in respect to $x$:

In [26]:
x.grad.zero_()

y.sum().backward()

x.grad

tensor([0., 2., 4., 6.])