**Differentiability**

Before we talk about torch autograd it is important to remind ourselves what is meant by differentiability of a function of several variables.

A function $f:\mathbb{R}^d \rightarrow \mathbb{R}$ is said to be differentiable at a point $x \in \mathbb{R}^d$ if there exists a linear function $D: \mathbb{R}^d \rightarrow \mathbb{R}$ with the property that

$$
\lim_{h\rightarrow 0} {f(x+h) - f(x) - D(h) \over \parallel h \parallel}= 0
$$

We call the function $D$ the derviative $f$ at $x.$ 

When $f$ is differentiable at $x$ it is directionally differentiable, i.e. the limit

$$
\lim_{t\rightarrow 0}  {f(x+tv) - f(x) \over t}
$$

exists, for any nonzero vector $v.$ When we take $v = e_i$ the unit vector $(0,0,\ldots,0,1,0,\ldots,0)$ then we obtain the partial derivative

$$
\lim_{t\rightarrow 0}  {f(x+te_i) - f(x) \over t} = {\partial f \over \partial x_i}, 
$$

and the linear function $D$ is given by

$$
D(h) = \sum_{i=1}^d {\partial f \over \partial x_i} h_i
$$

Importantly, directional differentiability of a function does not guarantee its differentiability. In order to be differentiable at a point, a function that has directional derivative must 

A counterexample is give by the function

$$
f(x,y) = {yx^2 \over x^2+y^2}.
$$

This function has partial derivative with respect to $x$ given by

$$
{\partial f \over \partial x} = {2xy^3 \over x^2+y^2}
$$

and this function is not continuous at $(0,0).$ Indeed, the limit along a linear path approaching $(0,0)$ depends on which path you take. 

Consider the path along a ray through the origin in the $\theta$ direction. Take  $x = r \cos(\theta)$ and $y = r\sin(\theta)$ for fixed $\theta$ letting $r \rightarrow 0$ we obtain

$$
{2xy^3 \over x^2+y^2}\biggr\rvert_{x = r\cos(\theta),y= r\sin(\theta)}
= 2 \cos(\theta) \sin^3(\theta)
$$

which depends on $\theta.$




**Torch autograd**

A key feature of Pytorch is autograd - the ability to store information about calculations on a tensor and generate gradients automatically in code.

Gradients will be useful whenever we want to optimize some function, like a loss function when we fit a statistical model.

Here we create a tensor x, and tell pytorch to store gradient information when we create tensors that are functions of x. Ultimately, we compute the gradient of a scalar function of x.

Let's start with a simple case of a dot product with a tensor.

In [1]:
import torch

x=torch.tensor([0.,1.,2.,3.],requires_grad=True)
y=torch.tensor([2.,3.,5.,7.])
z=torch.dot(x,y)
z.backward()
print(x.grad)

tensor([2., 3., 5., 7.])


And more complicated operations work. We just need to make sure that the operation is something that torch knows how to differentiate.

Here we create w as a function of x and compute the gradient of w with respect to x, which is a tensor of partial derivatives with respect to the components of x

$ {\partial ~ \over \partial x_j} w(x)$

In [3]:
import torch

x=torch.tensor([1.,2.,3.],requires_grad=True)
z=torch.sum(torch.sin(x))
u=torch.log(1+z)
w=torch.exp(-u)

w.backward()
print(x.grad)

tensor([-0.0646,  0.0498,  0.1184])


What if we try to do the same for u

In [4]:
u.backward()

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

In [5]:
import torch

x=torch.tensor([1.,2.,3.],requires_grad=True)
y=x[0]**1+x[1]**2+x[2]**3
z=torch.sin(y)
u=torch.log(1+z)
w=torch.exp(-u)
w.backward(retain_graph=True)
print(x.grad)

u.backward(retain_graph=True)
print(x.grad)

tensor([-0.3466, -1.3864, -9.3580])
tensor([0.1911, 0.7645, 5.1603])


In [6]:
import torch

x=torch.tensor([1.,2.,3.],requires_grad=True)
y=torch.tensor([5.,7.,6.],requires_grad=True)
z=torch.cos(x)*torch.sin(y)
u=torch.sum(z)
u.backward(retain_graph=True)
print(x.grad)
print(y.grad)


tensor([ 0.8069, -0.5974,  0.0394])
tensor([ 0.1533, -0.3137, -0.9506])
