# Automatic differentiation with ``autograd`` 


In machine learning, we want models to get better and better as a function of experience. Usually *getting better*, means minimizing a *loss function*, a measure of how *bad* our model is at any time. With neural networks, that loss is usually differentiable, i.e. for each of the model's parameters, we can always determine how much increasing or decreasing it might affect the loss. For complex models, working out these derivatives from scratch can be a pain.

_MXNet_'s autograd package eliminates this tedious work by automatically calculating derivatives for you. Other libraries require you to pre-define and compile symbolic graphs in order to access automatic derivatives. However, ``autograd``, much like the similar package in PyTorch, allows you to take derivatives even when running fully imperative code.

Essentially, every time you make pass through your model, autograd builds a graph on the fly, through which it can immediately backpropagate gradients.

Let's go through it step by step. For this tutorial, we'll only need to import ``mxnet.ndarray``, and ``mxnet.autograd``.

In [1]:
import mxnet as mx
import mxnet.ndarray as nd
from mxnet import autograd

## Attaching gradients

As a toy example, Let's say that we are interested in differentiating a function ``f = 2 * (x ** 2)`` with respect to parameter x. We can start by assining an initial value of ``x``.

In [None]:
x = mx.nd.array([[1, 2], [3, 4]])

Once we compute the gradient of ``f`` with respect to ``x``, we'll need a place to store it. In _MXNet_, we can tell an NDArray that we plan to store a gradient by invoking its ``atach_grad()`` method.

In [None]:
x.attach_grad()

Now we're going to define out function ``f`` and *MXNet* will generate a computation graph on the fly. It's as if *MXNet* turned on a recording device and captured the exact path by which each variabel was generated. 

Note that building the computation graph requires a nontrivial amount of computation. So we only *MXNet* to build the graph when explicitly told to do so. We can instruct *MXNet* to start recording by placing code inside a ``with autograd.record():`` block.

In [None]:
with autograd.record():
  y = x * 2
  f = y * x

Let's backprop with f.backward(). When f has more than one entry, f.backward() is equivalent to mx.nd.sum(f).backward().



In [None]:
f.backward()

Now, let's see if this is the expected output. Remember that ``y = x * 2``, and ``f = x * y``, so ``f`` should be equal to  ``2 * x * x``. After, doing backprop with ``f.backward()``, we expect to get back gradient df/dx as follows: dy/dx = ``2``, df/dx = ``4 * x``. So, if everything went according to plan, ``x.grad`` should consist of an NDArray with the values ``[[4, 8],[12, 16]]``.

In [None]:
print(x.grad)

## Head gradients and the chain rule

*Warning: This part is tricky, but not necessary to understanding subsequent sections.*

Sometimes when we call the backward method on an NDArray, e.g. ``y.backward()``, where ``y`` is a function of ``x``, we are just interested in the derivative of ``y`` with respect to ``x``. At other times, we may be interested in the gradient of ``z`` with respect to ``x``, where ``z`` is a function of ``y``. Recall that, by the chain rule, dz/dx can be expressed in terms of dz/dy and dy/dx. So, when ``y`` is part of a larger function ``z``, and we want ``x.grad`` to store dz/dx, we can pass in the *head gradient* dz/dy as an input to ``backward()``. The default argument is ``nd.ones_like(y)``.

In [None]:
with autograd.record():
  y = x * 2
  f = y * x
    
head_gradient = nd.array([[10,1.],[.1,.01]]) # dz/dy = [[10,1.],[.1,.01]]
f.backward(head_gradient)
print(x.grad)

Now that we know the basics, we can do some wild things with autograd, including building diferentiable functions using Pythonic control flow.

In [24]:
a = nd.random_normal(shape=3)
print(a)
print(nd.norm(a))
a.attach_grad()

with autograd.record():
    b = a * 2
    while (nd.norm(b) < 1000).asscalar():
        b = b * 2

    if (mx.nd.sum(b) > 0).asscalar():
        c = b
    else :
        c = 100 * b

[-0.97111022  0.05383795 -0.77569664]
<NDArray 3 @cpu(0)>
[ 1.24404943]
<NDArray 1 @cpu(0)>


In [25]:
head_gradient = nd.array([0.01,1.0,.1])
c.backward(head_gradient)

In [26]:
print(a.grad)

[   1024.  102400.   10240.]
<NDArray 3 @cpu(0)>
