<a href="https://colab.research.google.com/github/sum-coderepo/DeepLearning-Pytorch/blob/master/PytorchTutorials/Loss_backward.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Training a NN happens in two steps:**

**Forward Propagation:** In forward prop, the NN makes its best guess about the correct output. It runs the input data through each of its functions to make this guess.

**Backward Propagation:** In backprop, the NN adjusts its parameters proportionate to the error in its guess. It does this by traversing backwards from the output, collecting the derivatives of the error with respect to the parameters of the functions (gradients), and optimizing the parameters using gradient descent. For a more detailed walkthrough of backprop.

When you call **loss.backward()**, all it does is compute gradient of loss w.r.t all the parameters in loss that have **requires_grad = True** and store them in **parameter.grad** attribute for every parameter.

**optimizer.step()** updates all the parameters based on **parameter.grad**



loss is computing the loss between two tensors with no relation to a network. How does **loss.backward()** know which network it needs to reference and compute **parameter.grad**

Let's say we defined a model: model, and loss function: criterion and we have the following sequence of steps:
```
pred = model(input)
loss = criterion(pred, true_labels)
loss.backward()
```





pred will have an `grad_fn` attribute, that references a function that created it, and ties it back to the model. Therefore, **loss.backward()** will have information about the model it is working with.


Try removing `grad_fn` attribute, for example with:
`pred = pred.clone().detach()`

Then the model gradients will be None and consequently weights will not get updated.

And the optimizer is tied to the model because we pass model.parameters() when we create the optimizer.

In [None]:
import torch
x = torch.tensor([1., 2.], requires_grad=True)
y = 100*x

In [None]:
print(x,y)

tensor([1., 2.], requires_grad=True) tensor([100., 200.], grad_fn=<MulBackward0>)


In [None]:
loss = y.sum()

In [None]:
# Compute gradients of the parameters w.r.t. the loss
print(x.grad)     # None
loss.backward()      
print(x.grad)     # tensor([100., 100.])


None
tensor([100., 100.])


In [None]:
# MOdify the parameters by subtracting the gradient
optim = torch.optim.SGD([x], lr=0.001)
print(x)        # tensor([1., 2.], requires_grad=True)
optim.step()
print(x)        # tensor([0.9000, 1.9000], requires_grad=True)

tensor([1., 2.], requires_grad=True)
tensor([0.9000, 1.9000], requires_grad=True)


`loss.backward() `sets the grad attribute of all tensors with requires_grad=True in the computational graph of which loss is the leaf (only x in this case).</br>
Optimizer just iterates through the list of parameters (tensors) it received on initialization and everywhere where a tensor has `requires_grad=True`, it subtracts the value of its gradient stored in its **.grad** property (simply multiplied by the learning rate in case of SGD). It doesn't need to know with respect to what loss the gradients were computed it just wants to access that .grad property so it can do `x = x - lr * x.grad`

Note that if we were doing this in a train loop we would call `optim.zero_grad()` because in each train step we want to compute new gradients - we don't care about gradients from the previous batch. Not zeroing grads would lead to gradient accumulation across batches.

***Pytorch Example***

In [27]:
import torch

a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)

In [28]:
Q = 3*a**3 - b**2

When we call `.backward()` on Q, autograd calculates these gradients and stores them in the respective tensors’ `.grad `attribute.

We need to explicitly pass a gradient argument in `Q.backward()` because it is a vector. gradient is a tensor of the same shape as Q, and it represents the gradient of Q w.r.t. itself, i.e.

$\frac{dQ}{dQ} = 1$

Equivalently, we can also aggregate Q into a scalar and call backward implicitly, like `Q.sum().backward().`

In [29]:
external_grad = torch.tensor([1., 1.])
Q.backward(gradient=external_grad)

Gradients are now deposited in a.grad and b.grad

In [30]:
a.grad, 9*a**2

tensor([36., 81.])

In [33]:
-2*b, b.grad

(tensor([-12.,  -8.], grad_fn=<MulBackward0>), tensor([-12.,  -8.]))

In [32]:
# check if collected gradients are correct
print(9*a**2 == a.grad)
print(-2*b == b.grad)

tensor([True, True])
tensor([True, True])
