### Gradients in PyTorch

In [2]:
import torch 
import numpy as np 

#### Gradients  
PyTorch can compute the gradient of variables/weights/parameters automatically through require_grad, loss.backward, and later used in gradient descent to adjust the parameters 

In [3]:
x=torch.ones(6, requires_grad=True) #x is a parameter 1x6 vector, and will compute the gradient of every element the vector during gradient descent.

# Otherwise, PyTorch will not auto calculate its gradients. 

# Require_grad will be saved into a computational graph at PyTorch

#### Calculating gradients  
Parameters of a model are variables with gradient required on them.  
Operations done on them are all recorded in the computational graph, and when we calculate the gradient, PyTorch calculates the gradient by using chain rule.  

*Note*: PyTorch uses grad_fn as the gradient function to compute gradients 

In [5]:
x= torch.randn(3, requires_grad=True)   #make x into a 1x3 row vector parameter 
print(x)
y= x+2  #first operation 
print(y)
z=y*y+2     #second operation 
z=z.mean()  #another operation on the same variable 
print(z)


z.backward()    #calculates dz/dx, calculates the gradient of all the parameters associated with z (all the variables that have gradient on, not y)
print(x.grad)   #print out the gradient of x from dz/dx

tensor([-1.3045, -1.3004, -1.7878], requires_grad=True)
tensor([0.6955, 0.6996, 0.2122], grad_fn=<AddBackward0>)
tensor(2.3394, grad_fn=<MeanBackward0>)
tensor([0.4637, 0.4664, 0.1415])


#### Jacobian vector product 
- When calculating the gradient, if the final output is not a scalar, we have to call tensor.backward(v) for v is a vector the same size as our parameter. This is calculating the Jacobian vector product/chain rule 

In [None]:
# To not include some operations in the computation graph to calculate gradient 
# 1. with torch.no_grad() 
# 2. tensor.requires_grad_(False)
# 3. tensor.detach() 

Example training:  
let l= model_output  
  
gradient = dl/dx = [dl/dx1  dl/dx2  dl/dx3  dl/dx4] = [3 3 3 3], since l = 3x^2 and sums up  
  
With multiple operations or iterations, the gradients sum up, so if one iteration the gradient is [3 3 3 3], another iteration will also produce [3 3 3 3], and in total is [6 6 6 6]  
  
tensor.grad.zero_() will clear out the weights

In [11]:
weights= torch.ones(4, requires_grad=True)  # our parameter 
for epoch in range(2):
    model_output = (weights*3).sum()
    model_output.backward() 
    print(weights.grad)
    weights.grad.zero_()

tensor([3., 3., 3., 3.])
tensor([3., 3., 3., 3.])
