<a href="https://colab.research.google.com/github/RickyMacharm/PyTorch/blob/master/02_Gradient_Descent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 ## **Gradient Descent**
 In machine learning problems, we provide an input and desired output
pair and ask our model to generalize the relationship between the given input and
output pair. But sometimes the model learns that its predictions would be way off from the
desired output (this difference is known as a loss)

TGradient Descent is used to find the values of a function's parameters (coefficients or weights in machine learning) that minimizes the cost (or loss) function. Cost function is the difference between predictions generated by an algorithm and the actual value. 

Gradient Descent tries to minimize cost function such that the predictions are closer to the real values. This can be calculated in many ways and one of these is the mean square error (MSE).

$$ MSE = \frac{1}{n} \sum_{i=1}^{n} \left(y_i - \hat{y}_{i} \right)^2 $$
$y =$ actual value

$\hat{y} =$ predicted value

When gradient descent is employed, the idea is to find the lowest point in this function, where mean square error is closest to zero.

A gradient measures how much the output of the given function
varies when varying the inputs by a small factor, which is the same as the concept of
derivatives in calculus. A gradient calculates the variation in all weights with respect to the
change in error. Gradients are the slope of a function. A higher gradient means a steeper
slope and that a model can learn more rapidly. The gradient points toward the direction of
steepest slope.

The `Autograd` module in PyTorch performs all gradient calculations. It is the core Torch package for automatic differentiation. It often holds the value of the cost function. Using a tape-based
system for automatic differentiation, in the forward phase, the Autograd tape will
remember all the operations it executed, and in the backward phase it will replay them.

### **Let us code**

When we want to use PyTorch to create tensors to perform gradient calculations, we need to add a new key that lets
PyTorch know what is expected.
```python
x = torch.full((2,3), 4, requires_grad=True)
x
```
output:
```python
tensor([[4., 4., 4.],
[4., 4., 4.]], requires_grad=True)
```
We will create another tensor, `y`, that is derived out of our initial tensor `x` above. The difference in the output of this new tensor will be seen, as it has a gradient function attached
to it:
```python
y = 2*x+3
y
```
output:
```python
tensor([[11., 11., 11.],
[11., 11., 11.]], grad_fn=<AddBackward0>)
```

We are going to create another tensor `y`, from our initial tensor `x` using a slightly complex formula:
```python
y = (2*x**2+3)
y
```
output:
```python
tensor([[35., 35., 35.],
[35., 35., 35.]], grad_fn=<AddBackward0>)
```
Let us break down some concepts real quick:

**`requires_grad`**: When this parameter is set to `true`, it starts tracking all the operation history and forms a backward graph for gradient calculation. For an already existing tensor `x` it can be manipulated in-place as follows: 
```python
x.requires_grad_(True)
```
**`grad`**: this holds the value of gradient. If requires_grad is False it will hold a `None` value. Even if `requires_grad` is set to `True`, it will hold a None value unless `.backward()` function is called from some other node. 

For example, if you call `out.backward()` for some variable out that involved `y` in its calculations then `y.grad` will hold $\partial {out}$/$\partial{y}$.

Gradients are of the output node from which `.backward()` is called.
On turning `requires_grad = True` PyTorch will start tracking the operation and store the gradient functions at each step.

Let us go back to our calculations and calculate gradients with respect to `x` on `y`, since `y` is a tensor, and
we want to calculate the gradient with respect to this tensor. To do this, we will
pass the shape of `x`, which is the same as `y`:
```python
y.backward(torch.ones_like(x))
```
We now output the value of the gradient of `x` using the grad attribute:
```python
x.grad
```
This results in the following:
```python
tensor([[16., 16., 16.],
[16., 16., 16.]])
```

We can turn off the gradient calculation at a certain point in the code by going through the following codes. The expected outputs are given immedietely.

Using the `requires_grad_()` method on the tensor:
```python
>> x.requires_grad
True
```
to turn off and test again to see if done:
```pyton
>> x.requires_grad_(False) # turning of gradient
>> x.requires_grad
False
```
We can also turn off tracking the gradient calculation by using the `.no_grad()`
method:
```python
>> x = torch.full((2,3), 4,requires_grad=True)
>> x
tensor([[4., 4., 4.],
[4., 4., 4.]], requires_grad=True)
>> x.requires_grad
True
>> with torch.no_grad():
.. print((x**5+3).requires_grad)
False
```

**Run all these codes in the cells to verify the outputs given herein.**

In [2]:
import torch
x = torch.full((2,3), 4, requires_grad=True)
x

tensor([[4., 4., 4.],
        [4., 4., 4.]], requires_grad=True)

In [3]:
y = 2*x+3
y

tensor([[11., 11., 11.],
        [11., 11., 11.]], grad_fn=<AddBackward0>)