In [None]:
# This is Google Colab Notebook


### Gradient

A gradient is essentially a vector (a direction and magnitude) that tells us how much and in which direction we need to change a parameter (like weights or biases in a neural network) to minimize or maximize a given function.


Consider we have an function say y = x^2. Now to calculate derivative of y w.r.t. x is dy/dx = 2x

In [1]:
# hence we can code this considering y = x**2

def dy_dx(x):
  return 2*x

dy_dx(2)

4

In [2]:
# Now say we have 2 function where 1. y = x**2 and then 2. z = sin(y)
# Now dz/dx = dz/dy * dy/dx

import math

def dy_dx(x):
  return 2*x

def dz_dx(x):
  return dy_dx(x) * math.cos(x**2)

dz_dx(2)

-2.6145744834544478

 Now consider you have continously nested function like

 y = x**2

 z = sin(y)

 u = e**z

Now here the derivative du/dx become more difficult to derive manually.

In deep learning we have deep neural network which has very complex nested network/function. Then deriving the derivative through chain rule and keeping track of those derivative becomes very hard.

Usually training process is -

1. Forward pass

2. Calculate Loss

3. Backward Pass - Compute gradient of Loss w.r.t. parameters(weight and bias)

4. Update Gradient - Update the parameters using optimizer algo(like Gradient Descent).


Now to perform all this steps manually, it becomes very difficult and in deep neural network it can become nearly impossible.

Hence, **PyTorch AutoGrad** simplifies the process for us.




##**What is Autograd?**
Autograd is a system that performs automatic differentiation to compute the gradients of functions with respect to their inputs, parameters, or any variable. These gradients are crucial for optimizing the model during training (e.g in gradient descent).

When you define a function in machine learning, Autograd keeps track of all operations performed on the inputs. Using this information, it calculates the derivatives (gradients) automatically without requiring you to manually derive formulas.


#### Now taking similar examples as above, I am solving through PyTorch Autograd function

In [3]:
import torch

In [4]:
x = torch.tensor(3, requires_grad= True) # whenever you need to find derivative of the tensor, then keep this parameter as True

y = x**2

# I am leaving this error deliberatly to know that this only works for floating point and complex dtypes and not for integer.

RuntimeError: Only Tensors of floating point and complex dtype can require gradients

In [5]:
x = torch.tensor(3.0, requires_grad= True) # whenever you need to find derivative of the tensor, then keep this parameter as True

y = x**2

In [6]:
x

tensor(3., requires_grad=True)

In [7]:
y

tensor(9., grad_fn=<PowBackward0>)

In [9]:
# Here PyTorch internally creates a Computation Graph and stores the operations it has performed while Forward pass. Like in this case grad_fn=<PowBackward0>. Hence later at backward pass it can perform respective derivation.

y.backward()

In [10]:
x.grad

tensor(6.)

In [11]:
# Now taking 2nd example

dz_dx(3)

-5.466781571308061

In [12]:
# Let's try with AutoGrad functionality

x = torch.tensor(3.0, requires_grad=True)

y = x**2

z = torch.sin(y)



In [13]:
x, y , z

(tensor(3., requires_grad=True),
 tensor(9., grad_fn=<PowBackward0>),
 tensor(0.4121, grad_fn=<SinBackward0>))

In [14]:
z.backward()

In [15]:
x.grad

tensor(-5.4668)

In [16]:
# What if I try printing/fetching grad for y in this whole nested function.

y.grad

# We can't do that because when the computational graph is created, the gradients is not calculated with intermediate nodes. It is mainly calculated for Leaves Node.

  y.grad


Conceptually, autograd keeps a record of data (tensors) and all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of Function objects. In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:


*   run the requested operation to compute a resulting tensor
*   maintain the operation’s gradient function in the DAG


The backward pass kicks off when .backward() is called on the DAG root. autograd then:



*   computes the gradients from each .grad_fn,
*   accumulates them in the respective tensor’s .grad attribute
*   using the chain rule, propagates all the way to the leaf tensors.



*Reference* - https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html#more-on-computational-graphs


### In Terms of Actual Perceptron

1. Manual

In [17]:
import torch

x = torch.tensor(6.7) # Input feature
y = torch.tensor(0.0) # True Label (Binary)

w = torch.tensor(1.0) # Weight
b = torch.tensor(0.0) # Bias

In [18]:
# Binary cross entropy loss

def binary_cross_entropy_loss(pred, target):
  epsilon = 1e-8   # to prevent log(0)
  prediction = torch.clamp(pred, epsilon, 1 - epsilon)
  return -(target * torch.log(prediction) + (1 - target) * torch.log(1 - prediction)) # Formula of Binary Loss function

In [19]:
# Forward Pass

z = w * x + b # Linear function
y_pred = torch.sigmoid(z)

# Compute Binary Loss
loss = binary_cross_entropy_loss(y_pred, y)

In [20]:
loss

tensor(6.7012)

In [21]:
# Back Propagation

# Derivatives:
# 1. dL/d(y_pred): Loss with respect to the prediction (y_pred)
dloss_dy_pred = (y_pred - y)/(y_pred*(1-y_pred))

# 2. dy_pred/dz: Prediction (y_pred) with respect to z (sigmoid derivative)
dy_pred_dz = y_pred * (1 - y_pred)

# 3. dz/dw and dz/db: z with respect to w and b
dz_dw = x  # dz/dw = x
dz_db = 1  # dz/db = 1 (bias contributes directly to z)

dL_dw = dloss_dy_pred * dy_pred_dz * dz_dw
dL_db = dloss_dy_pred * dy_pred_dz * dz_db

In [22]:
print(f"Manual Gradient of loss w.r.t weight (dw): {dL_dw}")
print(f"Manual Gradient of loss w.r.t bias (db): {dL_db}")

Manual Gradient of loss w.r.t weight (dw): 6.691762447357178
Manual Gradient of loss w.r.t bias (db): 0.998770534992218


2. AutoGrad

In [23]:
x = torch.tensor(6.7)
y = torch.tensor(0.0)

w = torch.tensor(1.0, requires_grad=True) # Because we need to calculate Gradient with res to these parameters
b = torch.tensor(0.0, requires_grad=True)

In [24]:
w, b

(tensor(1., requires_grad=True), tensor(0., requires_grad=True))

In [25]:
z = w*x + b
z

tensor(6.7000, grad_fn=<AddBackward0>)

In [26]:
y_pred = torch.sigmoid(z)
y_pred

tensor(0.9988, grad_fn=<SigmoidBackward0>)

In [27]:
loss = binary_cross_entropy_loss(y_pred, y)
loss

tensor(6.7012, grad_fn=<NegBackward0>)

In [28]:
loss.backward()

In [29]:
w.grad

tensor(6.6918)

In [30]:
b.grad

tensor(0.9988)

In [31]:
# Now trying with 1 dim array tensor

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

x

tensor([1., 2., 3.], requires_grad=True)

In [32]:
y = (x**2).mean()
y

tensor(4.6667, grad_fn=<MeanBackward0>)

In [33]:
y.backward()

In [34]:
x.grad

tensor([0.6667, 1.3333, 2.0000])


### Clearning Grads

Usually there would be a multiple backward pass, but your gradients get accumulated. We can see in below example how this accumualtes and it might be problem getting the actual desired values.



In [35]:
x = torch.tensor(2.0, requires_grad=True)
x

tensor(2., requires_grad=True)

In [36]:
y = x ** 2
y

tensor(4., grad_fn=<PowBackward0>)

In [37]:
# 1st backward pass
y.backward()
x.grad

tensor(4.)

In [39]:
# again for second pass
y = x ** 2
y

tensor(4., grad_fn=<PowBackward0>)

In [40]:
# Now say there is 2nd backward pass, the value should have been same for x.grad i.e. 4. But we get different values becuase of grad accumulation
y.backward()
x.grad

tensor(8.)

In [41]:
# To disable Gradient tracking there are few techniques we can implement

# option 1 - requires_grad_(False)
# option 2 - detach()
# option 3 - torch.no_grad()