## Autograd example

In [1]:
import torch
from torch.autograd import grad   # importing the auto-grad function
import torch.nn.functional as F   # importing functional module, which will help us using activation functions

In [11]:
x = torch.tensor([3.])
w = torch.tensor([2.], requires_grad=True)  # setting requires_grad=True to track computation with it
b = torch.tensor([1.], requires_grad=True)
a = F.sigmoid(x*w + b)

In [12]:
a

tensor([0.9991], grad_fn=<SigmoidBackward0>)

In [13]:
grad(a, w, retain_graph=True)[0]   # retain_graph=True to retain the computation graph for further computations

tensor([0.0027])

Above, the retain_graph=True means the computation graph will be kept in memory -- this is for example purposes so that we can use the grad function again below. In practice, we usually want to free the computation graph in every round.

In [14]:
grad(a, b, retain_graph=True)[0]

tensor([0.0009])

### Automatic way of computing gradient

In [15]:
print(w.grad)

None


In [18]:
a.backward(retain_graph=True)   # uses chain rule to compute gradient for all the tensors (with requires_grad=True) involved in the computation of 'a'
print(w.grad)  # printing the gradients d(a)/d(w)
print(b.grad)  # printing the gradients d(a)/d(b)

tensor([0.0082])
tensor([0.0027])


## What is accumulation of gradient?

By default, PyTorch accumulates gradients into `.grad` every time you call `.backward()`.

Because sometimes we want to sum gradients across multiple backward passes before updating weights.

Think of it as “collecting all contributions to the gradient” before taking a step.

For example:

**1. Simulating Large Batch Training (Gradient Accumulation)**

- Suppose your GPU can only handle a batch size of 32.

- But you want to train as if batch size = 128 (for more stable gradients).

- You can split the batch into 4 mini-batches, run forward + backward on each, accumulate the gradients, and then do gradient descent once.

This way, memory usage is small, but effective batch size is large.

**2. Multi-Loss Training (Multiple Objectives)**

- Imagine training a model with two loss functions (say, classification + reconstruction).

- You compute gradients from the first loss (loss1.backward()), then from the second loss (loss2.backward()), and accumulate both before performing gradient descent.

This way, both objectives influence the weight update.

In [19]:
x = torch.tensor([5.], requires_grad=True)
y = torch.tensor([3.], requires_grad=True)

z1 = x * y
z2 = x**2 + 2* (x * y) + 3

print(f"output-1 (z1): {z1}, output-2 (z2): {z2}")

z1.backward()  # computing gradient for out1
print(f"Gradient of x for z1: {x.grad}, Gradient of y for z1: {y.grad}")

z2.backward()  # computing gradient for out2
print(f"Gradient of x: {x.grad}, Gradient of y: {y.grad}")  # gradients will be accumulated here

output-1 (z1): tensor([15.], grad_fn=<MulBackward0>), output-2 (z2): tensor([58.], grad_fn=<AddBackward0>)
Gradient of x for z1: tensor([3.]), Gradient of y for z1: tensor([5.])
Gradient of x: tensor([19.]), Gradient of y: tensor([15.])


**Explanation**

$$ z_1 = xy \text{  and  } z_2 = x^2 + 2xy + 3 $$
$$ \frac{\partial z_1}{\partial x} = y \text{  and  } \frac{\partial z_2}{\partial x} = 2x + 2y $$
$$ \frac{\partial z_1}{\partial y} = x \text{  and  } \frac{\partial z_2}{\partial y} = 2x $$

For the given values $x = 5$ and $x = 3$

$$ \frac{\partial z_1}{\partial x} = 3 \text{  and  } \frac{\partial z_2}{\partial x} = 16 $$
$$ \frac{\partial z_1}{\partial y} = 5 \text{  and  } \frac{\partial z_2}{\partial y} = 10 $$

- When you call `z1.backward()` it computes the gradient of `z1` w.r.t. `x` and `y`
- Again when you call  `z2.backward()` it computes the gradient of `z2` w.r.t. `x` and `y` and add these gradients with the previously computed gradients of `x` and `y`.

## How to stop accumulating gradients?

In most training loops, you want to reset (zero-out) the gradients before the next iteration.

One of the way to stop accumulating the gradients is to set the gradients of the variables to none or zero manually.

In [20]:
# manually clear the gradients

x = torch.tensor([5.], requires_grad=True)
y = torch.tensor([3.], requires_grad=True)

z1 = x * y
z2 = x**2 + 2* (x * y) + 3

z1.backward()  # computing gradient for out1
print(f"Gradient of x for z1: {x.grad}, Gradient of y for z1: {y.grad}")

x.grad = None  # x.grad.zero_() can also be used to reset the gradients
y.grad = None

z2.backward()  # computing gradient for out2
print(f"Gradient of x for z2: {x.grad}, Gradient of y for z2: {y.grad}")  # gradients will not be accumulated here

Gradient of x for z1: tensor([3.]), Gradient of y for z1: tensor([5.])
Gradient of x for z2: tensor([16.]), Gradient of y for z2: tensor([10.])


## Dynamic Computation Graph (Define-by-Run)

In PyTorch, the computation graph is dynamic — meaning it is built on the fly as operations are executed.
This contrasts with frameworks like TensorFlow (v1.x) that used a static computation graph, where you first define the whole graph and then execute it.

In PyTorch:

- Every time you perform an operation (like addition, multiplication, matrix multiplication), a node is added to the computation graph.

- The graph is re-built each time you call `forward`.

- This is why it’s called “define-by-run”: the graph is defined dynamically while running your code.

This gives flexibility, especially useful for:

- Variable-length sequences (e.g., time series / sentences with different sequence lengths).

- Conditional computations (`if`, `for`, etc.) inside the forward pass.

- Debugging (since you can use Python control flow directly).

### Example 1: Basic Graph

In [21]:
# Create tensors with gradient tracking
x = torch.randn(3, requires_grad=True)  # random tensor of shape (3,)
y = torch.randn(3, requires_grad=True)  # random tensor of shape (3,)

# Forward computation
z = x * y   # operation builds part of graph
out = z.sum()  # final scalar output

print("x:", x)
print("y:", y)
print("z:", z)
print("out:", out)

# Backward pass
out.backward()

print("Gradient of x:", x.grad)
print("Gradient of y:", y.grad)

x: tensor([-0.1577,  0.5581, -2.5103], requires_grad=True)
y: tensor([0.3114, 0.6489, 0.7322], requires_grad=True)
z: tensor([-0.0491,  0.3621, -1.8381], grad_fn=<MulBackward0>)
out: tensor(-1.5251, grad_fn=<SumBackward0>)
Gradient of x: tensor([0.3114, 0.6489, 0.7322])
Gradient of y: tensor([-0.1577,  0.5581, -2.5103])


- When `z = x * y` is executed, PyTorch dynamically creates a node for elementwise multiplication.

- When `out = z.sum()` is executed, another node for summation is added.

- Finally, `out.backward()` traverses the graph backwards to compute gradients.

- If you run the forward pass again with different shapes or conditions, PyTorch will build a new graph.

### Example 2: Dynamic Control Flow

In [22]:
x = torch.randn(4, requires_grad = True)
y = torch.randn(4, requires_grad = True)


def f(a, b, flag=True):
    if flag:
        # one type of computation
        c = a * b
    else:
        # another type of computation
        c = a + b

    return c.sum()

In [23]:
print("x:", x)
print("y:", y)

x: tensor([ 0.1173, -1.3290,  1.7495, -1.2362], requires_grad=True)
y: tensor([ 0.5248, -0.7247,  0.4923,  0.3318], requires_grad=True)


In [24]:
out1 = f(x, y, True)
out2 = f(x, y, False)

print(f"output with flag=True: {out1}, output with flag=False: {out2}")

output with flag=True: 1.4759141206741333, output with flag=False: -0.07416081428527832


In [25]:
out1.backward()

print("Gradient of x after out1.backward():", x.grad)

Gradient of x after out1.backward(): tensor([ 0.5248, -0.7247,  0.4923,  0.3318])


In [26]:
x.grad = None  # resetting the gradients of x, otherwise they will accumulate

out2.backward()

print("Gradient of x after out2.backward():", x.grad)

Gradient of x after out2.backward(): tensor([1., 1., 1., 1.])


## Loss functions in pytorch



In [27]:
import torch
import torch.nn as nn
import torch.nn.functional as F

### MSE loss

In [28]:
inputs = torch.tensor([1.0, 2.0, 3.0])
targets = torch.tensor([0.6, 1.8, 2.3])

mse_loss_fn = nn.MSELoss()

loss_mse = mse_loss_fn(inputs, targets)

print(f"MSE Loss: {loss_mse.item():.4f}")

MSE Loss: 0.2300


In [29]:
print(f"MSE Loss in functional representation: {F.mse_loss(inputs, targets):.4f}")

MSE Loss in functional representation: 0.2300


### BCE and BCE with logits Loss

In [32]:
logits = torch.tensor([2.0, -1.5, 1.2])

targets = torch.tensor([1, 1, 0], dtype=torch.float32)

probas = F.sigmoid(logits)

print(probas)

tensor([0.8808, 0.1824, 0.7685])


In [33]:
bce_loss_fn = nn.BCELoss()

loss_bce = bce_loss_fn(probas, targets) # passing probas (post sigmoid) as inputs

print(f"BCE loss: {loss_bce.item():.4f}")

print(f"BCE loss in functional representation: {F.binary_cross_entropy(probas, targets):.4f}")

BCE loss: 1.0972
BCE loss in functional representation: 1.0972


In [34]:
bce_with_logits_loss_fn = nn.BCEWithLogitsLoss()

loss_bce_with_logits = bce_with_logits_loss_fn(logits, targets) # passing logits (pre-sigmoid) as inputs (not the probas)

print(f"BCE with logit loss: {loss_bce_with_logits.item():.4f}")

print(f"BCE with logit loss in functional representation: {F.binary_cross_entropy_with_logits(logits, targets):.4f}")

BCE with logit loss: 1.0972
BCE with logit loss in functional representation: 1.0972


### Cross Entropy Loss

In [35]:
logits = torch.tensor([[1.2, 2.3, -0.5],
                       [3.1, 1.8, 0.9],
                       [2.8, -1.6, 4.9],
                       [0.5, 2.2, 1.9]])

targets = torch.tensor([1, 0, 2, 1])

logits.shape, targets.shape

(torch.Size([4, 3]), torch.Size([4]))

In [36]:
probas = F.softmax(logits, dim=1) 

print(probas)

print(f"Sum of probabilities across classes (should be 1.0 for each sample): {probas.sum(dim=1)}")

tensor([[0.2388, 0.7175, 0.0436],
        [0.7229, 0.1970, 0.0801],
        [0.1090, 0.0013, 0.8897],
        [0.0950, 0.5199, 0.3851]])
Sum of probabilities across classes (should be 1.0 for each sample): tensor([1.0000, 1.0000, 1.0000, 1.0000])


In [37]:
ce_loss_fn = nn.CrossEntropyLoss()

loss_ce = ce_loss_fn(logits, targets) # passing logits (pre softmax) as inputs

print(f"CE loss: {loss_ce.item():.4f}")

print(f"CE loss in functional representation: {F.cross_entropy(logits, targets):.4f}")

CE loss: 0.3569
CE loss in functional representation: 0.3569


### Negative Log-Likelihood Loss

In [38]:
log_probas = torch.log_softmax(logits, dim=1)

print(log_probas)

tensor([[-1.4319, -0.3319, -3.1319],
        [-0.3245, -1.6245, -2.5245],
        [-2.2169, -6.6169, -0.1169],
        [-2.3541, -0.6541, -0.9541]])


In [39]:
nll_loss_fn = nn.NLLLoss()

loss_nll = nll_loss_fn(log_probas, targets) # passing log-probas (log softmax) as inputs

print(f"NLL loss: {loss_nll.item():.4f}")

print(f"NLL loss in functional representation: {F.nll_loss(log_probas, targets):.4f}")

NLL loss: 0.3569
NLL loss in functional representation: 0.3569
