# Back-Propagation

Our main purpose in this session is to demonstrate how PyTorch can calculate the gradient of a general model function and then a neural network model.

## Simple function

Let's define a simple 1-dimensional algebraic function, just to simplify things: $y(x) = x^2 + 3x$.

We will operate on a PyTorch tensor `x`, keeping track of the computational graph along the way, then compute the gradients in the backward pass. (Remember the gradients are calculated from the output layer $l=n$ to preceding layers $n-1, n-2, \ldots, 1$.

In [21]:
import torch

# 1. Define tensor with requires_grad=True
x = torch.tensor([2.0, 3.0], requires_grad=True)

# 2. Perform operations (Computational Graph created here)
y = x**2 + 3*x
loss = y.sum()

# 3. Compute gradients (Backward Pass)
loss.backward()

# 4. Access gradients
print(x.grad)

tensor([7., 9.])


This is an example of a simple $y=f(x)$. Even though we define ${\bf x}=(2,3)$, we are really using only one $x$-value at a time. The tensor is actually a set of 2 data points, each with a single input feature.

You can see that we actually get the calculated gradients for each of the input data points.

Be aware that you should only call `loss.backward` once because the `grad` function gradients accumulate with each call.

This is why we call
`optimizer.zero_grad()` or `model.zero_grad()` or `tensor.grad.zero_()` after each training batch to prevent them from interfering with the next batch's calculations.

What happens if you call `loss.backward()` too many times?

In [5]:
loss.backward()

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

## Neural Network

Now we'll calculate gradients for a deep neural network (multi-layer perceptron) with a single hidden layer.

For this exercise we will use the ReLU activation function instead of the sigmoid function.

In [11]:
import torch.nn as nn

torch.manual_seed(42)
model = nn.Sequential(
    nn.Linear(2, 3),
    nn.ReLU(),
    nn.Linear(3, 1),
    nn.ReLU()
)

While the computational graph might be set up in a symbolic way, the actual gradients require some values in the model. (Why?)

If we don't have any output targets, then we cannot calculate a loss function value, so we cannot calculate a gradient for the weights or biases.

Let's set some random data, just so we have something to work with. Note the size of the input features and the output targets.

In [15]:
# Create a small batch of data compatible with our model
batch_size = 3
input_features = 2
num_classes = 1

# Random input data
X = torch.randn(batch_size, input_features)
# Random target labels
y = torch.tensor([0, 1, 0], dtype=torch.float32).unsqueeze(1)

Now we can feed our input data forward through the DNN model.

We define a loss function so that we can get a gradient $\partial L/\partial w$.
Then we clear the gradient data structure and perform the backpropagation through the network model.

In [18]:
output = model(X)
criterion = nn.BCEWithLogitsLoss()
loss = criterion(output, y)

# Zero out any existing gradients
model.zero_grad()

# Perform backpropagation
loss.backward()

We can peek into the `grad` object, stored as part of the model, to see the numerical values for the gradient.

Do you think the numerical values will change when the input dataset changes? Will this be a problem?

In [19]:
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"\n{name}:")
        print(f"  Shape: {param.grad.shape}")
        print(f"  Mean gradient: {param.grad.mean().item():.6f}")
        print(f"  Std gradient: {param.grad.std().item():.6f}")
        print(f"  Gradient (first few values):\n  {param.grad.flatten()[:5]}")
    else:
        print(f"\n{name}: No gradient computed")


0.weight:
  Shape: torch.Size([3, 2])
  Mean gradient: 0.000342
  Std gradient: 0.057776
  Gradient (first few values):
  tensor([-0.0002, -0.0829,  0.0002,  0.0982, -0.0127])

0.bias:
  Shape: torch.Size([3])
  Mean gradient: 0.013294
  Std gradient: 0.046669
  Gradient (first few values):
  tensor([-0.0399,  0.0473,  0.0325])

2.weight:
  Shape: torch.Size([1, 3])
  Mean gradient: 0.151282
  Std gradient: 0.062465
  Gradient (first few values):
  tensor([0.0827, 0.1662, 0.2049])

2.bias:
  Shape: torch.Size([1])
  Mean gradient: 0.300815
  Std gradient: nan
  Gradient (first few values):
  tensor([0.3008])


  print(f"  Std gradient: {param.grad.std().item():.6f}")


Can you think of any reason we care about the mean of the gradient or the standard deviation of the gradient?

Do you expect the gradients of the weights and biases to be similar or different?