$$\Large\boxed{\text{AME 5202 Deep Learning, Even Semester 2026}}$$

$$\large\text{Theme}: \underline{\text{computational foundations of gradient}}$$

---

Load essential libraries

---

In [None]:
import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plt
plt.style.use('dark_background')
%matplotlib inline
import sys
import pickle
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler, LabelEncoder
import nltk
from nltk.tokenize import word_tokenize
import seaborn as sns

---

Mount Google Drive folder if running Google Colab

---

In [None]:
## Mount Google drive folder if running in Colab
if('google.colab' in sys.modules):
    from google.colab import drive
    drive.mount('/content/drive', force_remount = True)
    DIR = '/content/drive/MyDrive/Colab Notebooks/MAHE/Office of Online Education/MDS6304_Webinar_October2025'
    DATA_DIR = DIR+'/Data/'
else:
    DATA_DIR = 'Data/'

---

Automatic differentiation in PyTorch.

Example: calculate the sensitivity of $L(w) = 4w+w^3$ w.r.t. the input $w$ at $w=1.$

Sensitivity $\nabla_wL = 4+3w^2,$ which at $w=1$ is equal to $4+3\times1^2=7.$

---

In [None]:
# 1. Create linspace and define function L(w)
w_vals = torch.linspace(-2, 2, 1000)
L_fn = lambda w: 4 * w + w ** 3

# 2. Plot L(w)
fig, ax = plt.subplots(figsize=(3, 3))
ax.plot(w_vals, L_fn(w_vals))
ax.set_xlabel('w')
ax.set_ylabel('L')
ax.axhline(y = 0, color = 'white')
ax.axvline(x = 0, color = 'white')

# Mark L(1)
ax.set_title('L(w) = 4w+w^3 at w=1')
ax.scatter(1, L_fn(torch.tensor(1.0)), c = 'red')
ax.scatter(1, 0, c='red')
plt.show()

# 3. Compute gradient of L at w = 1
w = torch.tensor(1.0, requires_grad = True)
L = 4 * w + w ** 3
L.backward()  # Compute gradient a.k.a sensitivity
print('The gradient of L w.r.t. w at w = 1 is %f' % w.grad.item())

---

Another example with a negative gradient.

Calculate the sensitivity of $L(w) = 4w-w^3$ w.r.t. the input $w$ at $w=1.5.$

Sensitivity $\nabla_wL = 4-3w^2,$ which at $w=1.5$ is equal to $4-3\times1.5^2=-2.75.$

---

In [None]:
# 1. Create linspace and define function L(w)
w_vals = torch.linspace(-2, 2, 1000)
L_fn = lambda w: 4 * w - w ** 3

# 2. Plot L(w)
fig, ax = plt.subplots(figsize=(3, 3))
ax.plot(w_vals, L_fn(w_vals))
ax.set_xlabel('w')
ax.set_ylabel('L')
ax.axhline(y = 0, color = 'white')
ax.axvline(x = 0, color = 'white')

# Mark L(1)
ax.set_title('L(w) = 4w-w^3 at w=1.5')
ax.scatter(1.5, L_fn(torch.tensor(1.5)), c = 'red')
ax.scatter(1.5, 0, c='red')
plt.show()

# 3. Compute gradient of L at w = 1.5
w = ?
L = ?
L.?  # Compute gradient a.k.a. sensitivity
print('The gradient of L w.r.t. w at w = 1.5 is %f' % ?)

---

Function with multiple inputs.

Example: calculate the sensitivity of $L(w_1,w_2) = w_1+w_2^2$ w.r.t. the inputs $w_1, w_2$ at $w_1=1, w_2=2.$

Setting $\mathbf{w} = \begin{bmatrix}w_1\\w_2\end{bmatrix},$ sensitivity $\nabla_\mathbf{w}L= \begin{bmatrix}\nabla_{w_1}(w_1+w_2^2)\\\nabla_{w_2}(w_1+w_2^2)\end{bmatrix} = \begin{bmatrix}\nabla_{w_1}(w_1)+\nabla_{w_1}(w_2^2)\\\nabla_{w_2}(w_1)+\nabla_{w_2}(w_2^2)\end{bmatrix} =\begin{bmatrix}1+0\\0+2w_2\end{bmatrix}=\begin{bmatrix}1\\2w_2\end{bmatrix},$

 which at $w_1=1,w_2=2$ is equal to $\begin{bmatrix}1\\4\end{bmatrix}.$

 ---

In [None]:
# Define variables with gradient tracking
w1 = ?
w2 = ?

# Compute the function
L = ?

# Compute gradients
?

# Print gradients
?

---

In the previous example, we could also calculate the sensitivity w.r.t. all the variables in the vector in one shot.

---

In [None]:
# Define variables as a vector with gradient tracking
w = torch.tensor([1.0, 2.0], requires_grad = True)

# Compute the function
L = w[0] + w[1]**2

# Compute gradients
L.backward()

# Print gradients
print(w.grad)

---

Another example of using a tensor variable as input where we calculate the sensitivity w.r.t. all the variables in the tensor:.

Consider calculating the sensitivity of $L(\mathbf{w}) = \lVert \mathbf{w}\rVert^2$ for an $8$-vector $w$ at $\tiny w=\begin{bmatrix}0.1\\0\\0.1\\0\\0.1\\0\\0.1\\0\end{bmatrix}.$

We know that $\nabla_\mathbf{w}\left(\lVert\mathbf{w}\rVert^2\right)=2\mathbf{w}$ which evaluated at $\tiny w=\begin{bmatrix}0.1\\0\\0.1\\0\\0.1\\0\\0.1\\0\end{bmatrix}$ is $\tiny2\begin{bmatrix}0.1\\0\\0.1\\0\\0.1\\0\\0.1\\0\end{bmatrix}=\begin{bmatrix}0.2\\0\\0.2\\0\\0.2\\0\\0.2\\0\end{bmatrix}.$

---

In [None]:
# Define variables as a vector with gradient tracking
w = torch.tensor([0.1, 0.0, 0.1, 0.0, 0.1, 0.0, 0.1, 0.0], requires_grad = True)

# Compute the function
L = torch.norm(w)**2

# Compute gradients
L.backward()

# Print gradients
print(w.grad)

---

Gradient calculation in PyTorch when a

- variable is part of the computation graph
- variable is marked non-trainable
- variable is not used at all
- variable is wrapped in tensor arithmetic (becoming a constant)

---

In [None]:
# Independent variable
w1 = torch.tensor(1.0, requires_grad = True)

# Constant tensor (non-trainable)
c1 = torch.tensor(-2.0)

# Variable treated as constant (no gradient tracking)
w2 = torch.tensor(2.0, requires_grad = False)

# Variable + constant â†’ becomes a tensor, no gradient tracking
c2 = torch.tensor(10.0, requires_grad = True) + 1.0  # treated as constant now

# Unused variable
w3 = torch.tensor(0.0, requires_grad = True)

# Forward computation
L = (w1 + c1)**2 + w2**3 + 4 * c2

# Backward
L.backward()

# Gradients
print(w1.grad)   # Used in computation
print(w2.grad)   # Not tracked (requires_grad = False)
print(w3.grad)   # Unused in L
print(c1.grad)   # c1 is a constant (no grad tracking in PyTorch)
print(c2.grad)   # c2 is a tensor (not a leaf with grad_fn)

---

Example: consider calculating the gradient of $\mathbf{a}(z) = \begin{bmatrix}a_1(z)\\a_2(z)\end{bmatrix}  = \begin{bmatrix}2z\\z^4\end{bmatrix}$ at $z = -1.$

The gradient is $\nabla_z(\mathbf{a})= \begin{bmatrix}\nabla_z(a_1) & \nabla_z(a_2)\end{bmatrix}=\begin{bmatrix}\nabla_z(2z) & \nabla_z\left(z^4\right)\end{bmatrix}=\begin{bmatrix}2&4z^3\end{bmatrix}.$

Note that the sum of the gradients $\nabla_z(a_1)+\nabla_z(a_2)=2+4z^3$ for $z=-1$ is returned which is equal to $2+4(-1)^3=-2.$

---

In [None]:
z = torch.tensor([-1.0], requires_grad = True)

# Forward computation
a = ?

# Backward
?

# Gradients
print(a.grad)

---

Applying the gradient descent method with

- a maximum number of iterations equal to 1000
- a stopping tolerance equal to $10^{-6}$
- a learning rate of 0.01

 to minimize $$L(\mathbf{w}) = (w_1-2)^2+(w_2+3)^2$$ starting from $\mathbf{w} = \begin{bmatrix}w_1\\w_2\end{bmatrix}=\begin{bmatrix}0\\0\end{bmatrix}.$

---

In [None]:
# Initialize weights as tensors with gradients
w = torch.tensor([0.0, 0.0], requires_grad = True)

# Hyperparameters
maxiter = 10000
tol = 1e-06
lr = 1e-02
norm_grad = float('inf')

k = 0
while k < maxiter and norm_grad > tol:
    # Zero the gradients
    if w.grad is not None:
        w.grad.zero_()

    # Define the loss function
    L = (w[0] - 2)**2 + (w[1] + 3)**2

    # Backpropagate to compute gradients
    L.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        #w = w + lr * (-w.grad)
        w -= lr * w.grad

    # Compute the norm of the gradient
    norm_grad = w.grad.norm().item()
    k += 1

    print(f'Iteration {k}: ||grad|| = {norm_grad}')