Okay so before we dive into PyTorch let's see another framework called NumPy. We can do a lot of the same things with NumPy that we can do with PyTorch. However, Numpy does not concerned with how to keep track of gradients, computational graphs, or automatic differentiation. So in summary, Numpy is not a deep learning framework, it is a general purpose numerical computing library.

In [1]:
# Import necessary libraries
import numpy as np  # NumPy for numerical operations and array handling
from rich import print  # Rich library for enhanced console output formatting  

In [2]:
# Define neural network dimensions
N, D_in, H, D_out = 64, 1000, 100, 10
# N: batch size (number of samples processed together)
# D_in: input dimension (number of input features)
# H: hidden layer dimension (number of neurons in hidden layer)
# D_out: output dimension (number of output classes/values)

print("N: ", N)
print("D_in: ", D_in)
print("H: ", H)
print("D_out: ", D_out)

# Create random input and output data for training
# x: input data matrix of shape (N, D_in) - each row is a sample, each column is a feature
x = np.random.randn(N, D_in)
# y: target/ground truth data matrix of shape (N, D_out) - what we want the network to output
y = np.random.randn(N, D_out)

# Randomly initialize weights using normal distribution
# w1: weights connecting input layer to hidden layer, shape (D_in, H)
w1 = np.random.randn(D_in, H)
# w2: weights connecting hidden layer to output layer, shape (H, D_out)
w2 = np.random.randn(H, D_out)

# print("x: ", x)
print("x.shape: ", x.shape)
# print("y: ", y)
print("y.shape: ", y.shape)
# print("w1: ", w1)
print("w1.shape: ", w1.shape)
# print("w2: ", w2)
print("w2.shape: ", w2.shape)

In [3]:
# Training hyperparameters
learning_rate = 1e-6  # Step size for weight updates (how fast the model learns)
                      # Small value (1e-6) ensures stable but slow learning
iterations = 500      # Number of training epochs (complete passes through the data)

print("learning_rate: ", learning_rate)
print("iterations: ", iterations)

In [4]:
# Training loop - implement gradient descent manually
for i in range(iterations):
    
    # FORWARD PASS: compute predicted output
    # Step 1: Linear transformation from input to hidden layer
    h = x.dot(w1)  # Matrix multiplication: (N, D_in) × (D_in, H) = (N, H)
    
    # Step 2: Apply ReLU activation function to hidden layer
    # ReLU(x) = max(0, x) - sets negative values to 0, keeps positive values unchanged
    h_relu = np.maximum(h, 0)  # Element-wise maximum with 0
    
    # Step 3: Linear transformation from hidden layer to output
    y_pred = h_relu.dot(w2)  # Matrix multiplication: (N, H) × (H, D_out) = (N, D_out)

    # LOSS COMPUTATION: Mean Squared Error (MSE)
    # Calculate the difference between predicted and actual values, square it, and sum
    loss = np.square(y_pred - y).sum()  # Sum of squared errors across all samples and outputs
    print(f"Iteration {i}: loss = {loss}")

    # BACKWARD PASS: compute gradients using chain rule of calculus
    # We need to find how much each weight contributes to the loss (partial derivatives)
    
    # Step 1: Gradient of loss with respect to predictions
    # d(loss)/d(y_pred) = 2 * (y_pred - y) [derivative of squared error]
    grad_y_pred = 2.0 * (y_pred - y)  # Shape: (N, D_out)
    
    # Step 2: Gradient of loss with respect to w2
    # Using chain rule: d(loss)/d(w2) = d(loss)/d(y_pred) × d(y_pred)/d(w2)
    # Since y_pred = h_relu.dot(w2), d(y_pred)/d(w2) = h_relu.T
    grad_w2 = h_relu.T.dot(grad_y_pred)  # Shape: (H, D_out)
    
    # Step 3: Gradient of loss with respect to hidden layer (after ReLU)
    # d(loss)/d(h_relu) = d(loss)/d(y_pred) × d(y_pred)/d(h_relu) = grad_y_pred × w2.T
    grad_h_relu = grad_y_pred.dot(w2.T)  # Shape: (N, H)
    
    # Step 4: Gradient of loss with respect to hidden layer (before ReLU)
    # ReLU derivative: 1 if input > 0, 0 if input <= 0
    grad_h = grad_h_relu.copy()  # Start with gradient after ReLU
    grad_h[h < 0] = 0  # Set gradient to 0 where original h was negative (ReLU derivative)
    
    # Step 5: Gradient of loss with respect to w1
    # d(loss)/d(w1) = d(loss)/d(h) × d(h)/d(w1) = grad_h × x.T
    grad_w1 = x.T.dot(grad_h)  # Shape: (D_in, H)
    
    # WEIGHT UPDATE: Apply gradient descent
    # New weight = Old weight - learning_rate × gradient
    # This moves weights in the direction that reduces the loss
    w1 -= learning_rate * grad_w1  # Update input-to-hidden weights
    w2 -= learning_rate * grad_w2  # Update hidden-to-output weights 

I realized we have not explained most of the logic behind what we are doing. I intend to do so in the next notebooks.