<hr>
<h1>
Assignment 1
</h1>
<hr>

<br>

This notebook is an assignment focused on building a convolutional neural network (CNN) from scratch using NumPy. The main goal is to gain a deep understanding of the fundamental building blocks of a CNN. The notebook walks through the implementation of various layer types, including `affine (fully-connected)`, `convolutional`, `max-pooling`, and `ReLU` activation layers. For each layer, both the forward and backward passes are implemented and tested against numerical gradients to ensure correctness.

Finally, these layers are assembled into a simple CNN model, which is then trained on the `MNIST` dataset to demonstrate its functionality in an image classification task. The key takeaway is to learn the inner workings of a CNN, rather than just using a high-level deep learning framework.

In [1]:
import numpy as np
import torch
import torchvision
import torchvision.transforms as transforms


def rel_error(x, y):
  """ returns relative error """
  return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))


def eval_numerical_gradient_array(f, x, df, h=1e-5):
    """
    Evaluate a numeric gradient for a function that accepts a numpy
    array and returns a numpy array.
    """
    grad = np.zeros_like(x)
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        ix = it.multi_index

        oldval = x[ix]
        x[ix] = oldval + h
        pos = f(x).copy()
        x[ix] = oldval - h
        neg = f(x).copy()
        x[ix] = oldval

        grad[ix] = np.sum((pos - neg) * df) / (2 * h)
        it.iternext()
    return grad

<br><br>
<hr>
<br><br>

## Affine (fully-connected) layers

### Affine forward

In [None]:
#(n*m) (c*m)

In [9]:
def affine_forward(x, w, b):
    """
    Computes the forward pass for an affine (fully-connected) layer.

    The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N
    examples, where each example x[i] has shape (d_1, ..., d_k). We will
    reshape each input into a vector of dimension D = d_1 * ... * d_k, and
    then transform it to an output vector of dimension M.

    Inputs:
    - x: A numpy array containing input data, of shape (N, d_1, ..., d_k)
    - w: A numpy array of weights, of shape (D, M)
    - b: A numpy array of biases, of shape (M,)

    Returns a tuple of:
    - out: output, of shape (N, M)
    - cache: (x, w, b)
    """
    ###########################################################################
    # TODO: Implement the affine forward pass. Store the result in out. You   #
    # will need to reshape the input into rows.                               #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    n = x.shape[0]; 
    X = x.reshape(n,-1); 
    out = X@w + b



    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    return out, (x, w, b)


You can test your implementaion by running the following:

In [10]:
# Test the affine_forward function

num_inputs = 2
input_shape = (4, 5, 6)
output_dim = 3

input_size = num_inputs * np.prod(input_shape)
weight_size = output_dim * np.prod(input_shape)

x = np.linspace(-0.1, 0.5, num=input_size).reshape(num_inputs, *input_shape)
w = np.linspace(-0.2, 0.3, num=weight_size).reshape(np.prod(input_shape), output_dim)
b = np.linspace(-0.3, 0.1, num=output_dim)

out, _ = affine_forward(x, w, b)
correct_out = np.array([[ 1.49834967,  1.70660132,  1.91485297],
                        [ 3.25553199,  3.5141327,   3.77273342]])


# Compare your output with ours. The error should be around e-9 or less.
print('Testing affine_forward function:')
print('difference: ', rel_error(out, correct_out))

Testing affine_forward function:
difference:  9.769849468192957e-10


### Affine backward

Now implement the `affine_backward` function and test your implementation using numeric gradient checking.

In [11]:
def affine_backward(dout, cache):
    """
    Computes the backward pass for an affine layer.

    Inputs:
    - dout: Upstream derivative, of shape (N, M)
    - cache: Tuple of:
      - x: Input data, of shape (N, d_1, ... d_k)
      - w: Weights, of shape (D, M)
      - b: Biases, of shape (M,)

    Returns a tuple of:
    - dx: Gradient with respect to x, of shape (N, d1, ..., d_k)
    - dw: Gradient with respect to w, of shape (D, M)
    - db: Gradient with respect to b, of shape (M,)
    """

    x,w,b = cache

    ###########################################################################
    # TODO: Implement the affine backward pass.                               #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    db = dout.sum(axis=0); 
    dx = (dout @ w.T).reshape(x.shape); 
    x = x.reshape(x.shape[0],-1)
    dw = x.T@dout;

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    return dx,dw,db

In [12]:
# Test the affine_backward function
np.random.seed(231)
x = np.random.randn(10, 2, 3)
w = np.random.randn(6, 5)
b = np.random.randn(5)
dout = np.random.randn(10, 5)

dx_num = eval_numerical_gradient_array(lambda x: affine_forward(x, w, b)[0], x, dout)
dw_num = eval_numerical_gradient_array(lambda w: affine_forward(x, w, b)[0], w, dout)
db_num = eval_numerical_gradient_array(lambda b: affine_forward(x, w, b)[0], b, dout)

_, cache = affine_forward(x, w, b)
dx, dw, db = affine_backward(dout, cache)

# The error should be around e-10 or less
print('Testing affine_backward function:')
print('dx error: ', rel_error(dx_num, dx))
print('dw error: ', rel_error(dw_num, dw))
print('db error: ', rel_error(db_num, db))

Testing affine_backward function:
dx error:  5.399100368651805e-11
dw error:  9.904211865398145e-11
db error:  2.4122867568119087e-11


<br><br>
<hr>
<br><br>

## Convolution layers

The core of a convolutional network is the convolution operation. Implement the forward and backward pass for the convolution layer.

You don't have to worry too much about efficiency at this point; just write the code in whatever way you find most clear.

### Convolution forward

In [13]:

def conv_forward_naive(x, w, b, conv_param):
    """
    A naive implementation of the forward pass for a convolutional layer.

    The input consists of N data points, each with C channels, height H and
    width W. We convolve each input with F different filters, where each filter
    spans all C channels and has height HH and width HH.

    Input:
    - x: Input data of shape (N, C, H, W)
    - w: Filter weights of shape (F, C, HH, WW)
    - b: Biases, of shape (F,)
    - conv_param: A dictionary with the following keys:
      - 'stride': The number of pixels between adjacent receptive fields in the
        horizontal and vertical directions.
      - 'pad': The number of pixels that will be used to zero-pad the input.

    Returns a tuple of:
    - out: Output data, of shape (N, F, H', W') where H' and W' are given by
      H' = 1 + (H + 2 * pad - HH) / stride
      W' = 1 + (W + 2 * pad - WW) / stride
    - cache: (x, w, b, conv_param)
    """
    ###########################################################################
    # TODO: Implement the convolutional forward pass.                         #
    # Hint: you can use the function np.pad for padding.                      #
    ###########################################################################
    N, C, H, W = x.shape
    F, _, HH, WW = w.shape
    stride = conv_param['stride']
    pad = conv_param['pad']
    # Assuming that all filters have the same size and also that all images have the same size
    H_out = 1 + (H + 2 * pad - HH) / stride
    W_out = 1 + (W + 2 * pad - WW) / stride
    assert(H_out == int(H_out) and W_out == int(W_out))
    H_out, W_out = int(H_out), int(W_out)

    x_pad = np.pad(x, ((0,), (0,), (pad,), (pad,)), mode='constant', constant_values=0)
    out = np.zeros((N, F, H_out, W_out))

    # For a concrete image i, filter f, and output pixel (h_o, w_o), the value of this output pixel can be calculated
    # as a dot product of a certain number of image samples from the image i and the weights of the filter f (assuming
    # that we straighten them out into 1 dimension using ravel()). Note that as we change (h_o, w_o) but are still in the
    # same image i and filter f, the weights remain constant but the image samples change and they are given by a sliding
    # window looking into the image i.

    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    for f in range(F): 
        for n in range(N): 
            for i in range(H_out): 
                for j in range(W_out): 
                    h_start = i*stride; 
                    h_end = h_start + HH; 
                    w_start = j*stride; w_end = w_start + WW; 
                    out[n][f][i][j] = np.sum(x_pad[n,:,h_start:h_end,w_start:w_end]*w[f])+b[f]


    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    cache = (x_pad, w, b, conv_param)
    return out, cache


You can test your implementation by running the following:

In [14]:
x_shape = (2, 3, 4, 4) # N,C,H,W
w_shape = (3, 3, 4, 4) # F, C, HH, WW
x = np.linspace(-0.1, 0.5, num=np.prod(x_shape)).reshape(x_shape)
w = np.linspace(-0.2, 0.3, num=np.prod(w_shape)).reshape(w_shape)
b = np.linspace(-0.1, 0.2, num=3)

conv_param = {'stride': 2, 'pad': 1}
out, _ = conv_forward_naive(x, w, b, conv_param)
correct_out = np.array([[[[-0.08759809, -0.10987781],
                           [-0.18387192, -0.2109216 ]],
                          [[ 0.21027089,  0.21661097],
                           [ 0.22847626,  0.23004637]],
                          [[ 0.50813986,  0.54309974],
                           [ 0.64082444,  0.67101435]]],
                         [[[-0.98053589, -1.03143541],
                           [-1.19128892, -1.24695841]],
                          [[ 0.69108355,  0.66880383],
                           [ 0.59480972,  0.56776003]],
                          [[ 2.36270298,  2.36904306],
                           [ 2.38090835,  2.38247847]]]])

# Compare your output to ours; difference should be around 2e-8
print('Testing conv_forward_naive')
print('difference: ', rel_error(out, correct_out))

Testing conv_forward_naive
difference:  2.2121476417505994e-08


### Convolution backward

In [21]:

def conv_backward_naive(dout, cache):
    """
    A naive implementation of the backward pass for a convolutional layer.

    Inputs:
    - dout: Upstream derivatives. (N, F, H_out, W_out)
    - cache: A tuple of (x, w, b, conv_param) as in conv_forward_naive

    Returns a tuple of:
    - dx: Gradient with respect to x (N, C, H, W)
    - dw: Gradient with respect to w (F, C, HH, WW)
    - db: Gradient with respect to b (F,)
    """
    x_pad, w, b, conv_param = cache
    stride = conv_param['stride']
    pad = conv_param['pad']
    N = x_pad.shape[0]
    F, C, HH, WW = w.shape
    _, _, H_out, W_out = dout.shape
    # We keep the padding on dx so that we can keep the same indexing scheme for x_pad as for dx
    dx, dw, db = np.zeros_like(x_pad), np.zeros_like(w), np.zeros_like(b)

    ###########################################################################
    # TODO: Implement the convolutional backward pass.                        #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    # db
    # We sum over all channels, width and height but not over the axis 1 which indexes the kernels
    for i in range(F): db[i] = np.sum(dout[:,i,:,:]); 
    # dw
    # The partial derivative of loss w.r.t. a weight of a filter f is calculated as a sum over all the input images i
    # of all the pixels that were multiplied by this weight and thus contributed to the value
    # of an output pixel at (i, f, h_o, w_o). This is  multiplied by dl/d(out[i, f, h_o, w_o]) i.e.,
    # the partial derivative of the loss w.r.t a single output pixel.
    #
    H_out,W_out = dout.shape[2],dout.shape[3];
    for f in range(F): 
        for n in range(N): 
            for i in range(H_out): 
                for j in range(W_out): 
                    h_start = i*stride; 
                    h_end = h_start + HH; 
                    w_start = j*stride; w_end = w_start + WW; 
                    grad_out = dout[n][f][i][j];
                    dw[f] += x_pad[n,:,h_start:h_end,w_start:w_end]*grad_out;
                    dx[n,:,h_start:h_end,w_start:w_end]+=w[f]*grad_out;
    if (pad>0): dx = dx[:,:,pad:-pad,pad:-pad];
    # dx
    # The derivative of the loss w.r.t. an original pixel in an image i is the sum over all output pixels
    # in the output image i, over all filters f and of all the weights
    # in these filters that contributed to the value of the output pixel at (i, f, h_o, w_o). This is multiplied by
    # dl/d(out[i, f, h_o, w_o]).


    # Remove padding from dx

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****#
    return dx, dw, db


You can test your implementation by running the following:

In [22]:
np.random.seed(231)
x = np.random.randn(4, 3, 5, 5)
w = np.random.randn(2, 3, 3, 3)
b = np.random.randn(2,)
dout = np.random.randn(4, 2, 5, 5)
conv_param = {'stride': 1, 'pad': 1}

dx_num = eval_numerical_gradient_array(lambda x: conv_forward_naive(x, w, b, conv_param)[0], x, dout)
dw_num = eval_numerical_gradient_array(lambda w: conv_forward_naive(x, w, b, conv_param)[0], w, dout)
db_num = eval_numerical_gradient_array(lambda b: conv_forward_naive(x, w, b, conv_param)[0], b, dout)

out, cache = conv_forward_naive(x, w, b, conv_param)
dx, dw, db = conv_backward_naive(dout, cache)

# Your errors should be around 1e-8'
print('Testing conv_backward_naive function')
print('db error: ', rel_error(db, db_num))
print('dw error: ', rel_error(dw, dw_num))
print('dx error: ', rel_error(dx, dx_num))

Testing conv_backward_naive function
db error:  3.3726153958780465e-11
dw error:  2.2471264748452487e-10
dx error:  1.159803161159293e-08


In [None]:
np.random.seed(231)
x = np.random.randn(4, 3, 5, 5)
w = np.random.randn(2, 3, 3, 3)
b = np.random.randn(2,)
dout = np.random.randn(4, 2, 5, 5)
conv_param = {'stride': 1, 'pad': 1}

dx_num = eval_numerical_gradient_array(lambda x: conv_forward_naive(x, w, b, conv_param)[0], x, dout)
dw_num = eval_numerical_gradient_array(lambda w: conv_forward_naive(x, w, b, conv_param)[0], w, dout)
db_num = eval_numerical_gradient_array(lambda b: conv_forward_naive(x, w, b, conv_param)[0], b, dout)

out, cache = conv_forward_naive(x, w, b, conv_param)
dx, dw, db = conv_backward_naive(dout, cache)

# Your errors should be around 1e-8'
print('Testing conv_backward_naive function')
print('db error: ', rel_error(db, db_num))
print('dw error: ', rel_error(dw, dw_num))
print('dx error: ', rel_error(dx, dx_num))

Testing conv_backward_naive function
db error:  2.1494967362289156e-11
dw error:  5.185597891706744e-10
dx error:  2.9516763408862005e-09


<br><br>
<hr>
<br><br>

## Max pooling


Implement the forward and backward pass for the max-pooling operation in the function. Again, don't worry too much about computational efficiency.



### Max pooling forward

In [23]:

def max_pool_forward_naive(x, pool_param):
    """
    A naive implementation of the forward pass for a max pooling layer.

    Inputs:
    - x: Input data, of shape (N, C, H, W)
    - pool_param: dictionary with the following keys:
      - 'pool_height': The height of each pooling region
      - 'pool_width': The width of each pooling region
      - 'stride': The distance between adjacent pooling regions

    Returns a tuple of:
    - out: Output data
    - cache: (x, pool_param)
    """
    ###########################################################################
    # TODO: Implement the max pooling forward pass                            #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    stride = pool_param['stride'];
    h =pool_param['pool_height'];
    w =pool_param['pool_width']; 
    N,C,W,H = x.shape; 
    W_out = (W-w+2)//stride;
    H_out = (H-h+2)//stride; 
    out = np.zeros((N,C,W_out,H_out));
    for n in range(N): 
        for c in range(C): 
            for i in range(H_out): 
                for j in range(W_out): 
                    h_start = i*stride; h_end = h_start+h; 
                    w_start = j*stride; w_end = w_start+w;
                    out[n][c][i][j] = np.max(x[n,c,h_start:h_end,w_start:w_end]);
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    cache = (x, pool_param)
    return out, cache



Check your implementation by running the following:

In [24]:
x_shape = (2, 3, 4, 4)
x = np.linspace(-0.3, 0.4, num=np.prod(x_shape)).reshape(x_shape)
pool_param = {'pool_width': 2, 'pool_height': 2, 'stride': 2}

out, _ = max_pool_forward_naive(x, pool_param)

correct_out = np.array([[[[-0.26315789, -0.24842105],
                          [-0.20421053, -0.18947368]],
                         [[-0.14526316, -0.13052632],
                          [-0.08631579, -0.07157895]],
                         [[-0.02736842, -0.01263158],
                          [ 0.03157895,  0.04631579]]],
                        [[[ 0.09052632,  0.10526316],
                          [ 0.14947368,  0.16421053]],
                         [[ 0.20842105,  0.22315789],
                          [ 0.26736842,  0.28210526]],
                         [[ 0.32631579,  0.34105263],
                          [ 0.38526316,  0.4       ]]]])

# Compare your output with ours. Difference should be around 1e-8.
print('Testing max_pool_forward_naive function:')
print('difference: ', rel_error(out, correct_out))

Testing max_pool_forward_naive function:
difference:  4.1666665157267834e-08


In [None]:
x_shape = (2, 3, 4, 4)
x = np.linspace(-0.3, 0.4, num=np.prod(x_shape)).reshape(x_shape)
pool_param = {'pool_width': 2, 'pool_height': 2, 'stride': 2}

out, _ = max_pool_forward_naive(x, pool_param)

correct_out = np.array([[[[-0.26315789, -0.24842105],
                          [-0.20421053, -0.18947368]],
                         [[-0.14526316, -0.13052632],
                          [-0.08631579, -0.07157895]],
                         [[-0.02736842, -0.01263158],
                          [ 0.03157895,  0.04631579]]],
                        [[[ 0.09052632,  0.10526316],
                          [ 0.14947368,  0.16421053]],
                         [[ 0.20842105,  0.22315789],
                          [ 0.26736842,  0.28210526]],
                         [[ 0.32631579,  0.34105263],
                          [ 0.38526316,  0.4       ]]]])

# Compare your output with ours. Difference should be around 1e-8.
print('Testing max_pool_forward_naive function:')
print('difference: ', rel_error(out, correct_out))

Testing max_pool_forward_naive function:
difference:  4.1666665157267834e-08


### Max pooling backward

In [29]:

def max_pool_backward_naive(dout, cache):
    """
    A naive implementation of the backward pass for a max pooling layer.

    Inputs:
    - dout: Upstream derivatives (N, C, H_out, W_out)
    - cache: A tuple of (x, pool_param) as in the forward pass.

    Returns:
    - dx: Gradient with respect to x (N, C, H, W)
    """
    ###########################################################################
    # TODO: Implement the max pooling backward pass                           #
    ###########################################################################
    x, pool_param = cache
    N, C, H_out, W_out = np.shape(dout)
    pool_height = pool_param['pool_height']
    pool_width = pool_param['pool_width']
    stride = pool_param['stride']
    H_out,W_out = dout.shape[2],dout.shape[3];
    dx = np.zeros_like(x) # (N, C, H, W)

    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    for n in range(N): 
        for c in range(C): 
            for i in range(H_out): 
                for j in range(W_out):
                    h_start = i*stride; h_end = h_start+pool_height; 
                    w_start = j*stride; w_end = w_start+pool_width;
                    mx = np.argmax(x[n,c,h_start:h_end,w_start:w_end]);
                    ii,jj = divmod(mx,pool_height);
                    dx[n][c][h_start+ii][w_start+jj]+=dout[n][c][i][j];

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    return dx


Check your implementation by running the following:

In [30]:
np.random.seed(231)
x = np.random.randn(3, 2, 8, 8) # (N, C, H, W)
dout = np.random.randn(3, 2, 4, 4) # (N, C, H_out, W_out)
pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}

dx_num = eval_numerical_gradient_array(lambda x: max_pool_forward_naive(x, pool_param)[0], x, dout)

out, cache = max_pool_forward_naive(x, pool_param)
dx = max_pool_backward_naive(dout, cache)

# Your error should be around 1e-12
print('Testing max_pool_backward_naive function:')
print('dx error: ', rel_error(dx, dx_num))

Testing max_pool_backward_naive function:
dx error:  3.27562514223145e-12


<br><br>
<hr>
<br><br>

## ReLU activation functions

### ReLU forward

In [31]:
def relu_forward(x):
    """
    Computes the forward pass for a layer of rectified linear units (ReLUs).

    Input:
    - x: Inputs, of any shape

    Returns a tuple of:
    - out: Output, of the same shape as x
    - cache: x
    """
    ###########################################################################
    # TODO: Implement the ReLU forward pass.                                  #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    cache = x.copy(); 
    mask = x>=0; 
    out = np.zeros_like(x); 
    out[mask] = x[mask]; 
    out[~mask] = 0;
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    return out, cache


Check your implementation by running the following:

In [32]:
# Test the relu_forward function

x = np.linspace(-0.5, 0.5, num=12).reshape(3, 4)

out, _ = relu_forward(x)
correct_out = np.array([[ 0.,          0.,          0.,          0.,        ],
                        [ 0.,          0.,          0.04545455,  0.13636364,],
                        [ 0.22727273,  0.31818182,  0.40909091,  0.5,       ]])

# Compare your output with ours. The error should be on the order of e-8
print('Testing relu_forward function:')
print('difference: ', rel_error(out, correct_out))

Testing relu_forward function:
difference:  4.999999798022158e-08


### ReLU backward

In [47]:
def relu_backward(dout, cache):
    """
    Computes the backward pass for a layer of rectified linear units (ReLUs).

    Input:
    - dout: Upstream derivatives, of any shape
    - cache: Input x, of same shape as dout

    Returns:
    - dx: Gradient with respect to x
    """
    ###########################################################################
    # TODO: Implement the ReLU backward pass.                                 #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    x = cache; 
    dx =(x>=0).astype(float); 
    #(n*m)*(n*m)
    dx = dx*dout;

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    return dx

Check your implementation by running the following:

In [48]:
# Test the relu_backward function
np.random.seed(231)
x = np.random.randn(10, 10)
dout = np.random.randn(*x.shape)

dx_num = eval_numerical_gradient_array(lambda x: relu_forward(x)[0], x, dout)

_, cache = relu_forward(x)
dx = relu_backward(dout, cache)

# The error should be on the order of e-12
print('Testing relu_backward function:')
print('dx error: ', rel_error(dx_num, dx))

Testing relu_backward function:
dx error:  3.2756349136310288e-12


<br><br>
<hr>
<br><br>

## Simple multi layer network for MNIST

In [49]:
class SimpleCNN:
    def __init__(self):
        np.random.seed(231) # for reproducibility

        # Hardcoded network dimensions
        self.num_classes = 10
        self.hidden_dim = 1024

        # Layer shapes and parameters initialization
        self.conv_param1 = {'stride': 2, 'pad': 1}
        self.W1 = 0.01 * np.random.randn(8, 1, 4, 4) # First convolutional layer weights
        self.b1 = np.zeros(8) # First convolutional layer biases

        self.conv_param2 = {'stride': 1, 'pad': 1}
        self.W2 = 0.01 * np.random.randn(16, 8, 3, 3) # Second convolutional layer weights
        self.b2 = np.zeros(16) # Second convolutional layer biases

        self.W3 = 0.01 * np.random.randn(16 * 7 * 7, self.hidden_dim)  # Fully-connected layer weights
        self.b3 = np.zeros(self.hidden_dim) # Fully-connected layer biases

        self.W4 = 0.01 * np.random.randn(self.hidden_dim, self.num_classes) # Output layer weights
        self.b4 = np.zeros(self.num_classes) # Output layer biases


    def forward(self, X):
        # Forward pass through the network
        out1, self.cache1 = conv_forward_naive(X, self.W1, self.b1, self.conv_param1) # Conv1
        relu_out1, self.relu_cache1 = relu_forward(out1) # ReLU

        out2, self.cache2 = conv_forward_naive(relu_out1, self.W2, self.b2, self.conv_param2) # Conv2
        relu_out2, self.relu_cache2 = relu_forward(out2) # ReLU

        pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}
        out3, self.cache3 = max_pool_forward_naive(relu_out2, pool_param) # Max pooling

        self.cache_reshape = out3.shape
        out3_flat = out3.reshape(out3.shape[0], -1) # Flatten for fully-connected layer

        out4, self.cache4 = affine_forward(out3_flat, self.W3, self.b3) # Fully-connected layer

        relu_out4, self.relu_cache4 = relu_forward(out4) # ReLU

        scores, self.cache5 = affine_forward(relu_out4, self.W4, self.b4) # Output layer
        return scores


    def backward(self, scores, y):
        # Backward pass to compute gradients
        loss, grads = 0, {}

        N = scores.shape[0]
        shifted = scores - np.max(scores, axis=1, keepdims=True) # for numerical stability
        probs = np.exp(shifted) / np.sum(np.exp(shifted), axis=1, keepdims=True) # Softmax probabilities

        loss = -np.sum(np.log(probs[np.arange(N), y])) / N # Cross-entropy loss

        dscores = probs.copy()
        dscores[np.arange(N), y] -= 1
        dscores /= N # Gradient of the loss with respect to scores

        dx5, dW4, db4 = affine_backward(dscores, self.cache5) # Backprop through output layer
        grads['W4'], grads['b4'] = dW4, db4

        drelu4 = relu_backward(dx5, self.relu_cache4) # Backprop through ReLU
        dx4, dW3, db3 = affine_backward(drelu4, self.cache4) # Backprop through fully-connected layer
        grads['W3'], grads['b3'] = dW3, db3

        dx3 = dx4.reshape(self.cache_reshape) # Reshape back to pooled feature map shape
        dpool = max_pool_backward_naive(dx3, self.cache3) # Backprop through max pooling

        drelu2 = relu_backward(dpool, self.relu_cache2) # Backprop through ReLU
        dx2, dW2, db2 = conv_backward_naive(drelu2, self.cache2) # Backprop through Conv2
        grads['W2'], grads['b2'] = dW2, db2

        drelu1 = relu_backward(dx2, self.relu_cache1) # Backprop through ReLU
        dx1, dW1, db1 = conv_backward_naive(drelu1, self.cache1) # Backprop through Conv1
        grads['W1'], grads['b1'] = dW1, db1
        return loss, grads


    def update(self, grad, learning_rate):
        # Update model parameters using gradients
        for param in ['W1', 'b1', 'W2', 'b2', 'W3', 'b3', 'W4', 'b4']:
            self.__dict__[param] -= learning_rate * grad[param]

<br><br>
<hr>
<br><br>

## Train network

### Initialize network and hyperparameters

In [50]:
learning_rate = 0.1
num_epochs = 10
batch_size = 20
simple_model = SimpleCNN()

max_data = 100

### load MNIST dataset

In [55]:
# Define preprocessing steps
transform = transforms.Compose([
    transforms.ToTensor(),                      # Converts to [0,1] range
    transforms.Normalize((0.5,), (0.5,))        # Normalization to [-1,1]
])

trainset = torchvision.datasets.MNIST(root='./data', train=True,
                                        download=True, transform=transform)

trainset = torch.utils.data.Subset(trainset, range(max_data))

trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=2)

100%|█████████████████████████████████████| 9.91M/9.91M [00:33<00:00, 298kB/s]
100%|█████████████████████████████████████| 28.9k/28.9k [00:00<00:00, 151kB/s]
100%|█████████████████████████████████████| 1.65M/1.65M [00:01<00:00, 999kB/s]
100%|████████████████████████████████████| 4.54k/4.54k [00:00<00:00, 4.49MB/s]


### run training
The goal was to train the model on an MNIST subset. If the loss decreases, your implementation works—no need for optimal performance here. This confirms basic functionality; further tuning can come later.

In [56]:
def train(model, dataloader, lr, num_epochs):
    for epoch in range(num_epochs):
        epoch_loss = 0
        for i, (X_batch, y_batch) in enumerate(dataloader):
            # Convert tensors to numpy arrays
            X_batch = X_batch.numpy()
            y_batch = y_batch.numpy()

            scores = model.forward(X_batch)
            loss, grads = model.backward(scores, y_batch)

            model.update(grads, lr)
            epoch_loss += loss

        print(f"Epoch {epoch+1:02d}, Loss: {epoch_loss:.4f}")


train(simple_model, trainloader, learning_rate, num_epochs)

Epoch 01, Loss: 11.5146
Epoch 02, Loss: 11.4963
Epoch 03, Loss: 11.4822
Epoch 04, Loss: 11.4661
Epoch 05, Loss: 11.4549
Epoch 06, Loss: 11.4428
Epoch 07, Loss: 11.4322
Epoch 08, Loss: 11.4218
Epoch 09, Loss: 11.4122
Epoch 10, Loss: 11.4055
