Simple implementation of an neural network of N size:
    - where N is the number of layers that constitute the network.

Unlike previous iterations of the net we have created, for this one we would have to modularize the initialization of the network's parameters, as well as make the forward and backward passes that handles those N layers.

Also, unlike previous iterations of the net we have created, this one will include 'batch normalization', which is used to maintain a fixed distibution of the values of each layer of the network as the network trains. This mitigates 'Internal Covariate Shift', which is the phenomenon that describes the change in the distribution of the network activations due to change in network parameters.

We will also implement dropout, which drops random neurons in each layer during training, so as to help mitigate overfitting, as well as allow neurons which would not commonly be used for training to be used more readily, so as to reduce co-adaptation of the neurons.

We will ask the user for the number of total layers for the network to be created, as well as the size of each of its constituent hidden layers.

In [2]:
import numpy as np
import random

# Initialization of the network parameters input X, and the labels y, as well as hidden dims.
np.random.seed(231)
num_layers = int(input('how many layers do you want in your network?'))
H_dims = [int(input('size of layer'+str(x))) for x in range(1, num_layers)]
N, D, C = 50, 3*32*32, 10
X = np.random.randn(N, D)
y = np.random.randint(C, size=(N,))

# Some hyperparameters to be used.
weight_scale = 5e-2
eps = 1e-5
reg_strength = 0.01
learning_rate = 0.2
momentum = 0.9
model_params = {}

# Weight and Biases initialization. Can handle an arbitrary number of hidden dimensions.
# Also, addition of gamma and beta parameters, which are used for the batch normalization .
for layer, hidden_dims in enumerate(H_dims):
    if layer == 0:
        model_params['W'+str(layer+1)] = weight_scale * np.random.randn(D, hidden_dims)
        model_params['b'+str(layer+1)] = np.zeros(hidden_dims)
        model_params['gamma'+str(layer+1)] = np.ones(hidden_dims)
        model_params['beta'+str(layer+1)] = np.zeros(hidden_dims)
    else:
        model_params['W'+str(layer+1)] = weight_scale * np.random.randn(H_dims[layer-1], hidden_dims)
        model_params['b'+str(layer+1)] = np.zeros(hidden_dims)
        model_params['gamma'+str(layer+1)] = np.ones(hidden_dims)
        model_params['beta'+str(layer+1)] = np.zeros(hidden_dims)
model_params['W'+str(num_layers)] = weight_scale * np.random.randn(H_dims[-1], C)
model_params['b'+str(num_layers)] = np.zeros(C)

# print
for p in model_params.keys():
    print(p, model_params[p].shape)

W1 (3072, 100)
b1 (100,)
gamma1 (100,)
beta1 (100,)
W2 (100, 100)
b2 (100,)
gamma2 (100,)
beta2 (100,)
W3 (100, 100)
b3 (100,)
gamma3 (100,)
beta3 (100,)
W4 (100, 100)
b4 (100,)
gamma4 (100,)
beta4 (100,)
W5 (100, 10)
b5 (10,)


N layer NeuralNet class implementation.

In [122]:
# Network architecture: 
# (affine-relu) * num_layers-1 -> (affine-softmax)

class NeuralNet:
    def __init__(self):
        pass

    def affine_forward(self, x, w, b):
        out = x.reshape(x.shape[0], -1).dot(w) + b
        cache = (x, w, b)
        return out, cache
    
    def affine_backward(self, dout, cache):
        x, w, b = cache
        dx = dout.dot(w.T)
        dw = x.T.dot(dout)
        db = np.sum(dout, axis=0)
        return dx, dw, db
    
    def relu_forward(self, x):
        out = np.maximum(0, x)
        cache = x
        return out, cache
    
    def relu_backward(self, dout, cache):
        x = cache
        return dout * (x > 0)
    
    def batchnorm_forward(self, x, gamma, beta, eps):
        # Do the normalization of input x: batch mean, variance, std.
        b_mean = np.mean(x, axis=0)
        b_var = np.var(x, axis=0)
        b_std = np.sqrt(b_var + eps)
        x_norm = (x - b_mean) / b_std

        # Applying linear transformation on the normalized x. (gamma and beta):
        out = (gamma * x_norm) + beta
        cache = (x, x_norm, gamma, beta, b_mean, b_var, b_std, eps)
        return out, cache
    
    def batchnorm_backward(self, dout, cache):
        # Computation of the gradients of x, gamma, and beta.
        x, x_norm, gamma, beta, b_mean, b_var, b_std, eps = cache
        dbeta = np.sum(dout, axis=0)                                      # derivative of beta wrt. dout. 
        dgamma = np.sum(x_norm * gamma, axis=0)                           # derivative of gamma wrt. dout.
        dx_norm =  dout * gamma                                           # derivative of x_norm wrt. dout (mult. rule)
        d_std = -np.sum(dx_norm * (x - b_mean), axis=0) / (b_std**2)      # derivative of std.
        d_variance = 0.5 *  d_std / b_std                                 # derivative of variance.
        dx1  = dx_norm / b_std + 2 * (x - b_mean) * d_variance / len(dout)# derivative of dx1. 
        d_mean = -np.sum(dx1, axis=0)                                     # derivative of the mean.
        dx2 = d_mean / len(dout)                                          # derivative of dx2.
        dx = dx1 + dx2
        return dx, dgamma, dbeta
    
    def print_mean_std(self, x):
        # Take the value from the batchnorm forward pass, and calculate the mean and std.
        # Mean should be close to 0, std should be close to 1.
        mean = np.mean(x, axis=0)
        std = np.sqrt(np.var(x, axis=0) + eps)
        return mean, std

    def softmax(self, x, y):
        exp_values = np.exp(x - np.max(x, axis=1, keepdims=True))
        softmax = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        correct_scores = np.zeros(x.shape)
        correct_scores[range(len(y)), y] = 1
        loss = np.mean(-np.log(np.sum(softmax * correct_scores, axis=1, keepdims=True)))
        dL = (softmax - correct_scores) / len(softmax)
        reg_loss = []
        for key, W in model_params.items():
            if 'W' in key:
                reg_loss.append(np.sum(np.square(W)))
        reg_loss = 0.05 * reg_strength * np.sum(reg_loss)
        #loss += reg_loss
        return loss, dL

Forward pass implementation. Goes through affine and relu methods, then computes the loss and gradients of the loss wrt. the scores.

In [61]:
# Forward pass.
nn = NeuralNet()
# Lists that holds the layer outputs, as well as the layer and relu cache values, which is used for backprop.
layer_out = []
layer_cache = []
relu_cache = []

# Loop that computes the forward pass an arbitrary amount of times (number of layers)
for layer in range(len(H_dims)):
    # If the first layer, use input X in the affine_forward() method.
    if layer == 0:
        w, b = model_params['W'+str(layer+1)], model_params['b'+str(layer+1)]
        l_values, l_cache_values = nn.affine_forward(X, w, b)
        
    # Otherwise, input x is the value of the relu activation function, which is in the layer_out list.
    else:
        x, w, b = layer_out[layer-1], model_params['W'+str(layer+1)], model_params['b'+str(layer+1)]
        l_values, l_cache_values = nn.affine_forward(x, w, b)

    # Compute the relu activation function, append relu result in layer_out list.
    # Append the cache values from the affine and relu passes into layer_cache.
    r_values, r_cache_values = nn.relu_forward(l_values)
    layer_out.append(r_values)
    layer_cache.append(l_cache_values)
    relu_cache.append(r_cache_values)
    # layer_cache.append((l_cache_values, r_cache_values))

# For the last layer, compute affine, append last layer cache, then compute the loss and dL
x, w, b = layer_out[-1], model_params['W'+str(num_layers)], model_params['b'+str(num_layers)]
l_values, l_cache_values = nn.affine_forward(x, w, b)
layer_cache.append(l_cache_values)
loss, dL = nn.softmax(l_values, y)
print(f'loss: {loss}')
print(f'the shape of the gradients of dL: {dL.shape}')

loss: 2.301195972715804
the shape of the gradients of dL: (50, 10)


  loss += 0.05 * reg_strength * np.sum([np.sum(W**2) for key, W in model_params.items() if 'W' in W])


In [6]:
'''Backward pass:
    We will compute the gradients of the weights and biases for each layer, as well as the gradient of x (dout).
    We will do the chain rule of the dout with the derivative of the activation function.
    This result is used to backprop the gradients to the preceeding layer.
    When we reach the end layer (first layer), the gradient of x will now have to be multiplied with anything
        else, as this is the end of the backprop through the network.
'''
grads = {}

for layer in reversed(range(num_layers)):
    if layer == num_layers - 1:
        doutx, grads['W'+str(layer+1)], grads['b'+str(layer+1)] = nn.affine_backward(dL, layer_cache[layer])
        drelu = nn.relu_backward(doutx, relu_cache[layer-1])  
    elif layer == 0:
        break
    else:
        doutx, grads['W'+str(layer+1)], grads['b'+str(layer+1)] = nn.affine_backward(drelu, layer_cache[layer])
        drelu = nn.relu_backward(doutx, relu_cache[layer-1])

dx, grads['W'+str(layer+1)], grads['b'+str(layer+1)] = nn.affine_backward(drelu, layer_cache[layer])

for grad in grads.keys():
    print(grad, grads[grad].shape)
print(dx.shape)

W5 (100, 10)
b5 (10,)
W4 (100, 100)
b4 (100,)
W3 (100, 100)
b3 (100,)
W2 (100, 100)
b2 (100,)
W1 (3072, 100)
b1 (100,)
(50, 3072)


In [48]:
# This is a improved version of the above backward pass.
grads = {}
doutx = dL

# Loop backwards through the layers, computing the gradients and updating the grads dict.
for layer in reversed(range(num_layers)):
    doutx, grads['W'+str(layer+1)], grads['b'+str(layer+1)] = nn.affine_backward(doutx, layer_cache[layer])

# As long as the layer is not the first one, compute the relu backward pass.
    if layer != 0:
        doutx = nn.relu_backward(doutx, relu_cache[layer-1])

dx = doutx
for grad in grads.keys():
    print(grad, grads[grad].shape)

IndexError: list index out of range

In [127]:
# Lets now implement the batch norm into the network.
# For simplicity, lets make a new class called 'Solver', which will constitute the forward and backward passes.

class Solver:
    def __init__(self):
        pass

    def forward_pass(self, model_params):
        # Forward pass.
        nn = NeuralNet()
        layer_out = []
        affine_cache = []
        batch_cache = []
        relu_cache = []

        # Loop that computes the forward pass an arbitrary amount of times (number of layers)
        for layer in range(len(H_dims)):
            # If the first layer, use input X in the affine_forward() method.
            if layer == 0:
                w, b = model_params['W'+str(layer+1)], model_params['b'+str(layer+1)]
                a_values, a_cache_values = nn.affine_forward(X, w, b)
                
            # Otherwise, input x is the value of the relu activation function, which is in the layer_out list.
            else:
                x, w, b = layer_out[layer-1], model_params['W'+str(layer+1)], model_params['b'+str(layer+1)]
                a_values, a_cache_values = nn.affine_forward(x, w, b)

            # Apply batchnorm to the affine forward output.
            gamma, beta = model_params['gamma'+str(layer+1)], model_params['beta'+str(layer+1)]
            b_values, b_cache_values = nn.batchnorm_forward(a_values, gamma, beta, eps)

            # Compute the relu activation function using output from batchnorm. Append relu output in layer_out list.
            r_values, r_cache_values = nn.relu_forward(b_values)
            layer_out.append(r_values)
            # Append the cache values from the affine, batch, and relu passes into its respective cache lists.
            affine_cache.append(a_cache_values)
            batch_cache.append(b_cache_values)
            relu_cache.append(r_cache_values)

        # For the last layer, compute affine, append last layer cache, then compute the loss and dL
        x, w, b = layer_out[-1], model_params['W'+str(num_layers)], model_params['b'+str(num_layers)]
        a_values, a_cache_values = nn.affine_forward(x, w, b)
        affine_cache.append(a_cache_values)
        loss, dL = nn.softmax(a_values, y)
        print(f'loss: {loss}')
        print(f'the shape of the gradients of dL: {dL.shape}')
        return loss, dL, affine_cache, batch_cache, relu_cache

    def backward_pass(self, loss, dL, affine_cache, batch_cache, relu_cache):
        # Now, lets implment the backward pass, with batchnorm in mind.
        # This is a improved version of the above backward pass.
        grads = {}
        doutx = dL

        # Loop backwards through the layers, computing the gradients and updating the grads dict.
        for layer in reversed(range(num_layers)):
            #print('layer'+str(layer+1))
            doutx, grads['W'+str(layer+1)], grads['b'+str(layer+1)] = nn.affine_backward(doutx, affine_cache[layer])
            grads['W'+str(layer+1)] *= (reg_strength * model_params['W'+str(layer+1)])
            #print(f'shape of the gradient doutx after affine backward: ', doutx.shape)
            #print(f'shape of W, b gradients respectively: ', grads['W'+str(layer+1)].shape, grads['b'+str(layer+1)].shape)

        # As long as the layer is not the first one, compute the relu backward pass, then batchnorm backward pass.
            if layer != 0:
                doutx = nn.relu_backward(doutx, relu_cache[layer-1])
                #print(f'shape of the gradients of doutx after relu backward pass: ', doutx.shape)
                doutx, grads['gamma'+str(layer+1)], grads['beta'+str(layer+1)] = nn.batchnorm_backward(doutx, batch_cache[layer-1])
                #print(f'shape of the gradients of doutx after batchnorm backward pass: ', doutx.shape)
        return grads
    
    def sgd(self, grads, model_params):
        # This is simple stochastic gradient descent.
        for layer in range(num_layers):
            W = model_params['W'+str(layer+1)]
            grad_w = grads['W'+str(layer+1)]
            W -= learning_rate * grad_w
            model_params['W'+str(layer+1)] = W
        return model_params
    
    def gradient_descent(self, grads, model_params, momentum):
        # This will be sgd + momentum. 
        for layer in range(num_layers):
            W = model_params['W'+str(layer+1)]
            grad_w = grads['W'+str(layer+1)]
            v = np.zeros_like(W)

            # Velocity update.
            v = momentum * v - learning_rate * grad_w
            W = W + v
            model_params['W'+str(layer+1)] = W
        return model_params


In [90]:
# For a check, lets see the sizes and shapes of the values from each layer of the network.
print(f'length of batch_cache: ', len(batch_cache))
print(f'length of relu cache: ', len(relu_cache))
print(f'length of the affine caches: ', len(affine_cache))

# Lets check the mean and standard deviation after the batchnorm pass. 
# Mean should be close to 0, std should be close to 1.
mean, std = nn.print_mean_std(b_values)
print(np.mean(mean))
print(np.mean(std))

length of batch_cache:  4
length of relu cache:  4
length of the affine caches:  5
5.215619602871869e-18
0.9999447818008784


In [91]:
# Now, lets implment the backward pass, with batchnorm in mind.

# This is a improved version of the above backward pass.
grads = {}
doutx = dL

# Loop backwards through the layers, computing the gradients and updating the grads dict.
for layer in reversed(range(num_layers)):
    print('layer'+str(layer+1))
    doutx, grads['W'+str(layer+1)], grads['b'+str(layer+1)] = nn.affine_backward(doutx, affine_cache[layer])
    print(f'shape of the gradient doutx after affine backward: ', doutx.shape)
    print(f'shape of W, b gradients respectively: ', grads['W'+str(layer+1)].shape, grads['b'+str(layer+1)].shape)

# As long as the layer is not the first one, compute the relu backward pass, then batchnorm backward pass.
    if layer != 0:
        doutx = nn.relu_backward(doutx, relu_cache[layer-1])
        print(f'shape of the gradients of doutx after relu backward pass: ', doutx.shape)
        doutx, grads['gamma'+str(layer+1)], grads['beta'+str(layer+1)] = nn.batchnorm_backward(doutx, batch_cache[layer-1])
        print(f'shape of the gradients of doutx after batchnorm backward pass: ', doutx.shape)

'''dx = doutx
for grad in grads.keys():
    print(grad, grads[grad].shape)'''

layer5
shape of the gradient doutx after affine backward:  (50, 100)
shape of W, b gradients respectively:  (100, 10) (10,)
shape of the gradients of doutx after relu backward pass:  (50, 100)
shape of the gradients of doutx after batchnorm backward pass:  (50, 100)
layer4
shape of the gradient doutx after affine backward:  (50, 100)
shape of W, b gradients respectively:  (100, 100) (100,)
shape of the gradients of doutx after relu backward pass:  (50, 100)
shape of the gradients of doutx after batchnorm backward pass:  (50, 100)
layer3
shape of the gradient doutx after affine backward:  (50, 100)
shape of W, b gradients respectively:  (100, 100) (100,)
shape of the gradients of doutx after relu backward pass:  (50, 100)
shape of the gradients of doutx after batchnorm backward pass:  (50, 100)
layer2
shape of the gradient doutx after affine backward:  (50, 100)
shape of W, b gradients respectively:  (100, 100) (100,)
shape of the gradients of doutx after relu backward pass:  (50, 100)


'dx = doutx\nfor grad in grads.keys():\n    print(grad, grads[grad].shape)'

In [128]:
# Lets update the weights using the gradients. We will use sgd + momentum.
iterations = 100000
solver = Solver()
for i in range(iterations):
    loss, dL, affine_cache, batch_cache, relu_cache = solver.forward_pass(model_params)
    grads = solver.backward_pass(loss, dL, affine_cache, batch_cache, relu_cache)
    model_params = solver.sgd(grads, model_params)
    

loss: 2.345710350441157
the shape of the gradients of dL: (50, 10)
loss: 2.345654552425943
the shape of the gradients of dL: (50, 10)
loss: 2.3455987527788023
the shape of the gradients of dL: (50, 10)
loss: 2.3455429514994366
the shape of the gradients of dL: (50, 10)
loss: 2.345487148587549
the shape of the gradients of dL: (50, 10)
loss: 2.345431323707329
the shape of the gradients of dL: (50, 10)
loss: 2.345375182899111
the shape of the gradients of dL: (50, 10)
loss: 2.3453190404114292
the shape of the gradients of dL: (50, 10)
loss: 2.3452629350148264
the shape of the gradients of dL: (50, 10)
loss: 2.345206909712076
the shape of the gradients of dL: (50, 10)
loss: 2.3451508541413957
the shape of the gradients of dL: (50, 10)
loss: 2.3450947968986697
the shape of the gradients of dL: (50, 10)
loss: 2.3450387379836037
the shape of the gradients of dL: (50, 10)
loss: 2.344982677395905
the shape of the gradients of dL: (50, 10)
loss: 2.3449266151352792
the shape of the gradients of 