Code Author: Tanzid Sultan 

**`Simple Neural Network Implementations`**

**Model 1:** We are given a set of instances with numerical attributes, and a numerical label/target value (i.e. ground truth) for each instance. POur goal is to create a neural network that can predict the label for any given instance. In our simplest neural network model, the prediction is just a linear combination of the attributes. The network is trained by optimizing the weights (i.e. constant co-efficients) of this linear combination using gradient descent.

To demo this model, we will use the "traffic lights" example, where we have three traffic lights, the state of each light is either `on` or `off` (i.e. 1 or 0) and the corresponding label is either walk or stop (1 or 0). The training dataset is contrived such that there is a strong correlation between the second attribute/light and the target. We would therefore expect the second weight to be much larger than the other two weights after the model has been trained sufficiently. 

In [38]:
import numpy as np

# traffic lights dataset (each row is and instance, the first three coulumns are the attributes and the last column is the label)
traffic_lights = np.array([ [1, 0, 1, 0], 
                            [0, 1, 1, 1],
                            [0, 0, 1, 0],
                            [1, 1, 1, 1],
                            [0, 1, 1, 1],
                            [1, 0, 1, 0]] )

# number of gradient descent iterations
niters = 30

# learning rate (i.e gradient descent step-size)
alpha = 0.1

# initialize random weights
weights = np.random.randn(3) 
print(f"Initial weights: {weights}")

# train the network
for i in range(niters):

    total_error = 0.0
    for j in range(traffic_lights.shape[0]):
        
        input = traffic_lights[j, :-1]
        target = traffic_lights[j, -1]
         
        # compute prediction
        prediction = np.dot(weights, input) 
        
        # compute squared error
        error = (prediction - target)**2
        total_error += error

        # compute gradient of error w.r.t. weights
        grad = 2 * (prediction - target) * input

        # optimize weights using gradient descent
        weights -= alpha * grad

    print(f"Iteration# {i+1}, Updated weights: {weights}, Total error: {total_error}")



Initial weights: [1.05403175 0.51637269 2.05480547]
Iteration# 1, Updated weights: [0.05730109 0.17189395 0.61562081], Total error: 13.937348556565842
Iteration# 2, Updated weights: [-0.1088283   0.35244806  0.44083243], Total error: 1.3646009816820786
Iteration# 3, Updated weights: [-0.16103145  0.50799829  0.37338589], Total error: 0.6810780925370029
Iteration# 4, Updated weights: [-0.17697599  0.61880503  0.32106545], Total error: 0.39995224898531107
Iteration# 5, Updated weights: [-0.17517698  0.69821005  0.27602813], Total error: 0.25637043119047204
Iteration# 6, Updated weights: [-0.1643831   0.75661992  0.23703995], Total error: 0.17421030724412512
Iteration# 7, Updated weights: [-0.14951396  0.80077899  0.2033768 ], Total error: 0.12257850195117463
Iteration# 8, Updated weights: [-0.13335413  0.83501851  0.17438221], Total error: 0.08793410718113523
Iteration# 9, Updated weights: [-0.11742852  0.86215721  0.14945296], Total error: 0.0637348105896432
Iteration# 10, Updated weigh

**Model 2:** We will now build a neural network with one hidden-layer, between the input and output layers, and introduce non-linearity via a relu activation function. This three layer network has two sets of weights, both of which are optimized during training. That training phase has two separate stages: a forward propagation and a backward propagation. Forward propagation involves computing the output at the end of each layer and sending them forward to be the inputs for the next layer. Backward propagatiuon involves computing the derivatives w.r.t. the inputs of the operation performed at each layer, composing these derivatives with those obtained from the next layer, and then sending these back to the previous layer. This composition of derivatives from the current layer with derivatives from the next layer is simply the application of the chain-rule for derivatives.  

In [35]:
import numpy as np

'''
    Input layer class: Input layer does not perform any operations
'''
class input_layer(object):
    '''
        class constructor
    '''
    def __init__(self) -> None:
        pass

    ''' 
        Input layer forward pass
    '''
    def forward(self, L_0):
        self.L_0 = L_0
        return self.L_0
    
''' 
    Hidden layer class: Hidden layer performs 2 operations. First it performs matrix multiplication
                        of inputs L_0 with weights W_0. Then it operates on this result with the Relu
                        function.
'''    
class hidden_layer(object):
    '''
        class constructor
    '''
    def __init__(self, W, activation) -> None:
        self.W = W
        self.W_grad = np.zeros_like(W)
        self.activation = activation
        print(f"Hidden layer activation function: {activation}")

    ''' 
        Hidden layer forward propagation
    '''
    def forward(self, L, dropout): 
        self.L = L
        return self.forward_matrix_mult(dropout)

    def forward_matrix_mult(self, dropout):
        self.Z =  np.dot(self.L, self.W)
        self.dropout = dropout 
        if(self.dropout):
            # generate a random dropout mask with rougly equal numbers of 0s and 1s
            self.dropout_mask = np.random.randint(0,2,size=(self.Z.shape))
           
        if(self.activation == "relu"):
            if(self.dropout):
              # multiply by a factor of 2 to compensate for rougly 1/2 the neurons being turned off by the masking
              return 2 * self.dropout_mask * self.forward_relu()
            else:
                return self.forward_relu()
    
        elif(self.activation == "sigmoid"):
            if(self.dropout):
              return 2 * self.dropout_mask * self.forward_sigmoid()
            else:
                return self.forward_sigmoid()
    
        elif(self.activation == "tanh"):
            if(self.dropout):
              return 2 * self.dropout_mask * self.forward_tanh()
            else:
                return self.forward_tanh()
    
    def forward_relu(self):
        return Relu(self.Z)
    
    def forward_sigmoid(self):
        return sigmoid(self.Z)
   
    def forward_tanh(self):
        return tanh(self.Z)
    
    ''' 
        Hidden layer backpropagation of derivatives
    '''
    def backward(self, D):
        if(self.activation == "relu"):
           self.backward_relu(D)
        elif(self.activation == "sigmoid"):
           self.backward_sigmoid(D)
        elif(self.activation == "tanh"):
           self.backward_tanh(D)

    def backward_relu(self, D):
        # dE/dZ
        dE_dZ = D * Relu_deriv(self.Z) 
        self.backward_matrix_mult(dE_dZ)
    
    def backward_sigmoid(self, D):
        # dE/dZ
        dE_dZ = D * sigmoid_deriv(self.Z) 
        self.backward_matrix_mult(dE_dZ)
    
    def backward_tanh(self, D):
        # dE/dZ
        dE_dZ = D * tanh_deriv(self.Z) 
        self.backward_matrix_mult(dE_dZ)
    
    def backward_matrix_mult(self, D):
        # dE/dW0
        if(self.dropout):
            self.W_grad = np.dot((self.L).T, self.dropout_mask * D)
        else:
            self.W_grad = np.dot((self.L).T, D)

    ''' 
        Gradient descent optimization of hidden layer weights
    '''
    def update_weights(self, alpha):
        self.W -= alpha * self.W_grad

       

''' 
    Ouput layer class: Performs two operations, first matrix multiplication of inputs L_1 with weights
                       W_1. This result is then operated on by squared error function.  
'''
class output_layer(object):
    
    ''' 
        class constructor
    '''

    def __init__(self, W) -> None:
        self.W = W
        self.W_grad = np.zeros_like(W)

    ''' 
        Output layer forward propagation
    '''
    def forward(self, L, Y, soft):
        self.L = L
        self.Y = Y
        return self.forward_matrix_mult(soft)

    def forward_matrix_mult(self, soft):
        self.P = np.dot(self.L, self.W) 
        if(soft):
            self.P = softmax(self.P)
        
        return self.P, self.forward_error()
 
    def forward_error(self):
        return np.sum((self.P - self.Y)**2) / self.P.shape[0]

    '''     
        Output layer backpropagation of derivatives
    '''
    def backward(self):
        return self.backward_error()

    def backward_error(self):
        # dE/dP
        dE_dP = 2*(self.P - self.Y) / self.P.shape[0]
        return self.backward_matrix_mult(dE_dP)

    def backward_matrix_mult(self, D):
        # dE/dW1
        self.W_grad = np.dot((self.L).T, D)
        # dE/dL1
        dE_dL = np.dot(D, (self.W).T)
        return dE_dL
    
    ''' 
        Gradient descent optimization of output layer weights
    '''
    def update_weights(self, alpha):
        self.W -= alpha * self.W_grad

'''
    A 3-layer neural network class
'''
class three_layer_net(object):
    ''' 
        class constructor: Takes in the following parameters- number of neurons in input layer (which is the number of feature attributes for each instance), number of hidden layers (has to be at least 1 and can be arbitrarily large), number of neurons in the output layer (which is the number of target attributes) and gradient descent step-size (alpha)
    '''
    def __init__(self, input_neurons, hidden_neurons, output_neurons, hidden_layer_activation = "relu") -> None:
        self.input_neurons  = input_neurons
        self.hidden_neurons = hidden_neurons
        self.output_neurons = output_neurons
        
        np.random.seed(1)
        # initialize weights W0 between input layer and hidden layer 
        W0 = 0.02*np.random.random(size=(input_neurons, hidden_neurons)) - 0.01
        # initialize weights W1 between hidden layer and output layer
        W1 = 0.2*np.random.random(size=(hidden_neurons, output_neurons)) - 0.1 

        # initialize layer objects
        self.layer_0 = input_layer()
        self.layer_1 = hidden_layer(W0, activation=hidden_layer_activation)
        self.layer_2 = output_layer(W1)

    ''' 
        neural network forward pass
    '''
    def forward_net(self, L0, Y, dropout, soft):
        # input layer forward pass
        self.L0 = self.layer_0.forward(L0) 
        # hidden layer forward pass 
        self.L1 = self.layer_1.forward(self.L0, dropout) 
        # output layer forward pass
        self.L2, error = self.layer_2.forward(self.L1, Y, soft) 

        return self.L2, error

    ''' 
        neural network backward pass
    ''' 
    def backward_net(self):
       # output layer backpropagation
       D = self.layer_2.backward() 
       # hidden layer backpropagation
       self.layer_1.backward(D) 

    '''     
        weight optimization
    '''
    def optimize(self, alpha):
        # update output layer weights
        self.layer_2.update_weights(alpha)
        # update hidden layer weights
        self.layer_1.update_weights(alpha)

    '''     
        train the network
    ''' 
    def train(self, X_train, y_train, X_test, y_test, alpha, batch_size=1, niters=1, dropout=False, soft=False):
        print(f"Dropout Enabled: {dropout}")
        print(f"Softmax Enabled: {soft}")
        print(f"Batch size: {batch_size}")
        print(f"Alpha: {alpha}")
        print("Training in progress...")
        #training iterations
        for i in range(niters):
            total_error = 0.0
            train_correct_count = 0
            # train using batch of instances
            for j in range(int(X_train.shape[0]/batch_size)):

                lo = j * batch_size
                hi = min((j+1) * batch_size, X_train.shape[0])

                X = X_train[lo:hi]
                y = y_train[lo:hi]

                # forward propagation
                prediction, error = self.forward_net(X, y, dropout, soft)
                total_error += error
                
                for k in range(hi-lo):
                    train_correct_count += int(np.argmax(prediction[k]) == np.argmax(y[k]))
                
                #if(i == (niters-1)):
                #    print(f"Instance# {j+1}, Target: {y}, Prediction: {prediction}")

                # backpropagation
                self.backward_net()

                # weight optimization
                self.optimize(alpha)

            # predict using test instances
            test_correct_count = 0
            for j in range(X_test.shape[0]):
                X = X_test[j:j+1]
                y = y_test[j:j+1]

                # forward propagation
                prediction, error = self.forward_net(X, y, dropout=False, soft=False)
                test_correct_count += int(np.argmax(prediction) == np.argmax(y))

            print(f"Iteration# {i+1}, Total error: {total_error}, Training accuracy: {train_correct_count/len(y_train)}, Testing accuracy: {test_correct_count/len(y_test)}")

# Relu function
def Relu(x):
    return x*(x > 0)

# Relu derivative function
def Relu_deriv(x):
    return (x > 0)

def tanh(x):
    return np.tanh(x)

def tanh_deriv(x):
    return (1.0 - np.tanh(x)**2)

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-1.0 * x))

def sigmoid_deriv(x):
    return sigmoid(x) * (1.0 - sigmoid(x)) 

def softmax(x): 
    ex = np.exp(x)
    return ex/np.sum(ex, axis = 1, keepdims = True)  

We can test this 3 layer network using the same traffic lights dataset.

In [22]:
# traffic lights dataset (each row is and instance, the first three coulumns are the attributes and the last column is the label)
traffic_lights = np.array([ [1, 0, 1, 0], 
                            [0, 1, 1, 1],
                            [0, 0, 1, 0],
                            [1, 1, 1, 1],
                            [0, 1, 1, 1],
                            [1, 0, 1, 0]] )

# dataset preprocessing
X_train = traffic_lights[:,:-1]
y_train = traffic_lights[:,-1:]

# initialize a three layer neural net object
#three_net = three_layer_net(input_neurons=X_train.shape[1], hidden_neurons=4, output_neurons=1, alpha=0.5)

# train the net
#three_net.train(X_train, y_train, batch_size=2, niters=30)

Testing our 3 layer model with the `MNIST dataset` of handwritten digits

In [None]:
from keras.datasets import mnist

In [36]:
'''
    MNIST dataset of handwritten digits:

    Each observation is an image. The input values per image are 28x28 pixels (i.e. 784 features/inputs per observation).
'''

(X_train, y_train), (X_test, y_test) = mnist.load_data() # x values contain image pixels (i.e. features) and y vaues are the corresponding labels

print(f"X_train shape = {X_train.shape}")
print(f"y_train shape = {y_train.shape}")
print(f"X_test shape = {X_test.shape}")
print(f"y_test shape = {y_test.shape}")

# flatten image pixels array
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1]*X_train.shape[2]) 
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1]*X_test.shape[2]) 

# normalize of pixel values from (0,255) to (0,1)
X_train = X_train / 255
X_test = X_test / 255

# one-hot encode the labels
y_train_onehot = np.zeros(shape=(y_train.shape[0], 10))
y_test_onehot = np.zeros(shape=(y_test.shape[0], 10))

for i in range(y_train.shape[0]):
    y_train_onehot[i, y_train[i]] = 1

for i in range(y_test.shape[0]):
    y_test_onehot[i, y_test[i]] = 1

# training dataset preparation
training_images = X_train[0:1000]
training_labels = y_train_onehot[0:1000]
testing_images = X_test
testing_labels = y_test_onehot

# initialize a three layer neural net object
three_net = three_layer_net(input_neurons=training_images.shape[1], hidden_neurons=100, output_neurons=training_labels.shape[1], hidden_layer_activation="tanh")

# train the net
three_net.train(training_images, training_labels, testing_images, testing_labels, alpha=0.05, batch_size=100, niters=300, dropout=True, soft=True)    

X_train shape = (60000, 28, 28)
y_train shape = (60000,)
X_test shape = (10000, 28, 28)
y_test shape = (10000,)
Hidden layer activation function: tanh
Dropout Enabled: True
Softmax Enabled: True
Batch size: 100
Alpha: 0.05
Training in progress...
Iteration# 1, Total error: 8.84353493775041, Training accuracy: 0.404, Testing accuracy: 0.6259
Iteration# 2, Total error: 8.392415829957024, Training accuracy: 0.697, Testing accuracy: 0.6672
Iteration# 3, Total error: 7.762081243471778, Training accuracy: 0.711, Testing accuracy: 0.6829
Iteration# 4, Total error: 6.950228423474349, Training accuracy: 0.718, Testing accuracy: 0.6979
Iteration# 5, Total error: 6.082768906388227, Training accuracy: 0.74, Testing accuracy: 0.7165
Iteration# 6, Total error: 5.393286427266974, Training accuracy: 0.75, Testing accuracy: 0.7344
Iteration# 7, Total error: 4.756614603828647, Training accuracy: 0.767, Testing accuracy: 0.7545
Iteration# 8, Total error: 4.221056538490205, Training accuracy: 0.787, Testi