# Introduction

Deep learning is very popular right now and as one of the main parts of machine learning based on learning data representations, whose models are loosely related to communication patterns and information processing in a biological neural system, for example the neural coding attempts to define the relationship between various stimuli and associated neuronal responses in the brain. [1][2][3] 

Neural networks become one of the major thrust areas recently in various pattern recognition, prediction, and analysis problems. Further, a multilayer perceptron(MLP) is a class of feedforward artificial neural network, which includes at least 3 layers of nodes. Apart from the input nodes, each node is a neuron which uses a non-linear action function. Also, we have some other basic network formalisms such as Convolutional networks, Recurrent networks and Boltzmann machines and some advanced formalisms such as adversarial models: GANs. 

MLP utilizes a supervised learning technique called backpropagation for training. [4][5] Its multiple layers and nonlinear activation distinguish MLP from a linear perceptron. It can distinguish data which is non-linear separable. [6] Moreover, MLPs are connectionist computational models and can represent any fuctions. For example, an XOR takes three perceptrons that has 6 weights, 3 threshold values and 9 total parameters. Individual perceptrons are computational equivalent of neurons. The MLP is a layered composition of many perceptrons. Also, MLPs can model Boolean functions so that individual perceptrons can act as Boolean gates and networks of perceptrons are Boolean functions. Otherwise, MLPs are Boolean machines so that they represent Boolean functions over linear boundaries, they can represent arbitrary decision boundaries and they can be used to classify data.
![](pic2.png)
Figure 1: Multi-Layer Perceptron XOR 


![](pic1.png)
Figure 2: The MLP as a Boolean Function Over Feature Fetectors

Explanation for Figure 1:
- The input layer comprises “feature detector”: Detect if certain patterns have occurred in the input.
- The network is a Boolean function over the feature detectors.
- I.e. It is important for the first layer to capture relevant patterns.

There are something else you should know in advance. A threshold unit  comprises a set of weights and a threshold and “Fires” if the weighted sum of inputs exceeds a threshold. A “squashing” function instead of a threshold at the output will be much better. Therefore, the sigmoid “activation”(the function that acts on the weighted combination of inputs (and threshold)) replaces the threshold. Output neuron may have some other actual “activation” – Threshold, sigmoid, tanh, softplus, rectifier, etc. Also, perceptrons with sigmoidal activations actually model class probabilities. Moreover, as for the softmax output layer, one of the outputs goes to 1,the others go to 0. Parameters are weights and bias.
![](pic6.png)
Figure 3: “Proper” Networks: Outputs With ActiFvations

![](pic7.png)
Figure 4: Vector Activation Example: Softmax.

# Main body

It will write an implementation of the backpropagation algorithm for training the neural network. In this tutorial, it will not use any audodiff toolboxes such as Tensorflow, Pytorch, ect and only use numpy like libraries.

The goal of this assignment is to label images of 10 handwritten digits of “zero”, “one”, ..., “nine”. It is a typical problem among MLPs. The images are 28 by 28 in size (MNIST dataset), which we will be represented as a vector x of dimension 784 by listing all the pixel values in raster scan order. The labels t are 0,1,2, ...,9 corresponding to 10 classes as written in the image. There are 3000 training cases, containing 300 examples of each of 10 classes. The can be found in the file digitstrain.txt.

This typical sample can be used in many situations nowadays because the numbers in the real world are almost everywhere. For example, you give the tip every day when you do the signature on the payment pad in the restaurant or somewhere using the card during the daily life. Also, when the student does the mathematics written calculation or someone doing the internet digital verification, it has many applications. Moreover, such as vehicle license plate recognitions, credit/debit card number auto-recognitions and personal ID numbers recognitions in the country/companies/schools, this typical sample is meaningful. More advanced, it can be updated to do handwritten letters/vocabularies recognition or personal written identify.

![](pic3.png)
Figure 5: Demo of This Typical Problem 


![](pic4.png)
Figure 6: Overview Structure

### PROBLEM 1: Data Loader

Here firstly, you must read an input file. Each line contains 785 numbers (comma delimited): the first values are between 0.0 and 1.0 correspond to the 784 pixel values (black and white images), and the last number denotes the class label: 0 corresponds to digit 0, 1 corresponds to digit 1, etc. As a warm up question, load the data firstly.

For this problem you must write a function that takes a file path as an argument which contains this data. Your function must return two values (X and Y) that contains the data from the file as described. Specifically, the first return value (X) must be a matrix where the rows are individual examples of images, and the columns are individual pixels (N x 784 matrix). The second return value must be a list/array of real numbers representing the labels of the examples (rows) in X.
 
eg:

1.0,0.0,1.0,0.0,....0.0,0.25,0.0,0.0
... 1.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.96776

X=[

[1.0,0.0,1.0,0.0,....0.0,0.25,0.0,0.0]

... 
[1.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.96776] 

]

Y = [5,...,2]

In [None]:
#Sample


# def load_data(filePath): 
#     X = []
#     Y = []
#     #INSERT YOUR CODE HERE 
#     return X, Y

In [1]:
import os
import numpy as np
import pickle
import json
from helpers.helpers import *

In [2]:
def load_data(filePath):
    sample = np.loadtxt(filePath, delimiter=",")
    X = sample[:, np.arange(len(sample[0])-1)]
    Y = sample[:, len(sample[0])-1]
    return X, Y

In [4]:
#TEST CODE (IF NOTHING HAPPENS, IT MEANS EVERTHING IS GOOD!)
# AUTOLAB_IGNORE_START

# def test_problem_1():
#     REF_XTRAIN = "fixtures/X_expected"
#     REF_YTRAIN = "fixtures/Y_expected"

#     (xtrain, ytrain) = load_data("data/digitstest.txt")

#     ref_xtrain = pickle.load(open(REF_XTRAIN, "rb"), **PICKLE_KWARGS)
#     ref_ytrain = pickle.load(open(REF_YTRAIN, "rb"), **PICKLE_KWARGS)

#     assert(ref_xtrain.shape == xtrain.shape)
#     assert(ref_ytrain.shape == ytrain.shape)
#     assert(isAllClose(ref_ytrain, ytrain))
#     assert(isAllClose(ref_xtrain, xtrain))

# AUTOLAB_IGNORE_STOP

### PROBLEM 2: Back Propagation Algorithm Without Hidden Layer 

Implement the backpropagation algorithm in a zero hidden layer neural network (weights between input and output nodes). The output layer should be a softmax output over 10 classes corresponding to 10 classes of handwritten digits (e.g. an architecture: 784 > 10). Your backprop code should minimize the cross-entropy entropy function for multi-class classification problem (categorical cross entropy). Using cross-entropy entropy function to calculate the loss. 
![](pic5.png)

This step should be done with a full step of gradient descent, not SGD or RMSProp. For this problem you must write a function that takes as an input a matrix of X values, a list of Y values (as returned from problem 1), a weight matrix, and a learning rate and performs a single step of backpropagation. You will need to do both a forward step with the inputs, and then a backward prop to get the gradients. Return the updated weight matrix and bias in the same format as it was passed.

The list of weight matrices will be a list with 1 entry where the only entry is a matrix in the format where the rows represent all of the outgoing weights for a neuron in the input layer and the columns represent the weights for the incoming neurons. A specific row column index will give you the weight for a neuron to neuron connection.

The list of bias vectors will be in the form where each entry in the list is a vector with the same length as the first set of weights. (e.g. for an architecture of 784 > 10, there will be a single element list with a vector of size 10).

In [None]:
#Sample

# def update_weights_perceptron(X, Y, weights, bias, lr): 
#     #INSERT YOUR CODE HERE
#     return updated_weights, updated_bias


In [5]:
#Firstly, we need a softmax on the output of forward through part.
def softmax(a):
    # 2-D dimension
    return np.exp(a) / np.sum(np.exp(a), axis=1)[:, None]

def forwardthrough(X, weights, bias):
    Z = np.add(np.dot(X, weights), bias)
    return softmax(Z)

#Then we need the cross-entropy entropy function to calculate the loss/difference.

def cross_entropy_loss(X, Y, weights, bias):
    Z = forwardthrough(X, np.asarray(weights)[0], np.asarray(bias)[0])
    A = np.zeros((Y.shape[0], 10), dtype=int)
    for i in range(Y.shape[0]):
        A[i][int(Y[i])] = 1
    return Z - A

def update_weights_perceptron(X, Y, weights, bias, lr):
    diff_Z = cross_entropy_loss(X, Y, weights, bias)
    diff_B = 1 / Y.shape[0] * np.sum(diff_Z, axis=0, keepdims=True)
    diff_W = 1/Y.shape[0] * np.dot(X.T, diff_Z)
    bias = np.asarray(bias)
    bias[0] += -lr * diff_B
    weights = np.asarray(weights)
    weights[0] += - lr * diff_W
    return weights.tolist(), bias.tolist()

In [9]:
#TEST CODE (IF NOTHING HAPPENS, IT MEANS EVERTHING IS GOOD!)
# AUTOLAB_IGNORE_START

# def test_problem_2():
#     STARTING_WEIGHTS_PATH = "fixtures/start_weights_problem2"
#     ENDING_WEIGHTS_PATH = "fixtures/final_weights_problem2"
#     STARTING_BIAS_PATH = "fixtures/start_bias_problem2"
#     ENDING_BIAS_PATH = "fixtures/final_bias_problem2"
#     PARAMS_PATH = "fixtures/problem2.params.json"

#     params = json.loads(open(PARAMS_PATH, "r").read())

#     (X, Y)= load_data("data/digitstrain.txt")
#     inputWeights = pickle.load(open(STARTING_WEIGHTS_PATH, "rb"), **PICKLE_KWARGS)
#     finalWeights = pickle.load(open(ENDING_WEIGHTS_PATH, "rb"), **PICKLE_KWARGS)

#     inputBias = pickle.load(open(STARTING_BIAS_PATH, "rb"), **PICKLE_KWARGS)
#     finalBias = pickle.load(open(ENDING_BIAS_PATH, "rb"), **PICKLE_KWARGS)

#     weightsToTest, biasToTest = update_weights_perceptron(X, Y, inputWeights, inputBias, float(params["LEARNING_RATE"]))

#     assert(isAllClose(finalWeights, weightsToTest))
#     assert(isAllClose(finalBias, biasToTest))

# AUTOLAB_IGNORE_STOP

### PROBLEM 3: Single Layer Neural Network With Hidden Units

Extend your code from problem 2 to support a single layer neural network with N hidden units (e.g. an architecture: 784 > 10 > 10). These hidden units should be using sigmoid activations.

For this problem you must write a function that takes as an input a matrix of X values, a list of Y values (as returned from problem 1), list of weight matrices, a list of bias vectors, a list of bias vectors, and a learning rate and performs a single step of backpropagation. You will need to do both a forward step with the inputs to get the outputs, and then a backward prop to get the gradients. Return the updated weight matrix and bias in the same format as it was passed.

The list of weight matrices is a list with 2 entries where each entry in the list contains a single weight matrix as previously defined in problem 2. For a network with shape 784 > 10 > 10 the past list of weight matrices would look like this: [Matrix with shape 784x10, Matrix with shape 10x10]. **Note:** Though a hidden layer of size 10 is used as an example here, your code must be able to support a hidden layer of dimension N.

The list of bias vectors will be in the form where each entry in the list is a vector with the same length as the first set of weights. (e.g. for an architecture of 784 > 10 > 10, there will be a two element list with a vector of size 10 and a vector of size 10)


In [None]:
#Sample

# def update_weights_perceptron(X, Y, weights, bias, lr): 
#     #INSERT YOUR CODE HERE
#     return updated_weights, updated_bias

In [11]:
#The sigmoid “activation” replaces the threshold on the forward through part ----Activation: The function that acts on the weighted combination of inputs (and threshold) 

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

#Using the derivation of sigmoid on the backward through part

def derivation_sigmoid(x):
    return x * (1.0 - x)

def cross_entropy_loss_single_layer(S, Y, weights, bias):
    Z = forwardthrough(S, np.asarray(weights)[1], np.asarray(bias)[1])
    A = np.zeros((Y.shape[0], 10), dtype=int)
    for i in range(Y.shape[0]):
        A[i][int(Y[i])] = 1
    return Z - A


def update_weights_single_layer(X, Y, weights, bias, lr):
    Z = np.add(np.dot(X, np.asarray(weights)[0]), np.asarray(bias)[0])
    S = sigmoid(Z)
    diff_Z = cross_entropy_loss_single_layer(S, Y, weights, bias)
    diff_B2 = 1 / Y.shape[0] * np.sum(diff_Z, axis=0, keepdims=True)
    diff_W2 = 1 / Y.shape[0] * np.dot(S.T, diff_Z)
    diff_S =  np.dot(diff_Z, np.asarray(weights)[1].T)
    diff_Z1 = diff_S * derivation_sigmoid(S)
    diff_B1 = 1 / Y.shape[0] * np.sum(diff_Z1, axis=0, keepdims=True)
    diff_W1 = 1 / Y.shape[0] * np.dot(X.T, diff_Z1)
    bias = np.asarray(bias)
    bias[0] += -lr * diff_B1
    bias[1] += -lr * diff_B2
    weights = np.asarray(weights)
    weights[0] += - lr * diff_W1
    weights[1] += - lr * diff_W2
    return weights.tolist(), bias.tolist()

In [15]:
#TEST CODE (IF NOTHING HAPPENS, IT MEANS EVERTHING IS GOOD!)
# AUTOLAB_IGNORE_START

# def test_problem_3():
#     STARTING_WEIGHTS_PATH = "fixtures/start_weights_problem3"
#     ENDING_WEIGHTS_PATH = "fixtures/final_weights_problem3"
#     STARTING_BIAS_PATH = "fixtures/start_bias_problem3"
#     ENDING_BIAS_PATH = "fixtures/final_bias_problem3"
#     PARAMS_PATH = "fixtures/problem3.params.json"

#     params = json.loads(open(PARAMS_PATH, "r").read())

#     (X, Y)= load_data("data/digitstrain.txt")
#     inputWeights = pickle.load(open(STARTING_WEIGHTS_PATH, "rb"), **PICKLE_KWARGS)
#     finalWeights = pickle.load(open(ENDING_WEIGHTS_PATH, "rb"), **PICKLE_KWARGS)

#     inputBias = pickle.load(open(STARTING_BIAS_PATH, "rb"), **PICKLE_KWARGS)
#     finalBias = pickle.load(open(ENDING_BIAS_PATH, "rb"), **PICKLE_KWARGS)

#     weightsToTest, biasToTest = update_weights_single_layer(X, Y, inputWeights, inputBias, float(params["LEARNING_RATE"]))

#     assert(isAllClose(finalWeights, weightsToTest))
#     assert(isAllClose(finalBias, biasToTest))

# AUTOLAB_IGNORE_STOP

### PROBLEM 4: 2-Layers Neural Network With Hidden Units

Extend your code from problem 3 (use cross entropy error) and implement a 2-layer neural network, starting with a simple architecture containing N hidden units in each layer (e.g. with architecture: 784 > 10 > 10 > 10). These hidden units should be using sigmoid activations.

For this problem you must write a function that takes as an input a matrix of X values, a list of Y values (as returned from problem 1), list of weight matrices, a list of bias vectors, and a learning rate and performs a single step of backpropagation. You will need to do both a forward step with the inputs to get the outputs, and then a backward prop to get the gradients. Return the updated weight matrix and bias in the same format as it was passed.

The list of weight matrices is a list with 3 entries where each entry in the list contains a single weight matrix as previously defined in problem 2. For a network with shape 784 > 10 > 10 > 10 the passed list of weight matrices would look like this: [Matrix with shape 784x10, Matrix with shape 10x10, Matrix with shape 10x10]. Note: Though a hidden layer of size 10 is used as an example here, your code must be able to support a hidden layer of dimension N.

The list of bias vectors will be in the form where each entry in the list is a vector with the same length as the first set of weights. (e.g. for an architecture of 784 > 10 > 10, there will be a two element list with an vector of size 10 and a vector of size 10)


In [None]:
#Sample

# def update_weights_perceptron(X, Y, weights, bias, lr): 
#     #INSERT YOUR CODE HERE
#     return updated_weights, updated_bias

In [16]:
def cross_entropy_loss_double_layer(S1, Y, weights, bias):
    Z = forwardthrough(S1, np.asarray(weights)[2], np.asarray(bias)[2])
    A = np.zeros((Y.shape[0], 10), dtype=int)
    for i in range(Y.shape[0]):
        A[i][int(Y[i])] = 1
    return Z - A

def update_weights_double_layer(X, Y, weights, bias, lr):
    Z = np.add(np.dot(X, np.asarray(weights)[0]), np.asarray(bias)[0])
    S = sigmoid(Z)
    Z1 = np.add(np.dot(S, np.asarray(weights)[1]), np.asarray(bias)[1])
    S1 = sigmoid(Z1)
    diff_Z = cross_entropy_loss_double_layer(S1, Y, weights, bias)
    diff_B3 = 1 / Y.shape[0] * np.sum(diff_Z, axis=0, keepdims=True)
    diff_W3 = 1 / Y.shape[0] * np.dot(S1.T, diff_Z)
    diff_S1 = np.dot(diff_Z, np.asarray(weights)[2].T)
    diff_Z1 = diff_S1 * derivation_sigmoid(S1)
    #One more Layer
    diff_B2 = 1 / Y.shape[0] * np.sum(diff_Z1, axis=0, keepdims=True)
    diff_W2 = 1 / Y.shape[0] * np.dot(S.T, diff_Z1)
    diff_S = np.dot(diff_Z1, np.asarray(weights)[1].T)
    diff_Z2 = diff_S * derivation_sigmoid(S)
    # One more Layer
    diff_B1 = 1 / Y.shape[0] * np.sum(diff_Z2, axis=0, keepdims=True)
    diff_W1 = 1 / Y.shape[0] * np.dot(X.T, diff_Z2)
    #Return
    bias = np.asarray(bias)
    bias[0] += -lr * diff_B1
    bias[1] += -lr * diff_B2
    bias[2] += -lr * diff_B3
    weights = np.asarray(weights)
    weights[0] += - lr * diff_W1
    weights[1] += - lr * diff_W2
    weights[2] += - lr * diff_W3
    return weights.tolist(), bias.tolist()

In [18]:
#TEST CODE (IF NOTHING HAPPENS, IT MEANS EVERTHING IS GOOD!)
# AUTOLAB_IGNORE_START
# def test_problem_4():
#     STARTING_WEIGHTS_PATH = "fixtures/start_weights_problem4"
#     ENDING_WEIGHTS_PATH = "fixtures/final_weights_problem4"
#     STARTING_BIAS_PATH = "fixtures/start_bias_problem4"
#     ENDING_BIAS_PATH = "fixtures/final_bias_problem4"
#     PARAMS_PATH = "fixtures/problem4.params.json"

#     params = json.loads(open(PARAMS_PATH, "r").read())

#     (X, Y)= main.load_data("data/digitstrain.txt")
#     inputWeights = pickle.load(open(STARTING_WEIGHTS_PATH, "rb"), **PICKLE_KWARGS)
#     finalWeights = pickle.load(open(ENDING_WEIGHTS_PATH, "rb"), **PICKLE_KWARGS)

#     inputBias = pickle.load(open(STARTING_BIAS_PATH, "rb"), **PICKLE_KWARGS)
#     finalBias = pickle.load(open(ENDING_BIAS_PATH, "rb"), **PICKLE_KWARGS)

#     weightsToTest, biasToTest = main.update_weights_double_layer(X, Y, inputWeights, inputBias, float(params["LEARNING_RATE"]))

#     assert(isAllClose(finalWeights, weightsToTest))
#     assert(isAllClose(finalBias, biasToTest))
# AUTOLAB_IGNORE_STOP

### PROBLEM 5: Different Activations Functions With Implementation Momentum

Extend your code from problem 4 to implement different activations functions which will be passed as a parameter and implement momentum with your gradient descent. In this problem all activations (except the final layer which should remain a softmax) must be changed to the passed activation function. Also, The momentum value will be passed as a parameter. Your function should perform “epoch” number of epochs and return the resulting weights.


In [None]:
#Sample

# def update_weights_double_layer_act_mom(X, Y, weights, bias, lr, activation, momentum, epochs): 
#     #INSERT YOUR CODE HERE
#     if activation == 'sigmoid':
#         #INSERT YOUR CODE HERE
#     if activation == 'tanh':
#         #INSERT YOUR CODE HERE
#     if activation == 'relu':
#         #INSERT YOUR CODE HERE
#     #INSERT YOUR CODE HERE
#     return updated_weights, updated_bias

In [20]:
#tanh activations for forward
def tanh(x):
    return np.sinh(x)/np.cosh(x)

#tanh activations for backward
def derivation_tanh(x):
    return 1.0 - np.tanh(x) ** 2

#relu activations for forward
def relu(x):
    return np.maximum(x, 0)

#relu activations for backward
def derivation_relu(x):
        return np.greater(x, 0).astype(int)

def update_weights_double_layer_act_mom(X, Y, weights, bias, lr, activation, momentum, epochs):
    deltaW1 = 0
    deltaW2 = 0
    deltaW3 = 0
    for i in range(epochs):

        if activation == 'sigmoid':
            Z = np.add(np.dot(X, np.asarray(weights)[0]), np.asarray(bias)[0])
            S = sigmoid(Z)
            Z1 = np.add(np.dot(S, np.asarray(weights)[1]), np.asarray(bias)[1])
            S1 = sigmoid(Z1)
            diff_Z = cross_entropy_loss_double_layer(S1, Y, weights, bias)
            diff_B3 = 1 / Y.shape[0] * np.sum(diff_Z, axis=0, keepdims=True)
            diff_W3 = 1 / Y.shape[0] * np.dot(S1.T, diff_Z)
            diff_S1 = np.dot(diff_Z, np.asarray(weights)[2].T)
            diff_Z1 = diff_S1 * derivation_sigmoid(S1)
            # One more Layer
            diff_B2 = 1 / Y.shape[0] * np.sum(diff_Z1, axis=0, keepdims=True)
            diff_W2 = 1 / Y.shape[0] * np.dot(S.T, diff_Z1)
            diff_S = np.dot(diff_Z1, np.asarray(weights)[1].T)
            diff_Z2 = diff_S * derivation_sigmoid(S)
            # One more Layer
            diff_B1 = 1 / Y.shape[0] * np.sum(diff_Z2, axis=0, keepdims=True)
            diff_W1 = 1 / Y.shape[0] * np.dot(X.T, diff_Z2)
            # Return
            bias = np.asarray(bias)
            bias[0] += -lr * diff_B1
            bias[1] += -lr * diff_B2
            bias[2] += -lr * diff_B3
            weights = np.asarray(weights)
            deltaW1 = momentum * deltaW1 - lr * diff_W1
            weights[0] += deltaW1
            deltaW2 = momentum * deltaW2 - lr * diff_W2
            weights[1] += deltaW2
            deltaW3 = momentum * deltaW3 - lr * diff_W3
            weights[2] += deltaW3

        if activation == 'tanh':
            Z = np.add(np.dot(X, np.asarray(weights)[0]), np.asarray(bias)[0])
            S = tanh(Z)
            Z1 = np.add(np.dot(S, np.asarray(weights)[1]), np.asarray(bias)[1])
            S1 = tanh(Z1)
            diff_Z = cross_entropy_loss_double_layer(S1, Y, weights, bias)
            diff_B3 = 1 / Y.shape[0] * np.sum(diff_Z, axis=0, keepdims=True)
            diff_W3 = 1 / Y.shape[0] * np.dot(S1.T, diff_Z)
            diff_S1 = np.dot(diff_Z, np.asarray(weights)[2].T)
            diff_Z1 = diff_S1 * derivation_tanh(S1)
            # One more Layer
            diff_B2 = 1 / Y.shape[0] * np.sum(diff_Z1, axis=0, keepdims=True)
            diff_W2 = 1 / Y.shape[0] * np.dot(S.T, diff_Z1)
            diff_S = np.dot(diff_Z1, np.asarray(weights)[1].T)
            diff_Z2 = diff_S * derivation_tanh(S)
            # One more Layer
            diff_B1 = 1 / Y.shape[0] * np.sum(diff_Z2, axis=0, keepdims=True)
            diff_W1 = 1 / Y.shape[0] * np.dot(X.T, diff_Z2)
            # Return
            bias = np.asarray(bias)
            bias[0] += -lr * diff_B1
            bias[1] += -lr * diff_B2
            bias[2] += -lr * diff_B3
            weights = np.asarray(weights)
            deltaW1 = momentum * deltaW1 - lr * diff_W1
            weights[0] += deltaW1
            deltaW2 = momentum * deltaW2 - lr * diff_W2
            weights[1] += deltaW2
            deltaW3 = momentum * deltaW3 - lr * diff_W3
            weights[2] += deltaW3

        if activation == 'relu':
            Z = np.add(np.dot(X, np.asarray(weights)[0]), np.asarray(bias)[0])
            S = relu(Z)
            Z1 = np.add(np.dot(S, np.asarray(weights)[1]), np.asarray(bias)[1])
            S1 = relu(Z1)
            diff_Z = cross_entropy_loss_double_layer(S1, Y, weights, bias)
            diff_B3 = 1 / Y.shape[0] * np.sum(diff_Z, axis=0, keepdims=True)
            diff_W3 = 1 / Y.shape[0] * np.dot(S1.T, diff_Z)
            diff_S1 = np.dot(diff_Z, np.asarray(weights)[2].T)
            diff_Z1 = diff_S1 * derivation_relu(S1)
            # One more Layer
            diff_B2 = 1 / Y.shape[0] * np.sum(diff_Z1, axis=0, keepdims=True)
            diff_W2 = 1 / Y.shape[0] * np.dot(S.T, diff_Z1)
            diff_S = np.dot(diff_Z1, np.asarray(weights)[1].T)
            diff_Z2 = diff_S * derivation_relu(S)
            # One more Layer
            diff_B1 = 1 / Y.shape[0] * np.sum(diff_Z2, axis=0, keepdims=True)
            diff_W1 = 1 / Y.shape[0] * np.dot(X.T, diff_Z2)
            # Return
            bias = np.asarray(bias)
            bias[0] += -lr * diff_B1
            bias[1] += -lr * diff_B2
            bias[2] += -lr * diff_B3
            weights = np.asarray(weights)
            deltaW1 = momentum * deltaW1 - lr * diff_W1
            weights[0] += deltaW1
            deltaW2 = momentum * deltaW2 - lr * diff_W2
            weights[1] += deltaW2
            deltaW3 = momentum * deltaW3 - lr * diff_W3
            weights[2] += deltaW3

    return weights.tolist(), bias.tolist()

In [24]:
#TEST CODE (IF NOTHING HAPPENS, IT MEANS EVERTHING IS GOOD!)
# AUTOLAB_IGNORE_START

# def test_problem_6():
#     STARTING_WEIGHTS_PATH = "fixtures/start_weights_problem6"
#     ENDING_WEIGHTS_PATH = "fixtures/final_weights_problem6"
#     STARTING_BIAS_PATH = "fixtures/start_bias_problem6"
#     ENDING_BIAS_PATH = "fixtures/final_bias_problem6"
#     PARAMS_PATH = "fixtures/problem6.params.json"

#     params = json.loads(open(PARAMS_PATH, "r").read())

#     (X, Y)= main.load_data("data/digitstrain.txt")
#     inputWeights = pickle.load(open(STARTING_WEIGHTS_PATH, "rb"), **PICKLE_KWARGS)
#     finalWeights = pickle.load(open(ENDING_WEIGHTS_PATH, "rb"), **PICKLE_KWARGS)

#     inputBias = pickle.load(open(STARTING_BIAS_PATH, "rb"), **PICKLE_KWARGS)
#     finalBias = pickle.load(open(ENDING_BIAS_PATH, "rb"), **PICKLE_KWARGS)

#     weightsToTest, biasToTest = main.update_weights_double_layer_act_mom(X, Y, inputWeights, inputBias, float(params["LEARNING_RATE"]), params["ACTIVATION"], float(params["MOMENTUM"]), int(params["EPOCH_COUNT"]))

#     assert(isAllClose(finalWeights, weightsToTest))
#     assert(isAllClose(finalBias, biasToTest))

# AUTOLAB_IGNORE_STOP

# Summary and References

This tutorial highlighted just a few elements MLPs and a typical sample of MLPs. Much more detail about the algorithm and questions on MLPs is general are available from the following links (References).

[1] Bengio, Y.; Courville, A.; Vincent, P. (2013). "Representation Learning: A Review and New Perspectives". IEEE Transactions on Pattern Analysis and Machine Intelligence. 35 (8): 1798–1828. arXiv:1206.5538 . doi:10.1109/tpami.2013.50

[2] Schmidhuber, J. (2015). "Deep Learning in Neural Networks: An Overview". Neural Networks. 61: 85–117. arXiv:1404.7828 . doi:10.1016/j.neunet.2014.09.003. PMID 25462637

[3] Olshausen, B. A. (1996). "Emergence of simple-cell receptive field properties by learning a sparse code for natural images". Nature. 381 (6583): 607–609. Bibcode:1996Natur.381..607O. doi:10.1038/381607a0. PMID 8637596.

[4] Rosenblatt, Frank. x. “Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms”. Spartan Books, Washington DC, 1961.

[5] Rumelhart, David E., Geoffrey E. Hinton, and R. J. Williams. "Learning Internal Representations by Error Propagation". David E. Rumelhart, James L. McClelland, and the PDP research group. (editors), Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1: Foundation. MIT Press, 1986

[6] Cybenko, G. 1989. “Approximation by superpositions of a sigmoidal function Mathematics of Control, Signals, and Systems” , 2(4), 303–314. 
