In [36]:
import sys
import math
import random

INTRODUCTION

This tutorial is on Neural Networks, a fundamental concept in Machine Learning. In the recent past, Neural Networks (or Artificial Neural Networks) have been succesfully applied to problems such as speech recognition (thank Neural Networks for the Siri on your phone!), computer visions (Uber definitely makes use of multi hidden layer neural networks for developing driverless cars) etc. 

Neural networks are appropriate for problems with the following characteristics:

1) Training Example With Errors: Neural network training methods are very robust to noise in the training data

2) The ability for people to understand the learned function is not important: The weights learned by neural networks are often difficult for humans to interpret. Learned neural networks are not easily communicated to humans.

3) Long training times are acceptable: Neural Network training algorithms typically require longer training times than other Machine Learning algorithms such as for example Decision Tree Algorithms. Training times can range from a few seconds to many hours depending on factors such as the number of weights in the network, the number of training examples considered, the number of hidden layers etc.

Through this tutorial, I will introduce you to the key concepts used to build Neural Networks, and will also build a simple neural network for a simplified dataset as well.

There are multiple different types of units that can be used to make up Neural Networks. We look at 2 for now:

1) A perceptron takes a vector of real-valued inputs, calculates a linear combination of these inputs, then outputs a 1 if the result is greater than some threshold and-1 otherwise

2) To develop a multilayer neural network, we unfrotunately cannot just use perceptrons, as it has a discontinuous theshhold and thus is not differentiable and not suitable for gradient descent. Thus, we look at another type of unit called the sigmoid unit, similar to the percepptron but having a smooth, differentiable threshhold function. Like the perceptron, the sigmoid unit first computes a linear combination of its inputs, then applies a threshold to the result. We can see the sigmoid function below:

In [38]:
import math

t = 0
for i in xrange(w): #w is list of weights
    t+= w[i] * x[i] #x is inputs

#here, t is a linear combination of inputs
    
def sigmoid(t):
    return (float(1)/float(1+ math.exp(-t)))


Next, we understand the core of how a neural network works:
    
We essentially have some inputs (the input layer) and an output layer (let us take the number of hidden layers to be 0 for now).

Now, for every interconnection between the input layer and the output layer, we have some weight assigned (initially we randomize the weight). Thus, we get a linear combination of the inputs (sum (Wi * Xi for all i)) and then apply the sigmoid function (definted in the cell above) to that to give a predicted output. (NOTE: Linear combination of inputs python code is in cell below)

Of course, it will not be very accurate since we initially randomized the weights. So the question arises, how do we change the weights every time so that our predicted output is closer to the actual output. A very common technique to so this is known as the backpropagation algorithm, which is explained and shown below.

In [39]:
def linearCombination(X, W):
    #here, X is a list of all inputs, W is a list of all weights
    total = 0
    for i in xrange(len(X)):
        total += X[i] * W[i]
    return total

The BackPropagation Algorithm: This algorithm learns the weights for a multilayer network, given a network with a fixed set of units and interconnections. It employs gradient descent to attempt to minimize the squared error between the network output values and the target values for these outputs. 

A step by step (and easy to implement!) implementation is as follows:

1) Initialize all weights to small random numbers

2) For each training example (X, t) (where t is the actual output): 

        (a) For each output unit k: calculate DeltaK = O_k * (1 - O_k)  * (T_k - O_k)
        
        (b)  For each hidden unit k: calculate DeltaH = O_h * (1 - O_h) * sum(W_h,k * DeltaK for all outputs k)
        
        (c) Update each network weight W_i,j = W_i,j  + learningRate * DeltaJ * X_i,j

(Note: O_k is the predicted  output K at every step, O_h is the predicted out for each hidden unit h, W_h,k is the weight of the interconnection between hidden unit h and output unit k, deltaJ is either the DeltaK or DeltaH depending on whether J is a hidden or output unit, W_i,j is the weight of the interconnection between i and j)

Some helper functions that we will use of the backpropagation algorithm to implement a neural network are written below:



In [40]:
def updateWeight(learningRate, deltaJ, X_ij, currentWeight):
    return currentWeight + learningRate * deltaJ * X_ij

def getDeltaK(predictedOutput, actualOutput):
    return predictedOutput  * (1-predictedOutput) * (actualOutput - predictedOutput)


def getDeltaH(val, w_k, delta_k):
    # getting delta H if there is only one output unit (this is the case we will handle in our sample dataset)
    return val * (1-val) * (w_k  * delta_k)
    

Now, we actually build our own neural network and use it for a particular dataset. For this tutorial, we use a simple music songs dataset with attributes as follows:

1) Year of release of the song : Discrete (any year between 1900 and 2000)
2) Length of recording (in minutes): Continuous
3) Jazz (Yes or No): Binary Categorical
4) Rock & Roll (Yes or No): Binary Categorical
5) Ouput Variable: Hit (Yes or No): Whether the song made it to the Billboard Top 50 songs)

We build a neural network with 4 input units (one for each of the inputs), one hidden layer with 3 units, and one output unit (the output).
We start by initializing the weights for all inter-connections (we use the random function to randomly initialize them).
Next, we go through every line of the training dataset and for the variables 'Jazz' and 'Rock & roll', we convert 'yes' and 'no' to 1 and 0. 
Then, we calculate the sigmoid function for each of the hidden units , and subsequently for the output unit (which essentially the predicted output).
We then continuously update our weights by the BackPropagation algorithm described above, and then calculate the mean squared error.
We do this for 500 iterations of the training data (I chose 500 at random to gain more accuracy but still remain efficient).


In [54]:
def main():
    inputX = []
    w11, w12,  w13 ,w21 , w22 , w23 , w31 = random.random(), random.random(),random.random(),random.random(),random.random(),random.random(),random.random()
    w32 ,w33 , w41 , w42 , w43 , w5 , w6 , w7 = random.random(),random.random(),random.random(),random.random(),random.random(),random.random(),random.random(),random.random()
    count = 0
    startMSE = 0
    finalMSE = 0
    totalLines = 0
    while (count< 500): 
        count+=1
        totalSquaredError = 0
        for line in open("music_train.csv", 'r').readlines()[1:]:
            if (count ==1):
                totalLines+=1
            temp1 = 0 #value of X3
            temp2 = 0 #value of X4
            temp3 = 0 #actual output
            values = line.split(",")
            inputX = values[0:4]  #X1,X2,X3,X4
            x1 = float(float(inputX[0]) - 1900)/float(100) #need to normalize x1
            x2 = float(inputX[1])/float(7) #need for normalization
            if (inputX[2]=="yes"): #converting "yes and no's to 1 and 0
                temp1 = 1 #value of thirs input
            if (inputX[3]=="yes"): #converting yes and no to 1 and 0
                temp2 = 1 #value of fourth input


            t1 = 0
            t2 = 0
            t3 = 0
            
            #hidden layer unit one
            #t1 = (x1* w11) + (x2 * w21) + (temp1 * w31 )+ (temp2 * w41)  
            t1 = linearCombination([x1,x2,temp1,temp2], [w11,w21,w31,w41])
            Y5 =  sigmoid(t1) #output for hidden unit 1

            #hidden layer unit two 
            #t2 = (x1* w12) + (x2 * w22) + (temp1 * w32 )+ (temp2 * w42) 
            t2 = linearCombination([x1,x2,temp1,temp2], [w12,w22,w32,w42])
            Y6 = sigmoid(t2) #output for hidden unit 2

            #hidden layer unit three
            #t3 = (x1* w13) + (x2 * w23) + (temp1 * w33 )+ (temp2 * w43)
            t3 = linearCombination([x1,x2,temp1,temp2], [w13,w23,w33,w43])
            Y7 = sigmoid(t3) #output for hidden unit 3


            t4 = 0
            #t4 = Y5 * w5 + Y6 * w6 + Y7 * w7
            t4 = linearCombination([Y5,Y6,Y7], [w5,w6,w7])
            predictedOutput = sigmoid(t4) 

            #step 2
            deltaK = 0
            if (values[4]=="yes\r\n"):
                temp3 = 1 # temp 3 is the actual output 
            deltaK = getDeltaK(predictedOutput, temp3)
            w5 = (0.01 * deltaK * Y5) + w5
            w6 = (0.01 * deltaK * Y6) + w6
            w7  = (0.01 * deltaK * Y7 ) + w7


            #step3 
            deltaH1 , deltaH2 , deltaH3 = 0,0,0

            deltaH1 = getDeltaH(Y5, w5, deltaK)

            deltaH2 = getDeltaH(Y6, w6, deltaK)

            deltaH3 = getDeltaH(Y7, w7, deltaK)


            #update weights
            w11 = updateWeight(0.01, deltaH1, x1,w11)
            w12 = updateWeight(0.01, deltaH2, x1,w12)
            w13 = updateWeight(0.01, deltaH3, x1,w13)
            w21 = updateWeight(0.01, deltaH1, x2,w21)
            w22 = updateWeight(0.01, deltaH2, x2,w22)
            w23 = updateWeight(0.01, deltaH3, x2,w23)
            w31 = updateWeight(0.01, deltaH1, float(temp1),w31)
            w32 = updateWeight(0.01, deltaH2, float(temp1),w32)
            w33 = updateWeight(0.01, deltaH3, float(temp1),w33)
            w41 = updateWeight(0.01, deltaH1, float(temp2),w41)
            w42 = updateWeight(0.01, deltaH2, float(temp2),w42)
            w43 = updateWeight(0.01, deltaH3, float(temp2),w43)
            totalSquaredError += (predictedOutput - float(temp3)) * (predictedOutput - float(temp3))
        MSE = float(totalSquaredError) / float(totalLines)
        if (count==1):
            startMSE = MSE
        #sys.stdout.write("%f\n"%MSE)
    #sys.stdout.write("TRAINING COMPLETED! NOW PREDICTING.\n")
    finalMSE = MSE

    W = (w11,w12,w13,w21,w22,w23,w31,w32,w33,w41,w42,w43,w5,w6,w7) #final weights from our model
    return (startMSE, finalMSE, W)

main()[0],main()[1]




(0.28333395964841607, 0.1858211752359032)

We see that, comparing the MSE for the first iteration of the training set compared to the last iteration of the training set, our MSE has decreased from 0.283 to 0.185. This shows that the greater the number of times we run our training dataset, the better our weights get trained and lesser the MSE (i.e. backpropagation algorithm does work).

Next, we check how accurate our model is for a test dataset:

In [69]:
def getErrorRateForTestSet(W):
    (w11,w12,w13,w21,w22,w23,w31,w32,w33,w41,w42,w43,w5,w6,w7) = W
    #do on training data
    predictions  = []
    error = 0
    total = 0
    for line in open("music_dev.csv", 'r').readlines()[1:]:
        temp1 = 0 #value of X3
        temp2 = 0 #value of X4
        temp3 = 0 #actual output
        values = line.split(",")
        inputX = values[0:4]  #X1,X2,X3,X4
        output = values[4]
        x1 = float(float(inputX[0]) - 1900)/float(100)
        x2 = float(inputX[1])/float(7)
        if (inputX[2]=="yes"):
            temp1 = 1 #value of thirs input
        if (inputX[3]=="yes"):
            temp2 = 1 #value of fourth input
        t1 = 0
        t2 = 0
        t3 = 0
        
        #hidden layer unit one
        t1 = linearCombination([x1,x2,temp1,temp2], [w11,w21,w31,w41])
        Y5 =  sigmoid(t1)

        #hidden layer unit two 

        t2 = linearCombination([x1,x2,temp1,temp2], [w12,w22,w32,w42])
        Y6 = sigmoid(t2)

        #hidden layer unit three
        t3 = linearCombination([x1,x2,temp1,temp2], [w13,w23,w33,w43])
        Y7 = sigmoid(t3)


        t4 = 0
        #t4 = Y5 * w5 + Y6 * w6 + Y7 * w7
        t4 = linearCombination([Y5,Y6,Y7], [w5,w6,w7])
        predictedOutput = sigmoid(t4) 
        if (predictedOutput>=0.5):
            predictions += ["yes"]
        else:
            predictions += ["no"]
        print predictions[-1], output
        if (predictions[-1] + "\r\n" != output):
            error+=1
        total +=1
    errorRate = float(error)/total
    return errorRate
    
weights = main()[2]
getErrorRateForTestSet(weights)

yes no

yes yes

yes yes

yes yes

yes yes

no no

yes yes

yes no

yes yes

yes no

yes no

yes yes

yes yes

yes no

yes yes

yes yes

no no

yes yes

no no

yes yes	

no no

yes no

no no

yes yes

yes yes

yes yes



0.2692307692307692

We first print the prediction vs. the actual output, and at the end we show the error rate.

We see our error rate is ~27%.

Thus, through this tutorial, we were able to understand some funamental concepts associated with Neural Networks. 
We learnt the types of units that make up neural networks, how to train a neural network (i.e. how to train your weights via the BackPropagation Algorithm), and we even made a neural network for a simple dataset.
I hope you find this useful and will try to implement a neural network yourself now!