# Improving How Neural Networks Learn

For the next couple of weeks, we will look at some ways to improve how feedforward neural networks learn. In all cases, we will want to use stochastic gradient descent (SGD) rather than ordinary gradient descent because it will speed up training.

Once we implement SGD with backpropagation, we will have constructed a "vanilla" neural network, which is probably the simplest one that is practical.

Now, many aspects of the neural networks we have implemented are actually customizable. There are many, many adaptations that have been made in various problems, but we will cover some of the most effective known adjustments, including the following.

* Alternative loss functions -- sometimes improves training speed and accuracy
* Regularization methods -- helps with over-fitting
* Alternative activation functions -- sometimes improves training speed
* Initialization strategies -- sometimes improves training speed and convergence

First, let's import some packages we will be using.

In [1]:
import numpy as np
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from tensorflow.keras.datasets import mnist

## Feedforward Neural Nets with SGD

The method from last week would incredibly slow for the full full-size MNIST dataset, but there is one piece we did not implement: stochastic gradient descent. We are feeding every single example into the net, feeding it forward, running backprop, and making a weight update. Instead, we will make weight updates based on random mini-batches of datapoints.

Below, we add in a function to create mini-batches of images to read to the class we wrote last week.

In [2]:
class FeedforwardNeuralNetworkSGD:
    
    # input a vector [a, b, c, ...] with the number of nodes in each layer
    def __init__(self, layers, alpha = 0.1, batchSize = 32):
        # list of weight matrices between layers
        self.W = []
        
        # network architecture will be a vector of numbers of nodes for each layer
        self.layers = layers
        
        # learning rate
        self.alpha = alpha
        
        # batch size
        self.batchSize = batchSize
        
        # initialize the weights (randomly) -- this is our initial guess for gradient descent
        
        # initialize the weights between layers (up to the next-to-last one) as normal random variables
        for i in np.arange(0, len(layers) - 2):
            self.W.append(np.random.randn(layers[i] + 1, layers[i + 1] + 1))
            
        # initialize weights between the last two layers (we don't want bias for the last one)
        self.W.append(np.random.randn(layers[-2] + 1, layers[-1]))
        
    # define the sigmoid activation
    def sigmoid(self, x):
        return 1.0 / (1 + np.exp(-x))
    
    # define the sigmoid derivative (where z is the output of a sigmoid)
    def sigmoidDerivative(self, z):
        return z * (1 - z)
    
    def getNextBatch(self, X, y, batchSize):
        for i in np.arange(0, X.shape[0], batchSize):
            yield (X[i:i + batchSize], y[i:i + batchSize])
    
    # fit the model
    def fit(self, X, y, epochs = 10000, update = 1000):
        # add a column of ones to the end of X
        X = np.hstack((X, np.ones([X.shape[0],1])))

        for epoch in np.arange(0,epochs):
            
            # randomize the examples
            p = np.arange(0,X.shape[0])
            np.random.shuffle(p)
            X = X[p]
            y = y[p]

            # feed forward, backprop, and weight update
            for (x, target) in self.getNextBatch(X, y, self.batchSize):
                # make a list of output activations from the first layer
                # (just the original x values)
                A = [np.atleast_2d(x)]
                
                # feed forward
                for layer in np.arange(0, len(self.W)):
                    
                    # feed through one layer and apply sigmoid activation
                    net = A[layer].dot(self.W[layer])
                    out = self.sigmoid(net)
                    
                    # add our network output to the list of activations
                    A.append(out)
                    
                # backpropagation (coming soon!)
                error = A[-1] - target
                
                D = [error * self.sigmoidDerivative(A[-1])]
                
                # loop backwards over the layers to build up deltas
                for layer in np.arange(len(A) - 2, 0, -1):
                    delta = D[-1].dot(self.W[layer].T)
                    delta = delta * self.sigmoidDerivative(A[layer])
                    D.append(delta)
                    
                # reverse the deltas since we looped in reverse
                D = D[::-1]
                
                # weight update
                for layer in np.arange(0, len(self.W)):
                    self.W[layer] -= self.alpha * A[layer].T.dot(D[layer])
                    
            if (epoch + 1) % update == 0:
                loss = self.computeLoss(X,y)
                print("[INFO] epoch = {}, loss = {:.6f}".format(epoch + 1, loss))
                
    def predict(self, X, addOnes = True):
        # initialize data, be sure it's the right dimension
        p = np.atleast_2d(X)
        
        # add a column of 1s for bias
        if addOnes:
            p = np.hstack((p, np.ones([X.shape[0],1])))
        
        # feed forward!
        for layer in np.arange(0, len(self.W)):
            p = self.sigmoid(np.dot(p, self.W[layer]))
            
        return p
    
    def computeLoss(self, X, y):
        # initialize data, be sure it's the right dimension
        y = np.atleast_2d(y)
        
        # feed the datapoints through the network to get predicted outputs
        predictions = self.predict(X, addOnes = False)
        loss = np.sum((predictions - y)**2) / 2.0
        
        return loss

In [36]:
### CLASSIFY MNIST PICTURES

# create a dataset of 10000 MNIST images, reshaped as single vectors, and labels
data = mnist.load_data()

# The datapoints are in mnistData[0][0]
X = data[0][0][:10000].reshape([10000,28*28])
X = X/255.0

# The labels are in mnistData[0][1]
Y = data[0][1][:10000]

# randomly choose 75% of the data to be the training set and 25% for the testing set
(trainX, testX, trainY, testY) = train_test_split(X, Y, test_size = 0.25)

trainY = LabelBinarizer().fit_transform(trainY)
testY = LabelBinarizer().fit_transform(testY)

# fit the model to the training data
model = FeedforwardNeuralNetworkSGD([784, 32, 16, 10], 0.5, 32)
model.fit(trainX,trainY,100,10)

# print the classification performance
print("Training set accuracy")
predictedY = model.predict(trainX)
predictedY = predictedY.argmax(axis=1)

trainY = trainY.argmax(axis=1)
print(classification_report(trainY, predictedY))

print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)

testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))

[INFO] epoch = 10, loss = 1006.665768
[INFO] epoch = 20, loss = 715.632151
[INFO] epoch = 30, loss = 469.793882
[INFO] epoch = 40, loss = 427.946123
[INFO] epoch = 50, loss = 101.818410
[INFO] epoch = 60, loss = 200.957703
[INFO] epoch = 70, loss = 75.889761
[INFO] epoch = 80, loss = 50.939743
[INFO] epoch = 90, loss = 43.231690
[INFO] epoch = 100, loss = 41.476832
Training set accuracy
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       751
           1       1.00      0.98      0.99       835
           2       0.98      0.99      0.98       760
           3       0.99      0.99      0.99       771
           4       0.99      1.00      0.99       756
           5       0.99      0.99      0.99       648
           6       0.99      1.00      0.99       753
           7       0.99      0.99      0.99       802
           8       0.99      0.99      0.99       702
           9       0.99      0.98      0.99       722

    accuracy  

Lets train for more epochs and see if it helps.

In [6]:
### CLASSIFY MNIST PICTURES

# create a dataset of 10000 MNIST images, reshaped as single vectors, and labels
data = mnist.load_data()

# The datapoints are in mnistData[0][0]
X = data[0][0][:10000].reshape([10000,28*28])
X = X/255.0

# The labels are in mnistData[0][1]
Y = data[0][1][:10000]

# randomly choose 75% of the data to be the training set and 25% for the testing set
(trainX, testX, trainY, testY) = train_test_split(X, Y, test_size = 0.25)

trainY = LabelBinarizer().fit_transform(trainY)
testY = LabelBinarizer().fit_transform(testY)

# fit the model to the training data
model = FeedforwardNeuralNetworkSGD([784, 32, 16, 10], 0.5, 32)
model.fit(trainX,trainY,1000,10)

# print the classification performance
print("Training set accuracy")
predictedY = model.predict(trainX)
predictedY = predictedY.argmax(axis=1)

trainY = trainY.argmax(axis=1)
print(classification_report(trainY, predictedY))

print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)

testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))

[INFO] epoch = 10, loss = 680.959643
[INFO] epoch = 20, loss = 204.763718
[INFO] epoch = 30, loss = 175.239784
[INFO] epoch = 40, loss = 152.383397
[INFO] epoch = 50, loss = 73.337793
[INFO] epoch = 60, loss = 61.043240
[INFO] epoch = 70, loss = 48.287076
[INFO] epoch = 80, loss = 44.288870
[INFO] epoch = 90, loss = 41.800286
[INFO] epoch = 100, loss = 53.489791
[INFO] epoch = 110, loss = 245.309725
[INFO] epoch = 120, loss = 160.131553
[INFO] epoch = 130, loss = 81.348472
[INFO] epoch = 140, loss = 42.997131
[INFO] epoch = 150, loss = 35.503840
[INFO] epoch = 160, loss = 34.130473
[INFO] epoch = 170, loss = 55.296433
[INFO] epoch = 180, loss = 38.633015
[INFO] epoch = 190, loss = 30.294394
[INFO] epoch = 200, loss = 29.276146
[INFO] epoch = 210, loss = 28.271499
[INFO] epoch = 220, loss = 27.940187
[INFO] epoch = 230, loss = 27.761435
[INFO] epoch = 240, loss = 27.299578
[INFO] epoch = 250, loss = 27.196518
[INFO] epoch = 260, loss = 26.999102
[INFO] epoch = 270, loss = 26.804968
[INF

This approach leads to a testing accuracy of 93% for the full MNIST dataset.

Sometimes we get a better result by adding layers to the network or increasing the number of nodes in layers, but sometimes it leads to more computation for not much gain, so let's see if it helps for this problem. 

In [11]:
### CLASSIFY MNIST PICTURES

# create a dataset of 10000 MNIST images, reshaped as single vectors, and labels
data = mnist.load_data()

# The datapoints are in mnistData[0][0]
X = data[0][0][:10000].reshape([10000,28*28])
X = X/255.0

# The labels are in mnistData[0][1]
Y = data[0][1][:10000]

# randomly choose 75% of the data to be the training set and 25% for the testing set
(trainX, testX, trainY, testY) = train_test_split(X, Y, test_size = 0.25)

trainY = LabelBinarizer().fit_transform(trainY)
testY = LabelBinarizer().fit_transform(testY)

# fit the model to the training data
model = FeedforwardNeuralNetworkSGD([784, 64, 32, 10], 0.5, 32)
model.fit(trainX,trainY,1000,10)

# print the classification performance
print("Training set accuracy")
predictedY = model.predict(trainX)
predictedY = predictedY.argmax(axis=1)

trainY = trainY.argmax(axis=1)
print(classification_report(trainY, predictedY))

print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)

testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))

[INFO] epoch = 10, loss = 1245.671410
[INFO] epoch = 20, loss = 472.016254
[INFO] epoch = 30, loss = 405.070409
[INFO] epoch = 40, loss = 98.840420
[INFO] epoch = 50, loss = 51.696259
[INFO] epoch = 60, loss = 33.487088
[INFO] epoch = 70, loss = 30.991897
[INFO] epoch = 80, loss = 28.566550
[INFO] epoch = 90, loss = 27.811007
[INFO] epoch = 100, loss = 27.183523
[INFO] epoch = 110, loss = 26.921858
[INFO] epoch = 120, loss = 26.246120
[INFO] epoch = 130, loss = 25.538169
[INFO] epoch = 140, loss = 24.126143
[INFO] epoch = 150, loss = 23.405379
[INFO] epoch = 160, loss = 22.832374
[INFO] epoch = 170, loss = 22.591054
[INFO] epoch = 180, loss = 22.010312
[INFO] epoch = 190, loss = 20.923296
[INFO] epoch = 200, loss = 20.832632
[INFO] epoch = 210, loss = 20.361917
[INFO] epoch = 220, loss = 19.808306
[INFO] epoch = 230, loss = 19.761570
[INFO] epoch = 240, loss = 19.727841
[INFO] epoch = 250, loss = 19.239642
[INFO] epoch = 260, loss = 19.211704
[INFO] epoch = 270, loss = 19.193265
[INFO]

Although the loss is smaller here after training, performance didn't improve, but the extra size of the layers makes the computational cost higher for no gain. Let's try it with a third hidden layer to see if it helps.

In [8]:
### CLASSIFY MNIST PICTURES

# create a dataset of 10000 MNIST images, reshaped as single vectors, and labels
data = mnist.load_data()

# The datapoints are in mnistData[0][0]
X = data[0][0][:10000].reshape([10000,28*28])
X = X/255.0

# The labels are in mnistData[0][1]
Y = data[0][1][:10000]

# randomly choose 75% of the data to be the training set and 25% for the testing set
(trainX, testX, trainY, testY) = train_test_split(X, Y, test_size = 0.25)

trainY = LabelBinarizer().fit_transform(trainY)
testY = LabelBinarizer().fit_transform(testY)

# fit the model to the training data
model = FeedforwardNeuralNetworkSGD([784, 64, 32, 16, 10], 0.5, 32)
model.fit(trainX,trainY,1000,10)

# print the classification performance
print("Training set accuracy")
predictedY = model.predict(trainX)
predictedY = predictedY.argmax(axis=1)

trainY = trainY.argmax(axis=1)
print(classification_report(trainY, predictedY))

print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)

testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))

[INFO] epoch = 10, loss = 360.139740
[INFO] epoch = 20, loss = 230.782219
[INFO] epoch = 30, loss = 222.605312
[INFO] epoch = 40, loss = 139.824198
[INFO] epoch = 50, loss = 168.668274
[INFO] epoch = 60, loss = 75.938406
[INFO] epoch = 70, loss = 105.838271
[INFO] epoch = 80, loss = 78.799024
[INFO] epoch = 90, loss = 93.641459
[INFO] epoch = 100, loss = 74.981546
[INFO] epoch = 110, loss = 100.000989
[INFO] epoch = 120, loss = 87.472935
[INFO] epoch = 130, loss = 140.494931
[INFO] epoch = 140, loss = 87.791402
[INFO] epoch = 150, loss = 142.094736
[INFO] epoch = 160, loss = 62.246703
[INFO] epoch = 170, loss = 105.862143
[INFO] epoch = 180, loss = 72.593607
[INFO] epoch = 190, loss = 32.340485
[INFO] epoch = 200, loss = 52.976588
[INFO] epoch = 210, loss = 219.410054
[INFO] epoch = 220, loss = 90.712570
[INFO] epoch = 230, loss = 60.990585
[INFO] epoch = 240, loss = 47.452495
[INFO] epoch = 250, loss = 85.998142
[INFO] epoch = 260, loss = 49.826051
[INFO] epoch = 270, loss = 55.008830

We did not gain much here, as the accuracy is just about the same as some of the smaller architectures we used above, but this one is more expensive to train, so its size is excessive.

Let's run the entire MNIST dataset with the smaller net where computing was cheapest.

In [12]:
### CLASSIFY MNIST PICTURES

# create a dataset of 50000 MNIST images, reshaped as single vectors, and labels
data = mnist.load_data()

# The datapoints are in mnistData[0][0]
X = data[0][0][:50000].reshape([50000,28*28])
X = X/255.0

# The labels are in mnistData[0][1]
Y = data[0][1][:50000]

# randomly choose 75% of the data to be the training set and 25% for the testing set
(trainX, testX, trainY, testY) = train_test_split(X, Y, test_size = 0.25)

trainY = LabelBinarizer().fit_transform(trainY)
testY = LabelBinarizer().fit_transform(testY)

# fit the model to the training data
model = FeedforwardNeuralNetworkSGD([784, 32, 16, 10], 0.5, 32)
model.fit(trainX,trainY,1000,10)

# print the classification performance
print("Training set accuracy")
predictedY = model.predict(trainX)
predictedY = predictedY.argmax(axis=1)

trainY = trainY.argmax(axis=1)
print(classification_report(trainY, predictedY))

print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)

testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))

[INFO] epoch = 10, loss = 1539.695055
[INFO] epoch = 20, loss = 1340.591629
[INFO] epoch = 30, loss = 1228.037576
[INFO] epoch = 40, loss = 1060.644886
[INFO] epoch = 50, loss = 1077.800345
[INFO] epoch = 60, loss = 775.697297
[INFO] epoch = 70, loss = 783.264038
[INFO] epoch = 80, loss = 828.792723
[INFO] epoch = 90, loss = 792.173548
[INFO] epoch = 100, loss = 661.837392
[INFO] epoch = 110, loss = 737.947187
[INFO] epoch = 120, loss = 589.155947
[INFO] epoch = 130, loss = 592.780400
[INFO] epoch = 140, loss = 586.565699
[INFO] epoch = 150, loss = 527.812983
[INFO] epoch = 160, loss = 567.250350
[INFO] epoch = 170, loss = 674.730026
[INFO] epoch = 180, loss = 517.040487
[INFO] epoch = 190, loss = 702.193719
[INFO] epoch = 200, loss = 585.979991
[INFO] epoch = 210, loss = 485.325440
[INFO] epoch = 220, loss = 541.430099
[INFO] epoch = 230, loss = 443.590380
[INFO] epoch = 240, loss = 398.573405
[INFO] epoch = 250, loss = 514.149681
[INFO] epoch = 260, loss = 478.496308
[INFO] epoch = 2

Let's see if more training helps.

In [13]:
### CLASSIFY MNIST PICTURES

# create a dataset of 50000 MNIST images, reshaped as single vectors, and labels
data = mnist.load_data()

# The datapoints are in mnistData[0][0]
X = data[0][0][:50000].reshape([50000,28*28])
X = X/255.0

# The labels are in mnistData[0][1]
Y = data[0][1][:50000]

# randomly choose 75% of the data to be the training set and 25% for the testing set
(trainX, testX, trainY, testY) = train_test_split(X, Y, test_size = 0.25)

trainY = LabelBinarizer().fit_transform(trainY)
testY = LabelBinarizer().fit_transform(testY)

# fit the model to the training data
model = FeedforwardNeuralNetworkSGD([784, 32, 16, 10], 0.5, 32)
model.fit(trainX,trainY,10000,100)

# print the classification performance
print("Training set accuracy")
predictedY = model.predict(trainX)
predictedY = predictedY.argmax(axis=1)

trainY = trainY.argmax(axis=1)
print(classification_report(trainY, predictedY))

print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)

testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))

[INFO] epoch = 100, loss = 686.141739
[INFO] epoch = 200, loss = 420.820100
[INFO] epoch = 300, loss = 476.385929
[INFO] epoch = 400, loss = 362.086947
[INFO] epoch = 500, loss = 281.548244
[INFO] epoch = 600, loss = 455.982912
[INFO] epoch = 700, loss = 295.032399
[INFO] epoch = 800, loss = 268.125648
[INFO] epoch = 900, loss = 904.469981




[INFO] epoch = 1000, loss = 625.717588
[INFO] epoch = 1100, loss = 261.739117
[INFO] epoch = 1200, loss = 362.273303
[INFO] epoch = 1300, loss = 384.625477
[INFO] epoch = 1400, loss = 152.174230
[INFO] epoch = 1500, loss = 148.864956
[INFO] epoch = 1600, loss = 147.173036
[INFO] epoch = 1700, loss = 146.361086
[INFO] epoch = 1800, loss = 705.951894
[INFO] epoch = 1900, loss = 378.794403
[INFO] epoch = 2000, loss = 321.649614
[INFO] epoch = 2100, loss = 201.125134
[INFO] epoch = 2200, loss = 173.434265
[INFO] epoch = 2300, loss = 151.807856
[INFO] epoch = 2400, loss = 165.208958
[INFO] epoch = 2500, loss = 484.326458
[INFO] epoch = 2600, loss = 272.031896
[INFO] epoch = 2700, loss = 216.653339
[INFO] epoch = 2800, loss = 398.076728
[INFO] epoch = 2900, loss = 322.041964
[INFO] epoch = 3000, loss = 317.480161
[INFO] epoch = 3100, loss = 585.719710
[INFO] epoch = 3200, loss = 334.106392
[INFO] epoch = 3300, loss = 188.229697
[INFO] epoch = 3400, loss = 186.582060
[INFO] epoch = 3500, loss

Notice accuracy went up a little to 95% with the full dataset. More data frequently improves results.

## Alternate Loss Functions

Another part of a neural net we can customize is the loss function.

One problem with the squared error loss function is that it tends to approach very slowly when the current output is far from the correct answer. (See the animated graphics in <a href="http://neuralnetworksanddeeplearning.com/chap3.html#the_cross-entropy_cost_function">Nielsen's book</a> in Chapter 3.) Ideally, we would like to see more drastic changes when the net is more incorrect in its answers.

The **cross-entropy** loss function accomplishes this (see mathematical details in class or in Nielsen, Ch 3). It is computed as

$$ L(W)=-\frac{1}{n}\sum\limits_x\sum\limits_j \left[y_j\ln a_j^L + (1-y_j)\ln\left(1-a_j^L\right)\right]$$

Cross-entrpy not only speeds up training, but can help with overfitting sometimes as well.

In [3]:
class FeedforwardNeuralNetworkSGD:
    
    # input a vector [a, b, c, ...] with the number of nodes in each layer
    def __init__(self, layers, alpha = 0.1, batchSize = 32, loss = "sum-of-squares"):
        # list of weight matrices between layers
        self.W = []
        
        # network architecture will be a vector of numbers of nodes for each layer
        self.layers = layers
        
        # learning rate
        self.alpha = alpha
        
        # batch size
        self.batchSize = batchSize
        
        # loss function
        self.loss = loss
        
        # initialize the weights (randomly) -- this is our initial guess for gradient descent
        
        # initialize the weights between layers (up to the next-to-last one) as normal random variables
        for i in np.arange(0, len(layers) - 2):
            self.W.append(np.random.randn(layers[i] + 1, layers[i + 1] + 1))
            
        # initialize weights between the last two layers (we don't want bias for the last one)
        self.W.append(np.random.randn(layers[-2] + 1, layers[-1]))
        
    # define the sigmoid activation
    def sigmoid(self, x):
        return 1.0 / (1 + np.exp(-x))
    
    # define the sigmoid derivative (where z is the output of a sigmoid)
    def sigmoidDerivative(self, z):
        return z * (1 - z)
    
    def getNextBatch(self, X, y, batchSize):
        for i in np.arange(0, X.shape[0], batchSize):
            yield (X[i:i + batchSize], y[i:i + batchSize])
    
    # fit the model
    def fit(self, X, y, epochs = 10000, update = 1000):
        # add a column of ones to the end of X
        X = np.hstack((X, np.ones([X.shape[0],1])))

        for epoch in np.arange(0,epochs):
            
            # randomize the examples
            p = np.arange(0,X.shape[0])
            np.random.shuffle(p)
            X = X[p]
            y = y[p]

            # feed forward, backprop, and weight update
            for (x, target) in self.getNextBatch(X, y, self.batchSize):
                # make a list of output activations from the first layer
                # (just the original x values)
                A = [np.atleast_2d(x)]
                
                # feed forward
                for layer in np.arange(0, len(self.W)):
                    
                    # feed through one layer and apply sigmoid activation
                    net = A[layer].dot(self.W[layer])
                    out = self.sigmoid(net)
                    
                    # add our network output to the list of activations
                    A.append(out)
                    
                # backpropagation
                error = A[-1] - target
                
                if self.loss == "sum-of-squares":
                    D = [error * self.sigmoidDerivative(A[-1])]
                    
                if self.loss == "cross-entropy":
                    D = [error]

                # loop backwards over the layers to build up deltas
                for layer in np.arange(len(A) - 2, 0, -1):
                    delta = D[-1].dot(self.W[layer].T)
                    delta = delta * self.sigmoidDerivative(A[layer])
                    D.append(delta)
                    
                # reverse the deltas since we looped in reverse
                D = D[::-1]
                
                # weight update
                for layer in np.arange(0, len(self.W)):
                    self.W[layer] -= self.alpha * A[layer].T.dot(D[layer])
                    
            if (epoch + 1) % update == 0:
                loss = self.computeLoss(X,y)
                print("[INFO] epoch = {}, loss = {:.6f}".format(epoch + 1, loss))
                
    def predict(self, X, addOnes = True):
        # initialize data, be sure it's the right dimension
        p = np.atleast_2d(X)
        
        # add a column of 1s for bias
        if addOnes:
            p = np.hstack((p, np.ones([X.shape[0],1])))
        
        # feed forward!
        for layer in np.arange(0, len(self.W)):
            p = self.sigmoid(np.dot(p, self.W[layer]))
            
        return p
    
    def computeLoss(self, X, y):
        # initialize data, be sure it's the right dimension
        y = np.atleast_2d(y)
        
        # feed the datapoints through the network to get predicted outputs
        predictions = self.predict(X, addOnes = False)
        
        if self.loss == "sum-of-squares":
            loss = np.sum((predictions - y)**2) / 2.0
            
        if self.loss == "cross-entropy":
            loss = np.sum(np.nan_to_num(-y*np.log(predictions)-(1-y)*np.log(1-predictions)))
        
        return loss

In [4]:
### CLASSIFY MNIST PICTURES

# create a dataset of 10000 MNIST images, reshaped as single vectors, and labels
data = mnist.load_data()

# The datapoints are in mnistData[0][0]
X = data[0][0][:10000].reshape([10000,28*28])
X = X/255.0

# The labels are in mnistData[0][1]
Y = data[0][1][:10000]

# randomly choose 75% of the data to be the training set and 25% for the testing set
(trainX, testX, trainY, testY) = train_test_split(X, Y, test_size = 0.25)

trainY = LabelBinarizer().fit_transform(trainY)
testY = LabelBinarizer().fit_transform(testY)

# fit the model to the training datak
model = FeedforwardNeuralNetworkSGD([784, 32, 16, 10], 0.1, 32, "cross-entropy")
model.fit(trainX,trainY,100,10)

# print the classification performance
print("Training set accuracy")
predictedY = model.predict(trainX)
predictedY = predictedY.argmax(axis=1)

trainY = trainY.argmax(axis=1)
print(classification_report(trainY, predictedY))

print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)

testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))

[INFO] epoch = 10, loss = 2718.886280
[INFO] epoch = 20, loss = 1675.718220


KeyboardInterrupt: 

In [None]:
### CLASSIFY MNIST PICTURES

# create a dataset of 10000 MNIST images, reshaped as single vectors, and labels
data = mnist.load_data()

# The datapoints are in mnistData[0][0]
X = data[0][0][:10000].reshape([10000,28*28])
X = X/255.0

# The labels are in mnistData[0][1]
Y = data[0][1][:10000]

# randomly choose 75% of the data to be the training set and 25% for the testing set
(trainX, testX, trainY, testY) = train_test_split(X, Y, test_size = 0.25)

trainY = LabelBinarizer().fit_transform(trainY)
testY = LabelBinarizer().fit_transform(testY)

# fit the model to the training datak
model = FeedforwardNeuralNetworkSGD([784, 32, 16, 10], 0.1, 32, "cross-entropy")
model.fit(trainX,trainY,1000,10)

# print the classification performance
print("Training set accuracy")
predictedY = model.predict(trainX)
predictedY = predictedY.argmax(axis=1)

trainY = trainY.argmax(axis=1)
print(classification_report(trainY, predictedY))

print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)

testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))

In [None]:
### CLASSIFY MNIST PICTURES

# create a dataset of 50000 MNIST images, reshaped as single vectors, and labels
data = mnist.load_data()

# The datapoints are in mnistData[0][0]
X = data[0][0][:50000].reshape([50000,28*28])
X = X/255.0

# The labels are in mnistData[0][1]
Y = data[0][1][:50000]

# randomly choose 75% of the data to be the training set and 25% for the testing set
(trainX, testX, trainY, testY) = train_test_split(X, Y, test_size = 0.25)

trainY = LabelBinarizer().fit_transform(trainY)
testY = LabelBinarizer().fit_transform(testY)

# fit the model to the training datak
model = FeedforwardNeuralNetworkSGD([784, 32, 16, 10], 0.1, 32, "cross-entropy")
model.fit(trainX,trainY,1000,10)

# print the classification performance
print("Training set accuracy")
predictedY = model.predict(trainX)
predictedY = predictedY.argmax(axis=1)

trainY = trainY.argmax(axis=1)
print(classification_report(trainY, predictedY))

print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)

testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))

[INFO] epoch = 10, loss = 13558.998781
[INFO] epoch = 20, loss = 10709.692479
[INFO] epoch = 30, loss = 9473.620639
[INFO] epoch = 40, loss = 8475.926553
[INFO] epoch = 50, loss = 6198.636888
[INFO] epoch = 60, loss = 6167.389599
[INFO] epoch = 70, loss = 5282.109816
[INFO] epoch = 80, loss = 5610.474029
[INFO] epoch = 90, loss = 5791.741880
[INFO] epoch = 100, loss = 4932.725486
[INFO] epoch = 110, loss = 5067.825925
[INFO] epoch = 120, loss = 5705.379545
[INFO] epoch = 130, loss = 5217.864736
[INFO] epoch = 140, loss = 4704.094505
[INFO] epoch = 150, loss = 3087.366765
[INFO] epoch = 160, loss = 4519.813385
[INFO] epoch = 170, loss = 3414.704367
[INFO] epoch = 180, loss = 3500.955825
