### Understanding Backpropication Matrix form from Michael A. Nielson
Using the matrix approach to back-propagation with mini-batches, and how it can fasten the learning of a neural network, using this approach with looping over a mini-batch and see how the mini-batch size can affect the learning’s speed and accuracy.

In [1]:
import pickle
import gzip
import numpy as np
import random

In this neural network taking training data of the MNIST, it has 784 input neuron, 10 output neurons, and a hidden layer of 30 neurons. By setting network to run over 30 epochs, with a learning rate of η=3.0 and a mini-batch size of 1 or 10, we get the following results:

In [2]:
def load_data():
    f = gzip.open('mnist.pkl.gz', 'rb')
    training_data, validation_data, test_data = pickle.load(f, encoding="latin1")
    f.close()
    return (training_data, validation_data, test_data)

def load_data_wrapper():
    tr_d, va_d, te_d = load_data()
    training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
    training_results = [vectorized_result(y) for y in tr_d[1]]
    training_data = zip(training_inputs, training_results)
    validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
    validation_data = zip(validation_inputs, va_d[1])
    test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
    test_data = zip(test_inputs, te_d[1])
    return (training_data, validation_data, test_data)

def vectorized_result(j):
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e



### matrix apporach
update min batch size = 1

In [3]:
class Network(object):

    def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]

    def feedforward(self, a):
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):
        training_data = list(training_data)
        n = len(training_data)

        if test_data:
            test_data = list(test_data)
            n_test = len(test_data)

        for j in range(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in range(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print("Epoch {} : {} / {}".format(j,self.evaluate(test_data),n_test));
            else:
                print("Epoch {} complete".format(j))

    def update_mini_batch(self, mini_batch, eta):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x, y):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] 
        zs = [] 
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        for l in range(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

    def evaluate(self, test_data):
        test_results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in test_data]
        return sum(int(x == y) for (x, y) in test_results)

    def cost_derivative(self, output_activations, y):
        return (output_activations-y)

#### Miscellaneous functions
def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    return sigmoid(z)*(1-sigmoid(z))

In [4]:
import time
start_time = time.time()
training_data, validation_data, test_data = load_data_wrapper()
net = Network([784, 30, 10])
net.SGD(training_data, 30, 1, 3.0, test_data=test_data)
print("--- %s seconds ---" % (time.time() - start_time))

Epoch 0 : 8027 / 10000
Epoch 1 : 8235 / 10000




Epoch 2 : 7503 / 10000
Epoch 3 : 7753 / 10000
Epoch 4 : 8236 / 10000
Epoch 5 : 8626 / 10000
Epoch 6 : 8594 / 10000
Epoch 7 : 8633 / 10000
Epoch 8 : 8495 / 10000
Epoch 9 : 8369 / 10000
Epoch 10 : 8696 / 10000
Epoch 11 : 8597 / 10000
Epoch 12 : 8667 / 10000
Epoch 13 : 8647 / 10000
Epoch 14 : 8647 / 10000
Epoch 15 : 8499 / 10000
Epoch 16 : 8745 / 10000
Epoch 17 : 8844 / 10000
Epoch 18 : 8947 / 10000
Epoch 19 : 8758 / 10000
Epoch 20 : 8822 / 10000
Epoch 21 : 8895 / 10000
Epoch 22 : 8919 / 10000
Epoch 23 : 8869 / 10000
Epoch 24 : 8788 / 10000
Epoch 25 : 9018 / 10000
Epoch 26 : 9029 / 10000
Epoch 27 : 8852 / 10000
Epoch 28 : 8928 / 10000
Epoch 29 : 8904 / 10000
--- 308.43776631355286 seconds ---


 it took almost 5 mins to run with min batch size =1

### matrix apporach 
update min batch size =30

In [5]:
class NetworkMatrix(object):

    def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]
     
    def feedforward(self, a):
        """Return the output of the network if ``a`` is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

    def feedforward2(self, a):
        zs = []
        activations = [a]

        activation = a
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation) + b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)

        return (zs, activations)
        
    def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):

        training_data = list(training_data)
        n = len(training_data)

        if test_data:
            test_data = list(test_data)
            n_test = len(test_data)

        for j in range(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in range(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print("Epoch {} : {} / {}".format(j,self.evaluate(test_data),n_test));
            else:
                print("Epoch {} complete".format(j))

    def update_mini_batch(self, mini_batch, eta):
        batch_size = len(mini_batch)

        
        x = np.asarray([_x.ravel() for _x, _y in mini_batch]).transpose()
        y = np.asarray([_y.ravel() for _x, _y in mini_batch]).transpose()

        nabla_b, nabla_w = self.backprop(x, y)
        self.weights = [w - (eta / batch_size) * nw for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b - (eta / batch_size) * nb for b, nb in zip(self.biases, nabla_b)]

        return

    def backprop(self, x, y):

        nabla_b = [0 for i in self.biases]
        nabla_w = [0 for i in self.weights]

        # feedforward
        zs, activations = self.feedforward2(x)

        # backward 
        delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1])
        nabla_b[-1] = delta.sum(1).reshape([len(delta), 1]) # reshape to (n x 1) matrix
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())

        for l in range(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l + 1].transpose(), delta) * sp
            nabla_b[-l] = delta.sum(1).reshape([len(delta), 1]) 
            nabla_w[-l] = np.dot(delta, activations[-l - 1].transpose())

        return (nabla_b, nabla_w)

    def evaluate(self, test_data):
        test_results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in test_data]
        return sum(int(x == y) for (x, y) in test_results)

    def cost_derivative(self, output_activations, y):
        return (output_activations-y)

In [6]:
import time
start_time = time.time()
training_data, validation_data, test_data = load_data_wrapper()
net_matrix = NetworkMatrix([784, 30, 10])
net_matrix.SGD(training_data, 30, 10, 3.0, test_data=test_data)
print("--- %s seconds ---" % (time.time() - start_time))

Epoch 0 : 9082 / 10000
Epoch 1 : 9228 / 10000
Epoch 2 : 9303 / 10000
Epoch 3 : 9353 / 10000
Epoch 4 : 9346 / 10000
Epoch 5 : 9414 / 10000
Epoch 6 : 9429 / 10000
Epoch 7 : 9423 / 10000
Epoch 8 : 9422 / 10000
Epoch 9 : 9459 / 10000
Epoch 10 : 9452 / 10000
Epoch 11 : 9460 / 10000
Epoch 12 : 9465 / 10000
Epoch 13 : 9479 / 10000
Epoch 14 : 9477 / 10000
Epoch 15 : 9474 / 10000
Epoch 16 : 9457 / 10000
Epoch 17 : 9474 / 10000
Epoch 18 : 9454 / 10000
Epoch 19 : 9489 / 10000
Epoch 20 : 9493 / 10000
Epoch 21 : 9484 / 10000
Epoch 22 : 9491 / 10000
Epoch 23 : 9495 / 10000
Epoch 24 : 9485 / 10000
Epoch 25 : 9493 / 10000
Epoch 26 : 9502 / 10000
Epoch 27 : 9474 / 10000
Epoch 28 : 9491 / 10000
Epoch 29 : 9498 / 10000
--- 52.02427625656128 seconds ---


Time take was only 52 secs with the min batch size =10

By comparing both the  results, program with a mini-batch size of 10, the learning is a bit faster that when the size is equal to 1, but model will have low accuracy. The reason behind this behavior is that, on the first trial, the weights are updated much more often than on the second trial, since they are updated right after processing each example, while on the second trial, the weights aren’t updated until all the 10 examples are processed.