In [2]:
import numpy as np
import random

 
#Defining some functions
def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))
 
def sigmoid_prime(z):
    return sigmoid(z)*(1-sigmoid(z))
 
class Network:
    # sizes is a list of the number of nodes in each layer
    def init(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x) for x,y in zip(sizes[:-1], sizes[1:])]
       
    def feedforward(self, a):
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a) + b)
        return a
   
    def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None):
        training_data = list(training_data)
        samples = len(training_data)
       
        if test_data:
            test_data = list(test_data)
            n_test = len(test_data)
       
        for j in range(epochs):
            random.shuffle(training_data)
            mini_batches = [training_data[k:k+mini_batch_size]
                            for k in range(0, samples, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print(f"Epoch {j}: {self.evaluate(test_data)} / {n_test}")
            else:
                print(f"Epoch {j} complete")
   
    def cost_derivative(self, output_activations, y):
        return(output_activations - y)
   
    def backprop(self, x, y):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # stores activations layer by layer
        zs = [] # stores z vectors layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation) + b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
       
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
       
        for _layer in range(2, self.num_layers):
            z = zs[-_layer]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-_layer+1].transpose(), delta) * sp
            nabla_b[-_layer] = delta
            nabla_w[-_layer] = np.dot(delta, activations[-_layer-1].transpose())
        return (nabla_b, nabla_w)
   
    def update_mini_batch(self, mini_batch, eta):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb + dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw + dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]
       
    def evaluate(self, test_data):
        test_results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in test_data]
        return sum(int(y[x]) for (x, y) in test_results)

We start with importing numpy and random libraries. Random library is used to randomize the starting weights in our neural network while numpy (or np in our code) helps us make the calculations faster.
We then define two popular helper functions, which are sigmoid and sigmoid prime. Signmoid prime is the derivative which is used in backpropagation to calculate the gradient.

# ***IN THE NETWORK:***
*   **sizes variable**: List of numbers that indicates the number of input nodes at each layer in our neural network.
*   **init function**: Four attributes are initialized:

    *   number of layers, num_layers, is set to the length of the sizes
    *   list of the sizes of the layers is set to the input variables, sizes
    *   initial biases of our network are randomized for each layer after the input layer
    *   weights connecting each node are randomized for each connection between the input and output layers



(*np.random.randn() returns a random sample from the normal distribution*)





*   **feedforward function**: Sends information forward in the neural network. This function will take one parameter, ‘a’, which represents the current activation vector. It loops through all the biases and weights and calculates the activations at each layer. The 'a' returned is the predeiction (activations of last layer).

*   **gradient descent function**: Four mandatory parameters and one optional parameter. 

    *   set of training data
    *   number of epochs
    *   sizeo of the mini-batches
    *   learning rate (eta)
    *   test data (optional)


> Converts the training_data into a list type and sets the number of samples to the length of that list (Same is done to the test data). This is because these are not returned to us as lists, but zips of lists. Now, we loop through the number of training epochs,wherein each epoch, we start by shuffling the data (for randomness), and create a list of mini-batches.The update_mini_batch function is called for each mini batch  (Test accuracy is also returned if test data is present).




*   **cost derivative function:** Determines if we made a mistake in our output layer. It takes two parameters:

    *   output_activations array
    *   expected output values, y



Now for backpropagation, the backprop function will take two values, x, and y. We initialize our nablas (𝛁) {gradients} to 0 vectors. We also need to keep track of our current activation vector, activation, all of the activation vectors, activations (the input layer being the first one), and the z-vectors, zs.

Now, we’ll loop through all the biases and weights. In each loop we calculate the z vector as the dot product of the weights and activation, add that to the list of zs, recalculate the activation, and then add the new activation to the list of activations.

*   **backpropagation function:** We calculate the delta, which is error from the last layer multiplied by the sigmoid_prime of the last entry of the zs vectors. We set the last layer of nabla_b as the delta and the last layer of nabla_w equal to the dot product of the delta and the second to last layer of activations (transposed). We do the same thing for each layer going backwards starting from the second to last layer. Finally, we return the nablas as a tuple.





*   **update mini batch function:**   It starts pretty much the same way as our backprop function by creating 0 vectors of the nablas for the biases and weights. It takes two parameters:
  
  *   mini_batch
  *   earning rate, eta



> Then, for each input, x, and output, y, in the mini_batch, we get the delta of each nabla array (by backprop function). Next, we update the nabla lists with these deltas. Finally, we update the weights and biases of the network using the nablas and the learning rate. Each value is updated to the current value minus the learning rate divided by the size of the minibatch times the nabla value.






*   **evaluate funtion:** This function takes one parameter, the test_data.The network’s outputs (which are calculated by feeding forward the input, x) are simply compared with the expected output, y.
















