In [63]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
import pandas as pd

from keras.datasets import mnist
from keras.utils import np_utils


# Neural Network Basics

We can try to build a neural network from scratch to see the details of how it works. We won't be using these homemade networks for real purposes, but it is good to get a bit of a look at the mechanics to help make things make sense. 

We will start by trying to create to solve a problem that we are familiar with - logistic regression. 

## What is a Neural Network?

Neural networks are a type of machine learning algorithm that is inspired by the way that the human brain works. The human brain is made up of neurons that are connected to each other, and a neural network aims to replicate that. How much this actually models a human brain is up in the air (I don't think it actually works anything like a real human), but it's ability to make predictive models is excellent. We have inputs, output(s), and some number of layers in between. This sounds complex, and it can be, but we can break it down into a few simple steps.

![Neural Network](images/nn_structure.png "Neural Network")

The strucutre of a neural network is based on layers:
<ul>
<li> Input layer - the data that we are feeding into the network. Each of the neurons above gets one variable from the data set.
<li> Output layer - the prediction that the network makes. This generates an actual prediction, just like any of our other models. 
<li> Weights - just like the slopes in regressions, we have weights that are multiplied by the inputs. Also just like regression, these weights are what the model learns to make its predictions.
<li> Middle (hidden) layers - these are the layers that are in between the input and output layers. These layers are where the magic happens and are what allows a neural network to learn about our data so well. 
</ul>

The basic process of training a neural network is also similar to regression:
<ul>
<li> The data comes into the input layer, each is multipled by a weight, and then the results are comined into a prediction. This is called forward propagation.
<li> The model, with the current weights makes a prediction. 
<li> The error of the prediction is calculated.
<li> The errror is broken down and attributed back to each of the weights, in a process called backpropagation. This is just like gradient descent finding adjustments for the weights in a regression, but slighty more complex since we normally have several layers. 
<li> The weights are adjusted based on the error.
<li> A new model with the new weights is made, and the process repeats until the model is good enough or we hit a limit. 
</ul>

![SimpleNN](images/simple_nn.png "SimpleNN" )

This is basically what we looked at in a logistic regression! That is true for a one layer NN, with an activation (more on this later) of a sigmod. We calcualte the weights, make a prediction, use gradient descent to adjust, and filter the output through the sigmoid calcualtion to get a yes/no prediction. The difference here is that we have multiple layers, each with its own set of weights. The more layers we add, the more ability the model has to learn relationships in the data. In a sense (this is a thought process, not a literal description) each layer is kind of similar to its own little logistic regression, and we layer them together like a boosting model to make improvements layer by layer. 

### Helpers - Base Layer and Support Functions

First we need a few helper functions - separating these out will make our code much easier to read. Below we have:
<ul>
<li> Base Layer Class: we are going to have two types of layers - the "normal" fully connected one, and an activation one to apply the activation function. In the future the activation is wrapped into the normal layer, but this will help us see the parts more clearly. 
<li> Activation Functions and Derivitives: we have two activation functions, and the derivitive of each. We'll talk about different activation functions in more detail later on. 
<li> Loss Function and Derivitive: we have the loss function and its derivitive. We'll keep it simple and use MSE, in a real classification this would probably be log-loss. We'll also talk more about different loss functions later on. 
<li> Convert to Bool: this just translates probability predictions [0,1] into a binary classification. We just need it to calculate predicted accuracy of our test data predictions. 
</ul>

In [64]:
class Layer:
    def __init__(self):
        self.input = None
        self.output = None

    # computes the output Y of a layer for a given input X
    def forward_propagation(self, input):
        raise NotImplementedError

    # computes dE/dX for a given dE/dY (and update parameters if any)
    def backward_propagation(self, output_error, learning_rate):
        raise NotImplementedError

In [65]:
# activation function and its derivative
def tanh(x):
    return np.tanh(x)

def tanh_prime(x):
    return 1-np.tanh(x)**2

In [66]:
# activation function and its derivative
def sigmoid(z):
    s = 1 / (1 + np.exp(-z))
    return s

def sigmoid_prime(z):
  fz = sigmoid(z)
  return fz * (1 - fz)

In [67]:
# loss function and its derivative
def mse(y_true, y_pred):
    return np.mean(np.power(y_true-y_pred, 2))

def mse_prime(y_true, y_pred):
    return 2*(y_pred-y_true)/y_true.size

In [68]:
# Convert a decimal between 0 and 1 to either 0 or 1, based on if it is
# above or below the cutoff. 
def conv_to_bool(float_list, cutoff=.5):
    new_list = []
    for i in float_list:
        if i < cutoff:
            new_list.append(0)
        else:
            new_list.append(1)
    return new_list

### Usable Layers

We can create the layers that we'll actually use now. Normally the FClayer and the activation layer we are creating would be combined, but this makes it easier to explore.

#### Fully Connected Layer (Dense)

The main component of the neural network is the fully connected, or dense, layer. The fully connected (dense) part just means that each neuron is connected to each neuron in the next layer. The three functions each have simple jobs:

<ul>
<li> Initialization: set up the matrix of weights and a vector of bias values. Here they are initialized to a random bumber, in practice these are initialized using some smart method, configurable with a parameter. Each neuron connects to each other neuron, so there is a wight for each of those connections - a matrix of input size by output size. Each output has one bias value. 
<li> Forward Propagation: in forward propagation this layer generates predictions by multiplying weights * values and adding the bias. The calculation itself is a dot product - which multiplies each input by its corresponding weight in the matrix automatically. This could be done by lots of loops, but this is more compact and efficient. Recall that each connection has a weight, so we end up with this large matrix. 
<li> Backward Propagation: this layer calculates the impact of the error stemming from each input X. This is done by calculating the gradient of the loss with respect to each input - thus allowing us to attribute portions of the output error to the different inputs. The calculation is done using the derivitive of the activation function and some math called the chain rule. Once we have this, we update all the weights and the bias so that we shrink our loss. 
</ul>

![Weights](images/weights.png "Weights" )



In [69]:
# inherit from base class Layer
class FCLayer(Layer):
    # input_size = number of input neurons
    # output_size = number of output neurons
    def __init__(self, input_size, output_size):
        self.weights = np.random.rand(input_size, output_size) - 0.5
        self.bias = np.random.rand(1, output_size) - 0.5

    # returns output for a given input
    def forward_propagation(self, input_data):
        self.input = input_data
        self.output = np.dot(self.input, self.weights) + self.bias
        return self.output

    # computes dE/dW, dE/dB for a given output_error=dE/dY. Returns input_error=dE/dX.
    def backward_propagation(self, output_error, learning_rate):
        input_error = np.dot(output_error, self.weights.T)
        weights_error = np.dot(self.input.T, output_error)
        
        # update parameters
        # dBias = output_error
        self.weights -= learning_rate * weights_error
        self.bias -= learning_rate * output_error

        return input_error

### Activation Functions

One of the most important parts of a neural network is the activation function. The activation function is what allows the network to learn non-linear relationships. Without an activation function, the network would be no more powerful than a linear regression. The activation function is applied to the output of each neuron, and it is what determines whether the neuron is "on" or "off". Prior to the output of each neuron being sent on to the next layer, the activation function is applied. If you're paying close attention, the output going into an activation function is just a linear regression - the sum of the features times their weight. The activation function changes this, and allows the neural network to be "more" than a complicated linear regression. In real applications, we usually have a few layers in our neural network, and the combination of multiple layers and activation functions allows us to learn very complex relationships. The term "deep learning" refers to networks that have lots of layers. 

This step is just like the sigmod step in a logistic regression - we have the raw value that comes into the function, the output is not that raw value, it is whatever that raw value translates to after the activation function is applied. If this wasn't applied, a multi-layer neural network could be decomponsed into a simple linear equation (with a bunch of linear algebra work).

![NN Steps](images/nn_steps.jpeg "NN Steps" )

There are many different activation functions, including the one that we are used to - the sigmoid. These activation functions are things that we can experiment with and tune like our hyperparameters. Some tend to be better suited for different types of problems, such as ReLU being common for image work and tanh being common for text processing, but there is no definitive answer. There are also considerations for computational efficiency, and some activation functions are more computationally expensive than others. For the most part we will start with the most common one for the type of problem we are doing, and then experiment with others if we feel the need. We'll look at these more later. 

![Activation Functions](images/activation-functions.png "Activation Functions")

#### Activation Layer

The activation layer is more simple, it just applies the activation function. Normally this is built into the other layer, but this is a little more simple to build by hand. 

<ul>
<li> Initialization: set the activation function and derivitive to use going forward. 
<li> Forward Propagation: in forward propagation the activation layer just takes the input that it gets and applies the activation function to it.  
<li> Backward Propagation: in backward propagation the activation layer translates the error of the predictions back up by multiplying the error by the derivitive of the activation function. This has the effect of translating the error we got with respect to the output of the activation function into error with respect to the input of the activation function. 
</ul>

In [70]:
# inherit from base class Layer
class ActivationLayer(Layer):
    def __init__(self, activation, activation_prime):
        self.activation = activation
        self.activation_prime = activation_prime

    # returns the activated input
    def forward_propagation(self, input_data):
        self.input = input_data
        self.output = self.activation(self.input)
        return self.output

    # Returns input_error=dE/dX for a given output_error=dE/dY.
    # learning_rate is not used because there is no "learnable" parameters.
    def backward_propagation(self, output_error, learning_rate):
        return self.activation_prime(self.input) * output_error

## Full Network

We can now build our layers into an actual model. Our code here is a little long, but most of it is pretty simple - the majority of the extra stuff is to make the model flexible. 

<ul>
<li> Add - call this to add layers to the model. 
<li> Use - provide a loss function, and the derivitive of a loss function for the model to use. 
<li> Predict - generate prediction. This is just one execution of a forward propagation for the new inputs. 
<li> Fit - train the model. 
    <ul>
    <li> Each iteration is called an epoch. We loop through the process until we've hit the desired number of epochs. 
    <li> Run each sample through a forward propagation. 
    <li> Get the prediction, and calculate the error. 
    <li> Pass the error to the back propagation. Repeat
    </ul>
</ul>

Another way to phrase the fitting and optimizing of the model is to think of the errors on each epoch. First we generate a prediction (FP), and find the error of that prediction by simply comparing it to the true value - this is the error with respect to y, the output. 

### Back Propagation and Gradient Descent

Next we do the back propagation, this takes that error with respect to y that we figured out, and "breaks" that error down into error with respect to each term from the previous layer. Recall that one neuron creates a prediction (during FP) that is equal to w1*x1 + w2*x2 +... b; this does a sort of reverse execution of that - starting with the overall error and mapping it to each term. The error is effectively split into parts and each of the m*x terms and the b term is labeled as being accountable for some of it. Since the x values are the inputs, they can't change, so we modify the weights and bias to lessen the error - this is the gradient descent part. This gets repeated back through each layer until the first layer, then another FP begins with the new updated weights. The learning rate controls the size of the adjustments at each round.

![Back Propagation](images/backprop.webp "Back Propagation" )

This back propagation part is the key to the high accuracy ceiling that neural networks have when the data is large - we can tune each of many neurons specifically to their contribution of error. If we have enough data, we can get very accurate predictions. This is also the gradient descent process that we are used to from logistic regression. The difference is that we walk that process back through the entire network, through all the layers, and update the weights and bias all the way through. This can become computationally expensive, as we are doing partial derivitives on a bunch of (potentially large) matricies over and over, but the concept is simple - we are adjusting the weights to improve the error, or performing gradient descent.

In [71]:
class Network:
    def __init__(self):
        self.layers = []
        self.loss = None
        self.loss_prime = None

    # add layer to network
    def add(self, layer):
        self.layers.append(layer)

    # set loss to use
    def use(self, loss, loss_prime):
        self.loss = loss
        self.loss_prime = loss_prime

    # predict output for given input
    def predict(self, input_data):
        # sample dimension first
        samples = len(input_data)
        result = []

        # run network over all samples
        for i in range(samples):
            # forward propagation
            output = input_data[i]
            for layer in self.layers:
                output = layer.forward_propagation(output)
            result.append(output)

        return result

    # train the network
    def fit(self, x_train, y_train, epochs, learning_rate):
        # sample dimension first
        samples = len(x_train)

        # training loop
        errors = []
        for i in range(epochs):
            err = 0
            for j in range(samples):
                # forward propagation
                output = x_train[j]
                for layer in self.layers:
                    output = layer.forward_propagation(output)

                # compute loss (for display purpose only)
                #print(self.loss)
                err += self.loss(y_train[j], output)

                # backward propagation
                error = self.loss_prime(y_train[j], output)
                for layer in reversed(self.layers):
                    error = layer.backward_propagation(error, learning_rate)

            # calculate average error on all samples
            if i % 100 == 0:
                err /= samples
                errors.append(err)
                print('epoch %d/%d   error=%f' % (i+1, epochs, err))

## Use the Network

We have a neural network now - we need to use it! Since we were smart and made the framework able to handle input data of an arbitrary size, we can use our model for pretty much any application!

<b>Note:</b> for all of these examples, the syntax that we are using is similar to the "real world", but not exactly the same. Our handmade neural network doesn't quite have the same functionality of the libraries we can use, but it is close enough to get the idea. 

### Simple Example - XOR

We can test and see what our network can do with a very simple trial. XOR is a logical operation - exclusive or. It is 1 only if exactly one input is 1, or else it is false. We have that data as the training X and y, we can see if our network can learn this very simple relationship before we test it on some real data. 

![XOR](images/xor.webp "XOR" )

One thing to note here is that the XOR problem can't be solved with a single layer. We need at least two layers to be able to learn the relationship. Let's see if our model can figure it out...

In [72]:
# training data
x_train = np.array([[[0,0]], [[0,1]], [[1,0]], [[1,1]]])
y_train = np.array([[[0]], [[1]], [[1]], [[0]]])

# network
net = Network()
net.add(FCLayer(2, 2))
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(2, 1))
net.add(ActivationLayer(tanh, tanh_prime))

# train
net.use(mse, mse_prime)
net.fit(x_train, y_train, epochs=1000, learning_rate=0.1)

# test
out = net.predict(x_train)
print(out)

epoch 1/1000   error=0.397781
epoch 101/1000   error=0.282690
epoch 201/1000   error=0.281922
epoch 301/1000   error=0.269284
epoch 401/1000   error=0.205885
epoch 501/1000   error=0.004860
epoch 601/1000   error=0.001466
epoch 701/1000   error=0.000824
epoch 801/1000   error=0.000564
epoch 901/1000   error=0.000425
[array([[0.00146386]]), array([[0.97417827]]), array([[0.97381539]]), array([[-0.00016839]])]


#### Implement a Logistic Regression

We can create the network that we defined above. You may also recognize this 1 layer model as a logistic regression:
<ul>
<li> The weights are the coefficients of the linear regression.
<li> The bias is the intercept of the linear regression.
<li> The activation function is the sigmoid function.
<li> The features are the features. 
</ul>

At our most simple, that is all a neural network is. The ones that are more capable and flexible are just more complex versions of this, normally with more layers. Each layer adds one of these logistic regression like steps, only they are being performed on the output of the previous layer. Multiple layers allow us to learn more complex relationships, as each layer can learn a different relationship from the transformed data. 

<b>Note:</b> We have to supply the "sigmoid_prime" and similar functions here. That's something that the library would normally do for us, but since we're building it from scratch, we have to do it ourselves. The prime version is what the nerual network uses to calculate the error in back propagation.

In [73]:
df = pd.read_csv("data/diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Diabetes Data

One weird thing that we will do to the data here is that we will reshape the Xs into a 3 dimensional array. This is due to the structure of the model - it is expecting data to be in that shape. When we start using Keras (the package we use for neural network models) this can generally be handled automatically, but since we've made this from scratch, we need to do it here. We do need to be a little more comfortable with the shape of the data, and potentially shifting it around, so this is a good exercise.

What we end up with for the data is:
<ul>
<li> 8 numerical features. We will have 8 neurons / Xs in the input of our network. 
<li> 1 binary categorical output. We will have one neuron for the output of the network. 
<li> Activation function applied to our output, to ensure we get categorization. 
</ul>

<b>Note:</b> we always want to scale data for our neural networks. 

![SimpleNN](images/simple_nn.png "SimpleNN" )

In [74]:
y = np.array(df["Outcome"]).reshape(-1,1)
X = df.drop(columns={"Outcome"})

X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train).reshape(-1,1,8)
X_test = scaler.transform(X_test).reshape(-1,1,8)

#Look at some details
print ("train_set_x shape: " + str(X_train.shape))
print ("train_set_y shape: " + str(y_train.shape))
print ("test_set_x shape: " + str(X_test.shape))
print ("test_set_y shape: " + str(y_test.shape))
print(X_test[0])
print(y_test[0])
print(X_train[0])
print(y_train[0])

train_set_x shape: (576, 1, 8)
train_set_y shape: (576, 1)
test_set_x shape: (192, 1, 8)
test_set_y shape: (192, 1)
[[0.13333333 0.46231156 0.42622951 0.         0.         0.4485842
  0.02798756 0.01666667]]
[0]
[[0.6        0.59798995 0.6557377  0.35353535 0.         0.43219076
  0.0821857  0.13333333]]
[1]


We will make a one layer network, with a simoid activation function. This is a logistic regression, just like my beautiful diagram above. 

In [75]:
# Network
net = Network()
net.add(FCLayer(8, 1))
net.add(ActivationLayer(sigmoid, sigmoid_prime))

# Train
net.use(mse, mse_prime)
net.fit(X_train, y_train, epochs=1000, learning_rate=0.0001)

# Evaluate on test set
out = net.predict(X_test)
pred_labels = conv_to_bool(out)
print(accuracy_score(y_test, pred_labels))

epoch 1/1000   error=0.249393
epoch 101/1000   error=0.235536
epoch 201/1000   error=0.232637
epoch 301/1000   error=0.230459
epoch 401/1000   error=0.228414
epoch 501/1000   error=0.226460
epoch 601/1000   error=0.224590
epoch 701/1000   error=0.222802
epoch 801/1000   error=0.221091
epoch 901/1000   error=0.219453
0.703125


##### Results

One thing that we can see is that our model doesn't really improve all that much if we let training keep running. This should make some intuitive sense - we are only using one layer, and thus only doing a "regular" gradient descent. Once the model has found good weights, there's not really anything else to adjust, it is kind of just bouncing around near the minimum on the gradient descent curve. The amount that can be learned is pretty limited, and we don't need any more iterations than a regular logistic regression would. 

#### Try a Larger Network

What if we try adding some layers? Outside of the input and output the rest of the stuff inside our network is configurable. We can add whatever we want in terms of layer, and each layer can be any size, as long as the shape matches the previous and next layer. We can also try that other activation function if we want. What's the "right" size? That question doesn't have a direct answer, we'll look at some things we can use to estimate it ~ 2 workbooks from now. We can play around a bit with it now and see what we get. As you can probably gues, larger models tend to be more flexible, more likely to overfit, while smaller models are less flexible, but less capable of learning the complex relationships in data. 

If our data isn't linearly separable, we should see our model more able to fit the data as we start adding layers. Try a few, the only thing that we need to be careful of is that the shapes of inputs and outputs match up. Other than that, we can add as many layers as we want, with different activation functions. 

![MultiLayerNN](images/multi_layer_nn.jpeg "MultiLayerNN" )

In [76]:
# Network
net = Network()
net.add(FCLayer(8, 50))            
net.add(ActivationLayer(sigmoid, sigmoid_prime))
net.add(FCLayer(50, 50))               
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(50, 50))               
net.add(ActivationLayer(sigmoid, sigmoid_prime))
net.add(FCLayer(50, 25))               
net.add(ActivationLayer(sigmoid, sigmoid_prime))
net.add(FCLayer(25, 1))  
net.add(ActivationLayer(sigmoid, sigmoid_prime))

# Train
net.use(mse, mse_prime)
net.fit(X_train, y_train, epochs=1000, learning_rate=0.01)

# test on 3 samples
out = net.predict(X_test)
pred_labels = conv_to_bool(out)

print(accuracy_score(y_test, pred_labels))

epoch 1/1000   error=0.241192
epoch 101/1000   error=0.160898
epoch 201/1000   error=0.156465
epoch 301/1000   error=0.154747
epoch 401/1000   error=0.153363
epoch 501/1000   error=0.151914
epoch 601/1000   error=0.150291
epoch 701/1000   error=0.148638
epoch 801/1000   error=0.147218
epoch 901/1000   error=0.146054
0.78125


##### Results

These results will vary depending on exactly what you added for layers, size, and activations, but we will generally see a couple of things as we add some more layers. First, we will see that the model is able to learn the data better. This is because we are adding more layers, and thus more logistic regression steps. Each layer is learning a different relationship from the data, and we are able to learn more complex relationships. Second, we should see more (it might not be a lot, it depends on the setup) learning as we progress through the rounds of training. This is because the capacity of our model to learn is far larger, we have more weights that can be combined in different combinations to deliver the best fit. The model will take longer to learn which combinations are best as it is doing that gradeint descent process. This is basically why large neural networks can take days or weeks to train, there is just a lot of math as the data sizes and the network sizes get large. 

### Complex Example - MNIST Images

Now that we are at least a little comfortable with the usage of neural networks, we can start to apply it to more useful applications, such as our old friends, the MNIST digits. Here we will make a larger model and we can also try a differnet activation function. One other difference to note is the "to_categorical" applied to the target y values. This works with the final layer of 10 output neurons - we get a softmax type (note: not actually softmax here) of output where there is a probability for being in each class. Technically, this would actually be a one-vs-rest for each potential output. The output layer is often felxible like this in a neural network, depending on what we are doing. Here each output neuron is the probability of being in that class, we have 10, one for each digit. Our raw probabilities, which we'll print a few of below, are the probabilites of being in each class; note that those probabilities don't sum to 1, since each prediction is independent. We'd then need to add a function that is similar to "argmax", to choose the highest probability and assign a label. These details are cleaner and easier to handle with Keras, so don't get hung up on that part. With the library functions it is a bit more intuitive to deal with multiple output classes. The structure of our model will be similar to this, the middle layers will depend on whatever we add:

![MINST](images/mnist_nn.jpeg "MINST" )

Since these images, and the relationships in the data, are much more complex than the simple examples above, we should expect to see higher accuracy with more training become more visible here. More layers will also likely help, so try a few variations. In general, we can probably get really good accuracy here, it could be even better if we used more data to train, but it'll take a while. 

In [77]:
# load MNIST from server
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# training data : 60000 samples
# reshape and normalize input data 
x_train = x_train.reshape(x_train.shape[0], 1, 28*28)
x_train = x_train.astype('float32')
x_train /= 255
# encode output which is a number in range [0,9] into a vector of size 10
# e.g. number 3 will become [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
y_train = np_utils.to_categorical(y_train)

# same for test data : 10000 samples
x_test = x_test.reshape(x_test.shape[0], 1, 28*28)
x_test = x_test.astype('float32')
x_test /= 255
y_test = np_utils.to_categorical(y_test)

print(x_train.shape)

# Network
net = Network()
net.add(FCLayer(28*28, 100))                # input_shape=(1, 28*28)    ;   output_shape=(1, 100)
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(100, 100))       
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(100, 50))             # input_shape=(1, 100)      ;   output_shape=(1, 50)
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(50, 10))                    # input_shape=(1, 50)       ;   output_shape=(1, 10)
net.add(ActivationLayer(sigmoid, sigmoid_prime))
#net.add(FCLayer(50, 1)) 

# train on 1000 samples
# we didn't use batches, which we'll look at more next time, so we can't use too much data or it will be slow. 
net.use(mse, mse_prime)
net.fit(x_train[0:1000], y_train[0:1000], epochs=700, learning_rate=0.001)

# test on 3 samples
out = net.predict(x_test[0:3])
print("\n")
print("predicted values : ")
print(out, end="\n")
print("true values : ")
print(y_test[0:3])

(60000, 1, 784)
epoch 1/700   error=0.282446
epoch 101/700   error=0.062150
epoch 201/700   error=0.044879
epoch 301/700   error=0.035292
epoch 401/700   error=0.029622
epoch 501/700   error=0.025319
epoch 601/700   error=0.021368


predicted values : 
[array([[6.83534351e-03, 7.71500738e-03, 1.16532539e-01, 2.04954793e-02,
        3.10298112e-03, 6.83885505e-03, 6.52023056e-04, 9.18520905e-01,
        5.22747803e-02, 1.39342149e-01]]), array([[2.55987191e-01, 3.07169560e-02, 7.30540378e-01, 1.96762587e-01,
        1.04247221e-01, 1.17377273e-02, 8.55676286e-02, 7.26970572e-04,
        7.23164115e-02, 9.40550232e-03]]), array([[1.36811592e-03, 9.43221070e-01, 2.94288994e-02, 1.66516692e-02,
        8.80341966e-04, 2.98053732e-02, 7.21774835e-03, 6.37819547e-02,
        1.34114944e-01, 1.12712648e-02]])]
true values : 
[[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]
