## Import libraries

In [29]:
# Path tools
import sys,os
sys.path.append(os.path.join("..")) # adding home directory to sys path so we can import the utility function

# Neural networks with numpy
from utils.neuralnetwork import NeuralNetwork 

# Sklearn - machine learning tools
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn import datasets

The ```utils.neuralnetwork``` is a python class which includes a neural network function called ```NeuralNetwork```. Programming neural networks from the ground up takes a lot of knowledge and many hours, which is why we use this function. We could have used existing Python libraries (e.g. ```tensorflow keras```), but these have something more complex we would need to learn first. Later om we will built our own networks. 

## Load sample data

We are going to use the neural network function form utils to study the handwriting dataset. Rather than working with the full dataset, we are going to use the ```load_digits``` function to only use part of it. This function returns around 180 samples per class from the data. 

In [30]:
digits = datasets.load_digits()

__Some preprocessing__

We need to make sure the data is floats rather than ints.

In [31]:
# Convert to floats
data = digits.data.astype("float")

__Remember min-max regularization?__

This is instead of dividing everything by 255. It's a bit of a smarter way to normalize.
We use max-min regurlization to normalize each data point, which gives us a more compressed and regular dataset to work with. 

What happens if you try the other way?

In [32]:
# MinMax regularization
data = (data - data.min())/(data.max() - data.min())

Check the shape of the data:

In [33]:
# Print dimensions
print(f"[INFO] samples: {data.shape[0]}, dim: {data.shape[1]}")

[INFO] samples: 1797, dim: 64


The data has around 1800 samples, and the dimensions are 64. 

## Split data

We split the data into training and test:

In [34]:
# split data
X_train, X_test, y_train, y_test = train_test_split(data, # original data
                                                  digits.target, # labels
                                                  test_size=0.2) # 20%% of the data goes into the test set

We're converting labels to binary representations.

Why am I doing this? Why don't I do it with LogisticRegression() for example?

In [35]:
# convert labels from integers to vectors
y_train = LabelBinarizer().fit_transform(y_train) # initializing the binarizer and fit it to the training data
y_test = LabelBinarizer().fit_transform(y_test) # doing the same for the test data

^We convert the labels to binary representations using the ```LabelBinarizer``` function. If we had three labels (e.g. cat, dog, mouse), when we binarize them we get only 2 labels (dog = 1, cat = 0, mouse = 0). Hence by binarizing we take the labels and turn them into a binary representation. We do this because classifiers work with 0s and 1s and not string labels. Hence, the output needs to be either a 0 or a 1 even if we have more than two labels. 
Because we are dealing with multple labels, we need to convert the labels into a binary representaion, to enable the computer to be able to map the data. 

In [36]:
# Let's look at the first ten labels now after having binarized them
y_train[:10]

array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]])

Now the labels have become binarized. Each label is either a 0 or 1. Hence, we have created a binary representation of the labels. This can be done for any kind of labels. 

## Training the network!

```NeuralNetwork(input_layer, hidden_layer, output_layer)```

The neural network function takes a list of numbers in which each number represent a layer in the neural network. The first layer is the input layer which corresponds to the size of the data. Since the data is 8x8 the input layer should be 64. The output layer is 10 (ten classes that we want to predict). The hidden layer is up to us to specify. We put in 32 for the first hidden layer and 16 for the second hidden layer. These numbers are arbitrary - we decide them. The only real limitation we need to think about is that the sum of the nodes in the hidden layers should be less than the sum of the nodes in the input and output layers. In this case the number of nodes in our hidden layers should not exceed 64 + 10 = 74 nodes. 

This is a very simple networks because we are only defining the number of nodes in the layers and the number of epochs (iterations). 

NB! This NeuralNetwork function has a default bias term included. The bias is based on the shape of the data. 

In [37]:
# train network
print("[INFO] training network...")
nn = NeuralNetwork([X_train.shape[1], 32, 16, 10]) # we specify the input layer, hidden layers, and output layers. The input layer is the size of the data. 

[INFO] training network...


Why can we use X_train.shape[1] to indicate the number of nodes in the input layer?
[1] is the number of columns in the data (the number of pixel in each image). For each individual entry in the data, we have an array of 64 values (8x8 - 8 rows and 8 columns) which represent one image. The whole X_train object is a collection of many images that are each 64 pixels. Hence, when we take the entry number 1 we take the number of pixels per image, which is the number of nodes in the input layer. 

In [38]:
# Fit network to data - training the network for 1000 epochs (iterations)
print(f"[INFO] {nn}")
nn.fit(X_train, y_train, epochs=1000) # epoch = iteration - a full pass through the entire dataset

[INFO] NeuralNetwork: 64-32-16-10
[INFO] epoch=1, loss=645.7462609
[INFO] epoch=100, loss=7.1322572
[INFO] epoch=200, loss=2.5680621
[INFO] epoch=300, loss=1.5792949
[INFO] epoch=400, loss=1.3613776
[INFO] epoch=500, loss=1.2527276
[INFO] epoch=600, loss=0.7741279
[INFO] epoch=700, loss=0.4473712
[INFO] epoch=800, loss=0.2137723
[INFO] epoch=900, loss=0.1637844
[INFO] epoch=1000, loss=0.1353876


We can see that for every epoch we get a lower and lower loss. Hence, each epoch minimizes the loss which is done by learning the weights better and better. For each epoch the model becomes slightly better.

__How many epochs should one use?__

In many ways it is through trial-and-error - seeing which model performs best. We could also perform a cross-validation and plot the training score next to the cross-validation score, which allows us to see how well the model converges and whether it underfits or overfits. The number of epochs is dependent on whetehr it is over- or underfitting the data. If the model is underfitting, then you would increase the number of epochs, while if the model is overfitting, you would decrease the number of epochs in order to stop the model because it learns too much from the data.

Generally speaking it is either:
- Trial-and-error
- Educated guess
- Computational, algorithm calculation of optimal parameters 

Now we have our trained network. Now we can evaluate it. 

## Evaluating the network

We use the model to predict the test class and use the classfication report to produce an output in which we can interpret how well the model performs.

In [39]:
# evaluate network
print(["[INFO] evaluating network..."])
predictions = nn.predict(X_test)
predictions = predictions.argmax(axis=1)
print(classification_report(y_test.argmax(axis=1), predictions))

['[INFO] evaluating network...']
              precision    recall  f1-score   support

           0       1.00      0.98      0.99        43
           1       1.00      1.00      1.00        37
           2       1.00      1.00      1.00        37
           3       1.00      1.00      1.00        35
           4       0.94      1.00      0.97        32
           5       0.94      0.94      0.94        35
           6       1.00      1.00      1.00        35
           7       1.00      0.94      0.97        36
           8       1.00      0.95      0.97        37
           9       0.92      1.00      0.96        33

    accuracy                           0.98       360
   macro avg       0.98      0.98      0.98       360
weighted avg       0.98      0.98      0.98       360



From this evaluation we can get up to 98% accuracy for these handwritten digits. We can go back to training the model, and adjust the number of epochs for instance and see what that does to the accuracy. 
With a very small amount of epochs the model might not converge. 
With this neural network class we can tweek different parameters and get different results and try to get the best model possible. 

Often you would start out with a logistric regression classifier to get some simple benchmarks, and then you would create a simple neural network and see how these results compare to the benchmark results created by the logistic regression classifier. 