__Chapter 12 - Implementing a Multilayer Artificial Neural Network from Scratch__

1. [Modeling complex functions with artificial neural networks](#Modeling-complex-functions-with-artificial-neural-networks)
    1. [Activating a neural network via forward propagation](#Activating-a-neural-network-via-forward-propagation)
1. [Classifying handwritten digits](#Classifying-handwritten-digits)
1. [Implementing a multilayer perceptron](#Implementing-a-multilayer-perceptron)
    1. [Homegrown implementation](#Homegrown-implementation)
1. [](#)
1. [](#)


In [None]:
# Standard libary and settings
import os
import sys
import importlib
import itertools
import warnings; warnings.simplefilter('ignore')
dataPath = os.path.abspath(os.path.join('../../Data'))
modulePath = os.path.abspath(os.path.join('../../CustomModules'))
sys.path.append(modulePath) if modulePath not in sys.path else None
from IPython.core.display import display, HTML; display(HTML("<style>.container { width:95% !important; }</style>"))


# Data extensions and settings
import numpy as np
np.set_printoptions(threshold = np.inf, suppress = True)
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.options.display.float_format = '{:,.6f}'.format


# Modeling extensions
import sklearn.base as base
import sklearn.cluster as cluster
import sklearn.datasets as datasets
import sklearn.decomposition as decomposition
import sklearn.discriminant_analysis as discriminant_analysis
import sklearn.ensemble as ensemble
import sklearn.feature_extraction as feature_extraction
import sklearn.feature_selection as feature_selection
import sklearn.linear_model as linear_model
import sklearn.metrics as metrics
import sklearn.model_selection as model_selection
import sklearn.neighbors as neighbors
import sklearn.pipeline as pipeline
import sklearn.preprocessing as preprocessing
import sklearn.svm as svm
import sklearn.tree as tree
import sklearn.utils as utils


# Visualization extensions and settings
import seaborn as sns
import matplotlib.pyplot as plt


# Custom extensions and settings
from quickplot import qp, qpUtil, qpStyle
from mlTools import powerGridSearch
sns.set(rc = qpStyle.rcGrey)


# Magic functions
%matplotlib inline


<a id = 'Modeling-complex-functions-with-artificial-neural-networks'></a>

# Modeling complex functions with artificial neural networks

A fully connected network, also known as a multilayer perceptron (MLP), has one input layer of neurons, one hidden layer and one output layer. The units in the hidden layer are fully connected to the input layer, and the output layer is fully connected to the hidden layer. If more than one hidden layer is present then the MLP is considered to be a deep artificial neural network.

Each neuron, or activation unit, can be identified by its position amongst the other activation neurons and the layer in which it appear - $a_i^l$ is the $i$th neuron in the $l$th layer. For simplicity, this walkthrough will use the $l$ values of $in, h, out$ to describe the input, hidden and output layer. So $a_i^{out}$ is the $i$th activation unit of the outer layer. The input and hidden layers each have bias units, $a_0^{in}$ and $a_0^{out}$ and these are set to one. This means the input layer is just the input values plus the bias unit:

$$
a^{in} 
= 
\begin{bmatrix} a_0^{in} \\ a_1^{in} \\ \vdots \\  a_m^{in} \end{bmatrix}
=
\begin{bmatrix} 1 \\ x_1^{in} \\ \vdots \\  x_m^{in} \end{bmatrix}
$$ 

Each activation unit in layer $l$ is connected to all of the units in layer $l$ + 1 by a weight coefficient. As an example, the connection between the $k$th unit layer $l$ to the $j$th in layer $l$ + 1 is written as $w_{k,j}^l$. So the weight matrix that connects the input layer to the hidden layer is $\mathbf{W}^{h}$, the weight matrix that connects the hidden layer to the output layer is $\mathbf{W}^{out}$. The weight matrix that connects, for example, the input and hidden layers is $\mathbf{W}^h \in \mathbb{R}^{m \times d}$, where $d$ is the number of hidden units and $m$ is the numnber of input units (including bias).

Having one unit in the output layer is sufficient for a binary classificaiton task, but having more than one enables multiclass classification through one-hot vector representation of the multiclass labels.:
$$
0 
= 
\begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}
1 
= 
\begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}
2 
= 
\begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}
$$


<a id = 'Activating-a-neural-network-via-forward-propagation'></a>

## Activating a neural network via forward propagation

The MLP learning procedure in three steps:

1. Starting at the input layer, forward propagate the patterns of the training data through the network to generate an output
2. Using the output, calculate the error to be minimized using a cost function
3. Backpropagate the error, find its derivative with respect to each weight in the network, then update the model

In feedforward networks, each layer serves as the input to the next layer without any loops. This contrasts with recurrent neural networks. The three steps above are repeated for multiple epochs to learn the best weights, and then forward propagation is used to the calculate the network output and apply a threshold function to obtain the predicted class labels represented in the one-hot format above. Describing each step in more detail:

The first activation unit in the hidden layer $a_1^{h}$ is connected to all units in the input layer, and is calculated by:

$$
z_1^h = a_0^{in}w_{0,1}^h + a_1^{in}w_{1,1}^h + ... a_m^{in}w_{m,1}^h
$$
$$
a_1^h = \phi\big(z_1^h\big)
$$
$z_1^h$ is the net input and $\phi(\cdot)$ is the activation function that acts on $z_1^h$. This activation function need to be differentiable to learn the weights that connect the neurons using a gradient-based approach. Non-linear activation function are also possible and are used to solve complex problems like image classification. One familiar non-linear activation function is the sigmoid function, which arose in the context of logistic regression:

$$
\phi(z) = \frac{1}{1 + e^{-z}}
$$

This is an S-shaped curve that maps the input $z$ onto a logistic distribution that ranges from 0 to 1 and cross the y-axis at $z$ = 0. Given this, we can think of each neuron as logistic regression units that return values in the continuous range of 0 to 1. To describe this activation function in linear algebra notation:

$$
\begin{equation}
\textbf{z}^{h} = \textbf{a}^{in}\textbf{W}^h
\\
\textbf{a}^h = \phi\big(\textbf{z}^h\big)
\end{equation}
$$

$\textbf{a}^{in}$ is the 1 x $m$ dimensional feature vector for a sample $\textbf{x}^{in}$, plus the bias unit. $\textbf{W}^{h}$ is the $m$ x $d$ dimensional weight matrix where $d$ is the number of units in the hidden layer. Through matrix-vector multiplication, we obtain a 1 x $d$ dimensional net input vector $\textbf{z}^h$ to be used to calculate the activation $\textbf{a}^{h}$ ($\textbf{a}^{h} \in \mathbb{R}^{1 \times d}$). This computation can be generalized to all $n$ samples in the training set by:

$$
\textbf{Z}^{h} = \textbf{A}^{in}\textbf{W}^{h}
$$

In this representation, $\textbf{A}^{in}$ is an $n$ x $m$ matrix, and the matrix-matrix multiplication results in an $n$ x $d$ dimensional net input matrix $\textbf{Z}^{h}$. Lastly, apply the activation function $\phi(\cdot)$ to each value in the net input matrix to get the $n$ x $d$ dimensional matrix $\textbf{A}^{h}$ for the next layer, which in this case is the output layer:

$$
A^h = \phi\big(\textbf{Z}^{h}\big)
$$

Just as above, we can write the activation function of the output layer in vectorized form for multiple samples:
$$
\textbf{Z}^{out} = \textbf{A}^{h}\textbf{W}^{out}
$$

In this last step, we multiply the $d$ x $t4 matrix $\textbf{W}^{out}$ (where $t$ is the number of output units) by the $n$ by $d$ dimensional matrix \textbf{A}^{h} to obtain the $n$ by $t$ dimensional matrix \textbf{Z}^{out}, where the columns in this matrix represent the outputs for each sample. The last step is to apply the sgmoid activation function to obtain the continuous valued ouput of the network:

$$
\textbf{A}^{out} = \phi\big(\textbf{Z}^{out}\big), \textbf{A}^{out} \in \mathbb{R}^{n \times t}
$$

<a id = 'Classifying-handwritten-digits'></a>

# Classifying handwritten digits

Implement and train our first MLP to classify handwritten digits from the Mixed National Institute of Standards and Techngology (MNIST). It consists of handwritten digits from 250 people - half high school students and half Census Bureau employees.

In [1]:
# Read image files into numpy arrays

import struct

def load_mnist(path, kind = 'train'):
    """
    This returns two arrays. images is an n x m dimensional array, where n is the
    number of samples and m is the number of features, which is this context is the
    number of pixels. The training data includes 60,000 digits and the test data includes
    10,000 digits. The second array labels includes the target variable, which takes on
    an integer value between 0 and 9.    
    
    The images in this dataset are 28 x 28 pixels in size, and each pixel is represented
    by a gray scale intensity value. This function unrolls the 28 x 28 pixels into a
    one-dimensional row vector of length 784, which is represented in the images array.
    """    
    labels_path = os.path.join(path, '%s-labels-idx1-ubyte' % kind)
    images_path = os.path.join(path, '%s-images-idx1-ubyte' % kind)
    with open(labels_path, 'rb') as lbpath:
        magic, n = struct.unpack('>II', lbpath.read(8))
        labels = np.fromfile(lbpath, dtype = np.uint8)
    with open(images_path, 'rb') as imgpath:
        magic, num, rows, cols = struct.unpack('>IIII', imgpath.read(16))
        images = np.fromfile(imgpath, dtype = np.uint8).reshape(len(labels, 784))
        
        # Normalize pixel values to range from -1 to 1 rather than 0 to 255
        images = ((images / 255.) - 0.5) * 2
    return images, labels


In [None]:
# 

XTrain, yTrain = load_mnist('', kind = 'train')
print('Rows: {}, columns: {}'.format(XTrain.shape[0], XTrain.shape[0]))

XTest, yTest = load_mnist('', kind = 't10k')
print('Rows: {}, columns: {}'.format(XTest.shape[0], XTest.shape[0]))


In [None]:
# Visualize samples digits

fig, ax = plt.subplots(nrows = 2, ncols = 5, sharex = True, sharey = True)

ax = ax.flatten()
for i in range(10):
    img = XTrain[yTrain == i][0].reshape(28,28)
    ax[i].imshow(img, cmap = 'Greys')
    
ax[0].set_xticks([])
ax[0].set_yticks([])
plt.tight_layout()
plt.show()


In [None]:
# Visualize multiple samples of the same digit

fig, ax = plt.subplots(nrows = 5, ncols = 5, sharex = True, sharey = True)

ax = ax.flatten()
for i in range(25):
    img = XTrain[yTrain == 7][i].reshape(28,28)
    ax[i].imshow(img, cmap = 'Greys')
    
ax[0].set_xticks([])
ax[0].set_yticks([])
plt.tight_layout()
plt.show()



In [None]:
# Efficiently save image files to avoid reloading and to save space

np.savez_compressed('mnist_Scaled.npz'
                   ,XTrain = XTrain
                   ,yTrain = yTrain
                   ,XTest = XTest
                   ,yTest = yTest)


In [None]:
# Load preprocessed images from compressed files

mnist = np.load('mnist_scaled.npz')


In [None]:
# Review files

mnist.files


In [None]:
# Load retrieved compressed files into arrays

XTrain, yTrain, XTest, yTest = [mnist[f] for f in mnist.files]


<a id = 'Implementing-a-multilayer-perceptron'></a>

# Implementing a multilayer perceptron

<a id = 'Homegrown-implementation'></a>

## Homegrown implementation

In [None]:
# 

class NeuralNetMLP():
    """
    Feedforward neural network / Multilayer perceptron classifier
    """

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A