## Multi-layer perceptron

I am not even going to try and write a better intro. to neural nets than this...

https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/

### Softmax Equation

Given an array of values of length n, the softmax of value i in the array is:

$$\frac{e^{i}}{\sum_{j}^{n}e^{j}}$$

### Deep Neural Network

When you have multiple hidden layers - the layers in between the input and softmax layers, the network is called deep.

### Backpropagation

Neural nets are trained using a technique called backpropagation. At a very high level, you pass a training example through your network (forward pass), then measure its error, and then you go backwards through each layer to measure the contribution of each connection to the error (backwards pass). You then use this information to adjust the weights of your connections using gradient descent. 

### Activation Functions

The article above does not talk much about activation functions. Typically, in an MLP after you pass connections to a neuron you then apply an activation function. Historically, that activation function was a logistic function, which then is basically logistic regression.

Another very popular activation function now is relu. Relu(z) = max(0,z). This is very fast to compute and in practice works very well.

### Cross-entropy

$$-\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K}y_{k}^{i}log(p_{k}^{i})$$

Where:

* m - the number of data points
* K - the number of classes
* y_{k}^{i} - the true class value for row i, class k. Either a zero or one depending on if k is the correct class
* p_{k}^{i} - the value predicted by your model for class k, row i. Usually from your softmax

This is the cost function you are trying to minimze.

### Important to Remember

* Scale data - usually zero to one
* Shuffle data

### Tuning Hyper-parameters

* Better to use random search
* Start with reasonable, known architectures
* Number of hidden layers:
    * Often can be valuable to have a deep network to learn heirarchy. Usually converge faster and generalize better. 
    * More complex problems can often require deeper networks and more data
* Number of neurons:
    * Typically size the layers to form a type of funnel with fewer and fewer neurons at each layer. This comes back the heirachy idea where you might need more neurons to learn lower level features. 
    * Also can try picking same number of neurons for all layers to have less parameters to tune
* Usually more value in going deeper than wider
* Can try going deeper and wider than you think necessary and use regularization techinques to prevent overfitting. Such as early stopping.

In [1]:
import numpy as np

values = np.array([1.0, 3.0, 8.0, 4.0, 12.0])
exp_values = np.exp(values)
softmax = exp_values / sum(exp_values)
print([round(x,2) for x in softmax])
print(sum(softmax))

[0.0, 0.0, 0.02, 0.0, 0.97999999999999998]
1.0


## Example using Python

In [7]:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.utils import np_utils
from keras.datasets import mnist
from sklearn.metrics import confusion_matrix
import numpy as np
from __future__ import division

In [2]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()

y_train = np_utils.to_categorical(y_train, 10)
y_test = np_utils.to_categorical(y_test, 10)

def vectorize_image(images):
    scaled_images = images / 255
    return images.reshape(scaled_images.shape[0],-1)

x_train = vectorize_image(x_train)
x_test = vectorize_image(x_test)

In [33]:
model = Sequential([
    Dense(128, input_shape=(784,)),
    Dense(10, activation='relu'),
    Activation('softmax'),
])

In [34]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_12 (Dense)             (None, 128)               100480    
_________________________________________________________________
dense_13 (Dense)             (None, 10)                1290      
_________________________________________________________________
activation_5 (Activation)    (None, 10)                0         
Total params: 101,770
Trainable params: 101,770
Non-trainable params: 0
_________________________________________________________________


In [35]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy')

In [36]:
model.fit(x_train, y_train, epochs=5, batch_size=64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x127645e48>

In [37]:
test_predictions = np.argmax(model.predict(x_test),1)
y_test_sparse = np.argmax(y_test, 1)

In [38]:
confusion_matrix(y_test_sparse, test_predictions)

array([[865,   0,   0,   2,   0,  35,   0,  70,   0,   8],
       [  0,   0,   0, 869,   0,  95,   0,   0,   0, 171],
       [ 74,   0,   0, 702,   0,  36,   0, 118,   0, 102],
       [ 12,   0,   0, 846,   0,  73,   0,  48,   0,  31],
       [  3,   0,   0,   2,   0,   7,   0,  12,   0, 958],
       [ 32,   0,   0,  60,   0, 694,   0,  53,   0,  53],
       [ 61,   0,   0,  11,   0, 389,   0,  34,   0, 463],
       [  0,   0,   0,  56,   0,   0,   0, 774,   0, 198],
       [ 27,   0,   0, 232,   0, 457,   0,  66,   0, 192],
       [  3,   0,   0,  13,   0,  11,   0,  23,   0, 959]])

In [39]:
np.sum(y_test_sparse == test_predictions) / test_predictions.shape

array([ 0.4138])