## Multi-layer perceptron

I am not even going to try and write a better intro. to neural nets than this...

https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/

### Softmax Equation

Given an array of values of length n, the softmax of value i in the array is:

$$\frac{e^{i}}{\sum_{j}^{n}e^{j}}$$

### Deep Neural Network

When you have multiple hidden layers - the layers in between the input and softmax layers, the network is called deep.

### Backpropagation

Neural nets are trained using a technique called backpropagation. At a very high level, you pass a training example through your network (forward pass), then measure its error, and then you go backwards through each layer to measure the contribution of each connection to the error (backwards pass). You then use this information to adjust the weights of your connections using gradient descent. 

### Vanishing/Exploding gradients

When your gradients start to get too small or too large this can negatively effect learning. For example, a zero gradient will stop learning all together and when you gradients get too large your learning can diverge.

### Activation Functions

The article above does not talk much about activation functions. Typically, in an MLP after you pass connections to a neuron you then apply an activation function. Historically, that activation function was a logistic function, which then is basically logistic regression. This tends to suffer from vanishing gradient problem.

Another very popular activation function now is relu. Relu(z) = max(0,z). This is very fast to compute and in practice works very well. This function suffers less from the vanishing gradient problem.

One problem with relu is that the connections can die. This happens if the inputs to a neuron end up negative resulting in a zero gradient. Thus, the **leaky relu** was invented: Leaky Relu(x) = max($\alpha$x, x) where $\alpha$ is usually a value of 0.01 or 0.02. The $\alpha$ value is the slope when x < 0 and ensures that the activation never truly dies, though it can become quite small.

**Elu** is another activation function which generally performs the best but is slower to compute then a leaky relu. Again, when x > 0 you just get x. But when x < 0 you get $\alpha$(exp(x) -1). $\alpha$ represents the value that the function approaches when x is a large negative number. Usually, it is set to 1. This function is also smooth everywhere, including zero.

### Batch Normalization

As we have learned it is important to scale - or normalize - your data before feeding it to a neural net. Another important normalization step is right before your activation function to again normalize your data by subtracting the mean and dividing by the standard deviation. Since you are working with a batch, you use the batch mean and standard deviation. You also allow each batch normalization to learn an appropriate scaling and shifiting factor for your standardized values. 

This technique has been shown to reduce the vanishing/exploding gradient problem, allow the use or larger learning rates, and be less sensitive to initalization. On the downside, it reduces runtime prediction speed.


### Cross-entropy

$$-\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K}y_{k}^{i}log(p_{k}^{i})$$

Where:

* m - the number of data points
* K - the number of classes
* y_{k}^{i} - the true class value for row i, class k. Either a zero or one depending on if k is the correct class
* p_{k}^{i} - the value predicted by your model for class k, row i. Usually from your softmax

This is the cost function you are trying to minimze.

### Important to Remember

* Scale data - usually zero to one
* Shuffle data

### Tuning Hyper-parameters

* Better to use random search
* Start with reasonable, known architectures
* Number of hidden layers:
    * Often can be valuable to have a deep network to learn heirarchy. Usually converge faster and generalize better. 
    * More complex problems can often require deeper networks and more data
* Number of neurons:
    * Typically size the layers to form a type of funnel with fewer and fewer neurons at each layer. This comes back the heirachy idea where you might need more neurons to learn lower level features. 
    * Also can try picking same number of neurons for all layers to have less parameters to tune
* Usually more value in going deeper than wider
* Can try going deeper and wider than you think necessary and use regularization techinques to prevent overfitting. Such as early stopping.

### Pretraining

### Optimizers

### Initialization

### Regularization

### Data Augmentation

In [1]:
import numpy as np

values = np.array([1.0, 3.0, 8.0, 4.0, 12.0])
exp_values = np.exp(values)
softmax = exp_values / sum(exp_values)
print([round(x,2) for x in softmax])
print(sum(softmax))

[0.0, 0.0, 0.02, 0.0, 0.97999999999999998]
1.0


## Example using Python

In [2]:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.utils import np_utils
from keras.datasets import mnist
from sklearn.metrics import confusion_matrix
import numpy as np
from __future__ import division

Using TensorFlow backend.


In [3]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()

y_train = np_utils.to_categorical(y_train, 10)
y_test = np_utils.to_categorical(y_test, 10)

def vectorize_image(images):
    scaled_images = images / 255
    return images.reshape(scaled_images.shape[0],-1)

x_train = vectorize_image(x_train)
x_test = vectorize_image(x_test)

In [4]:
model = Sequential([
    Dense(128, input_shape=(784,), activation='relu'),
    Dense(10),
    Activation('softmax'),
])

In [5]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 128)               100480    
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1290      
_________________________________________________________________
activation_1 (Activation)    (None, 10)                0         
Total params: 101,770
Trainable params: 101,770
Non-trainable params: 0
_________________________________________________________________


In [6]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy')

In [7]:
model.fit(x_train, y_train, epochs=5, batch_size=64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1281ebeb8>

In [8]:
test_predictions = np.argmax(model.predict(x_test),1)
y_test_sparse = np.argmax(y_test, 1)

In [9]:
confusion_matrix(y_test_sparse, test_predictions)

array([[   0,    0,   34,  265,    9,  452,  213,    7,    0,    0],
       [   0, 1120,    3,    6,    0,    2,    4,    0,    0,    0],
       [   0,    1,  954,   34,   12,    7,   14,   10,    0,    0],
       [   0,    2,   80,  858,    2,   44,    8,   16,    0,    0],
       [   0,    1,    6,    5,  953,    3,   13,    1,    0,    0],
       [   0,    0,    2,   63,   23,  785,   15,    4,    0,    0],
       [   0,    3,    2,    4,    4,   14,  931,    0,    0,    0],
       [   0,   11,   13,   60,   17,    1,    2,  924,    0,    0],
       [   0,   21,   51,  562,  179,  113,   20,   28,    0,    0],
       [   0,   12,    0,   32,  817,   19,    4,  125,    0,    0]])

In [10]:
np.sum(y_test_sparse == test_predictions) / test_predictions.shape

array([ 0.6525])