# Artificial Neural Network - Perceptron

The field of artificial neural networks started out with an electromechanical binary unit called a perceptron.

The perceptron took a weighted set of input signals and chose an ouput state (on/off or high/low) based on a threshold.

<img src="http://i.imgur.com/c4pBaaU.jpg">

If the output isn't right, we can adjust the weights, threshold, or bias ($x_0$ above)

The model was inspired by discoveries about the neurons of animals, so hopes were quite high that it could lead to a sophisticated machine. This model can be extended by adding multiple neurons in parallel. And we can use linear output instead of a threshold if we like for the output.

If we were to do so, the output would look like ${x \cdot w} + w_0$ (this is where the vector multiplication and, eventually, matrix multiplication, comes in)

When we look at the math this way, we see that despite this being an interesting model, it's really just a fancy linear calculation.

If we compose these, we'll still get a linear model. And, in fact, the proof that this model -- being linear -- could not solve any problems whose solution was nonlinear ... led to the first of several "AI / neural net winters" when the excitement was quickly replaced by disappointment, and most research was abandoned.

### Linear Perceptron

We'll get to the non-linear part, but the linear perceptron model is a great way to warm up and bridge the gap from traditional linear regression to the neural-net flavor.

Let's look at a problem -- the diamonds dataset from R -- and analyze it using two traditional methods in Scikit-Learn, and then we'll start attacking it with neural networks!

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

input_file = "data/diamonds.csv"

df = pd.read_csv(input_file, header = 0)
df.drop(df.columns[0], axis=1, inplace=True)
df = pd.get_dummies(df, prefix=['cut_', 'color_', 'clarity_'])

y = df.iloc[:,3:4].as_matrix().flatten()
y.flatten()

X = df.drop(df.columns[3], axis=1).as_matrix()
np.shape(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
dt = DecisionTreeRegressor(random_state=0, max_depth=10)
model = dt.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("RMSE %f" % np.sqrt(mean_squared_error(y_test, y_pred)) )

In [None]:
from sklearn import linear_model

lr = linear_model.LinearRegression()
linear_model = lr.fit(X_train, y_train)

y_pred = linear_model.predict(X_test)
print("RMSE %f" % np.sqrt(mean_squared_error(y_test, y_pred)) )

Now that we have a baseline, let's build a neural network -- linear at first -- and go further.

## Neural Network with Keras

### Keras is a High-Level Library for Neural Networks and Deep Learning

#### "*Being able to go from idea to result with the least possible delay is key to doing good research.*"
Maintained by Francois Chollet at Google, it provides

* High level APIs
* Pluggable backends for Theano and TensorFlow
* CPU/GPU support
* The now-officially-endorsed high-level wrapper for TensorFlow
* Model persistence and other niceties
* JavaScript version (!)
* Interop with further frameworks, like DeepLearning4J

Well, with all this, why would you ever *not* use Keras? If you're implementing something new and low level you probably need to add it down in the TensorFlow layer.

Another way to look at it:

TensorFlow Ops -> TensorFlow Procedures -> Keras 

is a little like

Assembly Code -> C -> Python 

The metaphor here fails because Python has its own VM and so runs code quite differently from C, whereas Keras is a fairly thin wrapper over its backends.

### We'll build a "Dense Feed-Forward Shallow" Network:
(the number of units in the following diagram does not exactly match ours)
<img src="http://i.imgur.com/LqyPRBd.jpg">

In [None]:
from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(30, input_dim=26, kernel_initializer='normal', activation='linear'))
model.add(Dense(1, kernel_initializer='normal', activation='linear'))

model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mean_squared_error'])
history = model.fit(X_train, y_train, epochs=5, batch_size=200)

scores = model.evaluate(X_test, y_test)
print
print("root %s: %f" % (model.metrics_names[1], np.sqrt(scores[1])))

#### Ouch, not so great!

Well, the neural network model is a bit of a different approach.

Let's do three things. 

First, __what is an epoch? what is a batch?__

Second, let's look at the error ...

In [None]:
import matplotlib.pyplot as plt
plt.plot(history.history['loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

Third, let's go outside of Jupyter so that we can run a long-running training, and not lock up our browser.

Open a terminal in the courseware folder...
* Make your deep learning Python environment active with `source activate dl`
* `cd` into the scripts folder
* and run `python keras-diamonds.py`

This will take about 3 minutes to converge to the same performance we got more or less instantly with our sklearn linear regression :)

Once it's started, let's look at the source code and talk about that.

---

> __ASIDE: How exactly is this training working?__ Don't worry, we're going to come back to this in more detail in a little while!

---
Let's also make the connection from Keras down to Tensorflow.

We used a Keras class called Dense, which represents a "fully-connected" layer of -- in this case -- linear perceptrons. Let's look at the source code to that, just to see that there's no mystery.

`https://github.com/fchollet/keras/blob/master/keras/layers/core.py`

It calls down to the "back end" by calling `output = K.dot(x, self.W)`

`K` represents the pluggable backend wrapper. You can trace K.dot on Tensorflow by looking at

`https://github.com/fchollet/keras/blob/master/keras/backend/tensorflow_backend.py`

Look for `def dot(x, y):` and look right toward the end of the method. The math is done by calling `tf.matmul(x, y)`

## Ok so we've come up with a very slow way to perform a linear regression. 

### *Welcome to Neural Networks in the 1960s!*

---

### Watch closely now because this is where the magic happens...

<img src="https://media.giphy.com/media/Hw5LkPYy9yfVS/giphy.gif">

# Non-Linearity + Perceptron = Universal Approximation

### Where does the non-linearity fit in?

* We start with the inputs to a perceptron -- these could be from source data, for example.
* We multiply each input by its respective weight, which gets us the ${x \cdot w}$
* Then add the "bias" -- basically an extra learnable parameter, to get ${x \cdot w} + b$
    * This value (so far) is sometimes called the "pre-activation"
* Now, apply a non-linear "activation function" to this value, such as the logistic sigmoid, ${1 \over {1 + e^{-x} } }$
* This is often written as ${\sigma(x)}$ where x is the pre-activation

### Now the network can "learn" non-linear functions

To gain some intuition, consider that where the sigmoid is close to 1, we can think of that neuron as being "on" or activated, giving a specific output. When close to zero, it is "off." 

So each neuron is a bit like a switch. If we have enough of them, we can theoretically express arbitrarily many different signals. 

In some ways this is like the original artificial neuron, with the thresholding output -- the main difference is that the sigmoid gives us a smooth (arbitrarily differentiable) output that we can optimize over using gradient descent to learn the weights. 

### Where does the signal "go" from these neurons?

Assume that we want to get a classfication output from these activations. If we have lots of neurons but only, say, 10 classes (like MNIST) we can feed the outputs from these forward into a final layer of 10 neurons, and compare those neurons' activation levels.

* Essentially we choose the output class whose neuron has the highest activation
* To make this mathematically friendly, instead of just using "argmax" we calculate the output using something called "softmax," a smoothed/softened version that is normalized to sum to 1:

$$\sigma (\mathbf {z} )_{j}={\frac {e^{z_{j}}}{\sum _{k=1}^{K}e^{z_{k}}}}$$

### So our network looks like this:
<img src="http://i.imgur.com/LqyPRBd.jpg">

* Where we attach our features to the "input layer"
    * Here imagining 3 features in each input record
* Feed those values forward to a sigmoid activation hidden layer
* Then feed the activations from the hidden layer to the output layer
    * Here imagining 2 possible output classes; technically if there are only 2 output classes, we could get away with one neuron in the output layer, but we normally have one per class
* We calculate the softmax vector of the output activations
* ...and that's our probability distribution for the "actual" predicted output

---

> __ASIDE: this structure reproduces the same math as multiclass logistic regression__

---

Ok, before we talk any more theory, let's run it and see if we can do better on our diamonds dataset!

Again, we'll hop outside of Jupyter and on the console, using your Python `dl` conda environment and `scripts/` folder, run `python keras-sigmoid.py`

While that's running, let's look at the code:

What is different here?

* First, we've changed the activation in the hidden layer to "sigmoid" per our discussion.

Note that we didn't have to explicitly write the "input" layer, courtesy of the Keras API. We just said `input_dim=26` on the first (and only) hidden layer.

* Next, notice that we're running 2000 training epochs!

It takes a long time to converge. If you experiment a lot, you'll find that ... it still takes a long time to converge. Around the early part of the most recent deep learning renaissance, researchers started experimenting with other non-linearities.

*Output here is still using "linear" rather than "softmax" because we're performing regression, not classification*

In theory, any non-linearity should allow learning, and maybe we can use one that "works better"

By "works better" we mean
* Simpler gradient - faster to compute
* Less prone to "saturation" -- where the neuron ends up way off in the 0 or 1 territory of the sigmoid and can't easily learn anything
* Keeps gradients "big" -- avoiding the large, flat, near-zero gradient areas of the sigmoid

Turns out that the most popular solution is a very simple hack:

### Rectified Linear Unit (ReLU)

<img src="images/activation-functions.svg" width=800>

### Go change your hidden-layer activation from 'sigmoid' to 'relu'

Start your script and watch the error for a bit!

Would you look at that?! 

* We break \$1000 RMSE around epoch 112
* \$900 around epoch 220
* \$800 around epoch 450
* By around epoch 2000, my RMSE is about $620
...


__Same theory; different activation function. Huge difference__

Feel free to experiment with other activation functions. Where would you find the options in Keras? How could you experiment with a custom activation functions?

---

### Some things to think about...

1. Consider the shape of the (sub)spaces that these neurons can "carve out"
    * What is the shape like compared to the sigmoid version? the decision tree version? think about the edges
2. ReLU is a bit like a logic gate -- an if/else conditional
3. ReLU supports more sparsity -- inactive neurons are just zero -- a form of feature selection
4. Since the pre-activations are linear, and the activations are non-linear, these sorts of models have been compared to a form of GLM (generalized linear model) where the activation takes the place of the "link function"
5. This is a high capacity model if we add lots of neurons
    * But it will have a wide & shallow shape 
    * ... we will move on soon to motivate deeper networks.