## Perceptrons

Perceptron checks inputs times their weights, plus the bias, and checks if it's positive or negative

Wx + b >= 0

## Why 'Neural Networks'?

Neurons in the brain take inputs, perform some calculations on them, and decide to 'fire' or not; similarly, in with deep neural networks, we take weighted inputs, perform calculations, and decide to output a 1 or a 0

## Perceptrons as Logical Operators

#### AND Perceptron

If all inputs are 'true', output is 'true', otherwise output is 'false'

#### OR perceptron

If any inputs are 'true', output is 'true', otherwise output is 'false'

#### XOR perceptron

If exactly one input is 1 and one is 0, output 'true', otherwise 'false'

## XOR Multi-Layer Perceptron

NAND (AND + NOT) and OR, then AND from the output of both to get XOR

## Perceptron Trick

For moving the line to better classify incorrectly classified points

Take the equation for the separating line, subtract (for negative point classified as positive, or add for positive point classified as negative) the incorrectly classified point multiplied by a determined factor (the learning rate), to get the equation for the new separating line

## Error Function

To use gradient descent, the error function must be continuous and differentiable. In order to do this, we have to do continuous predictions instead of discrete. Just change the step function (0 or 1) to the sigmoid function.

$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$

## Softmax

Equivalent of sigmoid activation function, but when the problem has 3 or more classes

## Maximum Likelihood

Maximize probability of all points. Use the natural log of the probabilities rather than the product

## Cross Entropy
See function below

## Multi-class Cross Entropy

more complicated...

## Logistic Regression

Error function

If y == 1<br>
P(pos) = $\hat{y}$<br>
Error = -ln($\hat{y}$)

If y == 0<br>
P(neg) = 1 - P(pos) = 1 - $\hat{y}$<br>
Error = -ln(1 - $\hat{y}$)

Summarized into one function:
$$Error = -(1-y)(ln(1-\hat{y})) - yln(\hat{y})$$

$$ Error function = -\frac{1}{m}\sum_{i=1}^{m} (1-y_i)(ln(1-\hat{y_i})) + y_{i}ln(\hat{y_i}) $$

$$ Multiclass error function = -\frac{1}{m}\sum_{i=1}^{m} \sum_{j=1}^{n}y_{ij}ln(\hat{y}_{ij}) $$

$$ E(W,b) = -\frac{1}{m}\sum_{i=1}^{m} (1-y_i)(ln(1-\sigma(Wx^{(i)}+b))) + y_{i}ln(\sigma(Wx^{(i)}+b)) $$

## Perceptron vs Gradient Descent

Gradient descent: Change $w_i$ to $$ w_{i} + a(y-\hat{y})x_{i} $$

Perceptron algorithm: If x is misclassified, change $w_i$ to $w_i$ + a x$_i$ if positive, or
$$ w_{i} - a x_{i} $$ if negative


## Non-linear Models

* Combine multiple linear models into a non-linear model
* Take probability of each point being positive or negative for each model
* Combine these probabilities by adding them together and then applying the sigmoid function
* Can add weights to each probability to make one model more important than the others

## Neural Network Architecture

* Create deep learning network by combining linear models to create non-linear models, then combining those to create more non-linear models
    * This is where the magic of deep learning happens!
* Input layer: weights
* Hidden layer: where linear models are combined
* Output layer: final non-linear model
* One output node for each class (for multi-class classification), with each output being the probability that the point lies in that class

## Feedforward

* Process of multiplying weights by matrices (for each layer) with the sigmoid function until we get the final prediction

## Backpropagation

* Do feedforward operation
* Compare out of model with desired output
* Calculate error
* Run the feedforward operation backwards (backpropagation) to spread the error to each of the weights
* Use this to update the weights, and get a better model
* Continue this until we have a good model

In [1]:
def perceptron_combiner(w1, w1_weight, w2, w2_weight, b):
    return sigmoid_function(w1 * w1_weight + w2 * w2_weight + b)

In [3]:
print(perceptron_combiner(2, 0.4, 6, 0.6, -2))
print(perceptron_combiner(3, 0.4, 5, 0.6, -2.2))
print(perceptron_combiner(5, 0.4, 4, 0.6, -3))

0.9168273035060777
0.8807970779778823
0.8021838885585818


In [None]:
-1 * math.log(0.7 * 0.9 * 0.8 * 0.6)

In [4]:
binary_cross_entropy([1, 1, 0], [0.8, 0.7, 0.1])

0.6851790109107685

In [2]:
import math
from math import log

# minimize the cross entropy to improve model   
def binary_cross_entropy(labels, probabilities):
    return -1 * sum([(y*log(p)) + ((1-y)*log(1-p)) for p,y in zip(probabilities, labels)])

def softmax(L):
    total = sum([np.exp(x) for x in L])
    return [np.exp(x) / total for x in L]

def perceptron_trick(current_line, point, learning_rate=0.1, current_positive=True):
    if current_positive:
        new_line = [round(x - (learning_rate*y), 2) for x,y in zip(current_line, point + [1])]
    else:
        new_line = [round(x + (learning_rate*y), 2) for x,y in zip(current_line, point + [1])]
    return new_line

def sigmoid_function(t):
    return 1 / (1 + math.exp(-t))

def step_function(t):
    if t >= 0:
        return 1
    else:
        return 0

def perceptron_classify(equation, point, function='sigmoid'):
    computed = [x * y for x, y in zip(equation, point + [1])]
    if function == 'sigmoid':
        return sigmoid_function(sum(computed))
    else:
        return step_function(sum(computed))

In [21]:
preds_1 = []
equation_1 = [1, 1, 0]
points = [[1,1], [-1,-1]]

for point in points:
    pred = perceptron_classify(equation_1, point)
    if pred > 0.5:
        preds_1.append([1, pred])
    else:
        preds_1.append([0, pred])
preds_1

[[1, 0.8807970779778823], [0, 0.11920292202211755]]

In [23]:
preds_2 = []
equation_2 = [10, 10, 0]
points = [[1,1], [-1,-1]]

for point in points:
    pred = perceptron_classify(equation_2, point)
    if pred > 0.5:
        preds_2.append([1, pred])
    else:
        preds_2.append([0, pred])
preds_2

[[1, 0.9999999979388463], [0, 2.0611536181902037e-09]]

In [27]:
error_1 = binary_cross_entropy([x for x,y in preds_1], [y for x,y in preds_1])
error_2 = binary_cross_entropy([x for x,y in preds_2], [y for x,y in preds_2])

error_1 > error_2

True

In [19]:
perceptron_classify([10, 10, 0], [1, 1])

perceptron_classify([10, 10, 0], [-1, -1])

0.9999999979388463

In [7]:
line = [3, 4, -10]
point = [1, 1]
lr = 0.1
cp = False
counter = 0
while perceptron_classify(line, point) == 0:
    line = perceptron_trick(line, point, lr, cp)
    counter += 1

In [None]:
counter

In [None]:
41.5 + 38.25

In [8]:
def and_perceptron(x1, x2):
    weight = 1
    bias = -1.5
    weighted_inputs = [weight * x for x in [x1, x2]]
    if sum(weighted_inputs) + bias >= 0:
        return 1
    else:
        return 0        

In [9]:
and_perceptron(1, 0)

0

In [10]:
def or_perceptron(x1, x2):
    weight = 1
    bias = -0.5
    weighted_inputs = [weight * x for x in [x1, x2]]
    if sum(weighted_inputs) + bias >= 0:
        return 1
    else:
        return 0

In [11]:
or_perceptron(1, 0)

1

In [12]:
def not_perceptron(x1, x2):
    w1 = 0.0
    w2 = -1.0
    bias = 0.5
    weighted_inputs = [w * x for w, x in zip([w1, w2], [x1, x2])]
    if sum(weighted_inputs) + bias >= 0:
        return 1
    else:
        return 0

In [13]:
not_perceptron(0, 1)

0

In [14]:
def xor_perceptron(x1, x2):
    if or_perceptron(x1, x2) == 1 and and_perceptron(x1, x2) == 0:
        return 1
    else:
        return 0

In [15]:
xor_perceptron(0, 1)

1

## Keras

The `keras.models.Sequential` class is a wrapper for the neural network model that treats the network as a sequence of layers. It implements the Keras model interface with common methods like `compile()`, `fit()`, and `evaluate()` that are used to train and run the model. We'll cover these functions soon, but first let's start looking at the layers of the model.

Keras requires the input shape to be specified in the first layer, but it will automatically infer the shape of all other layers. This means you only have to explicitly set the input dimensions for the first layer.

The first (hidden) layer from above, `model.add(Dense(32, input_dim=X.shape[1]))`, creates 32 nodes which each expect to receive 2-element vectors as inputs. Each layer takes the outputs from the previous layer as inputs and pipes through to the next layer. This chain of passing output to the next layer continues until the last layer, which is the output of the model. We can see that the output has dimension 1.

The activation "layers" in Keras are equivalent to specifying an activation function in the Dense layers (e.g., `model.add(Dense(128))`; `model.add(Activation('softmax'))` is computationally equivalent to `model.add(Dense(128, activation="softmax"))`), but it is common to explicitly separate the activation layers because it allows direct access to the outputs of each layer before the activation is applied (which is useful in some model architectures).

Once we have our model built, we need to compile it before it can be run. Compiling the Keras model calls the backend (tensorflow, theano, etc.) and binds the optimizer, loss function, and other parameters required before the model can be run on any input data. We'll specify the loss function to be `categorical_crossentropy` which can be used when there are only two classes, and specify `adam` as the optimizer (which is a reasonable default when speed is a priority). And finally, we can specify what metrics we want to evaluate the model with. Here we'll use accuracy.

The model is trained with the `fit()` method, through the following command that specifies the number of training epochs and the message level (how much information we want displayed on the screen during training).

In [16]:
import numpy as np
from keras.utils import np_utils
import tensorflow as tf
# Using TensorFlow 1.0.0; use tf.python_io in later versions
tf.python_io.control_flow_ops = tf

# Set random seed
np.random.seed(42)

# Our data
X = np.array([[0,0],[0,1],[1,0],[1,1]]).astype('float32')
y = np.array([[0],[1],[1],[0]]).astype('float32')

# Initial Setup for Keras
from keras.models import Sequential
from keras.layers.core import Dense, Activation

# Building the model
xor = Sequential()

# Add required layers
xor.add(Dense(8, input_dim=2))
xor.add(Activation('tanh'))
xor.add(Dense(1))
xor.add(Activation('sigmoid'))

# Specify loss as "binary_crossentropy", optimizer as "adam",
# and add the accuracy metric
xor.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Uncomment this line to print the model architecture
xor.summary()

# Fitting the model
history = xor.fit(X, y, epochs=250, verbose=0)

# Scoring the model
score = xor.evaluate(X, y)
print("\nAccuracy: ", score[-1])

# Checking the predictions
print("\nPredictions:")
print(xor.predict_proba(X))

Using TensorFlow backend.


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 8)                 24        
_________________________________________________________________
activation_1 (Activation)    (None, 8)                 0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 9         
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         
Total params: 33
Trainable params: 33
Non-trainable params: 0
_________________________________________________________________

Accuracy:  1.0

Predictions:
[[ 0.42938578]
 [ 0.55750144]
 [ 0.53029436]
 [ 0.45875055]]


## Training Optimization

### Batch vs Stochastic Gradient Descent
Stochastic gradient descent is very easy to implement in Keras. All we need to do is specify the size of the batches in the training process, as follows:

```python
model.fit(X_train, y_train, epochs=1000, batch_size=100, verbose=0)
```

### Learning Rate Decay

In general, better to have a small learning rate (learns better, but slower)
A decreasing learning rate is one that gets smaller as the model gets closer to the optimal solution

### Testing in Keras

In order to test in Keras, all we need to do is to split our set into a training and testing sets. Since we have 400 data points, using 50 for training makes sense:
```python
(X_train, X_test) = X[50:], X[:50]
(y_train, y_test) = y[50:], y[:50]
```

### Early Stopping

Stop when testing error starts to increase to avoid over-fitting

### Regularization

"The whole problem with AI is that bad models are so certain of themselves, and good models so full of doubts."

Large coefficients -> overfitting
Penalize large weights
L1 error function/regularization: good for feature selection
L2 error function/regularization: normally better for training models

### Dropout

Sometimes one part of the network dominates; to solve this, randomly turn off some nodes to leverage all of the nodes
Dropout parameter is probability each node will be dropped

### Vanishing Gradient

With sigmoid function, derivative of points far to the right or left are almost 0, which means that the step for gradient descent will be very small
With gradient descent and learning decay, we never make it to the optimal minimum if the steps are too small
Best way to change this is to change the activation function

### Other Activation Functions

Hyperbolic tangent functions; similar to sigmoid but derivates are large
Rectified linear unit (ReLU); very simple function, basically the max between x and 0

### Local Minima

#### Random restart

Use random restarts to start from a few different, random places, and do gradient descent from all of them, to increase the chances of arriving at the optimal minimum

#### Momentum

Use momentum to get 'over the hump', by taking the average of the last few steps, and weight them with a recency bias

### Keras Optimizers

#### SGD

This is Stochastic Gradient Descent. It uses the following parameters:

* Learning rate.
* Momentum (This takes the weighted average of the previous steps, in order to get a bit of momentum and go over bumps, as a way to not get stuck in local minima).
* Nesterov Momentum (This slows down the gradient when it's close to the solution).

#### Adam

Adam (Adaptive Moment Estimation) uses a more complicated exponential decay that consists of not just considering the average (first moment), but also the variance (second moment) of the previous steps.

#### RMSProp

RMSProp (RMS stands for Root Mean Squared Error) decreases the learning rate by dividing it by an exponentially decaying average of squared gradients. 