# Linear regression
- Gradient descent
  - Descend a mountain
  - Descend in the direction that will decrease the error the most
  - Until a minimum error is found
- Least squares
  - error = (prediction - actual)^2

# Linear to logistic regression

- Linear regression
  - Predict values on a continuous spectrum

- Logistic regression
  - Classify data among discrete classes

# Perceptron
- Chain logistic regression layers together
  - Input layer
  - Hidden layer
  - Output layer
  
  <img src="images/perceptron.jpg" width="50%" height="50%" />
  
- Perceptron
  - Articial neuron
  - Each one looks at input data and decides how to categorize that data.
  - Output is always 0 or 1.
- Weights
  - Input is multipled by a weight value
  - Start as random values
  - Neural network is trained by adjusting weights
  - Higher weight means that this input is more important than other inputs
  - Weighted input values are summed to a single value
  - Matrix of weights: $W$
  - Individual weight: $w$
- Activation function
  - Result of perceptron's summation is turned into output signal by activation function

  - Heaviside step function
    - `f(h) = h >= 0 ? 1 : 0`
    <img src="images/heaviside-step-function.png" width="30%" height="30%" />

- Bias: $b$, moves values in one direction or another

In [1]:
import pandas as pd

def test(weight1, weight2, bias, test_inputs, correct_outputs):    
    outputs = []

    # Generate and check output
    for test_input, correct_output in zip(test_inputs, correct_outputs):
        linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias
        output = int(linear_combination >= 0)
        is_correct_string = 'Yes' if output == correct_output else 'No'
        outputs.append([test_input[0], test_input[1], linear_combination, output, is_correct_string])

    # Print output
    num_wrong = len([output[4] for output in outputs if output[4] == 'No'])
    output_frame = pd.DataFrame(outputs, columns=['Input 1', '  Input 2', '  Linear Combination', '  Activation Output', '  Is Correct'])
    if not num_wrong:
        print('Nice!  You got it all correct.\n')
    else:
        print('You got {} wrong.  Keep trying!\n'.format(num_wrong))

    print(output_frame.to_string(index=False))

# AND perceptron
<img src="images/and-perceptron.png" width="30%" height="30%" />

In [2]:
weight1 = 5
weight2 = 5
bias = -10

# DON'T CHANGE ANYTHING BELOW
# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [False, False, False, True]

test(weight1, weight2, bias, test_inputs, correct_outputs)

Nice!  You got it all correct.

Input 1    Input 2    Linear Combination    Activation Output   Is Correct
      0          0                   -10                    0          Yes
      0          1                    -5                    0          Yes
      1          0                    -5                    0          Yes
      1          1                     0                    1          Yes


# OR perceptron

In [3]:
weight1 = 5
weight2 = 5
bias = -5

# DON'T CHANGE ANYTHING BELOW
# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [False, True, True, True]

test(weight1, weight2, bias, test_inputs, correct_outputs)

Nice!  You got it all correct.

Input 1    Input 2    Linear Combination    Activation Output   Is Correct
      0          0                    -5                    0          Yes
      0          1                     0                    1          Yes
      1          0                     0                    1          Yes
      1          1                     5                    1          Yes


# NOT perceptron

In [4]:
weight1 = 0
weight2 = -1
bias = 0

# DON'T CHANGE ANYTHING BELOW
# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [True, False, True, False]
outputs = []

test(weight1, weight2, bias, test_inputs, correct_outputs)

Nice!  You got it all correct.

Input 1    Input 2    Linear Combination    Activation Output   Is Correct
      0          0                     0                    1          Yes
      0          1                    -1                    0          Yes
      1          0                     0                    1          Yes
      1          1                    -1                    0          Yes


# XOR perceptron
<img src="images/xor-perceptron.png" width="50%" height="50%" />

- A: `NOT`
- B: `AND`
- C: `OR`

# Neural network
<img src="images/simple-network.png" width="30%" height="30%" />

- Use activation functions that are continuous and differentiable, possible to train using gradient descent

- Logistic (sigmoid) activation function
  - $sigmoid(x) = 1 / (1 + e^{-x})$
  - Output $(0, 1)$
  - Can be interpreted as a probability for success.
  - Same formulation as logistic regression.
  - Turn perceptron into neural network.
  <img src="images/sigmoid.png" width="30%" height="30%" />

- Circles: units
- Boxes: operations

- $h = \sum w_i x_i + b$
- $y = f(h)$


In [5]:
import numpy as np

def sigmoid(x):
    # TODO: Implement sigmoid function
    return 1 / (1 + np.exp(-x))

inputs = np.array([0.7, -0.3])
weights = np.array([0.1, 0.8])
bias = -0.1

# TODO: Calculate the output
output = sigmoid(np.dot(weights, inputs) + bias)

print('Output:')
print(output)

Output:
0.432907095035


# Gradient descent
- Sum of square errors (SSE)
  - Metric of how wrong the predictions are
  - $E = \frac{1}{2}\sum_\mu\sum_j[y^\mu_j - \hat{y}^\mu_j]^2$
    - $\hat{y}$ - prediction
    - $y$ - true value
    - $j$ - output units
    - $\mu$ - data points
    - First sum over $j$, then over $\mu$
  - $\hat{y}^\mu_j = f(\sum_i w_{ij} x^\mu_i)$
  - Find weights $w_{ij}$ that minimize the squared error $E$, using **gradient descent**.

- Mean of square errors (MSE)
  - If a lot of data is used, summing up all the weight steps can lead to really large updates that make the gradient descent diverge.
  - Error is divided by the number of records, $m$.
  - $E = \frac{1}{2m}\sum_\mu(y^\mu-\hat{y}^mu)^2$

- Gradient descent
  - Gradient: rate of change; slope; a derivative generalized to function with more than one variable.
  - Calculate the error and the gradient, and change each weight in the direction of the largest gradient.
  - $\Delta{w} = - gradient$
  - $w_i = w_i + \Delta{w_i}$
  - $\Delta{w_i} \propto -\frac{\partial{E}}{\partial{w_i}}$ -- gradient
  - $\Delta{w_i} = - \eta \frac{\partial{E}}{\partial{w_i}}$ -- add an arbitrary scaling parameter, learning rate $\eta$
    <img src="images/gradient-descent.png" width="30%" height="30%"/>

  - $\frac{\partial{E}}{\partial{w_i}} = -(y - \hat{y})f^\prime(h)x_i$
  - $\Delta{w_i} = \eta(y - \hat{y})f^\prime(h)x_i$
  - $= learning\_rate \times error \times activate\_derivative \times input$
  - Error term $\delta = (y - \hat{y})f^\prime(h)$
  - $w_i = w_i + \eta \delta x_i$
    <img src="images/network-calculation.jpg" width="50%" height="50%"/>

- Caveats: when weights are initialized with wrong values, gradient descent could lead weights into local minimum, but not global minimum.
  - Solution: [momentum](http://ruder.io/optimizing-gradient-descent/index.html#momentum) helps accelerate gradient descent in the relevant direction and dampens oscillations.

In [6]:
import numpy as np

# Defining the sigmoid function for activations
def sigmoid(x):
    return 1/(1+np.exp(-x))

# Derivative of the sigmoid function
def sigmoid_prime(x):
    return sigmoid(x) * (1 - sigmoid(x))

# Input data
x = np.array([1, 2])
# Target
y = np.array(0.5)
# Input to output weights
weights = np.array([0.5, -0.5])

# The learning rate, eta in the weight step equation
learnrate = 0.5

h = np.dot(x, weights)
y_hat = sigmoid(h)
error = y - y_hat
error_term = error * sigmoid_prime(h)

del_w = learnrate * error_term * x

print('Neural Network output:')
print(y_hat)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)

Neural Network output:
0.377540668798
Amount of Error:
0.122459331202
Change in Weights:
[ 0.0143892  0.0287784]


# Implementation
- Set weight step to zero $$\Delta{w}_i = 0$$
- For each record in training data:
  - Make a forward pass through the network and calculate output unit $$h = \sum_i w_i x_i$$
  - Apply activation function and get the output $$\hat{y} = f(h)$$
  - Calculate the error gradient in the output $$\delta = (y - \hat{y}) \times f^\prime(h)$$
  - Update weight step $$\Delta{w_i} = \Delta{w_i} + \delta x_i$$
  - Final $\Delta w_i$ is the summed weight step across all inputs
- Update weights $$w_i = w_i + \eta \Delta w_i / m$$
- Repeat for $e$ epoches

In [7]:
import numpy as np
from data_prep import features, targets, features_test, targets_test

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))

# Use to same seed to make debugging easier
np.random.seed(42)

n_records, n_features = features.shape
last_loss = None

# Initialize weights
weights = np.random.normal(scale=1 / n_features**.5, size=n_features)

# Neural Network hyperparameters
epochs = 1000
learnrate = 0.5

for e in range(epochs):
    del_w = np.zeros(weights.shape)
    for x, y in zip(features.values, targets):
        # Loop through all records, x is the input, y is the target
        h = np.dot(x, weights)
        output = sigmoid(h)
        output_prime = output / (1.0 - output)
        error = (y - output) * output_prime

        del_w += error * x

    # Update weights
    weights += learnrate * del_w / n_records

    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        out = sigmoid(np.dot(features, weights))
        loss = np.mean((out - targets) ** 2)
        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        last_loss = loss


# Calculate accuracy on test data
tes_out = sigmoid(np.dot(features_test, weights))
predictions = tes_out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))


Train loss:  0.256060280051
Train loss:  0.199324509463
Train loss:  0.197636657787
Train loss:  0.197270122133
Train loss:  0.197152222857
Train loss:  0.19710738054
Train loss:  0.197089215266
Train loss:  0.197081981947
Train loss:  0.197079469107
Train loss:  0.197079000687
Prediction accuracy: 0.725


# Multilayer perceptrons
