# Linear regression
- Gradient descent
  - Descend a mountain
  - Descend in the direction that will decrease the error the most
  - Until a minimum error is found
- Least squares
  - error = (prediction - actual)^2

# Linear to logistic regression

- Linear regression
  - Predict values on a continuous spectrum

- Logistic regression
  - Classify data among discrete classes

# Perceptron
- Chain logistic regression layers together
  - Input layer
  - Hidden layer - linear function
  - Output layer - step function that outputs 0 or 1
  
  <img src="images/perceptron.jpg" width="50%" height="50%" />
  
- Perceptron
  - Articial neuron
  - Each one looks at input data and decides how to categorize that data.
  - Output is always 0 or 1.
- Weights
  - Input is multipled by a weight value
  - Start as random values
  - Neural network is trained by adjusting weights
  - Higher weight means that this input is more important than other inputs
  - Weighted input values are summed to a single value
  - Matrix of weights: $W$
  - Individual weight: $w$
- Activation function
  - Result of perceptron's summation is turned into output signal by activation function

  - Heaviside step function
    - `f(h) = h >= 0 ? 1 : 0`
    <img src="images/heaviside-step-function.png" width="30%" height="30%" />

- Bias: $b$, moves values in one direction or another

In [1]:
import pandas as pd

def test(weight1, weight2, bias, test_inputs, correct_outputs):    
    outputs = []

    # Generate and check output
    for test_input, correct_output in zip(test_inputs, correct_outputs):
        linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias
        output = int(linear_combination >= 0)
        is_correct_string = 'Yes' if output == correct_output else 'No'
        outputs.append([test_input[0], test_input[1], linear_combination, output, is_correct_string])

    # Print output
    num_wrong = len([output[4] for output in outputs if output[4] == 'No'])
    output_frame = pd.DataFrame(outputs, columns=['Input 1', '  Input 2', '  Linear Combination', '  Activation Output', '  Is Correct'])
    if not num_wrong:
        print('Nice!  You got it all correct.\n')
    else:
        print('You got {} wrong.  Keep trying!\n'.format(num_wrong))

    print(output_frame.to_string(index=False))

## AND perceptron
<img src="images/and-perceptron.png" width="30%" height="30%" />

In [2]:
weight1 = 5
weight2 = 5
bias = -10

# DON'T CHANGE ANYTHING BELOW
# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [False, False, False, True]

test(weight1, weight2, bias, test_inputs, correct_outputs)

Nice!  You got it all correct.

Input 1    Input 2    Linear Combination    Activation Output   Is Correct
      0          0                   -10                    0          Yes
      0          1                    -5                    0          Yes
      1          0                    -5                    0          Yes
      1          1                     0                    1          Yes


## OR perceptron

In [3]:
weight1 = 5
weight2 = 5
bias = -5

# DON'T CHANGE ANYTHING BELOW
# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [False, True, True, True]

test(weight1, weight2, bias, test_inputs, correct_outputs)

Nice!  You got it all correct.

Input 1    Input 2    Linear Combination    Activation Output   Is Correct
      0          0                    -5                    0          Yes
      0          1                     0                    1          Yes
      1          0                     0                    1          Yes
      1          1                     5                    1          Yes


## NOT perceptron

In [4]:
weight1 = 0
weight2 = -1
bias = 0

# DON'T CHANGE ANYTHING BELOW
# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [True, False, True, False]
outputs = []

test(weight1, weight2, bias, test_inputs, correct_outputs)

Nice!  You got it all correct.

Input 1    Input 2    Linear Combination    Activation Output   Is Correct
      0          0                     0                    1          Yes
      0          1                    -1                    0          Yes
      1          0                     0                    1          Yes
      1          1                    -1                    0          Yes


## XOR perceptron
<img src="images/xor-perceptron.png" width="50%" height="50%" />

- A: `NOT`
- B: `AND`
- C: `OR`

# Perceptron Algorithm
For a point with coordinates (p,q) , label y, and prediction given by the equation:

$\hat{y} = step(w_1 x_1 + w_2 x_2 + b)$

- If the point is correctly classified, do nothing.
- If the point is classified positive, but it has a negative label:
  - $ w_1 = w_1 - \alpha p $
  - $ w_2 = w_2 - \alpha q $
  - $ b = b - \alpha $
- If the point is classified negative, but it has a positive label:
  - $ w_1 = w_1 + \alpha p $
  - $ w_2 = w_2 + \alpha q $
  - $ b = b + \alpha $

# Neural network
<img src="images/simple-network.png" width="30%" height="30%" />

- Use activation functions that are continuous and differentiable, possible to train using gradient descent

- Logistic (sigmoid) activation function
  - $sigmoid(x) = 1 / (1 + e^{-x})$
  - Output $(0, 1)$
  - Can be interpreted as a probability for success.
  - Same formulation as logistic regression.
  - Turn perceptron into neural network.
  <img src="images/sigmoid.png" width="30%" height="30%" />

- Circles: units
- Boxes: operations

- $h = \sum w_i x_i + b$
- $y = f(h)$


In [5]:
import numpy as np

def sigmoid(x):
    # TODO: Implement sigmoid function
    return 1 / (1 + np.exp(-x))

inputs = np.array([0.7, -0.3])
weights = np.array([0.1, 0.8])
bias = -0.1

# TODO: Calculate the output
output = sigmoid(np.dot(weights, inputs) + bias)

print('Output:')
print(output)

Output:
0.432907095035


# Softmax

- Sigmoid activation function for more than two classes
- $P(x_k) = \frac{e^{x_k}}{\sum_i {e^{x_i}}}$

In [6]:
import numpy as np

# Write a function that takes as input a list of numbers, and returns
# the list of values given by the softmax function.
def softmax(L):
    total = np.sum(np.exp(L))
    return list(map(lambda i: np.exp(i) / total, L))

print(softmax([1, 2, 3, 4]))

[0.032058603280084988, 0.087144318742032573, 0.23688281808991013, 0.64391425988797224]


# One-Hot Encoding

- A process by which categorical variables are converted into a group of bits. For each category, only one bit within this group is 1.


# Maximum Likelihood Method

- Pick a model that gives existing labels the highest probability.
  - Calculate the probability of each data point carrying the label according to the model;
  - Multiple these probabilities to obtain the probability of the whole arrangement;
  - Check which model is better.
- Product is hard to use
  - Multiplication of small values leads to tiny output
  - Change in one value affects the output a lot
  - Soluction: Turn product into sum

## Cross-Entropy
- Sum of negative logarithms of probabilities
- The lower the better
- Likely events with higher probabilities will have smaller cross entropy
- $CrossEntropy = - \sum_i{y_i \ln{(p_i)} + (1 - y_i) ln{(1 - p_i)}}$
- Tells whether two vectors are similar to each other
  - $CE[(1, 1, 0), (0.8, 0.7, 0.1)] = 0.69$
  - $CE[(0, 0, 1), (0.8, 0.7, 0.1)] = 5.12$
  - $(1, 1, 0)$ is more similar to $(0.8, 0.7, 0.1)$ than $(0, 0, 1)$

## Multi-Class Cross-Entropy
$$CrossEntroy = - \sum_{i=1}^{n}{\sum_{j=1}^{m}{y_{ij} ln{(p_{ij})}}}$$


# Gradient descent
- Sum of square errors (SSE)
  - Metric of how wrong the predictions are
  - $E = \frac{1}{2}\sum_\mu\sum_j[y^\mu_j - \hat{y}^\mu_j]^2$
    - $\hat{y}$ - prediction
    - $y$ - true value
    - $j$ - output units
    - $\mu$ - data points
    - First sum over $j$, then over $\mu$
  - $\hat{y}^\mu_j = f(\sum_i w_{ij} x^\mu_i)$
  - Find weights $w_{ij}$ that minimize the squared error $E$, using **gradient descent**.

- Mean of square errors (MSE)
  - If a lot of data is used, summing up all the weight steps can lead to really large updates that make the gradient descent diverge.
  - Error is divided by the number of records, $m$.
  - $E = \frac{1}{2m}\sum_\mu(y^\mu-\hat{y}^mu)^2$

- Gradient descent
  - Error function requirements
    - Continuous
    - Differentiable
  - Gradient: rate of change; slope; a derivative generalized to function with more than one variable.
  - Calculate the error and the gradient, and change each weight in the direction of the largest gradient.
  - $\Delta{w} = - gradient$
  - $w_i = w_i - gradient = w_i + \Delta{w_i}$
  - $\Delta{w_i} \propto -\frac{\partial{E}}{\partial{w_i}}$ -- gradient
  - $\Delta{w_i} = - \eta \frac{\partial{E}}{\partial{w_i}}$ -- add an arbitrary scaling parameter, learning rate $\eta$
    <img src="images/gradient-descent.png" width="30%" height="30%"/>

  - $\frac{\partial{E}}{\partial{w_i}} = -(y - \hat{y})f^\prime(h)x_i$
  - $\Delta{w_i} = \eta(y - \hat{y})f^\prime(h)x_i$
  - $= learning\_rate \times error \times activate\_derivative \times input$
  - Error term $\delta = (y - \hat{y})f^\prime(h)$
  - $w_i = w_i + \eta \delta x_i$
    <img src="images/network-calculation.jpg" width="50%" height="50%"/>

- Caveats: when weights are initialized with wrong values, gradient descent could lead weights into local minimum, but not global minimum.
  - Solution: [momentum](http://ruder.io/optimizing-gradient-descent/index.html#momentum) helps accelerate gradient descent in the relevant direction and dampens oscillations.

In [7]:
import numpy as np

# Defining the sigmoid function for activations
def sigmoid(x):
    return 1/(1+np.exp(-x))

# Derivative of the sigmoid function
def sigmoid_prime(x):
    return sigmoid(x) * (1 - sigmoid(x))

# Input data
x = np.array([1, 2])
# Target
y = np.array(0.5)
# Input to output weights
weights = np.array([0.5, -0.5])

# The learning rate, eta in the weight step equation
learnrate = 0.5

h = np.dot(x, weights)
y_hat = sigmoid(h)
error = y - y_hat
error_term = error * sigmoid_prime(h)

del_w = learnrate * error_term * x

print('Neural Network output:')
print(y_hat)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)

Neural Network output:
0.377540668798
Amount of Error:
0.122459331202
Change in Weights:
[ 0.0143892  0.0287784]


# Implementation
- Set weight step to zero $$\Delta{w}_i = 0$$
- For each record in training data:
  - Make a forward pass through the network and calculate output unit $$h = \sum_i w_i x_i$$
  - Apply activation function and get the output $$\hat{y} = f(h)$$
  - Calculate the error gradient in the output $$\delta = (y - \hat{y}) \times f^\prime(h)$$
  - Update weight step $$\Delta{w_i} = \Delta{w_i} + \delta x_i$$
  - Final $\Delta w_i$ is the summed weight step across all inputs
- Update weights $$w_i = w_i + \eta \Delta w_i / m$$
- Repeat for $e$ epoches

In [8]:
import numpy as np
from data_prep import features, targets, features_test, targets_test

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))

# Use to same seed to make debugging easier
np.random.seed(42)

n_records, n_features = features.shape
last_loss = None

# Initialize weights
weights = np.random.normal(scale=1 / n_features**.5, size=n_features)

# Neural Network hyperparameters
epochs = 1000
learnrate = 0.5

for e in range(epochs):
    del_w = np.zeros(weights.shape)
    for x, y in zip(features.values, targets):
        # Loop through all records, x is the input, y is the target
        h = np.dot(x, weights)
        output = sigmoid(h)
        output_prime = output / (1.0 - output)
        error = (y - output) * output_prime

        del_w += error * x

    # Update weights
    weights += learnrate * del_w / n_records

    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        out = sigmoid(np.dot(features, weights))
        loss = np.mean((out - targets) ** 2)
        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        last_loss = loss


# Calculate accuracy on test data
tes_out = sigmoid(np.dot(features_test, weights))
predictions = tes_out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))


Train loss:  0.256060280051
Train loss:  0.199324509463
Train loss:  0.197636657787
Train loss:  0.197270122133
Train loss:  0.197152222857
Train loss:  0.19710738054
Train loss:  0.197089215266
Train loss:  0.197081981947
Train loss:  0.197079469107
Train loss:  0.197079000687
Prediction accuracy: 0.725


# Perceptron vs Gradient Descent

- Gradient descent
  - Change $w_i$ to $w_i + \alpha (y - \hat{y})x_i$
- Perceptron algorithm
  - If correctly classified: $y - \hat{y} = 0$
  - If missclassified
    - $y - \hat{y} = 1$ **if positive**
    - $y - \hat{y} = -1$ if negative
    - Change $w_i$ to
      - $w_i + \alpha x_i$ **if positive**
      - $w_i - \alpha x_i$ if netagive

<img src="images/perceptron-vs-gradient-descent.png" width="60%" height="60%"/>

- When correctly classified, perceptron does nothing, but in gradient descent, the line will go farther away from the data point.


# Neural Network Architecture

## Combination of neural networks

<img src="images/combination-of-neural-networks-1.png" width="60%" height="60%"/>

<img src="images/combination-of-neural-networks-2.png" width="60%" height="60%"/>

## Multiple layers
- Add more nodes to the input, hidden and output layers
- Add more layers

<img src="images/multilayers.png" width="60%" height="60%"/>

## Multi-class classification
- Add more nodes to the output layer


# Train Neural Networks

## Feedforward
  - The process that neural networks use to turn the input into an output.

In [9]:
import numpy as np

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1/(1+np.exp(-x))

# Network size
N_input = 4
N_hidden = 3
N_output = 2

np.random.seed(42)
# Make some fake data
# 1x4
X = np.random.randn(4)

print('Input:')
print(X)

# 4x3
weights_input_to_hidden = np.random.normal(0, scale=0.1, size=(N_input, N_hidden))
# 3x2
weights_hidden_to_output = np.random.normal(0, scale=0.1, size=(N_hidden, N_output))

print('Input to hidden weights:')
print(weights_input_to_hidden)
print('Hidden to output weights:')
print(weights_hidden_to_output)

# TODO: Make a forward pass through the network

# 1x4 - 4x3 -> 1x3
hidden_layer_in = np.dot(X, weights_input_to_hidden)
hidden_layer_out = sigmoid(hidden_layer_in)

print('Hidden-layer Output:')
print(hidden_layer_out)

# 1x3 - 3x2 -> 1x2
output_layer_in = np.dot(hidden_layer_out, weights_hidden_to_output)
output_layer_out = sigmoid(output_layer_in)

print('Output-layer Output:')
print(output_layer_out)

Input:
[ 0.49671415 -0.1382643   0.64768854  1.52302986]
Input to hidden weights:
[[-0.02341534 -0.0234137   0.15792128]
 [ 0.07674347 -0.04694744  0.054256  ]
 [-0.04634177 -0.04657298  0.02419623]
 [-0.19132802 -0.17249178 -0.05622875]]
Hidden to output weights:
[[-0.10128311  0.03142473]
 [-0.09080241 -0.14123037]
 [ 0.14656488 -0.02257763]]
Hidden-layer Output:
[ 0.41492192  0.42604313  0.5002434 ]
Output-layer Output:
[ 0.49815196  0.48539772]


## Backpropagation

- Doing a feedforward operation.
- Comparing the output of the model with the desired output.
- Calculating the error.
- Running the feedforward operation backwards (**backpropagation**) to spread the error to each of the weights.
- Use this to update the weights, and get a better model.
- Continue this until we have a model that is good.

<img src="images/backpropagation-1.png" width="60%" height="60%"/>

- When a linear model correctly classifies a data point
  - Increase the weight of that linear model
  - Move the line closer to the data point
- When a linear model incorrectly classifies a data point
  - Decrease the weight of that linear model
  - Move the line farther to the data point

<img src="images/backpropagation-2.png" width="60%" height="60%"/>

- Feedforward: composing a bunch of functions.
- Backpropagation: taking the derivative of a composition, which is multiplying a bund of derivatives (chain rule)

<img src="images/feedforward-calculation.png" width="60%" height="60%"/>
<img src="images/backpropagation-calculation.png" width="60%" height="60%"/>

## Implementation

- Set weights for each layer to zero
  - Input to hidden weights $\Delta{w}_{ij} = 0$
  - Hidden to output weights $\Delta{W_{j} = 0}$
- For each record in training data:
  - Make a forward pass through the network and calculate output $\hat{y}$
  - Calculate the error gradient in the output
    $$\delta^0 = (y - \hat{y}) \times f^\prime(z)$$
    $$z = \sum_j W_j a_j$$
  - Propagate the errors to hidden layer
    $$\delta_j^h = \delta^0 W_j f^\prime(h_j)$$
  
  - Update weight steps
    $$\Delta{W_j} = \Delta{W_j} + \delta^0 a_j$$
    $$\Delta{w_{ij}} = \Delta{w_{ij}} + \delta_j^h a_i$$
- Update weights
  $$W_j = W_j + \eta \Delta W_j / m$$
  $$w_{ij} = w_{ij} + \eta \Delta w_{ij} / m$$
- Repeat for $e$ epoches


In [10]:
import numpy as np
from data_prep import features, targets, features_test, targets_test

np.random.seed(42)

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))


# Hyperparameters
n_hidden = 2  # number of hidden units
epochs = 900
learnrate = 0.005

n_records, n_features = features.shape
last_loss = None
# Initialize weights
weights_input_hidden = np.random.normal(scale=1 / n_features ** .5,
                                        size=(n_features, n_hidden))
weights_hidden_output = np.random.normal(scale=1 / n_features ** .5,
                                         size=n_hidden)

for e in range(epochs):
    del_w_input_hidden = np.zeros(weights_input_hidden.shape)
    del_w_hidden_output = np.zeros(weights_hidden_output.shape)
    for x, y in zip(features.values, targets):
        ## Forward pass ##
        # TODO: Calculate the output
        hidden_input = np.dot(x, weights_input_hidden)
        hidden_output = sigmoid(hidden_input)
        output_input = np.dot(hidden_output, weights_hidden_output)
        output_output = sigmoid(output_input)

        ## Backward pass ##
        # TODO: Calculate the error
        error = y - output_output

        # TODO: Calculate error gradient in output unit
        output_error = error * output_output * (1 - output_output)

        # TODO: propagate errors to hidden layer
        hidden_error = np.dot(output_error, weights_hidden_output) * hidden_output * (1 - hidden_output)

        # TODO: Update the change in weights
        del_w_hidden_output += output_error * hidden_output
        del_w_input_hidden += hidden_error * x[:, None]

    # TODO: Update weights
    weights_input_hidden += learnrate * del_w_input_hidden / n_records
    weights_hidden_output += learnrate * del_w_hidden_output / n_records

    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        hidden_output = sigmoid(np.dot(x, weights_input_hidden))
        out = sigmoid(np.dot(hidden_output,
                             weights_hidden_output))
        loss = np.mean((out - targets) ** 2)

        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        last_loss = loss

# Calculate accuracy on test data
hidden = sigmoid(np.dot(features_test, weights_input_hidden))
out = sigmoid(np.dot(hidden, weights_hidden_output))
predictions = out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))


Train loss:  0.229558087439
Train loss:  0.229405884821
Train loss:  0.229256817163
Train loss:  0.229110801711
Train loss:  0.22896775824
Train loss:  0.228827608971
Train loss:  0.228690278489
Train loss:  0.228555693665
Train loss:  0.228423783578
Train loss:  0.228294479446
Prediction accuracy: 0.750


# Readings
- [Yes you should understand backprop](https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b)
  - Backpropagation is a leaky abstraction, and easily to fall into traps.
    - Vanishing gradients on sigmoids
    - Dying ReLUs
    - Exploding gradients in RNNs