### Neural Networks

Work like nerve cells in the human brain; fire when certain conditions are present
Artificial Neural Networks (ANN): cartoonish version of NN

Perceptron:
inputs multiplied by weights, then summed, to make the activation value
if activation >= firing threshold, output = 1
else output = 0

### How Powerful Is a Perceptron Unit

Perceptrons always compute halfplanes (above 1, below 0)

Can perform boolean operations (AND, OR, NOT, XOR, etc.)

### Perceptron Training

Given examples, find weights that map inputs to output

* perceptron rule (threshold)
* gradient descent/delta rule (unthresholded)

#### Perceptron rule

Set the weights so that you capture the training/testing data
Set learning rule for the weights, but not for theta (subtract theta from both sides)
Added 'bias' as 1, with weight as -1 (subtracted theta from both sides)
Iterate over training data, grab X (features) and y (target), and change weight for each feature
Weight change--take target, compare it to sum of features that is greater than or equal to zero
If output is correct (difference is zero), no change
If output is wrong, decrease or increase weights at the scale of the learning rate until output is zero

y = target<br>
$\hat{y}$ = output<br>
n = learning rate<br>
x = input<br>

w$_i$ = w$_i$ + $\Delta$w # weight equals weight plus the change in the weight (just means we're changing the weight as we iterate<br>
$\Delta$w = n(y - $\hat{y}$)x$_i$ # change in weight is equal to the learning rate times (target - output) times input<br>
$\hat{y}$ = ($\Sigma_i$w$_i$x$_i$ >= 0) # output = sum of all input weights times the features is greater than or equal to 0<br>

If dataset is linearly separable, it will iterate until it finds it<br>
(put condition in it to stop the loop if there's no more error)<br>
If it's not linearly separable...

### Gradient Descent

More robust to non-linearly separable data
Imagine the output is not thresholded
Define error metric on weight vector, use calculus to push squared error downward toward zero

$\Delta$w = n(y - a)x$_i$, where a = activation (same as perceptron rule but w/out threshold)

Can't do gradient descent on $\hat{y}$ because it's non-differentiable

### Sigmoid

Makes perceptron rule differentiable (analagous to perceptron, not equal to it)

(S-like function)

As activation gets smaller, function goes toward 0
As activation gets bigger, function goes toward 1

### Neural Network Sketch

hidden layer of sigmoid units (or other differentiable functions), whole thing is differentiable
can calculate how to move the weights in order to get closer to the output you want
back propagation = computationally beneficial organization of the chain rule

many local optima--can get 'stuck' in spot where you can't change any individual weight without making the error worse
plots where there are several 'local' valleys, where it goes down to a local low point but then goes back up, not the global optimum

### Optimizing Weights

gradient descent isn't the only method
advanced methods:
* momentum (momentum turns in the gradient, to keep from getting stuck in a local minimum)
* higher-order derivatives (look at combinations of weights instead of individual weights)
* randomized optimization
* penalty for "complexity" (overfitting, more nodes, more layers, large numbers)

### Restriction Bias

* Representational power
* set of hypotheses we will consider

perceptron: half spaces
sigmoids: much more complex, not much restriction

Boolean: network of threshold-like units
Continuous: "connected", no jumps - hidden
Arbitrary: stitch together - two hidden

can represent pretty much anything with a sufficiently complex network
danger of overfitting!
usually bounded number of hidden units and layers, to reduce complexity (and more restriction bias)
can use cross validation to check overfitting

### Preference Bias

* algorithm's selection of one representation over another
What algorithm?
How to initialize weights?
typically use small random values
* adds variability, helpful to avoid local minima
* low complexity
prefer correct over incorrect
prefer simpler answers to more complex ones

Occam's razor: entities should not be multiplied unnecessarily

multiply = make more complex
unnecessary = not reducing error

In [1]:
import numpy as np

In [15]:
# input layer
input_layer = np.array([1, 2, 3])

# two hidden layers (the weights)
hidden_layer_1 = np.array([1, 1, -5])
hidden_layer_2 = np.array([3, -4, 2])

# multiply the input by each hidden layer
y1 = np.dot(input_layer, hidden_layer_1)
y2 = np.dot(input_layer, hidden_layer_2)
y = np.array([y1, y2])

# then multiply that by the output layer
z = np.dot(y, [2,-1])

z

-25

In [18]:
x = [1, 1]

w1 = [3, 2]
w2 = [-1, 4]
w3 = [3, -5]

y1 = np.dot(x, w1)
y2 = np.dot(x, w2)
y3 = np.dot(x, w3)

z = np.dot([y1, y2, y3], [1, 2, -1])

z

13

In [19]:
y1

5

In [20]:
y2

3

In [21]:
y3

-2