## Gradient Descent:  The Code

- The sum of the squared error (SSE)
    - $ E = \frac{1}{2} \sum\limits_{\mu} (y^\mu - \hat{y}^\mu)^2$ where $\mu$ is the data records
    - $\hat{y}$ --> $f(\sum\limits_{i} w_iX_i^\mu)$
    - $ E = \frac{1}{2} \sum\limits_{\mu} (y^\mu - f(\sum\limits_{i} w_iX_i^\mu))^2$
    - error depens on the weight $w_i$ and the input values $x_i$

- $Δw_i=\eta \delta x_i$
- where δ is the error term
- $w_i = w_i + \Delta w_i$ update weight to minimize error
- $\Delta w_i \propto \frac{\partial E}{\partial w_i}$ -> gradient
- $\Delta w_i \propto -\eta \frac{\partial E}{\partial w_i}$ where $\eta$ is the learning rate

- Error Term: $\delta = (y - \hat{y}) f'(h)$
- $w_i = w_i + \eta \delta x_i$
- $\hat{y} = f(h)$ where $h = \sum\limits_{i} w_i x_i$
- Error Term (expanded): $\delta = (y - \hat{y}) \sum\limits_{i} w_i x_i$
    - $(y - \hat{y})$ is the output error
    - $f'(h)$ - derivative of the activation function -> output gradient

In [None]:
import numpy as np

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1/(1+np.exp(-x))

def sigmoid_prime(x):
    """
    # Derivative of the sigmoid function
    """
    return sigmoid(x) * (1 - sigmoid(x))

learnrate = 0.5
# input data
x = np.array([1, 2, 3, 4])
# target
y = np.array(0.5)

# Initial weights
w = np.array([0.5, -0.5, 0.3, 0.1])

### Calculate one gradient descent step for each weight
### Note: Some steps have been consilated, so there are
###       fewer variable names than in the above sample code

# TODO: Calculate the node's linear combination of inputs and weights
h = np.dot(x, w)

# TODO: Calculate output of neural network
nn_output = sigmoid(h) #y_hat

# TODO: Calculate error of neural network
error = y - nn_output

# TODO: Calculate the error term
#       Remember, this requires the output gradient, which we haven't
#       specifically added a variable for.
error_term = error * sigmoid_prime(h)

# TODO: Calculate change in weights
del_w = learnrate * error_term * x

print('Neural Network output:')
print(nn_output)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)

### Mean Square Error
- $E = \frac{1}{2m}\sum\limits_{\mu}(y^ \mu - \hat{y}^ \mu)^2$
    - need a small learning rate, take the average
    - if use SSE (sum) will have a large learning rate, gradient descent might diverge

### Implementing Gradient Descent
- update weights: $Δw_{ij}=\eta * \delta _j * x_i$
- where $\eta$ - learning rate, $\delta$ - error, $x_i$ - input values
- use gradient descent to train a network on graduate school admissions data (http://www.ats.ucla.edu/stat/data/binary.csv)
    - data set has 3 input features: **GRE, GPA, Rank** of prestige of undergraduate school (1-4)
    - Goal: predict if a student will be admitted to grad program based on features
#### Data cleanup
- Use *one-hot encoding* for rank (categorical) -> row of 1's and 0's
- Standardize GRE and GPA
    - normalization vs. standardization
        - normalization: put everything in scale from 0-1
        - standardization: turns mean of 0 and std of 1
        - standardized value = (X - avg)/std