In [13]:
import pandas as pd
import numpy as np

## Introduction to deep learning

* Captures interactions between features really well
* In reality, most things interact with each other

### Neural networks
* Made up of input layer, output layer snd one or more hidden layer
* The more nodes we have in the hidden layers, the more interactions we capture

## Forward propagation

* Example: inputs are number of children and number of accounts, output is number of transactions
* The hidden layer has 2 nodes
* Each of the inputs has a line to each of the nodes (4 in total), each with a weight
* The weights indicate how much each of the inputs affect each of the nodes
* In order to get the value that is input into each of the nodes from the input layers, we multiply the value in the input layer by the weight of the line, do this for each input by line combination, and then add them together
* We repeat this for each layer, including the output
* This forward propagation process is done for one data point (row) at a time, and the value in the output is the prediction for that data point.

## Activation functions

* Activation function in the hidden layers allow the NN to capture non-linear relationships
* Is a function applied to input values coming into the node to give output value
* Standard function is ReLU: rectified linear actiavtion
    * Gives 0 if value is <= 0, gives value otherwise

## Deeper networks

* It is common to have NN with many, many hidden layers
* Each iteration through layers uses the same process as with one hidden layer
* Deep networks internally build up representations of patterns in data
* Each layer has the ability to recognise sophisticated patterns
* Can partially replace the need for feature engineering

## The need for optimisation

* The weights for the model's lines are initially set randomly
* Through a process called back propagation, they optimise these weights so that the error of the model is reduced (the predicted values are further away from the actual values)
* As the number of data points in the model increases, this optimisation process gets more difficult
* Loss function is an aggregation of all errors
* Our goal is to find the lowest amount of loss (lowest value of the loss function
* Gradient descent - keep going down until it is uphill in every direction
* Using the slope of the tangent to the curve, we can minimise the loss by going in the opposite direction of the slope (i.e., if the slope is positive, we go in a negative direction, and vice versa) until we hit a flat area and any further progress changes direction

## Gradient descent

* If a slope is positive:
    * Going opposite the slope means moving to lower numbers
    * Subtract the slope from the current value
    * Too big a slope might lead us astray
* Solution - control the rate of the steps using someting called a learning rate
* Change the weight of the line by subtracting the learning rate * slope
* How do we find the slopes of the weights?
    * Requires calculus, but keras and tensorflow do this for us
* To calculate the slope for a weight, we need to multiply:
    * Slope of the loss function, with respect to the value at the node we feed into
        * Slope of the mean-squared loss function wrt prediction:
            * 2 * (predicted value - actual value) = 2 * error
    * The value of the node that feeds into our weight
    * The slope of the activation function with respect to the value we feed into

For the below example, this would be:

| 3 | - 2 -> | 6 |

Predicted = 6
Actual = 10
Value of node that feeds into our weight = 3
No activation function

Slope of mean-squared loss function = 2 x -4 x 3 = 24

If learning rate is 0.01, the new weight would be:
2 - 0.01(-24) = 2.24 (new weight for the line)

3 * 2.24 = 6.72
6.72 - 10 = -3.28 (lower error than previously)

I think (?) the 2 comes from a value that would normally be calculated using calculus.

In [14]:
## Calculating one iteration of gradient descent

weights = np.array([0, 2, 1])
input_data = np.array([1, 2, 3])
target = 0

# Set the learning rate: learning_rate
learning_rate = 0.01

# Calculate the predictions: preds
preds = (weights * input_data).sum()

# Calculate the error: error
error = preds - target

# Calculate the slope: slope
slope = 2 * input_data * error

# Update the weights: weights_updated
weights_updated = weights - learning_rate * slope

# Get updated predictions: preds_updated
preds_updated = (input_data * weights_updated).sum()

# Calculate updated error: error_updated
error_updated = preds_updated - target

# Print the original error
print(error)

# Print the updated error
print(error_updated)

7
5.04


## Backpropagation

* Backpropagation works backwards through each layer to reduce the amount of error (minimise the loss function)
* We must have weights assigned to each line in order to begin backpropagation
* We go back one layer at a time
* Gradients (array of slopes) for weight is product of:
    1. Node value feeding into that weight
    2. Slope of loss function wrt the node it feeds into
    3. Slope of activation function at the node it feeds into
* How do we get these values?
    1. This is either the value in the input layer, or it is the value calculated for a node in a hidden layer
    2. We calculate this using this formula: slope = 2 * input_data * error
    3. E.g., for ReLU: the slope is 0 if the number is <= 0, and 1 otherwise
* We also need to keep track of the slopes of the loss function wrt node values
* Slopes of node values are the sum of the slopes for all weights that come out of them

## Backpropagation in practice

