# Introduction to Deep Learning in Python Course

## Introduction to Deep Learning
- Deep learning is a subset of machine learning, which is a subset of artificial intelligence.
- Deep learning is a type of machine learning that trains a computer to perform human-like tasks, such as recognizing speech, identifying images or making predictions.
- Instead of organizing data to run through predefined equations, deep learning sets up basic parameters about the data and trains the computer to learn on its own by recognizing patterns using many layers of processing.
- We use Keras for this course. Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.

## Forward Propagation
- Forward propagation is the process neural networks use to make predictions.
- Bank transactions example:
    - Inputs: number of children, number of existing accounts.
    - Output: number of transactions.
    - Weights: parameters that the model learns.
    - [Model Photo](forward_propagation.png)
- Dot product: multiply the inputs by the weights and sum them up.
- Forward propagation for one data point. Output is the prediction for that data point.
- Code:
```python
import numpy as np
input_data = np.array([2, 3])
weights = {'node_0': np.array([1, 1]),
           'node_1': np.array([-1, 1]),
           'output': np.array([2, -1])}
node_0_value = (input_data * weights['node_0']).sum()
node_1_value = (input_data * weights['node_1']).sum()
hidden_layer_values = np.array([node_0_value, node_1_value])
print(hidden_layer_values)
output = (hidden_layer_values * weights['output']).sum()
print(output)
```

## Activation Functions
- Activation functions are applied to node inputs to produce node output.
- An activation function allows models to capture non-linearities.
- If the relationships in the data are non-linear, we need an activation function to capture them.
- Applied to node inputs, activation functions produce node outputs.
- Activation functions:
    - ReLU (Rectified Linear Activation): max(0, x)
    - Tanh (Hyperbolic Tangent): (e^x - e^-x) / (e^x + e^-x)
    - Sigmoid: 1 / (1 + e^-x)
    - Identity: f(x) = x
- [Activation Function Tanh](activation_function_tanh.png)

### ReLU (Rectified Linear Activation)
- ReLU is the most common activation function.
- [Activation Function ReLU](ReLU.png)

### Tanh Example
- Example code:
```python
import numpy as np
input_data = np.array([-1, 2])
weights = {'node_0': np.array([3, 3]),
           'node_1': np.array([1, 5]),
           'output': np.array([2, -1])}
node_0_input = (input_data * weights['node_0']).sum()
node_0_output = np.tanh(node_0_input)
node_1_input = (input_data * weights['node_1']).sum()
node_1_output = np.tanh(node_1_input)
hidden_layer_outputs = np.array([node_0_output, node_1_output])
output = (hidden_layer_outputs * weights['output']).sum()
print(output)
```

### ReLU Example
- Example code:
```python
def relu(input):
    '''Define your relu activation function here'''
    # Calculate the value for the output of the relu function: output
    output = max(0, input)
    
    # Return the value just calculated
    return(output)

# Calculate node 0 value: node_0_output
node_0_input = (input_data * weights['node_0']).sum()
node_0_output = relu(node_0_input)

# Calculate node 1 value: node_1_output
node_1_input = (input_data * weights['node_1']).sum()
node_1_output = relu(node_1_input)

# Put node values into array: hidden_layer_outputs
hidden_layer_outputs = np.array([node_0_output, node_1_output])

# Calculate model output (do not apply relu)
model_output = (hidden_layer_outputs * weights['output']).sum()

# Print model output
print(model_output)
```

## Deeper Networks
- There is more than one hidden layer. 
- Deep networks internally build representations of patterns in the data.
- You use same forward propagation process, but apply it multiple times iteratively.
- [Deeper Networks](multiple_hidden_layers_relu.png)
- This is the mechanics for how neural networks make predictions.

### Representation Learning
- Deep networks internally build representations of patterns in the data.
- Partially replace the need for feature engineering.
- Subsequent layers build increasingly sophisticated representations of raw data.
- First layer might detect edges, second layer might detect shapes, third layer might detect high-level features. It shows us how deep learning models can learn from the data.

### Deep Learning
- Modeler doesn't need to specify the interactions.
- When you train the model, the neural network gets weights that find the relevant patterns to make better predictions.

## The need for optimization
- Optimization finds the set of weights that minimizes the loss function.
- Loss function measures how well the model's predictions match the target values.
- We use the data to update the weights.
- Making accurate predictions gets harder with more points.
- At any set of weights, there are many values of the error corresponding to the many points we make predictions for.

### Loss Function
- Aggregates errors in predictions from many data points into single number.
- Measure of model's predictive performance.
- Lower loss function value means a better model.
- We use the mean squared error loss function.
- [Loss Function Graph](loss_function_graph.png)
- Goal: Find the weights that give the lowest value for the loss function.
- Gradient descent is a general method to minimize functions.

### Gradient Descent
- Imagine you are in a pitch dark field.
- Want to find the lowest point.
- Feel the ground to see how to go downhill.
- Take small steps downhill.
- Repeat until it is uphill in every direction.
- This is gradient descent.
- Steps:
    - Start at random point.
    - Until you are somewhere flat:
        - Find the slope.
        - Take a step downhill.
- Learning rate: how big the step is.
- Too big: might miss the minimum.
- Too small: will take too long.

## Review
- The importance of model weights in making accurate predictions. Adjusting weights can significantly change the model's output.
- The concept of a loss function, which aggregates all prediction errors into a single measure, helping to evaluate the model's performance.
- Gradient descent, an algorithm used to find the set of weights that minimizes the loss function. It involves starting with random weights, calculating the slope (or gradient) of the loss function at those weights, and then adjusting the weights in the direction that reduces the loss.


## Gradient Descent
- If the slope is positive:
    - Going opposite the slope means moving to lower numbers.
    - Subtract the slope from the current value.
    - Too big a step might lead us astray.
- Learning rate: how much we update the weights. Update each weight by subtracting the product of learning rate and slope.
- Slope calculation for a weight, need to multiply:
    - Slope of the loss function w.r.t (with respect to) the value at the node we feed into.
    - The value of the node that feeds into our weight.
    - Slope of the activation function w.r.t the value we feed into.
- Slope of mean-squared loss function w.r.t prediction:
    - 2 * (prediction - actual) = 2 * error.

### Gradient Descent Example
[Example](slope_calculation_example.png)
- 2 * -4 = -8 (slope of the loss function w.r.t prediction).
- 2 * -4 * 3 = -24 (slope of the loss function w.r.t prediction * node value).
- If learning rate is 0.01, the new weight would be 2 - 0.01 * -24 = 2.24. This is how we update the weights.

## Backpropagation
- Takes the error from the output layer and propagates it backward through the network.
- It calculates the necessary slopes sequentially from the weights closest to the prediction, through the hidden layers, to the input layer.
- Allow gradient descent to update all weights in the neural network (by getting gradients for all weights).
- Comes from chain rule of calculus.
- Process:
    - Trying to estimate the slope of the loss function w.r.t each weight.
    - Do forward propagation to calculate predictions and errors.
    - Go back one layer at a time.
    - Gradients for weight is product of:
        - Node value feeding into that weight.
        - Slope of loss function w.r.t node it feeds into.
        - Slope of activation function at the node it feeds into.
    - Use these gradients to update the weights.
    - Need to also keep track of slopes of the loss function w.r.t node values.
    - Slope of node values are the sum of the slopes for all weights that come out of them.

### ReLU Activation Function
- Slope is 0 for negative values.
- Slope is 1 for positive values.

## Backpropagation In Practice
- Calculating slopes associated with any weight in the network.
- Gradients for weight is product of:
    - Node value feeding into that weight.
    - Slope of the loss function w.r.t node it feeds into.
    - Slope of activation function at the node it feeds into.

[Backpropagation](backpropagation_example.png)

### Recap
- Start at some random set of weights.
- Use forward propagation to make a prediction.
- Use backward propagation to calculate the slope of the loss function w.r.t each weight.
- Multiply that slope by the learning rate, and subtract from the current weights.
- Keep going with that cycle until we get to a flat part.

## Stochastic Gradient Descent
- It is common to calculate slopes on only a subset of the data ('batch').
- Use a different batch of data to calculate the next update.
- Start over from the beginning once all data is used.
- Each time through the training data is called an epoch.
- When slopes are calculated on one batch at a time: stochastic gradient descent.
- When slopes are calculated on the whole data set: gradient descent.
