# Course: Deep Learning in Python
- [DataCamp course link](https://www.datacamp.com/courses/deep-learning-in-python/)

In [1]:
# Pre-load modules used later
from IPython.display import Image

## Chapter 1: Basics of deep learning and neural networks
- [Slides](slides/ch1_slides.pdf)


- The key difference in neural networks vs. classic regression models is can capture the *interactions* between features which also affect targets.  Regression just captures features' *direct* outcome on the target.
- More nodes = more interactions which can be captured.
- *representation learning* -- They build representations of patterns in the data which are useful for making predictions.  Patterns increase in complexity in successive layers of the network.  This partially replaces the need for feature engineering.

**Forward propogation**
- Uses dot product of matrices representing node *values* & *weights*.


**Activation functions**
- Current best practice & most widely-used is the **ReLU (Rectified Linear Unit)** activation function. Variations will be discussed later (i.e. leaky ReLU, softmax).
>*
\begin{equation}
ReLU(x) =
\begin{cases}
0, & \text{if  $x$ < 0}\\
x, & \text{if  $x$ >= 0}
\end{cases}
\end{equation}
>*

In [7]:
#
# ReLU function visualized.  
# Reference: https://www.analyticsvidhya.com/blog/2017/10/fundamentals-deep-learning-activation-functions-when-to-use-them/
#
Image(url="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/10/17160725/relu.png", width=450)

In [16]:
#
# Exercise: Manually calculate values of nodes in a 3-layer network (without an activation function)
#
from numpy import array

# Hidden layer 1
n1 = (array([1, 1]) * array([2, 4])).sum()
n2 = (array([1, 1]) * array([4, -5])).sum()
# Hidden layer 2
n3 = (array([n1, n2]) * array([0, 1])).sum()
n4 = (array([n1, n2]) * array([1, 1])).sum()
# Output layer
output = (array([n3, n4]) * array([5, 1])).sum()

print(n1, n2, n3, n4, output)

6 -1 -1 5 0


## Chapter 2: Optimizing a neural network with backpropagation
- [Slides](slides/ch2_slides.pdf)


- **Loss function** -- A measure of a model's predictive performance, it aggregates prediction errors from many data points/nodes into single value. (E.g. mean squared error (MSE))
- **Gradient descent** is used to find the best (lowest) value of the loss function.
    - FYI: Mathematically, an array of slopes is called a "gradient."
- **Learning rate**
    - Weights are updated by subtracting **```learning rate * slope```**
    - Btw calculating it involves the good ol' *chain rule* from calculus! ;) 
- **Backpropagation**
    - Basic steps:
        1. Start with random set of weights
        1. Apply forward propagation
        1. Apply backward propagation to calculate slope of loss function w.r.t. each weight
        1. Update weights by multiplying those slopes by the learning rate and subtracting it from the previous weights
        1. Iterate until slope of loss function levels out
    - Gradients for weight is *the product of*:
        1. Node value *feeding into* the weight
        1. Slope of loss function w.r.t the node value it *feeds into* (aka prediction error)
        1. Slope of *activation function* for the node it feeds into (when using ReLU: 0 if < 0, 1 if >= 0)
 - **Stochastic gradient descent** done in small batches of input data is often used for computational efficiency.
     - *stochastic* = randomly determined.
     - Each batch is an *epoch*.


## Chapter 3: Building deep learning models with keras
- [Slides](slides/ch3_slides.pdf)


[**Keras**](keras.io) is a high-level library capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, Theano, or PlaidML.


### Model building steps:
1. Specify network architecture
1. Compile
    - Specify optimizer ('[*adam*](https://keras.io/optimizers/#adam)' is usually a good choice)
    - Specify loss function (*'mean_squared_error'* is common for regression, *‘categorical_crossentropy’* aka logloss for classification)
    - For classification problems:
        - Add `metrics=['accuracy']` for better explainability
        - Output layer
            1. has node for each possible classification outcome and
            1. uses *'softmax'* activation to ensure predictions sum to 1 so they can be used as probabilities.
1. Fit
    - Applies foward and back propagation & gradient descent
1. Predict

Model types:
- `keras.models.Sequential` -- Layers are connected only in a sequential fashion.

Layer types:
- `keras.layers.Dense` -- All nodes of a layer connect to all nodes of the next layer.

## Chapter 4: Fine-tuning keras models
- [Slides](slides/ch4_slides.pdf)


More optimizers:
- **Stachastic gradient descent (SGD)** -- `keras.optimizers.SGD`


Common optimization problems:
- **dead neurons** -- Happens when node values get stuck at zero, as with ReLU activation.
- **vanishing gradient** -- Happens when many layers have very small slopes (e.g due to being on flat part of tanh activation curve).  Leads to backprop weight updates close to zero.

Model training optimization (`model.fit()`):
- **Validation**
    - One **validation split** data set is commonly used in deep learning instead of k-fold cross-validation as with classical models, due to the large size of data sets used.  The implications are:
        - Cross-validation on large data sets is too compute intensive
        - The reliability of a larger validation data set is better
    - `validation_split=` param of `fit()`.
- **Early stopping** -- Quit iterating when model performance stops improving.  
    - `keras.callbacks.EarlyStopping`
    - Specify using the `callbacks=` param of `fit()`.
    
**Model or network capacity** -- Similar to overfitting/underfitting (where model complexity is the factor you're trying to optimize).  Too complex of a model can overfit the data used to train it.
- Workflow for optimizing model capacity:
    1. Start w/ small network
    1. Get validation score
    1. Keep increasing capacity till validation score no longer improves