# Deep Learning Principles

Artificial Neural Networks were defined in 1943 by the neurophysiologist Warren McCulloch and the mathematician Walter Pitts. After few successes (1957 Perceptron by Frank Rosenblatt), they were forgotten until the 80's when new architectures and training techniques were developed. 

In 1988, D. E. Rumelhart published his *backpropagation* paper "Learning Internal Representations by Error Propagation" and Yann Le Cun showed its power with LeNet, identify handwritten numbers for the US post service. However, in the 90's Machine Learning algorithms as SVM were preferred by researchers because of its performance and explainability. Backpropagation, among other training techniques, large datasets, and the increase in computer power (GPUs, CUDA 2007, TPUs 2016) made possible what we know now as deep learning.

Deep learning is a field of Machine Learning that focuses on learning successive layers of representations. In this notebook, we will introduce some basic concepts needed to understand deep learning architectures.

## Activation functions

Actionvation functions sum all the inputs and decide the output.
In its origins, perceptron used to work with step functions to **activate** a unit, but to use gradient descient to train Multi-Layer Perceptron (MLP) (1986 by Rumelhart et al), it was replaced by the logistic function (a.k.a sigmoid). After many other fucntions have been proved efective.

- Hyperbolic tangent: Output from -1 to 1, it creates normalized output as it's centered around 0 helping to control the vanishing/exploding gradients problem.
- ReLU = max(0, z): Fast to compute, it doesn't saturate at 1 for large input values helping to avoid vanishing.
- LeakyReLU = max(az, z): It includes a small slope to avoid death cells.
- PReLU: Similar to LeakyReLU, but it learns or adjusts the slope during training.
- ELU (2015):  It outperformed ReLU variants and reduced training time but it's slower to compute. 
- Swish (2017): It is multiplication of the input with the sigmoid function, so it doesn't change abruptly in 0.
- Softmax: Used for classification


## Weight-initialization schemes

In backpropagation, as the errors propagates throught the layer, values are getting smaller and smaller impeding training in the lower layers. This is known as **vanishing gradient problem** and the oppossite **exploding gradients problem**. In 2010, Glorot and Bengio gave light to understand the reasons of these problems: *Variance of the outputs are bigger than the inputs*. They proposed  a new **inizialization** (Xavier or Glorot), and in the years after, new schemes have been proposed.

- Glorot: weights are initialized with normal distribution mean `0` and sd `sqr(2/(fan_in + fan_out))` or unifor between -r and r, where `r=sqr(6/(fan_in + fan_out))`. 
- He (2015): it uses a truncated normal distribution with mean `0` and `sd = sqr(2/fan_in + fan_out)`


## Optimizers

Apart from controlling gradients problems, DL got a big boost from fast optimizers which helps gradient descent to converge faster and scape local optima. 

- Momentum optimization (1964): Use gradients as accelerator not speed.
- Nesterov (1983): use the gradients from a further point in the same direction, converging faster.
- AdaGrad (2011): adapt the learning rate, faster for steep dimensions and slower near the optimum. It stops before reaching the optimum sometimes. 
- RMSProp: Like AdaGrad, but adding exponential decay, so only most recent iteratins are considered.
- Adam (2015): momentum + RMSProp, exponential decay average of past gradients and their squares.

Among these training tricks, it's important to remember: batch normalization, gradicent clipping, early stopping, and dropout.

**References**

- GÉRON, Aurélien. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O'Reilly Media, 2019.
- CHOLLET, Francois. Deep Learning with Python and Keras. Manning Publications, 2018.
