In [6]:
from IPython.display import Image
import keras

# Artificial Neural Networks

Artificial neural networks can tackle highly complex ML tasks, such as image classification, speech recognition, predicting stocks, playing games... They are the core of Deep Learning.

Why ``playing`` with NN is fun? Because they can have different **architectures**. It is up to you and your creativity to find the best architectures to solve your problem.

The idea behind artificial neuronal network is a biological neuronal system. Neurons are typically organized in **layers**.

<img src="https://dl.dropboxusercontent.com/s/rukcytf4xm8frz1/bio_neuron_layers.png?dl=0" width="800">

A neuron is a cell is composed of cell body, many branching extensions called dendrites, and one long extension called axon.
The neuron  receives short eletrical signals from other neurons that are attached to its dendrites. But only when the signal increases a certain treshold, it also fires a signal that goes through the axos and reaches the next layer of neurons.

<img src="https://dl.dropboxusercontent.com/s/9jeij0scgtgij18/bio_neuron.png?dl=0" width="800">



Biological neurons are not activated by any signal they receive, but the suppress the input until it has grown large enough (threshold).

The artificial neurons work exactly in the same way. 
They have multiple inputs. They add them up. Accordingly to a certain treshold, if the resulting input is large enough, then the neuron is activated and fires a signal. Otherwise, the signal is suppressed.

The function that takes the input signal and generates an output signal given a threshold is called an **activation function**. And the result of the activation function is also the output of the neuron.

<img src="https://dl.dropboxusercontent.com/s/i0g07tonai21phb/artifical_neuron.png?dl=0" width="800">



### The Perceptron (invented in 1957)

The most basic and oldest type of ANN is the perceptron. It was invented in 1957 by Frank Rosenblatt. It is made up of only one neuron that accepts the input and applies an activation function to it in order to generate a binary output.

<img src="https://dl.dropboxusercontent.com/s/x1t7go0h7yncp2q/perceptron.png?dl=0" width="800">

Each input connection is associated with a weight. The perceptron computes a weighted sum of its inputs 

$\Large z = w_{0} + w_{1} x_{1} + w_{2} x_{2} + ⋯ + w_{n} x_{n} = x^{T} w$ 

then applies a step function to that sum and outputs the result

$\Large output = h(z)$

The activation function can be a step function.

$\Large h(z) = \begin{cases}
      0~~~if~~~z<0\\
      1~~~~~if~~~z\geq0
    \end{cases}$


Training the Perceptron means finding the best set of weights that can perform the task. Note that there is always a bias term!

The decision boundary of each output neuron is linear, so Perceptrons are incapable of learning complex patterns (just like Logistic Regression classifiers). This is like drawing an hyperplane in the space of input features and then use it as decision bounday. So, Perceptrons can be used for linear classification problems, they cannot do much more than that. However, if the training instances are linearly separable, Perceptrons would always converge to a solution.

<a href="https://playground.tensorflow.org/#activation=relu&batchSize=10&dataset=gauss&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=&seed=0.58206&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false">Example of Perceptron</a>


### Multi-Layer Perceptron

The decision boundary of the Perceptron is always linear. So Perceptrons are incapable of learning complex patterns. This limitation can be eliminated by stacking multiple Perceptrons. The resulting ANN is a Multi-layer Perceptron.

This method is equivalent to add an **hidden layer** of two neurons between the **input layer** and the **output layer**.

<img src="https://dl.dropboxusercontent.com/s/4g0xvalf58x9ju7/ANN_1.png?dl=0" width="800">

<a href="https://playground.tensorflow.org/#activation=relu&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=2&seed=0.69416&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false">Example of Multi-Layer Perceptron</a>


<img src="https://dl.dropboxusercontent.com/s/s2ovjqzrvpevie6/ANN_2.png?dl=0" width="800">



### Connectivity

A layer where all the nodes are connected to all the nodes of the previous and following layer is called **dense layer** or **fully connected** layer. 

Why connect each node with each other node in the next layer?

Why not establish a more creative network? Full connectivity is much easier to code. If a connection is not needed the weight will be set to zero by the learning process. The connection weights are the trainable parameters of the model and get adjusted!

### Activation Function

Why do we need activation functions in the first place? 

If you chain several linear transformations, all you get is a linear transformation. For example, 

$f(x) = 2 x + 3$ 

and 

$g(x) = 5 x - 1$ 

then chaining these two linear functions gives you another linear function: 

$f(g(x)) = 2(5 x - 1) + 3 = 10 x + 1$

So if you don’t have some non-linearity between layers, then even a deep stack of layers is equivalent to a single layer: you cannot solve very complex problems with that.

<img src="https://dl.dropboxusercontent.com/s/i0vohs2bzttigb8/activation_functions.png?dl=0" width="800">



### The Input layer

The input layer contains as many neurons as the number of input features.

### The output layer

**Regression**. 

If you want to predict a single value (e.g., the price of a house given many of its features), then you just need a single output neuron: its output is the predicted value. For multivariate regression (i.e., to predict multiple values at once), you need one output neuron per output dimension. 

In general, for regression you do not want to use any activation function for the output neurons, so they are free to output any range of values. However, if you want to guarantee that the output will always be positive, then you can use the ReLU activation function. If you want to guarantee that the predictions will fall within a given range of values, then you can use the logistic function or the tanh, and scale the labels to the appropriate range.

**Classification**

For a binary classification problem, you just need a single output neuron using the logistic activation function: the output will be a number between 0 and 1, which you can interpret as the estimated probability of the positive class.

For multiclass classification you need to have one output neuron per class, and you should use the softmax activation function for the whole output layer.

<img src="https://dl.dropboxusercontent.com/s/zcxro1w45y03k93/ANN_multi_class.png?dl=0" width="800">



### How many hidden layers? How many neurons?

- Each neuron in the first hidden layer adds a decision boundary hyperplane.

- Adding a second hidden layer will perform a linear combination of the decision boundary hyperplanes given by the first hidden layers

<a href="https://playground.tensorflow.org/#activation=relu&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=2&seed=0.69416&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false">Example of Multi-Layer Perceptron</a>

- The lower hidden layers model low-level structures (e.g., line segments of various shapes and orientations), the intermediate hidden layers combine these low-level structures to model intermediate-level structures (e.g., squares, circles), and the highest hidden layers and the output layer combine these intermediate structures to model high-level structures (e.g., faces).

- My advice. Typically 1-5 hidden layers will solve most of the problems. Increase the number of hydden layers only if you are working with images, videos, and audios. Each layer should contain 1-500 neurons. Something to keep in mind with choosing a smaller number of layers/neurons is that if the this number is too small, your network will not be able to learn the underlying patterns in your data and thus be useless. Instead, if the number of layers/neurons is too large, you simply overfit. So, a good approach is to start with a large number of layers/neurons and progressively reduce this number or regularize in order to reduce overfitting (i.e., the **stretch pants approach**).

### Stretch pants approach

Finding the right number of neurons and layers is a dark art which requires a lot of experience. However, a simple approach to pick the right number of neurons and layers is the **stretch pants approach**. You pick a model with more layers and neurons than you actually need, then use early stopping to prevent it from overfitting. Or you gradually reduce the number of layers and neurons untill the model stops overfitting. 

In other words, instead of wasting time looking for pants that perfectly match your size, just use large stretch pants that will shrink down to the right size.

## Training the network weights - Backpropagation 

Suppose that we want to train an ANN to reproduce some stellar spectra given the stellar labels (temperature, surface gravity, metallicity). The trainig sample is composed by spectra of stars whose labels are known.

<img src="https://dl.dropboxusercontent.com/s/vxo84ex382z9hwz/project_ANN.png?dl=0" width="800">



- First thing first, you have to define a loss function. A very common one is the square difference between the predicted value and the true value. The loss function is calculated for every i$^{th}$ element in your training sample.

$\Large L=\frac{1}{2}[y^{(i)}-\hat{y}^{(i)}]^2$

- The ANN handles one mini-batch at a time (for example containing 32 instances each), and it goes through the full training set multiple times. Each pass is called an **epoch**.

- Each mini-batch is passed to the network’s input layer, which just sends it to the first hidden layer. The algorithm then computes the output of all the neurons in this layer (for every instance in the mini-batch). The result is passed on to the next layer, its output is computed and passed to the next layer, and so on until we get the output of the last layer, the output layer. This is the **forward pass**: it is exactly like making predictions, except all intermediate results are preserved since they are needed for the backward pass.

- Next, the algorithm measures the network’s output error. It takes the average of L calculated for every instance of the mini-batch. That will return a single number for your entire mini-batch.

$\Large J(w)=\frac{1}{2n}\sum_{i=1}^{n}[y^{(i)}-\hat{y}^{(i)}]^2$

- Obviously this cost function J is function of the weights. So you can measure how much of these error contributions came from each connection in each layer, from the output to the input. This is done with the partial derivative of the cost function relatively to each weight and using the **chain rule**. This reverse pass efficiently measures the error gradient across all the connection weights in the network by propagating the error gradient backward through the network (hence the name of the algorithm).

$\Large \frac{\partial J(w)}{\partial w_{j}}$

- Finally, the algorithm performs a Gradient Descent step to tweak all the connection weights in the network, using the error gradients it just computed.


### The Chain rule

Imagine you have a plain ANN.

<img src="https://dl.dropboxusercontent.com/s/b9nl84chqh5ddq5/plain_ANN.jpg?dl=0" width="1000">

The chain rule tells you the following.

$\Large \frac{\partial J(w)}{\partial w^{(1)}} = \frac{\partial J}{\partial a^{(3)}} \frac{\partial a^{(3)}}{\partial z^{(3)}} \frac{\partial z^{(3)}}{\partial a^{(2)}} \frac{\partial a^{(2)}}{\partial z^{(2)}} \frac{\partial z^{(2)}}{\partial a^{(1)}} \frac{\partial a^{(1)}}{\partial z^{(1)}} \frac{\partial z^{(1)}}{\partial w^{(1)}} $

Once you have that, you can correct the weight

$ \Large w^{(1)} := w^{(1)} - \eta \frac{\partial J(w)}{\partial w^{(1)}}$

### Vanishing gradient

What can slow down the gradient descend?

Assume that your activation functions are Sigmoid functions. 

<img src="https://dl.dropboxusercontent.com/s/yys9bu4eygqrigd/sigmoid.png?dl=0" width="600">

The Sigmoid saturates at the extremities. So, whem the input is too high or too low, the derivative will be very close to zero.

If, for instance,

$\Large \frac{\partial a^{(2)}}{\partial z^{(2)}} \sim 0$

then also

$\Large \frac{\partial J(w)}{\partial w^{(1)}} \sim 0$

When this happens, the gradient will get smaller and smaller as you go backward in propagating the gradient. This is the so-called **vanishing gradient**.

<img src="https://dl.dropboxusercontent.com/s/qk1gmrwecywd6n4/vanishing.gif?dl=0" width="300">


Similarly, if the various gradients are too large, I have the problem of the **exploding gradients**.

### The choice of the activation function

Choosing the right activation function is crucial to avoid the vanishing/exploding gradient.

The logistic activation function suffers of vanishing gradient. When backpropagation kicks in, it has virtually no gradient to propagate back through the network. Also, this gradient keeps getting diluted as backpropagation progresses down through the top layers, so there is really nothing left for the lower layers.

Now let's consider a **ReLU activation function**. It avoids the vanishing gradient problem, but it suffers from a problem know as the **dying ReLUs**: during training, some neurons effectively die, meaning they stop outputting anything other than 0. 

When this happens, the weights just keeps outputting 0s, and gradient descent does not change them anymore since the gradient of the ReLU function is 0 when its input is negative.

<img src="https://dl.dropboxusercontent.com/s/7738rv1y20qhr1m/relu.png?dl=0" width="500">

When we face these problems it is better to choose a non-saturating activation function. For instance, a **Leaky ReLU** could be a good option in this case!

<img src="https://dl.dropboxusercontent.com/s/esw1gwcw5ovuobt/LRelu.jpg?dl=0" width="700">


In general, the performance from using different activation functions improves in this order (from lowest→highest performing): logistic → tanh → ReLU → Leaky ReLU → ELU → SELU.

[This](https://arxiv.org/pdf/1811.03378.pdf) is an excellent paper that dives deeper into the comparison of various activation functions for neural networks.

As always, don't be afraid to experiment with a few different activation functions.


### Vanishing gradient and the weight initialization

The right weight initialization method can avoid problems with the vanishing gradient and speed up time-to-convergence considerably.

The idea behing all that is the following. We need the signal to flow properly in both directions: in the forward direction when making predictions, and in the reverse direction when backpropagating gradients. We don’t want the signal to die out, nor do we want it to explode and saturate. For the signal to flow properly, we need the variance of the outputs of each layer to be equal to the variance of its inputs.

Here’s an analogy: if you set a microphone amplifier’s knob too close to zero, people won’t hear your voice, but if you set it too close to the max, your voice will be saturated and people won’t understand what you are saying. Now imagine a chain of such amplifiers: they all need to be set properly in order for your voice to come out loud and clear at the end of the chain. Your voice has to come out of each amplifier at the same amplitude as it came in.

There are different methods to initialise the weights. Each method works properly with a specific class of activation functions.

- When using ReLU or leaky RELU, use [He initialization](https://arxiv.org/pdf/1502.01852.pdf)
- When using SELU or ELU, use [LeCun initialization](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)
- When using softmax, logistic, or tanh, use [Glorot initialization](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)

Keras uses Gorot as default.


### Batch size

The **batch size** defines the number of samples that will be propagated through the network.

For instance, let's say you have 1050 training samples and you want to set up a batch_size equal to 100. The algorithm takes the first 100 samples (from 1st to 100th) from the training dataset and trains the network. Next, it takes the second 100 samples (from 101st to 200th) and trains the network again. We can keep doing this procedure until we have propagated all samples through of the network. 

Advantages of using a batch size < number of all samples.

- It requires less memory. Since you train the network using fewer samples, the overall training procedure requires less memory. That's especially important if you are not able to fit the whole dataset in your machine's memory.

- Typically networks train faster with mini-batches. That's because we update the weights after each propagation. 

- The smaller the batch the less accurate the estimate of the gradient will be. However in some cases these noisy gradients can actually help escape local minima. When it is too low, your network weights can just jump around if your data is noisy and it might be unable to learn or it converges very slowly, thus negatively impacting total computation time.

Disavantages of using a batch size < number of all samples.

- The smaller the batch the less accurate the estimate of the gradient will be. In the figure below, you can see that the direction of the mini-batch gradient (green color) fluctuates much more in comparison to the direction of the full batch gradient (blue color).

So, by batching you have influence over training speed vs. gradient estimation accuracy. By choosing the batch size you define how many training samples are combined to estimate the gradient before updating the parameter(s).

<img src="https://dl.dropboxusercontent.com/s/xs2up2l8kbazkvl/min-batch.png?dl=0" width="700">


## Dropout

Dropout is one of the most popular regularization techniques for deep and dense neural networks.

It is a fairly simple algorithm: at every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability p of being temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step. The hyperparameter p is called the dropout rate, and it is typically set to 50%. After training, neurons don’t get dropped anymore. 

<img src="https://dl.dropboxusercontent.com/s/raxvz3jara2qukw/dropout.png?dl=0" width="700">

Here is an analogy. A company asks to its employees to toss a coin every morning to decide whether or not to go to work. By doing this, the company would be forced to adapt its organization; it could not rely on any single person to fill in the coffee machine or perform any other critical tasks, so this expertise would have to be spread across several people. Employees would have to learn to cooperate with many of their coworkers, not just a handful of them. The company would become much more resilient. If one person gets covid, it wouldn’t make much of a difference. It’s unclear whether this idea would actually work for companies, but it certainly does for neural networks.

### Neural network terminology

- **batch size** = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.

- number of **iterations** = number of passes, each pass using (batch size) number of examples.

- one **epoch** = one forward pass and one backward pass of all the training examples

- **input layer** and **output layer**. The first and last layer of neurons.

- **hidden layer**. Is any layer in between the input and the output layers.

- **activation function**

- **weight initialization**

- **optimizer**


## Optimizers

Training a very large deep neural network can be painfully slow. A huge speed boost comes from using a fast optimizer. Gradient Descent and Stochastic Gradient Descent aren't the only optimizers in town! There's a few different ones to choose from.

- Momentum
- Nesterov Accelerated Gradient
- AdaGrad
- RMSProp
- AdaDelta
- Adam

### Momentum Optimization

Imagine a bowling ball rolling down a gentle slope on a smooth surface: it will start out slowly, but it will quickly pick up momentum until it eventually reaches terminal velocity (if there is some friction or air resistance). 

Recall that Gradient Descent simply updates the weights $\theta$ by directly subtracting the gradient of the cost function $J(\mathbf{\theta})$ with regards to the weights ($\nabla_\theta J(\mathbf{\theta})$) multiplied by the learning rate $\eta$.

$\Large \theta \leftarrow \theta - \eta \nabla_\theta J(\mathbf{\theta})$

Instead the momentum optimization corrects the weights taking into account also the correction of the previous step. The m parameters controls how much of the previous gradient it has to remember.

<img src="https://dl.dropboxusercontent.com/s/hkt88n7dnks4wq0/momentum.png?dl=0" width="800">


**Advantages**

- Reduces the oscillations

- Good to escape from local minima

**Disadvantages**

- You need to tune an additional hyperparameter: the momentum m

- Due to the momentum, the optimizer may overshoot a bit, then come back, overshoot again, and oscillate like this many times before stabilizing at the minimum. This is one of the reasons why it is good to have a bit of friction in the system: it gets rid of these oscillations and thus speeds up convergence.

### Nesterov Accelerated Gradient

Momentum may be a good method but if the momentum is too high the algorithm may miss the local minima and may continue to rise up. So, to resolve this issue the NAG algorithm was developed. It is a look ahead method. 

The idea of Nesterov Momentum optimization, or Nesterov Accelerated Gradient (NAG), is to measure the gradient of the cost function not at the local position but slightly ahead in the direction of the momentum. 

NAG will almost always speed up training compared to regular Momentum optimization.

<img src="https://dl.dropboxusercontent.com/s/il4k1j5myn2oska/nesterov.png?dl=0" width="800">



### AdaGrad

Consider the elongated bowl problem again: Gradient Descent starts by quickly going down the steepest slope, then slowly goes down the bottom of the valley. It would be nice if the algorithm could detect this early on and correct its direction to point a bit more toward the global optimum.

$\Large \theta \leftarrow \theta - \frac{\eta}{\sqrt{G + \epsilon}} \nabla_\theta J(\mathbf{\theta})$

where 

G = is a diagonal matrix containing the squares of all previous gradients

and

$\epsilon$ = is a very small order of 10$^{-8}$ regularization term that provides numerical stability by preventing division by 0.

In short, this algorithm decays the learning rate, but it does so faster for steep dimensions than for dimensions with gentler slopes. This is called an adaptive learning rate.

**Advantages**

- It quickly converges near the minimum

**Disadvantages**

- AgaGrad has a monotonically decreasing learning rate. The consequence is that it will never reach the actual minimum. It will stop earlier. For this reason it is never recommended to use AdaGrad for complicated problems.

<img src="https://dl.dropboxusercontent.com/s/rkg477d3cjeomzb/adagrad.png?dl=0" width="800">



### RMSProp

Although AdaGrad slows down a bit too fast and ends up never converging to the global optimum, the RMSProp algorithm fixes this by accumulating only the gradients from the most recent iterations (as opposed to all the gradients since the beginning of training).

How much of the previous gradient is it going to remember at each step? To control this we need to introduce another hyperparameter: $\rho$. 

$\rho$ =0 will remember all the previous steps

$\rho$ =1 won't remember any of the previous steps

Default value in Keras: $\rho$=0.9

### Adam

Adam stands for adaptive moment estimation, combines the ideas of Momentum optimization and RMSProp: just like Momentum optimization it keeps track of an exponentially decaying average of past gradients, and just like RMSProp it keeps track of an exponentially decaying average of past squared gradients.

It is probably the best optimizer for complicated problems.

The problem is that Adam has two more hyperparameters:

- $\beta_{1}$ = controls the momentum decay (default: 0.9)

- $\beta_{2}$ = controls the scaling decay (default: 0.99)

<img src="https://dl.dropboxusercontent.com/s/ws6rl15l5kbggqm/adam.gif?dl=0" width="800">



### How to choose the right optimizer?

Choosing the right optimizer for your machine learning problem can be hard. More specifically, there is no one-fits-all solution and the optimizer has to be carefully chosen based on the particular problem at hand.

As a rule of thumb: If you have the resources to find a good learning rate schedule, SGD with momentum is a solid choice. If you are in need of quick results without extensive hypertuning, tend towards adaptive gradient methods.


<img src="https://dl.dropboxusercontent.com/s/2mph4hi0nld3x8l/hist_optimizers.jpg?dl=0" width="800">



### Architectures

ANNs can have very different architectures.

You can inject inputs and extract outputs at any level of the network. In this way you can also do regression and classification at the same time, for example.

<img src="https://dl.dropboxusercontent.com/s/wcsrnajn1eiacrk/ANN_architectures.png?dl=0" width="600">

If you have time series you probably want to use **recurrent NNs*.

<img src="https://dl.dropboxusercontent.com/s/1oc8nm4higyjkt7/recurrent_NN.png?dl=0" width="600">

When you are more interested in detecting outliers or anomalies you can use **autoencoders**.

<img src="https://dl.dropboxusercontent.com/s/04ig8rph4z3v5rh/autoencoder.jpg?dl=0" width="600">

...there is a vast zoo of neural networks.

<img src="https://dl.dropboxusercontent.com/s/fe9707zgkg3lhbn/NN_Zoo.png?dl=0" width="600">

