# Deep Learning

Deep Learning models can be used for a variety of complex tasks:
* Artificial Neural Networks for Regression and Classification
* Convolutional Neural Networks for Computer Vision
* Recurrent Neural Networks for Time Series Analysis
* Self Organizing Maps for Feature Extraction
* Deep Boltzmann Machines for Recommendation Systems
* Auto Encoders for Recommendation Systems

We create an artificial structure called an artificial neural net (nodes or neurons). We will have some neurons for input values which are processed through all hidden layers just like in the human brain to have an output value. It is the Deep Learning.
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_1.JPG?raw=true' width='600'>

##### 参考：
[神经网络入门](http://www.ruanyifeng.com/blog/2017/07/neural-network.html#support)

## Artificial Neural Networks

### The Neuron

The Neuron is the basic building block of Artificial Neural Networks.<br>

All input values can be considered as independent variable, and those will be added up or multiplied by some weight. In the case, they will be easier for neural network to process them if they are all about the same (Standardize or Normalize)
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_2.JPG?raw=true' width='500'>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_3.JPG?raw=true' width='500'>
If the output is categorical, then it will be several outputs values ($y_1, y_2 ... y_n$) because these will be dummy variables which will be representing the categories.

The inputs are single observation (only in one row in the dataset) and the output is single observation as well. It means the input is for one row in dataset, then the output is for that the same exact row.
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_4.JPG?raw=true' width='500'>
It's just all values in one row with different characteristics or attributes(columns) relating to that one observation(row), every single time, we are dealing with.

All the Synapse get assigned weights. By adjusting the weights, the neural network decides in every single case (which signal is not important). It means we are training the artificial neural network by adjusting all of the weights in all of the synapse across the whole neural network.
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_5.JPG?raw=true' width='500'>

Here is the thing happened in Neural node:
STEP 1: To sum up all inputs with their weights
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_6.JPG?raw=true' width='500'>
STEP 2: Apply the activation function which is assigned to the current neuron (or to the whole layer). 
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_7.JPG?raw=true' width='500'>
STEP 3: The neuron will decide whether pass the signal or not based on the activation function applied in STEP 2.
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_8.JPG?raw=true' width='500'>
Finally, those STEPs will be repeated throughout the whole neural network on and on.

### The Activation Function

#### Threshold Function (Yes or No type function)

<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_9.JPG?raw=true' width='400'>

#### Sigmoid Function

Sigmoid function is useful in the final layer output. Especially when we are trying to predict probabilities. The probability of y is equal to 1.
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_10.JPG?raw=true' width='400'>

#### Rectifier Function

Rectifier function is one of the most popular functions in Artificial Neural Network.
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_11.JPG?raw=true' width='400'>

#### Hyperbolic Tangent (tanh)

<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_12.JPG?raw=true' width='400'>

#### Which activation function we should use?

When the Dependent Variable (y) is binary (y = 0 or 1):
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_13.JPG?raw=true' width='500'>

We also can apply the rectifier function in the hidden layers and the sigmoid function in the output layers
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_14.JPG?raw=true' width='500'>

### How do Neural Networks Work

每个 Neural node (hidden layer) 根据自身的特点方面，可以 take 不同的 input values，以及不同的 activation function，最后合并产生结果。

The whole hidden layers allows to increase the flexibility of the neural network. It allows the neural network to look for very specific things and then in combination.
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_15.JPG?raw=true' width='500'>

### How do Neural Networks Learn

通过对结果的对比，让 Neural Networks 自己摸索学习。

使用 cost function: $C = \frac{(\hat{y} - y)^2}{2}$，比较预测值和真实值的区别，并反馈给 Neural Networks。Neural Networks 将调解 weights 值，来缩小 cost function。
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_16.JPG?raw=true' width='500'>

当 multiple rows inputs into the same ONE neural networks (not 8 neural network, just the same one)，产生相对应 each row 的$\hat{y}$
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_17.JPG?raw=true' width='500'>
随后比较相对应 each row 的真实值($y$)，生成$C= \sum \frac{(\hat{y} - y)^2}{2}$。使用新生成的 Cost function (C) 来调节当前 neural network 中的 $w_1, w_2, w_3$ (the weights are the same for all of the rows, all the rows share the same weights)，最终使 Cost function 最小化。
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_18.JPG?raw=true' width='500'>

### Gradient Descent

Dataset 中所有 rows 全部输入同一 neural network 中，得到 Cost function。<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_20.JPG?raw=true' width='400'>

通过对 Cost funtion 求导得到斜率，'-'为 downhill 则需要调整 weights，使$\hat{y}$向右趋近取值； '+'为 uphill 则需要调整 weights，使$\hat{y}$向左趋近取值。直至斜率 = 0 得到最小 Cost function (可能是 local minimum)
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_19.JPG?raw=true' width='400'>

### Stochastic Gradient Descent

**Batch Gradient Descent:** Adjust weights after runing all of the rows in the same neural network. Then adjust weights and run the whole thing again (iteration) until minimize the Cost function.

**Stochastic Gradient Descent:** Run one row at a time, then adjust weights (iteration) till minimize the Cost function. Then run next single row into the same neural network...
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_21.JPG?raw=true' width='400'>

The stochastic gradient helps to avoid local minimums issue and it's faster than the batch gradient.

**Mini Batch Gradient Descent:** combine both batch and stochastic gradient. Run 5, 10 or 100 rows (set by users) at a time, then adjust weights like stochastic gradient.

##### 参考：
[A Neural Network in 13 lines of Python (Part 2 - Gradient Descent)](https://iamtrask.github.io/2015/07/27/python-network-part2/)

### Back Propagation

The information is entered into the input layer and then it's propagated forward to get the output values $\hat{y}$. Then we compare $\hat{y}$ to the actual values that we have in the training set. Then we calculate the errors (cost function).
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_22.JPG?raw=true' width='400'>
Then feedback the errors (back propagate) through the network in the opposite direction and **it allows us to train the network by adjusting ALL the weights AT THE SAME TIME.**
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_23.JPG?raw=true' width='400'>
The advantage of back propagation is the way algorithm to be structured. So you know which part of the error each weights in the neural network is repsonsible for.

##### 参考：
[Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/chap2.html)

### Training the ANN with Stochastic Gradient Descent

**STEP 1:** Randomly initialise the weights to small numbers close to 0 (but not 0).<br>
**STEP 2:** Input the first observation (first row) of the dataset in the input layer, each feature in one input node. (basically take the columns and put them into the input nodes)<br>
**STEP 3:** Forward Propagation - from left to right, the neurons are activated in a way that the impact of each neuron's activation is limited by the weights (the weights determine how important each neurons activation is). Propagate the activations until getting the predicted result $y$<br>
**STEP 4:** Compare the predicted result to the actual result. Measure the generated error (Cost function)<br>
**STEP 5:** Back Propagation - from right to left, the error is back-propagated. Update the weights according to how much they are responsible for the error (because of the way back-propagated algorithm is structured). The learning rate decides by how much we update the weights.<br>
**STEP 6:** Repeat STEPs 1-5 and update the weights after each observation (Reinforcement Learning, in our case, it is Stochastic Gradient Descent). Or: Repeat STEPs 1-5 but udpate the weights only after a batch of observation (Batch Learning).<br>
**STEP 7:** When the whole training set passed through the ANN, that makes an epoch. Redo more epochs. (basically just keep doing again and again, to train better and better and adjust itself until the cost function is minimum.)

### ANN in Python

## Convolutional Neural Networks