# Deep Learning

Deep Learning models can be used for a variety of complex tasks:
* Artificial Neural Networks for Regression and Classification
* Convolutional Neural Networks for Computer Vision
* Recurrent Neural Networks for Time Series Analysis
* Self Organizing Maps for Feature Extraction
* Deep Boltzmann Machines for Recommendation Systems
* Auto Encoders for Recommendation Systems

We create an artificial structure called an artificial neural net (nodes or neurons). We will have some neurons for input values which are processed through all hidden layers just like in the human brain to have an output value. It is the Deep Learning.<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_1.JPG?raw=true' width='600'>

##### 参考：
[神经网络入门](http://www.ruanyifeng.com/blog/2017/07/neural-network.html#support)

## Artificial Neural Networks

### The Neuron

The Neuron is the basic building block of Artificial Neural Networks.<br>

All input values can be considered as independent variable, and those will be added up or multiplied by some weight. In the case, they will be easier for neural network to process them if they are all about the same (Standardize or Normalize)<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_2.JPG?raw=true' width='500'><br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_3.JPG?raw=true' width='500'><br>
If the output is categorical, then it will be several outputs values ($y_1, y_2 ... y_n$) because these will be dummy variables which will be representing the categories.

The inputs are single observation (only in one row in the dataset) and the output is single observation as well. It means the input is for one row in dataset, then the output is for that the same exact row.<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_4.JPG?raw=true' width='500'><br>
It's just all values in one row with different characteristics or attributes(columns) relating to that one observation(row), every single time, we are dealing with.

All the Synapse get assigned weights. By adjusting the weights, the neural network decides in every single case (which signal is not important). It means we are training the artificial neural network by adjusting all of the weights in all of the synapse across the whole neural network.<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_5.JPG?raw=true' width='500'><br>

Here is the thing happened in Neural node:<br>
STEP 1: To sum up all inputs with their weights<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_6.JPG?raw=true' width='500'><br>
STEP 2: Apply the activation function which is assigned to the current neuron (or to the whole layer). <br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_7.JPG?raw=true' width='500'><br>
STEP 3: The neuron will decide whether pass the signal or not based on the activation function applied in STEP 2.<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_8.JPG?raw=true' width='500'><br>
Finally, those STEPs will be repeated throughout the whole neural network on and on.

### The Activation Function

#### Threshold Function (Yes or No type function)

<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_9.JPG?raw=true' width='400'>

#### Sigmoid Function

Sigmoid function is useful in the final layer output. Especially when we are trying to predict probabilities. The probability of y is equal to 1.<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_10.JPG?raw=true' width='400'>

#### Rectifier Function

Rectifier function is one of the most popular functions in Artificial Neural Network.<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_11.JPG?raw=true' width='400'>

#### Hyperbolic Tangent (tanh)

<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_12.JPG?raw=true' width='400'>

#### Which activation function we should use?

<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_24.JPG?raw=true' width='500'><br>
When the Dependent Variable (y) is binary (y = 0 or 1):<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_13.JPG?raw=true' width='500'><br>

We also can apply the rectifier function in the hidden layers and the sigmoid function in the output layers<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_14.JPG?raw=true' width='500'><br>

### How do Neural Networks Work

每个 Neural node (hidden layer) 根据自身的特点方面，可以 take 不同的 input values，以及不同的 activation function，最后合并产生结果。

The whole hidden layers allows to increase the flexibility of the neural network. It allows the neural network to look for very specific things and then in combination.<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_15.JPG?raw=true' width='500'>

### How do Neural Networks Learn

通过对结果的对比，让 Neural Networks 自己摸索学习。

使用 cost function: $C = \frac{(\hat{y} - y)^2}{2}$，比较预测值和真实值的区别，并反馈给 Neural Networks。Neural Networks 将调解 weights 值，来缩小 cost function。<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_16.JPG?raw=true' width='500'>

当 multiple rows inputs into the same ONE neural networks (not 8 neural network, just the same one)，产生相对应 each row 的$\hat{y}$<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_17.JPG?raw=true' width='500'><br>
随后比较相对应 each row 的真实值($y$)，生成$C= \sum \frac{(\hat{y} - y)^2}{2}$。使用新生成的 Cost function (C) 来调节当前 neural network 中的 $w_1, w_2, w_3$ (the weights are the same for all of the rows, all the rows share the same weights)，最终使 Cost function 最小化。<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_18.JPG?raw=true' width='500'>

### Gradient Descent

Dataset 中所有 rows 全部输入同一 neural network 中，得到 Cost function。<br><img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_20.JPG?raw=true' width='400'>

通过对 Cost funtion 求导得到斜率，'-'为 downhill 则需要调整 weights，使$\hat{y}$向右趋近取值； '+'为 uphill 则需要调整 weights，使$\hat{y}$向左趋近取值。直至斜率 = 0 得到最小 Cost function (可能是 local minimum)<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_19.JPG?raw=true' width='400'>

### Stochastic Gradient Descent

**Batch Gradient Descent:** Adjust weights after runing all of the rows in the same neural network. Then adjust weights and run the whole thing again (iteration) until minimize the Cost function.

**Stochastic Gradient Descent:** Run one row at a time, then adjust weights (iteration) till minimize the Cost function. Then run next single row into the same neural network...<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_21.JPG?raw=true' width='400'>

The stochastic gradient helps to avoid local minimums issue and it's faster than the batch gradient.

**Mini Batch Gradient Descent:** combine both batch and stochastic gradient. Run 5, 10 or 100 rows (set by users) at a time, then adjust weights like stochastic gradient.

##### 参考：
[A Neural Network in 13 lines of Python (Part 2 - Gradient Descent)](https://iamtrask.github.io/2015/07/27/python-network-part2/)

### Back Propagation

The information is entered into the input layer and then it's propagated forward to get the output values $\hat{y}$. Then we compare $\hat{y}$ to the actual values that we have in the training set. Then we calculate the errors (cost function).<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_22.JPG?raw=true' width='400'><br>
Then feedback the errors (back propagate) through the network in the opposite direction and **it allows us to train the network by adjusting ALL the weights AT THE SAME TIME.**<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_23.JPG?raw=true' width='400'><br>
The advantage of back propagation is the way algorithm to be structured. So you know which part of the error each weights in the neural network is repsonsible for.

##### 参考：
[Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/chap2.html)

### Training the ANN with Stochastic Gradient Descent

**STEP 1:** Randomly initialise the weights to small numbers close to 0 (but not 0).<br>
**STEP 2:** Input the first observation (first row) of the dataset in the input layer, each feature in one input node. (basically take the columns and put them into the input nodes)<br>
**STEP 3:** Forward Propagation - from left to right, the neurons are activated in a way that the impact of each neuron's activation is limited by the weights (the weights determine how important each neurons activation is). Propagate the activations until getting the predicted result $y$<br>
**STEP 4:** Compare the predicted result to the actual result. Measure the generated error (Cost function)<br>
**STEP 5:** Back Propagation - from right to left, the error is back-propagated. Update the weights according to how much they are responsible for the error (because of the way back-propagated algorithm is structured). The learning rate decides by how much we update the weights.<br>
**STEP 6:** Repeat STEPs 1-5 and update the weights after each observation (Reinforcement Learning, in our case, it is Stochastic Gradient Descent). Or: Repeat STEPs 1-5 but udpate the weights only after a batch of observation (Batch Learning).<br>
**STEP 7:** When the whole training set passed through the ANN, that makes an epoch. Redo more epochs. (basically just keep doing again and again, to train better and better and adjust itself until the cost function is minimum.)

epoch defines the number times that the learning algorithm will work through the entire training dataset. One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters. An epoch is comprised of one or more batches. For example, as above an epoch that has one batch is called the batch gradient descent learning algorithm.

### ANN in Python

#### Data Preprocessing

In [1]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')
dataset.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [3]:
# Independent variables: CreditScore,Geography,Gender,Age,
# Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
# ANN will determine which independent variable will be more important.
X = dataset.iloc[:,3:13].values
y = dataset.loc[:,'Exited'].values

We have some Categorical variables in our matrix of features. Therefore, we need to encode them (OneHotEncoder)

In [4]:
# Encodes any categorical data in the dataset
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# In the dataset, only 'Geography' and 'Gender' need to be encoded.
labelencoder_X_1 = LabelEncoder()
X[:,1] = labelencoder_X_1.fit_transform(X[:,1])
labelencoder_X_2 = LabelEncoder()
X[:,2] = labelencoder_X_2.fit_transform(X[:,2])

# The categorical variables are NOT ordinal 
# (No relational order between the categorical variables)
# For example, France (2) is NOT higher than Germany (1)
# So we need create dummy variables for these categorical variables
onehotencoder = OneHotEncoder(categorical_features = [1])
# In order to convert X to be a matrix, we need add '.toarray()' in the end.
X = onehotencoder.fit_transform(X).toarray()

# To remove one dummy variable in order to avoid falling into the dummy variable trap.
X = X[:,1:]

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


<font color='red'>**We need apply feature scaling on the ALL general deep learning (ANN and CNN)**</font><br>

Because there will be a lot of highly compute intensive calculations and besides parallel computations<br>
We need apply feature scaling to ease of these calculations.

In [5]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler as ss
sc = ss()
X = sc.fit_transform(X)

In [6]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

#### Create the Artificial Neural Network Model

In [7]:
# Importing the Keras libraries and package
import keras  # keras will build neural networks based on TensorFlow

Using TensorFlow backend.


We need import two modules here:
* The sequential module - it is required to initialize the neural network
* The dense module - it is required to build the layers of ANN

In [8]:
from keras.models import Sequential
from keras.layers import Dense

##### Initialising the ANN

In [9]:
# We don't need add any arguments, 
# because we will define the layers step by step
classifier = Sequential()

##### Adding the input layer and the 1st hidden layer
**使用 .add(Dense function)来建立或者添加每一层layer.**<br>

Dense function 的 arguments:
* units: 此次建模的 output nodes number。因为是第一层 hidden layer，所以 output nodes number 就是 hidden layer nodes number.
* kernel_initializer: use a uniform function to initialize the weights (STEP 1)
* activation: the activation function will be chosen in the hidden layer (best option - Rectifier Function)
* input_dim: the number of nodes in the input layer, which is also the number of independent variables <br>

注意：第一次定义 Dense function 时，因为所有 layer 还未生成，第一层的 hidden layer 没有任何关于 input layer 的信息。所以必须注明 input layer node 数量。hidden layer 建立后，再添加 hidden layer 就无需特别定义了

**To choose the number of nodes in the hidden layer as the average of the number of nodes in the input layer and the number of nodes in the output layer**
* $Nodes_{hidden} = \frac{Nodes_{input} + Nodes_{output}}{2}$

The best option for the activation function is **Rectifier Function** (based on experiments & research)<br>
The 2nd option is **Sigmoid Function**, it will allow us to get the probabilities of the different segments.<br>
注意：Sigmoid Function is only for single output layer. We will use soft_max function if the output layer has 2 or more categories<br>

Hidden Layer : Rectifier Function; Output Layer: Sigmoid Function

In [10]:
classifier.add(Dense(activation="relu", input_dim=11, units=6, kernel_initializer="uniform"))

Instructions for updating:
Colocations handled automatically by placer.


##### Adding the second hidden layer
因为 layer 已经建立了，所以不需要特别注明 input_dim 了。<br>
第二层 hidden layer 可以选择其他 activation function，但是因为 Rectifier Function 是最好选择，所以仍然选择它

In [11]:
classifier.add(Dense(activation="relu", units=6, kernel_initializer="uniform"))

##### Adding the output layer
now the output layer node is only 1 node, so units = 1<br>
We need know the probability about customer leave or stay in the bank, <br>
so use sigmoid function as activation function<br>

In [12]:
classifier.add(Dense(activation="sigmoid", units=1, kernel_initializer="uniform"))

##### Compiling the ANN
'Compile' means We apply the Stochastic Gradient Descent on the whole ANN<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_26.JPG?raw=true' width='600'><br>
* optimizer: the algorithm we want to use to find the optimal set of weights in the neural networks (there're several Stochastic Gradient algorithm, the best one is called 'adam')
* loss: this corresponds to the lost function. The loss function is kind of the same as for logistic regression.<br><img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_27.JPG?raw=true' width='600'>
* metrics: a criterion we will choose to evaluate the model (typically we use the 'accuracy' criterion). This arugment must be a list.

注：since the activation function for the output layer is the sigmoid function and we use adam stochastic gradient algorithm. So we will use 'Logarithmic Loss' as well. <br>
* if the dependent variable has a binary outcome then the logarithmic loss function is called <b>binary_crossentropy</b>.
* if the dependent variable has more than 2 outcomes like three categories, then the logarithmic function is called <b>categorical_crossentropy</b>.

In [13]:
classifier.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

##### Fitting the ANN to the Training set
Two new arguments in .fit function:
* batch_size: In STEP 6, we can choose to update the weights either after each observation/row (Reinforcement Learning)  or after a batch of observations (Batch Learning). Therefore, we need select batch sizewhich is the number of observations after which we want to update the weights
* epochs: number of epoch. In STEP 7, the epoch is basically around when the whole training set passed through the ANN. In reality training of ANN consists of applying STEPs 1-6 over many epochs

epoch defines the number times that the learning algorithm will work through the entire training dataset. One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters. An epoch is comprised of one or more batches. For example, as above an epoch that has one batch is called the batch gradient descent learning algorithm.

注意：选择 batch_size 和 nb_epoch 没有技巧，只能靠实验。<br>

经过 training, Accuracy 会随着回合慢慢增长。

In [14]:
classifier.fit(X_train,y_train,batch_size=10,epochs=100)

Instructions for updating:
Use tf.cast instead.
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100

Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x2d95595c630>

#### Making the Predictions and Evaluating the Model

##### Predicting the Test set results
The .predict method returns the probability<br>
Since our predict result is the probability. We need the threshold to convert probability to 1/0<br>
So we choose 50% as threshold here.

In [15]:
y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5)

##### Making the confusion matrix

In [16]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
cm

array([[1544,   51],
       [ 224,  181]], dtype=int64)

##### Making Classification Report

In [17]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.87      0.97      0.92      1595
           1       0.78      0.45      0.57       405

    accuracy                           0.86      2000
   macro avg       0.83      0.71      0.74      2000
weighted avg       0.85      0.86      0.85      2000



## Convolutional Neural Networks

### What are convolutional neural networks?

##### 参考：
[一文看懂卷积神经网络](https://easyai.tech/ai-definition/cnn/)

卷积神经网络(Convolutional Neural Network, CNN)最擅长的就是图片的处理。它受到人类视觉神经系统的启发。

CNN 有2大特点：
1. 能够有效的将大数据量的图片降维成小数据量
2. 能够有效的保留图片特征，符合图片处理的原则

目前 CNN 已经得到了广泛的应用，比如：人脸识别、自动驾驶、美图秀秀、安防等很多领域。

#### CNN 解决了什么问题？
在 CNN 出现之前，图像对于人工智能来说是一个难题，有2个原因：
1. 图像需要处理的数据量太大，导致成本很高，效率很低
2. 图像在数字化的过程中很难保留原有的特征，导致图像处理的准确率不高

下面就详细说明一下这2个问题：<br>

**需要处理的数据量太大**<br>
图像是由像素构成的，每个像素又是由颜色构成的。现在随随便便一张图片都是 $1000×1000$ 像素以上的， 每个像素都有RGB 3个参数来表示颜色信息。假如我们处理一张 $1000×1000$ 像素的图片，我们就需要处理3百万个参数！$1000×1000×3=3,000,000$

**卷积神经网络 – CNN 解决的第一个问题就是「将复杂问题简化」，把大量参数降维成少量参数，再做处理。**<br>
更重要的是：我们在大部分场景下，降维并不会影响结果。比如1000像素的图片缩小成200像素，并不影响肉眼认出来图片中是一只猫还是一只狗，机器也是如此。

**保留图像特征**<br>
图片数字化的传统方式我们简化一下，就类似下图的过程：<br><img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/a401a747520e93cdcbd80ef3cf5c8ac6a120ee03/Part%208%20-%20Deep%20Learning/dl_28.JPG?raw=true' width='500'><br>
假如有圆形是1，没有圆形是0，那么圆形的位置不同就会产生完全不同的数据表达。但是从视觉的角度来看，**图像的内容（本质）并没有发生变化，只是位置发生了变化。**

所以当我们移动图像中的物体，用传统的方式的得出来的参数会差异很大！这是不符合图像处理的要求的。

**而 CNN 解决了这个问题，他用类似视觉的方式保留了图像的特征，当图像做翻转，旋转或者变换位置时，它也能有效的识别出来是类似的图像。**

#### 人类的视觉原理
人类的视觉原理如下：从原始信号摄入开始（瞳孔摄入像素 Pixels），接着做初步处理（大脑皮层某些细胞发现边缘和方向），然后抽象（大脑判定，眼前的物体的形状，是圆形的），然后进一步抽象（大脑进一步判定该物体是只气球）。下面是人脑进行人脸识别的一个示例：<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_30.JPG?raw=true' width='400'><br>
我们可以看到，在最底层特征基本上是类似的，就是各种边缘，越往上，越能提取出此类物体的一些特征（轮子、眼睛、躯干等），到最上层，不同的高级特征最终组合成相应的图像，从而能够让人类准确的区分不同的物体。

那么我们可以很自然的想到：可以不可以模仿人类大脑的这个特点，构造多层的神经网络，较低层的识别初级的图像特征，若干底层特征组成更上一层特征，最终通过多个层级的组合，最终在顶层做出分类呢？

这也是许多深度学习算法（包括CNN）的灵感来源。

#### 典型的 CNN 由3个部分构成：

* 卷积层：负责提取图像中的局部特征
* 池化层：用来大幅降低参数量级(降维)
* 全连接层：类似传统神经网络的部分，用来输出想要的结果

### Step 1 - Convolution Operation

<b>Convolution:</b> 一个 function 通过另一个 function 时，产生的变化。<br> $(f*g)(t) = \int_{-\infty}^{\infty} f(\tau)g(t-\tau) \; d\tau$

卷积层的运算过程如下图，用一个 feature detector/kernel/filter - 卷积核/过滤器） 扫完整张图片：<br>
$I_{00}$ x $C_{00}$ + $I_{01}$ x $C_{01}$ + $I_{02}$ x $C_{02}$ + $I_{03}$ x $C_{03}$ + $I_{10}$ x $C_{10}$ + ...<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/2019-06-19-juanji.gif?raw=true' width='300'><br>
这个过程我们可以理解为: 使用过滤器(一种图形模式，已知特征)来过滤图像的各个小区域，从而得到这些小区域的特征值。最后生成包含所有特征值的矩阵 - Feature Map (上图中为 Convolved Feature). **如果 Feature Map 中特征值越大，则认为此图像块和所用的过滤器(已知特征)越接近。**对照已知特征，从而辨别出图像是什么。

以上动图中，Stride(步长)为1，即每次过滤器只向右移动1格。步长可以为2(向右移动2格)。当移到下一行时，同样要下移2行而非1行。卷积后，feature map 变得更小(步长越大，feature map 越小)，这样处理起来更容易更快。

如果步长使过滤器超出原图范围，则无视超出范围，仅使用过滤器范围内的剩余数值计算。<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_32.JPG?raw=true' width='150'>

在具体应用中，往往有多个卷积核，如果我们设计了n个卷积核，可以生成n个 feature maps，这n个 feature maps 组成 first convolution layer (卷积层)。通过 training, network can decide which feature is more important. 可以理解为这个图像上有n种底层纹理模式，也就是用n种基础模式就能描绘出一副图像。以下就是25种不同的卷积核的示例：<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/2019-06-19-150926.jpg?raw=true' width='100'>

### Step 1(b) - ReLU Layer

<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_31.JPG?raw=true' width='400'><br>
在建立 convolutional layer 后，使用 Rectifir Function. 目的是增加 network 的 non-linearity。因为 image 是 highly non-linear 的，但当我们 apply some math operation such as convolution and running the feature detection to create the feature maps，有一定几率产生了 linearity。

### Step 2 - Pooling

池化层（下采样）—— 数据降维，避免过拟合<br>
池化层简单说就是下采样，他可以大大降低数据的维度。其过程如下：<br><img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/2019-06-19-chihua.gif?raw=true' width='400'><br>
上图中，我们可以看到，原始图片是$20×20$的，我们对其进行下采样，采样窗口为$10×10$，最终将其下采样成为一个$2×2$大小的特征图。<br>

同 convolution 一样，可以自由选择步长(Stride)。如果步长使过滤器超出原图范围，则无视超出范围，仅使用过滤器范围内的剩余数值计算。<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_32.JPG?raw=true' width='150'>

之所以这么做的原因，是因为即使做完了卷积，图像仍然很大（因为卷积核比较小），所以为了降低数据维度，就进行下采样。

**池化层相比卷积层可以更有效的降低数据维度，这么做不但可以大大减少运算量，还可以有效的避免过拟合(overfitting)。**

当图片或近或远，挤压或拉大，正面侧面不同时。我们的 neural network 需要一定等级的 flexibility to be able to still find the features. <br>
Pooling 在有效降维的同时，使 neural network 保持对图片的辨识度。

Pooling 有很多种：
* Max Pooling - 取采样窗口中的最大值
* Average Pooling (subsampling) - 取采样窗口中的平均值

例如，下图中假设'4'为鼻子。Max Pooling 后，哪怕鼻子位置改变，但是采样窗仍然采集到鼻子的信息('4')。<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_33.JPG?raw=true' width='400'>

通过 Pooling 后，建立起 Pooling Layer<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_34.JPG?raw=true' width='400'><br>
##### 参考：
[Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition](http://ais.uni-bonn.de/papers/icann2010_maxpool.pdf)

### Step 3 - Flattening

将得到的影像矩阵 (Pooled Feature Map) 转换成 one vector. <br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_35.JPG?raw=true' width='300'><br>
如果是多个 Pooled Feature Maps 组成的 Pooling Layer, 则同样把每个 Pooled Feature Map 转成 one vector，最后连在一起，组成 one huge vector.<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_36.JPG?raw=true' width='300'>

### Step 4 - Full Connection

全连接层 —— 同 Artificial Neural Network 相似，只是 hidden layer，又称 Full Connection Layer。最终输出预测结果<br>
经过卷积层和池化层降维过的数据输入到全连接层，得到预测结果，计算 loss function 偏差，然后 back propagate，在 Artificial Neural Network 中 adjust weights，同时需要 adjust Feature Detectors (检测是否选择错误 Feature)。然后重复 forward/back propagate<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_37.JPG?raw=true' width='400'><br>
在训练过程中，预测结果不断和真实结果比较，同时确定哪些 neural nodes (值为0-1) “属于”自己。如下图，在不断训练中，当紫线 nodes 出现高值(如 0.9)，模型发现 'Dog' 是正确结果，于是这些 nodes 将赋予高 weights 给 'Dog'。同理，训练后，模型发现绿色 nodes 出现高值时，更多显示是 'Cat' 结果，于是绿色 nodes 上将被赋予选择 'Cat' 的高 weights。<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_39.JPG?raw=true' width='400'><br>
典型的 CNN 并非只是上面提到的3层结构，而是多层结构，例如 LeNet-5 的结构就如下图所示：<br>
卷积层 – 池化层- 卷积层 – 池化层 – 卷积层 – 全连接层<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/2019-06-19-lenet.png?raw=true' width='600'>

### Summary

<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_40.JPG?raw=true' width='600'>

### Softmax & Cross-Entropy

Cross-Entropy 只针对 Classification。如果是 Regression 等，还是使用 Mean Squared Error (MSE)。

最初得到的 Dog/Cat 值相加并非为1。可以是一大一小任意值。然后 Apply SoftMax function 得到相加为1的两个值。<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_41.JPG?raw=true' width='400'><br>
Cross-Entropy Function: $L_i = -log(\frac{e^{f_{y_i}}}{\sum_je^{f_j}}) \rightarrow H(p,q) = - \sum_xp(x)log(q(x))$<br>
在使用完 SoftMax function 后，使用 Cross-Entropy 来 minimize errors and maximize the optimional. 下图中，分别将得出 Dog 的概率和 Dog 的确定值(1为狗，0为猫)，代入 Cross-Entropy。<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_42.JPG?raw=true' width='400'>

##### 为何更倾向于用 Cross-Entropy而非MSE
Cross-Entropy 更加的敏感和精准的察觉 Neural Network 在调整后的进步值。在最开始阶段，信息非常少，error可能非常小的情况下，如$10^{-6} \rightarrow 10^{-3}$，MSE 可能不觉得有很大的进步。但是Cross-Entropy 中的$log$可以显示神经网有了巨大的进步。因为其他算法错误的以为进步值太小，所以they won't guide the back propagation in the right direction.

##### 参考：
[A Friendly Introduction to Cross-Entropy Loss](https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/)

### CNN in Python

#### Image Preprocessing

使用 keras 来导入图片数据。所以需要手动建立 folders - train_set, test_set。并且在 train/test folder 中建立分类的子文件夹(如 dogs, cats)。<br>
因此不需要像常规步骤那样，做 data preprocessing, train_test_split(已手动完成分类)。

#### Build the Convolutional Neural Network

In [18]:
# Importing the Keras libraries and packages

# The Sequential package will initialize the neural network
# 2 ways of initializing a neural network either as a sequence of layers
# or as a graph (CNN is still a sequence of layers)
from keras.models import Sequential

# since we work on images which are in 2D (video is 3D with time)

# The convolution step to add convolution layer
from keras.layers import Conv2D
# To process the pooling step and add the pooling layers
from keras.layers import MaxPooling2D
# Convert all the pooled feature maps into the large feature vector
# which becomes the input of the fully connected layers.
from keras.layers import Flatten
# To add the fully connected layers and the artificial neural network
from keras.layers import Dense

##### Initialising the Convolutional Neural Network

In [19]:
classifier = Sequential()

##### Adding Layers

**STEP 1:** Convolution

注：Dense function 用来添加最后阶段的 fully connected layer，如 ANN。所以这里不用 Dense function 来建立 convolutional layer。

Conv2D 的 arguments:
* nb_filter: the number of feature detectors we will use.(将会产生数量相同的 feature maps)
* nb_row: the number of rows of the feature detector
* nb_col: the number of columns of the feature detector
* border_mode: how the feature detectors will handle the borders of the input image (use default 'same')
* input_shape: the shape of the input image. Image has different format and size. Here we convert all the images into one same single format and one fixed size.<br>
<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_44.JPG?raw=true' width='400'><br>
彩色图层是3D的，所以这里 input_shape = (256,256,3)中，3代表3 channel - 彩色，1 channel - 黑白；两个256表示256x256 pixels。合起来，(256,256,3)意味着 256x256的彩色图层。使用 CPU 处理图层可以选择64x64(足够辨别)，用 GPU 处理图层可以选128x128。

Convolution2D(64,3,3) means 使用64个3x3的 feature detectors。

一般常规初始选择为32个 feature detectors 建立1st convolutional layer, 之后建其他的 convolutional layer 时逐步增加(64,128,256...) 

使用 rectifier function as activation function 来增加 non-linearity.

In [20]:
classifier.add(Conv2D(32,(3,3),input_shape=(64,64,3),activation='relu'))

**STEP 2:** Pooling

MaxPooling2D 的 arguments:
* pool_size: the size of the pooling table (通常使用2x2，不会丢失过多信息)

In [21]:
classifier.add(MaxPooling2D(pool_size=(2,2)))

**Adding 2nd Convolutional Layer and Pooling Layer**

增加 CNN 的精准度

Adding a new convolutional layer:
* Do not need input_shape since input_shape is initial information before we create the layer from scratch.
* Need the number of feature detectors (32), the dimensions of these feature detectors (3,3), an activation function

Adding a new max pooling layer:
* Only need pool size parameter

注：常规方法，第二次还是使用和第一次数量相同的 feature detectors (32)。如果之后(第三次)还需要再添加 convolutional layer，则将 feature detector 数目翻倍(如下次64，再下次128...)

In [22]:
# we only need input_shape if we don't have anything previously.
classifier.add(Conv2D(32,(3,3),activation='relu'))
classifier.add(MaxPooling2D(pool_size=(2,2)))

**STEP 3:** Flattening

问：为何不直接从原始图层 flattening 成 a huge vector？
Each node of the huge vector will represent one pixel of the image independently of pixels that are around it. So we only get informations of the pixel itself and we won't get the informations of how this pixel especially connect to the other pixels around it.

Each feature map corresponds to one specific feature of the image then each node in this huge vector that contains a high number will represent the information of a specific feature and specific detail of the input image.

In [23]:
classifier.add(Flatten())

**STEP 4:** Full Connection

Use Dense function to create the fully connection layer just like ANN.

第一次建 full connection 时，units 表示第一层(input layers 的 output 数量，也就是需要多少个 hidden layers 作为第二层)。

units = 第一层 numbers of hidden layers nodes = (input nodes + output nodes) / 2

因为这次 input nodes 将会是 flatten 后 huge vector。所以我们将直接选取高数值 for units (如128 - power of 2)

In [24]:
classifier.add(Dense(activation="relu", units=128))
# Add output layer
# 2个或 2个以下输出用 sigmoid，超过2个则用softmax function.
# 只需要 1个output (是狗或者猫)
classifier.add(Dense(activation="sigmoid", units=1))

<img src='https://github.com/yunjcai/Machine-Learning-A-Z/blob/master/Part%208%20-%20Deep%20Learning/dl_45.JPG?raw=true' width='1000'>

##### Compiling the CNN

More than 2 outcomes, we need use categorical_crossentropy as loss function.

In [25]:
classifier.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

#### Fitting the CNN to the image

[ImageDataGenerator](https://keras.io/preprocessing/image/)<br>
Deep Learning 需要大量信息来学习。如 CNN 需要大量图片来训练，或者使用技巧 (ImageDataGenerator)。<br>

ImageDataGenerator will create many batches of our images and each batch will apply some random transformations on a random selection of our image (like rotating, flipping, shifting, shearing). And we will get during the training is many more diverse images inside these batches and therefore a lot more material to train.

Image augmentation is a technique that allows us to enrich the data sets for training without adding more images and prevent overfitting.<br>

因为我们已经将训练图层和测试图层分类在不同文件夹里，所以将使用 '.flow_from_directory(diectory)'<br>
The fit generator method will not only fit the CNN to the training set but also the test set at the same time.

从 Keras Documentation 网页 Copy 以下 code:

In [26]:
from keras.preprocessing.image import ImageDataGenerator

In [27]:
# 从 Keras Documentation 网页 Copy 以下 code:
# (有所改动，原 code 被 comment)
 
train_datagen = ImageDataGenerator(rescale=1. / 255,
                                   shear_range=0.2,
                                   zoom_range=0.2,
                                   horizontal_flip=True)

test_datagen = ImageDataGenerator(rescale=1. / 255)

#train_generator = train_datagen.flow_from_directory(
training_set = train_datagen.flow_from_directory(# 'data/train',
                                                 # from which directory
                                                 'dataset/training_set',
                                                 # the size of our images
                                                 # target_size=(150, 150),
                                                 # since we use 64x64 for our image
                                                 target_size=(64, 64),
                                                 batch_size=32,
                                                 # it indicates the dependent variable
                                                 # is binary or has more than 2 categories
                                                 class_mode='binary')

#validation_generator = test_datagen.flow_from_directory(
test_set = test_datagen.flow_from_directory(#'data/validation',
                                            'dataset/test_set',
                                            #target_size=(150, 150),
                                            target_size=(64, 64),
                                            batch_size=32,
                                            class_mode='binary')

# Our model is called 'classifier'
#model.fit_generator(#train_generator,
classifier.fit_generator(#train_generator,
                        # Here's the training set
                        training_set,
                        # The number of images in the training set
                        # Because all the observation of the training set
                        # pass through the CNN during each epoch.
                        # Since we have 8000 images in the training set
                        #steps_per_epoch=2000,
                        steps_per_epoch=8000,
                        # number of epochs we want to choose to train the CNN
                        # 50 maybe too much, let's try 25
                        #epochs=50,
                        epochs=25,
                        # Here's the test set
                        #validation_data=validation_generator,
                        validation_data=test_set,
                        # The number of images in the test set
                        #validation_steps=800)
                        validation_steps=2000)

Found 8000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.
Epoch 1/25
 663/8000 [=>............................] - ETA: 41:09 - loss: 0.6135 - acc: 0.6562

KeyboardInterrupt: 

### Way to Improve the Accuracy

We need make a deeper Deep Learning model (deeper CNN):
* Add another Convolutional Layer (best solution)
* Add another Full Connection Layer
但是可以两个同时添加。

新增的 Convolutional Layer 在之前的 Pooling Layer 之后。

注：常规方法，第二次还是使用和第一次数量相同的 feature detectors (32)。如果之后(第三次)还需要再添加 convolutional layer，则将 feature detector 数目翻倍(如下次64，再下次128...)


##### 更好的方法
Adding more convolutional layers will help get an even better accuracy, but if we want to really get a better accuracy, we can choose a higher "target_size" in 

* train_datagen.flow_from_directory(... target_size=(64, 64), ...)

* test_datagen.flow_from_directory(... target_size=(64, 64), ...)
                                 
For the images of the training and test set, so that we get more information of the pixel patterns