# Chapter 8: Deep Learning #

Deep learning models can be used for:
- Artificial Neural Networks for Regression and Classification
- Convolutional Neural Networks for Computer Vision
- Recurrent Neural Networks for Time Series Analysis
- Self Organizing Maps for Feature Extraction
- Deep Boltzmann Machines for Recommendation Systems
- Auto Encoders for Recommendation Systems

Concept is to mimic how a brain learns.

## 1. Artificial Neural Networks ##

Topics of interest:
- **Neuron**
    <br>Dendrite and axon, it takes input, does a simple operation, applies an activation function and gives an output.
    <br>*Interesting reading about Neurons can be Efficient Backprop by LeCun 1998*
- **Activation function**
    <br>Different types are threshold (0 if <0 else 1), sigmoid (1 / 1 + $e^{-x}$), rectifier (max(0,x)), hyperbolic tangent (1 - $e^{-2x}$ / 1 + $e^{-2x}$).
    <br>*Interesting reading about Activation functions can be Deep Sparse Rectifier neural networks by Glorot 2011*
- **How do Neural Networks work**
    <br>They do very simple operations with all the previous input and combining them, they produce an output.
- **How do they learn**
    <br>Thanks to the cost function, they compare $\hat{y}$ and $y$ and then reajust the weights. An epoch is one round of training with the full dataset.
    <br> *Interesting list of cost functions can be found in CrossValidated (2015)*
- **Gradient descent**
    <br>How the weights are adjusted?
    <br>First instinct would be brute force, but it is way too computationally intensive especially given the curse of dimensionality: each input has a weight in each node. The number of operations rapidly becomes impossible.
    <br>Hence the gradient descent method with classic derivation methods. Problem is, the cost function needs to be convex and even if it is then it can still very well converge towards a local minimum.
- **Stochastic Gradient descent**
    <br>And this is where enters stochastic gradient descent which doesn't require for the cost function to be convex! Contrary to the previous batch gradient descent, you adjust the weights after each row and not after a total batch. Although batch gradient descent is a deterministic method, stochastic GD is more likely to find the global minimum. But it is a stochastic method i.e. partly random, so it will fluctuate more. It won't take longer since it doesn't need to load the full dataset weights into memory but it will fluctuate more.
    <br> *Interesting reading can be A Neural Network in 13 lines of Python by Trask (Part 2 - Gradient Descent) (2015)*
    <br> *Something else a bit harder, Neural Networks and Deep Learning by Michael Nielsen (chap 2) (2015)*
- **Backpropagation**
    <br>The key strength is that it adjusts all the weights at the same time.
    <br> *Something to read, Neural Networks and Deep Learning by Michael Nielsen (chap 2) (2015)*

Business problem example:
- Churn prediction on a bank sample with different characteristics + a variable named 'exited' if they exited the bank in the following 6 months

Available librairies are Theano, Tensorflow and Keras

Process is:
+ **Step 1**: Classic classification problem
+ **Step 2**: Take care of eventual categorical data
+ **Step 3**: Apply feature scaling to help the machine and ease the number of calculations
+ **Step 4**: Import Keras and packages
+ **Step 5**: Initialise the ANN
+ **Step 6**: Create the layers
+ **Step 7**: Compile
+ **Step 8**: Predict

In [1]:
# Importing the Keras libraries and packages
import keras
from keras.models import Sequential
from keras.layers import Dense

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [None]:
# Initialising the ANN
classifier = Sequential()

In [None]:
# Adding the input layer and the first hidden layer
classifier.add(Dense(output_dim = 6, init = 'uniform', activation = 'relu', input_dim = 11))

# Adding the second hidden layer
classifier.add(Dense(output_dim = 6, init = 'uniform', activation = 'relu'))

# Adding the output layer
classifier.add(Dense(output_dim = 1, init = 'uniform', activation = 'sigmoid'))

**NB**: How many nodes in each hidden layer?
- take the average of the number of nodes in the input layer and the output layer

**NB2**: What's the equivalent of sigmoid for classification problem with more than 2 categories?
- it's called softmax

In [None]:
# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Fitting the ANN to the Training set
classifier.fit(X_train, y_train, batch_size = 10, nb_epoch = 100)

In [None]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5)

## 2. Convolutional Neural Networks ##

Topics of interest:
- **What is a CNN**
    <br>Classifies images for example. You can treat a B&W image through each 0-255 value of each pixel. Even a colored image through 3 different layers of RBG of 0-255 values. For example with a face, it can classify its emotion.
    <br>*Interesting reading is Gradient Based Learning applied to Document Recognition by Yann LeCun 1998*
- **Convolution operation**
    <br>You apply a feature detector (like a 3x3 matrix of booleans) to a bigger image (like a 14x14 image). This gives a feature map of values of where the feature detector caught positive signals. The important part is that the feature map is smaller than the first image. We are loosing information but we detect features which is what matters. Plus we create a lot of different feature maps to obtain our first convolution layer.
    <br>*Interesting reading is Introduction to Convolutional Neural Networks by Jianxin Wu 2017 see http://cs.nju.edu.cn/wujx/*
- **ReLU**
    <br>This function helps breaking up linearity and removes all the negative part to focus on the positive signals.
    <br>*Interesting reading is Understanding Convolutional Neural Network with a Mathematical model by C C Jay Kuo 2016*
    <br>*Interesting reading is Delving deep into rectifiers: Surpassing Human Level Performance by Kaiming He 2015*
- **Pooling**
    <br>When you try to recognize a pattern or a specific thing, the model has to have spatial invariance i.e. it should not care as to where in the image the feature might appear but most importantly if the feature is a bit tilted or warped in some way. This is where different pooling come in. Max pooling for example, you apply a max matrix to each subpart of the feature map. You keep the important information through the max and reduce the size again. We obtain a pooled feature map. It helps accounting for any possible spatial distortion. It also prevent overfitting because it doesn't keep extra information.
    <br>*Interesting reading and fairly easy to read is Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition by Dominik Scherer 2010*
    <br> *Interesting link: http://scs.ryerson.ca/~aharley/vis/conv/flat.html*
- **Flattening**
    <br>It's basically just put everything into a column of input for the input layer of the ANN.
- **Full Connection**
    <br>An hidden layer in the new ANN is called a fully connected layer as each node is connected to all the previous and next nodes. The role of those layers is to combine our features into more elaborate features and into attributes that are good predictors of a class. What's different than a previous ANN is that we have multiple output nodes at the end. You need one output for each class. After one forwrd prediction an error is predicted and then it's backpropagated through the network.
- **Summary**
    <br>*Interesting reading is The 9 deep learning papers you need to know about by Deshpande in 2016*
- **Softmax and Cross Entropy**
    <br>In the n output nodes at the end, the value don't have to add up to one. They do because we apply a function called Softmax. As for Cross entropy it's a cost function to rank the output of a NN. You have Mean-Squared Errors and Cross Entropy for example. CE > MSE if for example the values are really small, the gradient descent will take a lot more time with the MSE.
    <br>*Interesting video called the softmax function*
    <br>*Interesting reading is A Friendly introduction to Cross Entropy loss by DiPietro 2016*
    <br>*Interesting reading is How to implement a NN intermezzo 2 by Roelants 2016*

In [None]:
# Importing the Keras libraries and packages
from keras.models import Sequential
from keras.layers import Convolution2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense

# Initialising the CNN
classifier = Sequential()

In [None]:
# Step 1 - Convolution
classifier.add(Convolution2D(32, 3, 3, input_shape = (64, 64, 3), activation = 'relu'))

**Convolution step**: pick the number of feature maps we are looking for (here 32) and correctly input the input shape (size and # of channels) as well as the activation function. This step is important because we also extract the special information. Not only on each pixel but also on what's around each pixel.

In [None]:
# Step 2 - Pooling
classifier.add(MaxPooling2D(pool_size = (2, 2)))

**Pooling step**: We keep the information but reduce by 2 the size of the feature maps because if we didn't do that we would have too many input nodes after the Flattening step and the model would run for ages. Using a 2 by 2 matrix is the norm for a max pooling. There are other kinds of pooling.

In [None]:
# Adding a second convolutional layer
classifier.add(Convolution2D(32, 3, 3, activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))

**Second Convolution step**: The first time we ran our model with only one convolution step we had a significant difference of the accuracy of the training and the test set. This means that we had overfitting. A good way to prevent that and to make our model run faster is to have a second convolution step. Here we don't need to specify the input shape as the model already knows what it's producing.

In [None]:
# Step 3 - Flattening
classifier.add(Flatten())

**Flattening step**: This step consists in transforming all the above information contained in feature maps into a single input vector.

In [None]:
# Step 4 - Full connection
classifier.add(Dense(output_dim = 128, activation = 'relu'))
classifier.add(Dense(output_dim = 1, activation = 'sigmoid'))

**Full connection step**: In this step we create a classic ANN with the input vector we designed in the above steps. Be careful of the output dimensions and pick well the activation functions. The first output dimension, a power of 2 is a best practice. Relu here is the classic hidden layer activation function and Sigmoid is to give a probability output, if we had more than 2 categories we would need Softmax. Here it is 1 because we only have 2 categories, were it not the case we would need 1 dimension for each category.

In [None]:
# Compiling the CNN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

**Compiling step**: We need to pick the optimize, the loss function and a list of the metrics to calculate the efficiency of our model. For more than two categories we use 'categorical_crossentropy' as a loss function.

In [None]:
# Part 2 - Fitting the CNN to the images

from keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(rescale = 1./255,
                                   shear_range = 0.2,
                                   zoom_range = 0.2,
                                   horizontal_flip = True)

test_datagen = ImageDataGenerator(rescale = 1./255)

training_set = train_datagen.flow_from_directory('dataset/training_set',
                                                 target_size = (64, 64),
                                                 batch_size = 32,
                                                 class_mode = 'binary')

test_set = test_datagen.flow_from_directory('dataset/test_set',
                                            target_size = (64, 64),
                                            batch_size = 32,
                                            class_mode = 'binary')

classifier.fit_generator(training_set,
                         samples_per_epoch = 8000,
                         nb_epoch = 25,
                         validation_data = test_set,
                         nb_val_samples = 2000)

**Preprocessing the images step**: We do this to prevent overfitting. As we do not have a lot of images, the first function we use is to create more examples of the training set. We apply tiny changes to the images and therefore augment the training set. A lot of examples of good functions such as this one can be found in the keras documentation. What's important here is to pick the right target sizes, small enough batch sizes and nb of epochs.

At the end if you want to have a better accuracy, you can add more convolution layers and increase the target size to extract more information out of our images.