# Introduction to Neural Network

The widespread adoption of artificial intelligence in recent years has been largely driven by advancement in neural networks. Neural networks is behind systems ranging from 
<a href="https://deepmind.com/research/alphago/">AlphaGo</a>, 
<a href="https://translate.google.com/">Google Translate</a> 
to <a href="https://www.tesla.com/en_HK/autopilot">Tesla Autopilot</a>.

Neural network is fundamentally numeric computation, so any software with decent numeric computation capabilities can be used to construct and train a neural network. That said, while in theory you can construct a neural network in Excel, in practice it will be very troublesome since Excel is not designed with neural network in mind. Libraries are that specifically geared toward neural network include:
- Google's <a href="https://www.tensorflow.org/">Tensorflow</a>
- Facebook's <a href="http://pytorch.org/">PyTorch</a> 
- Microsoft's <a href="https://github.com/Microsoft/CNTK">CNTK</a> (discontinued)  
- Intel's <a href="https://ai.intel.com/neon/">neon</a> (discontinued)
- <a href="http://deeplearning.net/software/theano/">Theano</a> and <a href="http://caffe.berkeleyvision.org/">Caffe</a> (discontinued)

In this course we will focus on using <a href="https://keras.io/">```keras```</a>, which is a high-level library for constructing neural networks. Keras runs on top of a numerical computation library of your choice, defaulting to ```tensorflow```. A library such as Keras significantly simplify the workflow of constructing and training neural networks. 

<img src="../Images/nn_libraries.png" width="80%">

Before we start, we will first disable the server's GPU so that everything runs on its CPU. Later we will turn it back on to see how much speed up we can get. This setting has no effect if you do not have a (Nvidia) GPU.

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

## A Simple Example: Binary Neural Network Classifier

As a first example, we will train a neural network to the following classification task:

|y|x1|x2|
|-|-|-|
|0|1|0|
|1|0|1|

To be clear: there is absolutely no need to use neural network for such as simple task. A simpler model will train a lot faster and potentially with better accuracy.

We first generate the data:

In [None]:
import numpy as np 
from sklearn.model_selection import train_test_split

#Generate 2000 samples. [1,0] -> 0, [0,1] -> 1
X = np.repeat([[1,0]], 1000, axis=0)
y = np.repeat([0], 1000, axis=0)
X = np.append(X,np.repeat([[0,1]], 1000, axis=0),axis=0)
y = np.append(y,np.repeat([1], 1000, axis=0),axis=0)

#Shuffle and split data into train set and test set
X_train, X_test, y_train, y_test = train_test_split(X,y)

So the neuron is essentially a logistic regression.

In [None]:
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model

# Set up layers 


# Set up model


#start training

Out-of-sample test can be conducted with ```model.evaluate()```:

The first number is the model's loss while the subsequent numbers are the metrics we specified. In our case, they are ```binary_crossentropy``` and ```accuracy``` respectively.

Unlike OLS, a neural network's performance could vary across runs. Run the code a few more times and see how the performance vary.

Make prediction (this is called *inference* in machine learning) with ```model.predict()```:

## Activations

Different activation can have profound impact on model performance. Besides ```sigmoid```, which is just a different name for the logistic function, there are other activation function such as ```tanh``` and ```relu```. ```relu```, which stands for **RE**ctified **L**inear **U**nit, is a particular common choice due to its good performance.

In [None]:
# Replace 'sigmoid' with 'relu' for the hidden layer


Why is ReLU performing so much better than the logistic function? Let us take a look at the shape of each function:
<img src="../Images/logistic_v_relu.png">
The most prominent feature of the logistic function is that it is bounded between 0 and 1. This means it is virtually flat for very large or very small input values, and flat means small gradient. As gradient descent relies on gradient to learn, small gradient implies slow learning. ReLU avoids this issue by being linear above zero.

## Neural Network Regression

Next we are going use a neural network in a regression task. The true data generating process (DGP) is as follows:

$$
y = x^5 -2x^3 + 6x^2 + 10x - 5
$$

The model does not know the true DGP, so it needs to figure out the relationship between $y$ and $x$ from the data.

First we generate the data:

In [None]:
#Generate 1000 samples
X = np.random.rand(1000,1)
y = X**5 - 2*X**3 + 6*X**2 + 10*X - 5

#Shuffle and split data into train set and test set
X_train, X_test, y_train, y_test = train_test_split(X,y)

Then we construct the model:

In [None]:
# Single hidden layer with 100 neurons


We are going to run the model through different settings. The function contains everything we have coded previously:

In [None]:
import time
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
import tensorflow.keras.backend as backend

def polyNN(data,
           hidden_count=100,
           epochs=200,
           batch_size=32,
           activation='relu'):
    
    #Record the start time
    start = time.time()
    
    #Unpack the data
    X_train, X_test, y_train, y_test = data
    
    #Layers
    inputs = Input(shape=(X_train.shape[1],))
    x = Dense(hidden_count, activation=activation)(inputs)
    predictions = Dense(1, activation='linear')(x)

    #Model
    model = Model(inputs=inputs, outputs=predictions)
    model.compile(optimizer='adam',
                  loss='mean_squared_error')
    model.fit(X_train,y_train,epochs=epochs,batch_size=batch_size,verbose=0) #Do not display progress
    
    #Collect and display info
    param_count = model.count_params()
    loss_tr = round(model.evaluate(x=X_train,y=y_train,batch_size=batch_size,verbose=0),4)
    loss_te = round(model.evaluate(x=X_test,y=y_test,batch_size=batch_size,verbose=0),4)
    elapsed = round(time.time() - start,2)    
    print("Hidden count:",str(hidden_count).ljust(5),
          "Parameters:",str(param_count).ljust(6),
          "loss (train,test):",str(loss_tr).ljust(7),str(loss_te).ljust(7),
          "Time:",str(elapsed)+"s",
         )
    
    backend.clear_session()

`clear_session()` is called at the end of the function to clear existing models from memory. This is important if you are working with multiple models&mdash;for example, when you run through different sets of hyperparameters&mdash;to avoid running out of memory.

Now we can easily try out different settings:

In [None]:
data = train_test_split(X,y)

# Try different number of neurons


Here we see the universal approximation theorem in work: the more neurons we have the better the fit.

One trick that can often improve performance: *standardizing* data.

In [None]:
from sklearn import preprocessing


Now let us run everything again with logistic activation:

In [None]:
# Sigmoid, original data


In [None]:
# Sigmoid, standardized data


Did you notice how the logistic activation function actually performed better than ReLU when the data is not standardized? What we are seeing here is that ReLU is much more sensitive to data standardization than the logistic function. This is a good example why so much research goes into optimizing the modelling process&mdash;because every details matters.

<!--Further reading: <a href="https://towardsdatascience.com/weight-initialization-in-neural-networks-a-journey-from-the-basics-to-kaiming-954fb9b47c79">Weight Initialization in Neural Networks: A Journey From the Basics to Kaiming</a>. These articles also contain links to the most important research papers when
http://deepdish.io/2015/02/24/network-initialization/
-->

### Dropout

As neural networks are highly flexible, they can easily overfit. Dropout is a regularization technique that works by randomly setting the outputs of some neurons to zero, thereby forcing the network to not rely too much on a specific neurons or feature. The function below added a 50% dropout to the hidden layer:

In [None]:
# Sigmoid, standardized data with dropout

## Speed Things Up

Due to its complexity, neural network trains a lot slower than the other techniques we have covered previously. To speed up training, we can ask Keras to go through more samples before updating the model's parameters by specifying a larger ```batch_size```. Doing so allows Keras to make better use of the CPU's parallel processing capabitilies.

Keras' default batch size is 32. We will try 128 instead:

Holding the number of epochs constant, what you should see with a larger batch size is faster training but also larger error. The latter is due to the fact that we are updating the parameters less often, resulting in slower learn. This can be countered by increasing the number of epochs.

## Running Model on GPU

If you have a GPU in your computer, you can now turn it on to see how much it speeds up the process of training.

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [None]:
polyNN(data,hidden_count=1)

With a GPU you can take advantage of its high number of core count by setting a much higher batch size, such as 1000:

To compensate for the less frequent update, we can increase the number of epochs:

### MNIST
MNIST is a dataset of 70000 handwritten digits. It is often used to teach image recognition due to its simplicity. 

In [None]:
import tensorflow.keras as keras
from tensorflow.keras.datasets import mnist

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

Each sample consists of a 28x28 monochome image of a handwritten digit stored as a 2D numpy array:

In [None]:
from matplotlib import pyplot as plt
plt.imshow(x_train[0], cmap='gray')
plt.show()

Target is digit's value:

For classification task, the common practice is to have one output neuron per class. We can use `keras.utils.to_categorical()` to convert the target value to a dummy vector:

We will use a single-layer fully-connected network withe 100 hidden neurons. There are two more preprocessing tasks that we need to handle: flattening the 2D array into 1D and normalizing the features from 0-255 to 0-1:

In [None]:
# Settings
batch_size = 128
epochs = 30
pixel_count = 28 * 28
num_classes = 10 # target classes (0-9)

# The data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Flatten each image to a vector


# Normalize features


# convert class vectors to binary class matrices


# Model


# Train and evaluate


With just 100 neurons we are able to achieve a 97% accuracy. With a more advanced convolutional network we should be able to do even better:

In [None]:
import tensorflow.keras as keras
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras import backend as K

#Settings
batch_size = 128
num_classes = 10
epochs = 30

# input image dimensions
img_rows, img_cols = 28, 28

# The data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Features has to be in the following shape: (obs, rows, cols, color channels)
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)

# Normalize features
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255

# Convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

# Construct model using sequential syntax
model = Sequential()
model.add(Conv2D(6, kernel_size=(5, 5),
                 activation='relu',
                 input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(16, (5, 5), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())

model.add(Dense(120, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Here is a very nice 3D visualization of what is going on inside a trained CNN: https://www.cs.ryerson.ca/~aharley/vis/conv/.
I have set up the model above to resemble the one in the visualization. There are many hyperparameters that you can try adjusting to improve its performance&mdash;the number of layers, the number of filters, the size of the kernel, the type of activation and dropout ratio, etc.