---

# Practical Machine Learning with Python
# Chapter 8: From Neural Networks to Deep Learning
## Guillermo Avendaño-Franco 

### HPC Summer Workshop 2019

---

This notebooks is based on a variety of sources, usually other notebooks, the material was adapted to the topics covered during lessons. In some cases, the original notebooks were created for Python 2.x or older versions of Scikit-learn or Tensorflow and they have to be adapted. 

## References

### Books

 * **Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems**, 1st Edition *Aurélien Géron*  (2017)

 * **Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow**, 2nd Edition, *Sebastian Raschka* and *Vahid Mirjalili* (2017)

 * **Deep Learning: A Practitioner's approach**, *Josh Patterson* and *Adam Gibson* 
 
 * **Deep Learning**, *Ian Goodfelow*, *Yoshua Bengio* and *Aaron Courville* (2016)

### Jupyter Notebooks

 * [Yale Digital Humanities Lab](https://github.com/YaleDHLab/lab-workshops)
 
 * Aurelein Geron Hands-on Machine Learning with Scikit-learn 
   [First Edition](https://github.com/ageron/handson-ml)
   [Second Edition (In preparation)](https://github.com/ageron/handson-ml2)
   
 * [A progressive collection notebooks of the Machine Learning course by the University of Turin](https://github.com/rugantio/MachineLearningCourse)
   
 * [A curated set of jupyter notebooks about many topics](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks)
   
### Videos

 * [Caltech's "Learning from Data" by Professor Yaser Abu-Mostafa](https://work.caltech.edu/telecourse.html)
 

## Setup

This Jupyter notebook was created to run on a Python 3 kernel. Some Ipython magics were used: 

In [3]:
# commands prefaced by a % in Jupyter are called "magic"
# these "magic" commands allow us to do special things only related to jupyter

# %matplotlib inline - allows one to display charts from the matplotlib library in a notebook
# %load_ext autoreload - automatically reloads imported modules if they change
# %autoreload 2 - automatically reloads imported modules if they change
import matplotlib
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [4]:
%load_ext watermark
%watermark

2019-07-29T14:58:42-04:00

CPython 3.7.3
IPython 5.8.0

compiler   : GCC 8.3.0
system     : Linux
release    : 5.0.0-20-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit


In [7]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import sklearn
import scipy
import tensorflow as tf

In [6]:
%watermark -iv

matplotlib 3.0.2
numpy      1.16.2
tensorflow 1.14.0
IPython    5.8.0
scipy      1.2.1
sklearn    0.20.2
pandas     0.23.3



This material was elaborated from a variety of sources. Mostly from John Urbanic's "Deep Learning In An Afternoon".

A very approachable introduction to Neural Networks with descriptive examples.

# From Neural Networks to Deep Learning

Deep Learning is one of the techniques in Machine Learning with most success in a variety of problems. From Classification to Regression. Its ability to account for complexity is remarkable.

We can think about Neural Networks from two origins. 

### Biological Neural Networks

From one side the idea of simulate synapsis in biological Neural Networks and use the knowledge about activation barriers and multiple connectivity as inspiration to create and Artificial Neural Network. 

![Biological to Artificial Neural Networks](fig/bioNN.png)

### Perceptron

The other origin is the idea of Perceptron in Machine Learning.

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. As a binary classifier it can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. 

<img src="fig/perceptron.png" width="700" height=400>

## Complexity of Neural Networks

![Complexity](fig/shallow_and_deep_NN.png)

When the neural nework has no hidden layer it is in fact just a linear classifier or Perceptron. When hidden layers are added the NN is able to account for non-linearity, in the case of multiple hidden layers have what is called a Deep Neural Network.

## In practice how complex it can be

A Deep Learning model can include:
    
 * **Input** (With many neurons)
 * **Layer 1** 
 * ...
 * ...
 * **Layer N**
 * **Output layer** (With many neurons)
    
For example the input can be an image with thousands of pixels and 3 colors for each pixel. Hundreds of hidden layers and the output could also be an image. That complexity is responsible for the computational cost of running such networks.




    

## Basic Neural Network Architecture

![Basic Architecture](fig/basic_architeture.png)

## Neural Network Zoo

From <http://www.asimovinstitute.org/neural-network-zoo>
    
![Neural Network Zoo](fig/NeuralNetworkZoo20042019.png)
    
    

## Activation Function

Each internal neuron receives input from several other neurons, computes the aggregate and propagates the result based on the activation function. 

![Neural Network](fig/nn.png)


![Activation Function](fig/activation_function.png)

Neurons apply activation functions at these summed inputs. 

Activation functions are typically non-linear.

 * The **Sigmoid Function** produces a value between 0 and 1, so it is intuitive when a probability is desired, and was almost standard for many years.
 
 * The **Rectified Linear (ReLU)** activation function is zero when the input is negative and is equal to the input when the input is positive. Rectified Linear activation functions are currently the most popular activation function as they are more efficient than the sigmoid or hyperbolic tangent.
 
     * Sparse activation: In a randomly initialized network, only 50% of hidden units are active.
     
     * Better gradient propagation: Fewer vanishing gradient problems compared to sigmoidal activation functions that saturate in both directions.
     
     * Efficient computation: Only comparison, addition and multiplication.
     
     * There are Leaky and Noisy variants.
     
 * The **Soft Plus** shared some of the nice properties of ReLU and still preserves continuity on the derivative.

## Inference or Foward Propagation

 <table>
  <tr>
    <th><img src="fig/DL1.png" alt="Input" style="width:400px"></th>
    <th><img src="fig/DL2.png" alt="Hidden" style="width:400px"></th>
    <th><img src="fig/DL3.png" alt="Output" style="width:400px"></th>
  </tr>
  <tr>
    <td>Receiving Input</td>
    <td>Computing Hidden Layer</td>
    <td>Computing Output</td>
  </tr>
</table> 


### Reciving Input

 * H1 Weights = (1.0, -2.0, 2.0)
 * H2 Weights = (2.0, 1.0, -4.0)
 * H3 Weights = (1.0, -1.0, 0.0)
 * O1 Weights = (-3.0, 1.0, -3.0)
 * O2 Weights = (0.0, 1.0, 2.0)
 
### Hidden Layer

 * H1 = Sigmoid(0.5 * 1.0 + 0.9 * -2.0 + -0.3 * 2.0) = Sigmoid(-1.9) = .13
 * H2 = Sigmoid(0.5 * 2.0 + 0.9 * 1.0 + -0.3 * -4.0) = Sigmoid(3.1) = .96
 * H3 = Sigmoid(0.5 * 1.0 + 0.9 * -1.0 + -0.3 * 0.0) = Sigmoid(-0.4) = .40
 
### Ouptut Layer

 * O1 = Sigmoid(.13 * -3.0 + .96 * 1.0 + .40 * -3.0) = Sigmoid(-.63) = .35
 * O2 = Sigmoid(.13 * 0.0 + .96 * 1.0 + .40 * 2.0) = Sigmoid(1.76) = .85


In terms of Linear Algebra:

In [2]:
# Hidden Layer Matrix
H=np.array([[1,-2,2],[2,1,-4],[1,-1,0]],dtype=float)
H

array([[ 1., -2.,  2.],
       [ 2.,  1., -4.],
       [ 1., -1.,  0.]])

In [3]:
# Input vector
inp=np.array([0.5,0.9,-0.3]).reshape(3)
inp

array([ 0.5,  0.9, -0.3])

In [4]:
# The Hidden layer operating over the input
Hdotinp=np.dot(H,inp)
Hdotinp

array([-1.9,  3.1, -0.4])

In [5]:
from scipy.special import expit
af=expit(Hdotinp)
af

array([0.13010847, 0.95689275, 0.40131234])

In [6]:
# Output matrix
O=np.array([[-3.0, 1.0, -3.0],[0.0, 1.0, 2.0]])
O

array([[-3.,  1., -3.],
       [ 0.,  1.,  2.]])

In [7]:
OdotAf=np.dot(O,af)
OdotAf

array([-0.6373697 ,  1.75951742])

In [8]:
expit(OdotAf)

array([0.34584136, 0.85314921])

The fact that we are able to describe the problem in terms of Linear Algebra is one of the reasons why Neural Networks are so efficient on GPUs. The same operation as a single execution line looks like this:

In [9]:
expit(np.dot(O,expit(np.dot(H,inp))))

array([0.34584136, 0.85314921])

## Biases

It is also very useful to be able to offset our inputs by some constant. 
You can think of this as centering the activation function, or translating the solution (next slide). 
We will call this constant the bias, and it there will often be one value per layer.

Our math for the previously calculated layer now looks like this with b=0.1

In [10]:
expit(np.dot(O,expit(np.dot(H,inp) + np.full((3,), 0.1))))

array([0.32269997, 0.85959729])

## Accounting for Non-Linearity

Neural networks are so effective in classification and regression due to its ability to combine linear and non-linear operation on each step of the evaluation.

 * The matrix multiply provides the skew and scale.
 * The bias provides the translation.
 * The activation function provides the twist.

## The hard part of Neural Networks: The back propagation

During training, once we have forward propagated the network, we will find that the final output differs from the known output. The weights must change in order to produce better results next time.

How do we know which new weights? to use? 

We want to minimize the error on our training data. 
Given labeled inputs, select weights that generate the smallest average error on the outputs.
We know that the output is a function of the weights: 

\begin{equation}
E(w_1,w_2,w_3,...i_1,...t_1,...)
\end{equation}

So to figure out which way, and how much, to push any particular weight, say w3, we want to calculate

\begin{equation}
\frac{\partial E}{\partial \{w,i,t\}}
\end{equation}

## Backpropagation

If we use the chain rule repeatedly across layers we can work our way backwards from the output error through the weights, adjusting them as we go. Note that this is where the requirement that activation functions must have nicely behaved derivatives comes from.

This technique makes the weight inter-dependencies much more tractable. 
An elegant perspective on this can be found from [Chris Olahat Blog](http://colah.github.io/posts/2015-08-Backprop)

With basic calculus you can readily work through the details. 

You can find an excellent explanation from the renowned [3Blue1Brown](https://www.youtube.com/watch?v=Ilg3gGewQ5U)



## Solving the back propagation efficiently

The explicit solution for back propagation leaves us with potentially many millions of simultaneous equations to solve (real nets have a lot of weights). 

They are non-linear to boot. Fortunately, this isn't a new problem created by deep learning, so we have options from the world of numerical methods.

The standard has been **Gradient Descent** local minimization algorithms.


To improve the convergence of Gradient Descent, refined methods use adaptive **time step** and incorporate **momentum** to help get over a local minimum. **Momentum** and **step size** are the two hyperparameter

The optimization problem that Gradient Descent solves is a local minimization. We don't expect to ever find the actual global minimum. Several techniques have been created to avoid a solution being trapped in a local minima.

We could/should find the error for all the training data before updating the weights (an epoch). However it is usually much more efficient to use a stochastic approach, sampling a random subset of the data, updating the weights, and then repeating with another. This is the **mini-batch Gradient Descent**

## One first example for MNIST

In [11]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot = True)

n_nodes_hl1 = 500
n_nodes_hl2 = 500
n_nodes_hl3 = 500
n_nodes_hl4 = 500

n_classes = 10
batch_size = 100

x = tf.placeholder('float', [None, 784])
y = tf.placeholder('float')

def neural_network_model(data):
    hidden_1_layer = {'weights':tf.Variable(tf.random_normal([784, n_nodes_hl1])),
                      'biases':tf.Variable(tf.random_normal([n_nodes_hl1]))}

    hidden_2_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])),
                      'biases':tf.Variable(tf.random_normal([n_nodes_hl2]))}

    hidden_3_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl2, n_nodes_hl3])),
                      'biases':tf.Variable(tf.random_normal([n_nodes_hl3]))}

    hidden_4_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl3, n_nodes_hl4])),
                      'biases':tf.Variable(tf.random_normal([n_nodes_hl4]))}

    output_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl4, n_classes])),
                    'biases':tf.Variable(tf.random_normal([n_classes])),}


    l1 = tf.add(tf.matmul(data,hidden_1_layer['weights']), hidden_1_layer['biases'])
    l1 = tf.nn.relu(l1)

    l2 = tf.add(tf.matmul(l1,hidden_2_layer['weights']), hidden_2_layer['biases'])
    l2 = tf.nn.relu(l2)

    l3 = tf.add(tf.matmul(l2,hidden_3_layer['weights']), hidden_3_layer['biases'])
    l3 = tf.nn.relu(l3)

    l4 = tf.add(tf.matmul(l2,hidden_4_layer['weights']), hidden_4_layer['biases'])
    l4 = tf.nn.relu(l4)

    
    output = tf.matmul(l4,output_layer['weights']) + output_layer['biases']

    return output

def train_neural_network(x):
    prediction = neural_network_model(x)
    # OLD VERSION:
    #cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(prediction,y) )
    # NEW:
    cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y) )
    optimizer = tf.train.AdamOptimizer().minimize(cost)
    
    hm_epochs = 5
    with tf.Session() as sess:
        # OLD:
        #sess.run(tf.initialize_all_variables())
        # NEW:
        sess.run(tf.global_variables_initializer())

        for epoch in range(hm_epochs):
            epoch_loss = 0
            for _ in range(int(mnist.train.num_examples/batch_size)):
                epoch_x, epoch_y = mnist.train.next_batch(batch_size)
                _, c = sess.run([optimizer, cost], feed_dict={x: epoch_x, y: epoch_y})
                epoch_loss += c

            print('Epoch', epoch, 'completed out of',hm_epochs,'loss:',epoch_loss)

        correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))

        accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
        print('Accuracy:',accuracy.eval({x:mnist.test.images, y:mnist.test.labels}))

train_neural_network(x)

W0729 12:00:22.471252 139805703444288 deprecation.py:323] From <ipython-input-11-ae79a13c5956>:3: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
W0729 12:00:22.472553 139805703444288 deprecation.py:323] From /home/gufranco/.local/lib/python3.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please write your own downloading logic.
W0729 12:00:22.473993 139805703444288 deprecation.py:323] From /home/gufranco/.local/lib/python3.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:252: _internal_retry.<locals>.wrap.<locals>.wrapped_fn (from tensorflow.contrib.learn.python.learn.datasets.base) is depreca

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/data/train-images-idx3-ubyte.gz


W0729 12:00:23.070621 139805703444288 deprecation.py:323] From /home/gufranco/.local/lib/python3.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
W0729 12:00:23.072955 139805703444288 deprecation.py:323] From /home/gufranco/.local/lib/python3.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:110: dense_to_one_hot (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.one_hot on tensors.


Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/data/t10k-images-idx3-ubyte.gz


W0729 12:00:23.333289 139805703444288 deprecation.py:323] From /home/gufranco/.local/lib/python3.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: DataSet.__init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
W0729 12:00:23.472496 139805703444288 deprecation.py:323] From <ipython-input-11-ae79a13c5956>:55: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.



Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz
Epoch 0 completed out of 5 loss: 1818695.238357544
Epoch 1 completed out of 5 loss: 401376.5926427841
Epoch 2 completed out of 5 loss: 221874.44502973557
Epoch 3 completed out of 5 loss: 128293.66678369045
Epoch 4 completed out of 5 loss: 79404.68307192624
Accuracy: 0.9434


## Convolutional Neural Networks

![CNN](fig/cnn.jpeg)

## The first Neural Network using Tensorflow

In [12]:
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets(".", one_hot=True)

x = tf.placeholder(tf.float32, [None, 784])
y_ = tf.placeholder(tf.float32, [None, 10])

x_image = tf.reshape(x, [-1,28,28,1])

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting ./train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting ./train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting ./t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting ./t10k-labels-idx1-ubyte.gz


In [13]:
W_conv1 = tf.Variable(tf.truncated_normal([5, 5, 1, 32], stddev=0.1))
b_conv1 = tf.Variable(tf.constant(0.1,shape=[32]))
h_conv1 = tf.nn.relu(tf.nn.conv2d(x_image, W_conv1,strides=[1, 1, 1, 1], padding='SAME') + b_conv1)
h_pool1 = tf.nn.max_pool(h_conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

W_conv2 = tf.Variable(tf.truncated_normal([5, 5, 32, 64], stddev=0.1))
b_conv2 = tf.Variable(tf.constant(0.1,shape=[64]))
h_conv2 = tf.nn.relu(tf.nn.conv2d(h_pool1, W_conv2,strides=[1, 1, 1, 1], padding='SAME') + b_conv2)
h_pool2 = tf.nn.max_pool(h_conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

W_fc1 = tf.Variable(tf.truncated_normal([7 * 7 * 64, 1024], stddev=0.1))
b_fc1 = tf.Variable(tf.constant(0.1,shape=[1024]))
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

W_fc2 = tf.Variable(tf.truncated_normal([1024, 10], stddev=0.1))
b_fc2 = tf.Variable(tf.constant(0.1,shape=[10]))
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

W0729 12:00:46.085740 139805703444288 deprecation.py:506] From <ipython-input-13-b0b47f973ce1>:19: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [14]:
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=y_, logits=y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

sess = tf.InteractiveSession()

sess.run(tf.global_variables_initializer())
for i in range(1000):
  batch = mnist.train.next_batch(50)
  if i%100 == 0:
    train_accuracy = accuracy.eval(feed_dict={x:batch[0], y_: batch[1], keep_prob: 1.0})
    print("step %d, training accuracy %g"%(i, train_accuracy))
  train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

print("test accuracy %g"%accuracy.eval(feed_dict={
    x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))


step 0, training accuracy 0.16
step 100, training accuracy 0.86
step 200, training accuracy 0.82
step 300, training accuracy 0.86
step 400, training accuracy 0.94
step 500, training accuracy 0.96
step 600, training accuracy 0.92
step 700, training accuracy 0.94
step 800, training accuracy 0.94
step 900, training accuracy 0.96
test accuracy 0.9626
