# Activation Functions Lab

## Objective

In this lab, we'll learn about different common activation functions, and compare and contrast their effectiveness on an MLP for classification on the MNIST data set!

### Getting Started: What Is An Activation Function?

In your words, answer the following question:

**_What purpose do acvtivation functions serve in Deep Learning?  What happens if our neural network has no activation functions?  What role do activation functions play in our output layer? Which activation functions are most commonly used in an output layer?_**

Write your answer below this line:
______________________________________________________________________________________________________________________
**Activation functions allow our Deep Learning models to capture nonlinearity. If ANNs are a symbolic representation of biological neural networks, then activation functions mirror the ability of neurons being able to fire with different levels of intensity based on the rapidity of how often they fire. A model with no activation functions would just be a linear model. In the output layer, activation functions make the results of our neural network's forward propagation step interpretable. If the task we are trying to solve is a binary classification task, then we would use a sigmoid neuron, so that we can interpret the results as a probability, much like the output of a logistic regression. If our task is multi-class classification, then we would use a softmax function, which would have the network output a vector of probabilities, which each element corresponding to the probability that the observed input data belongs to a different class.**

For the first part of this lab, we'll only make use of the numpy library.  Run the cell below to import numpy.

In [1]:
import numpy as np

## Writing Different Activation Functions

We'll begin this lab by writing different activation functions manually, so that we can get a feel for how they work.  

### Logistic Sigmoid Function


We'll begin with the **_Sigmoid_** activation function, as described by the following equation:

$$\LARGE \phi(z) = \frac{1}{1 + e^{-z}}  $$

In the cell below, complete the `sigmoid` function. This functio should take in a value and compute the results of the equation returned above.  

In [2]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

In [3]:
sigmoid(.458) # Expected Output 0.61253961344091512

0.6125396134409151

### Hyperbolic Tangent (tanh) Function 

The hyperbolic tangent function is as follows:



$$\LARGE  \frac{e^x - e^{-x}}{e^x + e^{-x}}  $$

Complete the function below by implementing the `tanh` function.  

In [4]:
def tanh(z):
    return (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))

In [10]:
print(tanh(2)) # 0.964027580076
print(np.tanh(2)) # 0.964027580076
print(tanh(0)) # 0.0

0.964027580075817
0.9640275800758169
0.0


### Rectified Linear Unit (ReLU) Function

The final activation function we'll implement manually is the **_Rectified Linear Unit_** function, also known as **_ReLU_**.  

The relu function is:

$$\LARGE  Max(0, z)  $$

In [11]:
def relu(z):
    return max(0,z)

In [15]:
print(relu(-2)) # Expected Result: 0.0
print(relu(2)) # Expected Result: 2.0

0
2


### Softmax Function

The **_Softmax Function_** is primarily used as the activation function on the output layer for neural networks for multi-class categorical prediction.  The softmax equation is as follows:

<img src='softmax.png'>

The mathematical notation for the softmax activation function is a bit dense, and this is a special case, since the softmax function is really only used on the output layer. Thus, the code for the softmax function ahs been provided.  

Run the cell below to compute the softmax function on a sample vector.  

In [16]:
z = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0]
softmax = np.exp(z)/np.sum(np.exp(z))
softmax

array([0.02364054, 0.06426166, 0.1746813 , 0.474833  , 0.02364054,
       0.06426166, 0.1746813 ])

**_Expected Output:_**

array([ 0.02364054,  0.06426166,  0.1746813 ,  0.474833  ,  0.02364054,
        0.06426166,  0.1746813 ])


## Comparing Training Results 

Now that we have experience with the various activation functions, we'll gain some practical experience with each of them by trying them all as different hyperparameters in a neural network to see how they affect the performance of the model. Before we can do that, we'll need to preprocess our image data. 

We'll build 3 different versions of the same network, with the only difference between them being the activation function used in our hidden layers.  Start off by importing everything we'll need from Keras in the cell below.

**_HINT:_** Refer to previous labs that make use of Keras if you aren't sure what you need to import

In [18]:
import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.datasets import mnist
import matplotlib.pyplot as plt
%matplotlib inline

### Preprocessing Our Image Data

We'll need to preprocess the MNIST image data so that it can be used in our model. 

In the cell below:

* Load the training and testing data and their corresponding labels from MNIST.  
* Reshape the data inside `X_train` and `X_test` into the appropriate shape (from a 28x28 matrix to a vector of length 784).  Also cast them to datatype `float32`.
* Normalize the data inside of `X_train` and `X_test`
* Convert the labels inside of `y_train` and `y_test` into one-hot vectors (Hint: see the [documentation](https://keras.io/utils/#to_categorical) if you can't remember how to do this).

In [19]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()

In [23]:
X_train = X_train.reshape(60000,784).astype('float32')
X_test = X_test.reshape(10000,784).astype('float32')

In [24]:
X_train /= 255
X_test /= 255

In [25]:
y_train = keras.utils.to_categorical(y_train,10)
y_test = keras.utils.to_categorical(y_test,10)

### Model Architecture

Your task is to build a neural network to classify the MNIST dataset.  The model should have the following architecture:

* Input layer of `(784,)`
* Hidden Layer 1: 100 neurons
* Hidden Layer 2: 50 neurons
* Output Layer: 10 neurons, softmax activation function
* Loss: `categorical_crossentropy`
* Optimizer: `'SGD'`
* metrics:  `['accuracy']`

In the cell below, create a model that matches the specifications above and use a **_sigmoid activation function for all hidden layers_**.

In [26]:
sigmoid_model = Sequential()

sigmoid_model.add(Dense(100, activation='sigmoid', input_shape=(784,)))
sigmoid_model.add(Dense(50, activation='sigmoid'))
sigmoid_model.add(Dense(10, activation='softmax'))

In [28]:
sigmoid_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 100)               78500     
_________________________________________________________________
dense_2 (Dense)              (None, 50)                5050      
_________________________________________________________________
dense_3 (Dense)              (None, 10)                510       
Total params: 84,060
Trainable params: 84,060
Non-trainable params: 0
_________________________________________________________________


Now, compile the model with the following hyperparameters:

* `loss='categorical_crossentropy'`
* `optimizer='SGD'`
* `metrics=['accuracy']`

In [29]:
sigmoid_model.compile(optimizer='SGD', loss='categorical_crossentropy', metrics=['accuracy'])

Now, fit the model.  In addition to our training data, pass in the following parameters:

* `epochs=10`
* `batch_size=32`
* `verbose=1`
* `validation_data=(X_test, y_test)`

In [30]:
sigmoid_model = sigmoid_model.fit(X_train, y_train, batch_size=32, epochs=10, 
                                  verbose=1, validation_data=(X_test, y_test))

Train on 60000 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Fitting a Model with Tanh Activations

Now, we'll build the exact same model as we did above, but with hidden layers that use `tanh` activation functions rather than `sigmoid`.

In the cell below, create a second version of the model that uses hyperbolic tangent function for activations.  All other parameters, including number of hidden layers, size of hidden layers, and the output layer should remain the same. 

In [31]:
tanh_model = Sequential()

tanh_model.add(Dense(100, activation='tanh', input_shape=(784,)))
tanh_model.add(Dense(50, activation='tanh'))
tanh_model.add(Dense(10, activation='softmax'))

In [33]:
tanh_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 100)               78500     
_________________________________________________________________
dense_5 (Dense)              (None, 50)                5050      
_________________________________________________________________
dense_6 (Dense)              (None, 10)                510       
Total params: 84,060
Trainable params: 84,060
Non-trainable params: 0
_________________________________________________________________


Now, compile this model.  Use the same hyperparameters as we did for the sigmoid model. 

In [34]:
tanh_model.compile(optimizer='SGD',loss='categorical_crossentropy',metrics=['accuracy'])

Now, fit the model.  Use the same hyperparameters as we did for the sigmoid model. 

In [35]:
tanh_model = tanh_model.fit(X_train, y_train, batch_size=32, epochs=10, 
                            verbose=1, validation_data=(X_test, y_test))

Train on 60000 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Fitting a Model with ReLU Activations

Finally, construct a third version of the same model, but this time with `relu` activation functions for the hidden layer.  

In [36]:
relu_model = Sequential()

relu_model.add(Dense(100, activation='relu', input_shape=(784,)))
relu_model.add(Dense(50, activation='relu'))
relu_model.add(Dense(10, activation='softmax'))

In [37]:
relu_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_7 (Dense)              (None, 100)               78500     
_________________________________________________________________
dense_8 (Dense)              (None, 50)                5050      
_________________________________________________________________
dense_9 (Dense)              (None, 10)                510       
Total params: 84,060
Trainable params: 84,060
Non-trainable params: 0
_________________________________________________________________


Now, compile the model with the same hyperparameters as the last two models. 

In [38]:
relu_model.compile(optimizer='SGD', loss='categorical_crossentropy', metrics=['accuracy'])

Now, fit the model with the same hyperparameters as the last two models. 

In [39]:
relu_model = relu_model.fit(X_train, y_train, batch_size=32, epochs=10,
                            verbose=1, validation_data=(X_test, y_test))

Train on 60000 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Conclusion

Which activation function was most effective?



- relu model was most effective.
- Sigmoid model was the least effective