$\newcommand{\xv}{\mathbf{x}}
\newcommand{\Xv}{\mathbf{X}}
\newcommand{\yv}{\mathbf{y}}
\newcommand{\zv}{\mathbf{z}}
\newcommand{\av}{\mathbf{a}}
\newcommand{\Wv}{\mathbf{W}}
\newcommand{\wv}{\mathbf{w}}
\newcommand{\tv}{\mathbf{t}}
\newcommand{\Tv}{\mathbf{T}}
\newcommand{\muv}{\boldsymbol{\mu}}
\newcommand{\sigmav}{\boldsymbol{\sigma}}
\newcommand{\phiv}{\boldsymbol{\phi}}
\newcommand{\Phiv}{\boldsymbol{\Phi}}
\newcommand{\Sigmav}{\boldsymbol{\Sigma}}
\newcommand{\Lambdav}{\boldsymbol{\Lambda}}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}$

# <center> Activation Functions: An Examination of the Rectified Linear Function and its Performance </center>
<center> Vignesh M. Pagadala </center>
<center> Department of Computer Science </center>
<center> Colorado State University </center>
<center> Vignesh.Pagadala@colostate.edu </center>
***

## <center> Contents </center>
***
1. Abstract
2. Rectified Linear Unit
> - Description
> - Implementation
3. Performance in Comparision with tanH
> - Plot
> - Observations
> - Inference
4. References
5. Extra Credit Problems

## 1. Abstract
<p>
    Presently, one of the most popular activation functions in use for training neural networks is the Rectified Linear Unit, abbreviated as ReLU. In this report, we perform an experiment to examine the performance of a neural network with ReLU used as the activation function, and compare it with the performance when the hyperbolic tangent function is used in the same capacity. We initially define functions implementing ReLU, and apply it to train several different neural network configurations, and also do the same using the hyperbolic tangent function. We finally calculate the Root Mean Squared Error (RMSE), plot the results, observe and infer.  
</p>

### Module Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import neuralnetworksA2 as nn

## 2. Rectified Linear Unit <br>

 - ### Description <br> 
    The Rectified Linear function is described as follows. If the input is zero or positive, then the rectifier output is essentially the input itself. If negative, then the output is zero. Mathematically, <br> 
    <center> $f(x) = max(0,x)$ </center>
    <br>
    ![title](ReLU.jpg)
    <br> It is quite evident upon looking at the above plot that the slope of the graph for values lesser than or equal to 0, is 0, and for values greater than 0, 1. Therefore, the derivative of this function can be represented as, <br><br>
    <center> $ f(x) = 0, if x <= 0 $ </center>
    <center> $ f(x) = 1, if x > 0  $ </center>
    
 - ### Implementation <br>


In [None]:
# Create new class NeuralNetworkReLU which inherits from the NeuralNetwork class, and define new functions 
# activation and activationDerivate which implement the Rectified Linear function. 
class NeuralNetworkReLU(nn.NeuralNetwork):
    def __init__(self, ni, nh, no):
        super(NeuralNetworkReLU, self).__init__(ni, nh, no)

    def activation(self, weighted_sum):
        return np.maximum(0, weighted_sum)

    def activationDerivative(self, activation_value):
        actDer = np.copy(activation_value)
        actDer[actDer <= 0] = 0
        actDer[actDer > 0] = 1
        return actDer

## 3. Performance Comparision with tanH

 - ### Plot


Let us define the function *partition* as shown below. The primary purpose of this function is to take in input data, and the desired target output, and divide the records into training and testing data, based on the fraction argument.

In [19]:
def partition(X, T, fraction, shuffle):
    nRows = X.shape[0]
    # Choose number of rows for training and testing data.
    nTrain = int(round(fraction*nRows)) 
    nTest = nRows - nTrain

    rows = np.arange(nRows)
    # If the shuffle argument is set to true, then mix up the data records randomly.
    if(shuffle == True):
        np.random.shuffle(rows)

    trainIndices = rows[:nTrain]
    testIndices = rows[nTrain:]

    Xtrain = X[trainIndices, :]
    Ttrain = T[trainIndices, :]
    Xtest = X[testIndices, :]
    Ttest = T[testIndices, :]
    
    return Xtrain, Ttrain, Xtest, Ttest

The following function is to determine the Root Mean Squared Error for any two input arguments.

In [25]:
def rmse(A, B):
    return np.sqrt(np.mean((A - B)**2))

Let's try out a few examples to demonstrate the working of the partition function.

In [2]:
X = np.arange(10*2).reshape((10, 2))
T = X[:, 0:1] * 0.1

In [3]:
X

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11],
       [12, 13],
       [14, 15],
       [16, 17],
       [18, 19]])

In [4]:
T

array([[0. ],
       [0.2],
       [0.4],
       [0.6],
       [0.8],
       [1. ],
       [1.2],
       [1.4],
       [1.6],
       [1.8]])

In [7]:
Xtrain, Ttrain, Xtest, Ttest = partition(X, T, 0.8, shuffle=False)

In [8]:
Xtrain

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11],
       [12, 13],
       [14, 15]])

In [9]:
Ttrain

array([[0. ],
       [0.2],
       [0.4],
       [0.6],
       [0.8],
       [1. ],
       [1.2],
       [1.4]])

In [10]:
Xtest

array([[16, 17],
       [18, 19]])

In [11]:
Ttest

array([[1.6],
       [1.8]])

Examples for ```shuffle=True```. The data samples in this case are rearranged randomly before partitioning is done.

In [12]:
Xtrain, Ttrain, Xtest, Ttest = partition(X, T, 0.8, shuffle=True)

In [13]:
Xtrain

array([[ 0,  1],
       [18, 19],
       [12, 13],
       [14, 15],
       [ 4,  5],
       [16, 17],
       [ 6,  7],
       [ 2,  3]])

In [14]:
Ttrain

array([[0. ],
       [1.8],
       [1.2],
       [1.4],
       [0.4],
       [1.6],
       [0.6],
       [0.2]])

In [15]:
Xtest

array([[ 8,  9],
       [10, 11]])

In [16]:
Ttest

array([[0.8],
       [1. ]])

Now, let's plot the RMSE values for the following cases:
1. Using tanh activation function, and calculating RMSE on training data.
2. Using tanh activation function, and calculating RMSE on testing data.
3. Using ReLU activation function and calculating RMSE on training data.
4. Using ReLU activation function and calculating RMSE on training data.

In the following snippet of code, we take each of the two activation functions, train using them with different hidden layer structures, 10 times for each structure, and store the RMSE mean in each case. Finally, we plot everything, with four different curves for each of the above cases.

In [None]:
# Load the csv data.
dframe = pd.read_csv('energydata_complete.csv', sep=',',header=None)
# Filter out required columns.
#dframe = dframe.drop(dframe.columns[[0, -2, -1]], axis=1)

# Get target.
Td = dframe.iloc[1:, [1]]
Td = Td.as_matrix()
T = Td.astype(float)

# Get input.
Xd = dframe.iloc[1:, 2:-2]
Xd = Xd.as_matrix()
X = Xd.astype(float)

# Comparision
hiddenLayers = [[u]*nl for u in [1, 2, 5, 10, 50] for nl in [1, 2, 3, 4, 5, 10]]
tanHlist = []
ReLUlist = []
for actFun in [nn.NeuralNetwork, NeuralNetworkReLU]:
    for hidden in hiddenLayers:
        # Create list for storing RMSE.
        rmseTrainList = []
        rmseTestList = [] 
        for i in range(10):
            Xtrain, Ttrain, Xtest, Ttest = partition(X, T, 0.8, shuffle = False)
            nnet = actFun(Xtrain.shape[1], hidden, Ttrain.shape[1])
            nnet.train(Xtrain, Ttrain, 100)
            rmseTrain = rmse(Ttrain, nnet.use(Xtrain))
            rmseTest = rmse(Ttest, nnet.use(Xtest))
            rmseTrainList.append(rmseTrain)
            rmseTestList.append(rmseTest)
        rmseTrainMean = sum(rmseTrainList)/len(rmseTrainList)
        rmseTestMean = sum(rmseTestList)/len(rmseTestList)
        if(actFun == nn.NeuralNetwork):
            tanHlist.append([hidden, rmseTrainMean, rmseTestMean])
        else:
            ReLUlist.append([hidden, rmseTrainMean, rmseTestMean])

tanHlist = pd.DataFrame(tanHlist)
ReLUlist = pd.DataFrame(ReLUlist)

plt.figure(figsize = (20, 20))
plt.plot(tanHlist.values[:, 1], 'b', label = 'tanH Train RMSE')
plt.plot(tanHlist.values[:, 2], 'g', label = 'tanH Test RMSE')
plt.plot(ReLUlist.values[:, 1], 'm', label = 'ReLU Train RMSE')
plt.plot(ReLUlist.values[:, 2], 'k', label = 'ReLU Test RMSE')
#plt.plot(tanHlist.values[:, 1:], 'o-')
#plt.plot(ReLUlist.values[:, 1:], 'o-')
plt.legend(('tanh Train RMSE', 'tanh Test RMSE', 'ReLU Train RMSE', 'ReLU Test RMSE'))	
plt.xticks(range(tanHlist.shape[0]), hiddenLayers, rotation=30, horizontalalignment='right')
plt.grid(True)
plt.show()

 - ### Observation

 - ### Inference

## 4. References
[1] George E. Dahl, Tara N. Sainath and Geoffrey
E. Hinton, “Improving Deep Neural Networks for LVCSR Using Rectified Linear Units"

## Grading and Check-in

Your notebook will be run and graded automatically. Test this grading process by first downloading [A3grader.tar](http://www.cs.colostate.edu/~anderson/cs445/notebooks/A3grader.tar) and extract `A3grader.py` from it. Run the code in the following cell to demonstrate an example grading session. You should see a perfect execution score of  60 / 60 if your functions and class are defined correctly. The remaining 40 points will be based on the results you obtain from the comparisons of hidden layer structures and the two activation functions applied to the energy data.

For the grading script to run correctly, you must first name this notebook as `Lastname-A3.ipynb` with `Lastname` being your last name, and then save this notebook.  Your working director must also contain `neuralnetworksA2.py` and `mlutilities.py` from lecture notes.

Combine your notebook, `neuralnetworkA2.py`, and `mlutilities.py` into one zip file or tar file.  Name your tar file `Lastname-A3.tar` or your zip file `Lastname-A3.zip`.  Check in your tar or zip file using the `Assignment 3` link in Canvas.

A different, but similar, grading script will be used to grade your checked-in notebook. It will include other tests.

In [21]:
%run -i A3grader.py



Extracting python code from notebook named 'Madharapakkam Pagadala-A3.ipynb' and storing in notebookcode.py
Removing all statements that are not function or class defs or import statements.

Testing  import neuralnetworksA2 as nn

--- 5/5 points. The statement  import neuralnetworksA2 as nn  works.

Testing nnet = nn.NeuralNetwork(1, 10, 1)

--- 5/5 points. nnet correctly constructed

Testing a = nnet.activation(-0.8)

--- 5/5 points. activation of -0.664036770267849 is correct.

Testing da = nnet.activationDerivative(-0.664)

--- 5/5 points. activationDerivative of 0.5591039999999999 is correct.

Testing nnetrelu = NeuralNetworkReLU(1, 5, 1)

--- 5/5 points. nnet correctly constructed

Testing a = nnetrelu.activation(-0.8)

--- 5/5 points. activation of 0.0 is correct.

Testing a = nnetrelu.activation(1.8)

--- 5/5 points. activation of 1.8 is correct.

Testing da = nnetrelu.activationDerivative(0.0)

--- 5/5 points. activationDerivative of 0.0 is correct.

Testing da = nnetrelu.act

## Extra Credit

Run additional experiments using different numbers of training iterations.  How do the relative performances of the three activation functions depend on numbers of training iterations?  This will earn one extra credit point.

You may also earn an extra credit point by creating yet another version of the neural network class, called ```NeuralNetworkSwish``` and repeat the above comparisons.  You may set the constant $\beta = 1$.  This is tricker than it sounds, because the Swish activation derivative requires the weighted sum as an argument, but our other two activation function derivatives did not.