<br>
<br>
<br>
<br>

# DAV 6150 Module 13: Neural Networks
<br>
<br>
<br>

## Project 3 Review



## How should we assess categorical values relative to numeric data in our EDA work?

- Some suggestions: Calculate chi square metrics; calculate mutual information gain metrics; use bar plots; etc.


- If the categorical variables are binary, we can also calculate a __biserial metric__. See these links: 


https://towardsdatascience.com/point-biserial-correlation-with-python-f7cd591bd3b1,  


https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pointbiserialr.html


- Facet plots (a type of bar plot) are another great tool we can use: 


https://datavizpyr.com/how-to-make-simple-facet-plots-with-seaborn-catplot-in-python/


https://seaborn.pydata.org/tutorial/categorical.html


http://seaborn.pydata.org/generated/seaborn.barplot.html



## Make sure you provide an explanation for how you chose your modeling hyperparameters

- On what basis are you selecting parameter values for your models? If you are relying on the default parameter settings, you need to explain __WHY__ you believe the default settings might be appropriate for the data you are working with.

## An automated approach to hyperparameter selection: Grid Search

Chapter 19 of the __MLPR__ text provides an example of how to use the sk-learn __model_selection.GridSearchCV()__ function (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)to automatically compare different iterations of a particular type of model constructed using varying hyperparameter values:

https://github.com/mattharrison/ml_pocket_reference/blob/master/ch19.ipynb (see code cells 11 and 12)


## Selection of hyperparameters should NEVER be arbitrary. 

- Data science practitioner are expected to be able to explain their choice of hyperparameter settings when explaining their models. __If you can't explain the choices you've made when subjected to scrutiny by others__, your choices will appear to be arbitrary and __you will damage your own credibility as a data scientist__.

## Neural Network Basics


### What is a "neuron" in an artificial neural network (ANN)?

An ANN __neuron__ is simply a mathematical algorithm that attempts to model the functioning of a biological neuron. Typically, an ANN neuron will compute the weighted average of its inputs. The result of that computation is then passed through a function to generate the output of the neuron. An example from the assigned readings: https://towardsdatascience.com/perceptron-learning-algorithm-d5db0deab975


### Neuron Input Weights

Each neuron input is assigned its own input weight. Weights are often initialized to small random values, such as values in the range 0 to 0.3 (although more complex initialization schemes can also be used). In general, __the larger the weight assigned to a neuron input value, the more influence that input value will have upon the output of the neuron__. One caveat: the use of larger weights can increase the variance of your models. As such, it is generally recommended that practitioners refrain from the use of larger input weights. 


### Activation Functions

An __activation function__ is a simple mapping of a neuron's weighted inputs to the output of a neuron. It is called an activation function because it governs the threshold at which the neuron is activated and strength of the output signal. Selection of an appropriate activation function is the responsibility of the practitioner.


### Perceptrons

- __Perceptrons__ are a type of __artificial neuron__. However, __they are not representative of biological neurons that exist within human brains__. Instead, they are assumed to be __proxies__ for actual neurons and as such are not expected to behave in the same manner as actual __biological__ neurons will behave.


- Perceptrons accept __multiple inputs__ and generate a __single output value__. An __activation function__ is used to generate the output of a perceptron.


- A basic perceptron is basically just a binary classifier that applies user-supplied weightings to inputs along with a user-supplied "activation function" to calculate a value. That value is then compared against a threshold (defined by the perceptron's activation function) to determine whether the neuron outputs a $0$ or a $1$. __No other output value is possible from a perceptron__.


- Basic perceptrons require that the input data be __linearly separable__, meaning that __there must exist a straight line within the decision space that clearly separates the data points that coincide with the two possible binary output values.__ If the data we are working with is not linearly separable, we should avoid the use of basic perceptrons.


- Small changes in perceptron inputs can result in significant changes in perceptron output values due to the use of the linear decision boundary.


### Sigmoid Neurons

- __Sigmoid Neurons__ differ from perceptrons in that they __can output any continuous value between $0$ and $1$__. Like a perceptron, sigmoid neurons are also a type of __artificial neuron__ that is not representative of biological neurons that exist within human brains. 


- Sigmoid neurons rely on an __activation function__ that is capable of __generating an S-shaped decision boundary between data points__ instead of the linear boundary generated by a basic perceptron (e.g., https://towardsdatascience.com/sigmoid-neuron-deep-neural-networks-a4cd35b629d7 ). 


- The __logistic function__ is often used as the activation function within a sigmoid neuron, though other activation functions can be used if so desired (e.g., a hyperbolic tangent function, which has an output range of (-1, 1) and therefore a steeper gradient than the logistic function; a piecewise linear function; a rectified linear unit (RELU); etc.). The sigmoid function often works well for classification problems


- Since sigmoid neurons can output any real number between $0$ and $1$, __they can be applied to data that is not linearly separable.__  As a result, small changes in neuron inputs are much less likely to result in changes to the output of a sigmoid neuron



## Multi-Layer Perceptrons (a.k.a., Artificial Neural Networks)

- A __multi-layer perceptron__ (MLP) (also often referred to as an __artificial neural network__ (ANN)) is a type of __feed-forward network__ created by combining multiple "layers" of neurons into a single feed-forward network.


- MLP's __must have at least three layers of nodes__: 1) an __input layer__; 2) one or more __hidden layers__; 3) an __ouput layer__.


- Neurons within MLP's are often referred to as __nodes__.


- __Nodes__ within an MLP are typically constructed using __non-linear activation functions__ (i.e., they are usually __sigmoid neurons__ rather than basic perceptrons)


- MLP's rely on __back propagation__ for purposes of model training.


A simple example from the assigned readings: https://machinelearningmastery.com/neural-networks-crash-course/


### What is an "Input Layer"?

An __input layer__ is simply the ANN layer that captures input from your training dataset. This layer is sometimes also referred to as the "visible layer". An input layer is usually drawn as having one input point per explanatory variable. The input points themselves __are not neurons__ and they are not assigned any input weightings. The input points are simply pass-through entities that perform no calculations. In a feed-forward ANN, every input node passes its input to each and every node contained within the first __hidden layer__.


### What is a "Hidden Layer"?

A __hidden layer__ is an ANN layer that sits beween the input and output layers. All hidden layer neurons are assigned input weightings and perform calculations relative to their inputs and input weightings. In a __feed forward network__, the output of each hidden layer is passed on to a subsequent layer within the ANN. Nodes within a hidden layer of a feed-forward ANN receive the output of every node pertaining to the network layer that immediately precedes it. The output of a node within a hidden layer is transmitted to each and every node contained within the next hidden layer. Also, the number of nodes contained within a hidden layer __does not need to be equivalent to the number of nodes contained within the ANN's input layer__. The number of nodes within a hidden layer can be greater than, less than, or equivalent to the number of nodes contained within an ANN's input layer. Similarly, the number of nodes within any given hidden layer __does not need to be equivalent to the number of nodes in any other hidden layer__, nor does it need to be equivalent to the number of nodes in the ANN's output layer.


### What is an "Output Layer"?

An __output layer__ is responsible for producing the output of an ANN. The input to a node within an output layer is comprised of the outputs of every node pertaining to the hidden layer that immediately precedes the output layer. The number of nodes within an output layer __does not need to be equivalent to the number of ANN input nodes, and does not need to be equivalent to the number of nodes found within any other hidden layer within the ANN.__


### "Feed Forward" Networks

In a __feed forward__ ANN, data flows in one direction only: from the input layer through the hidden layers to the output layer. An example: 

https://visualstudiomagazine.com/articles/2013/05/01/~/media/ECG/visualstudiomagazine/Images/2013/05/0513vsm_McCaffreyNeuralNet2.ashx


### Back Propagation

__"Back Propagation"__ is the term used to describe the process by which the output of an ANN is used to update the input weights of every ANN neuron after each ANN iteration. The process of back propagation makes use of __gradient descent__ for purposes of calculating the gradient of the training errors generated by the output of a neural network.

The basic back propagation algorithm works as follows:

- After giving the network an input, it will produce an output


- The output of the network is compared to the expected output (i.e., the __actual values__ for the response variable within the training data set) and an error is calculated (think of the error as the "residual"). The gradient of the error of the loss function is then calculated (typically using either __stochastic__ or __mini-batch gradient descent__ techniques) and “back propagated” through the model to adjust all of the weights assigned at each artificial neuron input. 


- The output of the gradient descent algorithm (which, again, calculates the derivative of the loss function) is typically interpreted as follows:

   1) If it is __positive__, the model's error will __increase__ if we __increase__ the model's neuron input weights. Therefore, we should __decrease the neuron input weight values throughout the ANN model__.

   2) If it is __negative__, the model's error will __decreases__ if we __increase__ the model's neuron input weights,Therefore, we should __increase the neuron input weight values throughout the ANN model__.

   3) If it is __0, our model has converged__ and no further neuron input weight adjustments are needed.


-  The output of the gradient descent algorithm is then propagated back through the network, one layer at a time, and the __neuron input weights are updated according to the amount that they contributed to the overall error of the model__. 


- The amount that an individual neuron input weight contributes to the overall error of the model is determined by calculating the derivative of the neuron's activation function relative to the output it had previously generated + the amount of back-propagated error conveyed to it via the back propagation process.


- Neuron weight updates are back propagated all the way back through the ANN through its first hidden layer.


- During the next training iteration, the output of the ANN should be closer to its "ideal" output.


This process is repeated over some specified number of training iterations until we consider the output error to have become sufficiently minimized.


__This process of back propagation is what constitutes the "learning" within an ANN__. 


## Neural Network Hyperparameters


### Epochs

The term __epoch__ is used to refer to the number of training iterations we impose upon an ANN. Each epoch represents a single instance in which our complete set of training data has been fed through the model. At the conclusion of each epoch we calculate the gradient of the loss function for the output of the epoch and used for purposes of updating the ANN's neuron input weights via __back propagation__.


### Learning Rate 

__Learning Rate__ is an ANN hyperparameter that controls the magnitude of the updates to neuron input weights during model training. This hyperparameter typically has a small __positive real value__, usually within the range of $0$ and $1$. Basically, __the learning rate determines how quickly our model's gradient descent algorithm will move along the gradient of the model's loss function__.

The learning rate __controls how quickly the model is allowed to adapt to the training data__. Smaller learning rates require more training epochs since smaller changes are made to neuron input weights during each update. By contrast, larger learning rates can result in rapid changes to neuron input weights during each update and as a result may require fewer training epochs.

However, __a learning rate that is too large can cause the model to converge too quickly to a suboptimal solution__, whereas a learning rate that is too small can result in the model never converging within any reasonable amount of time.

In general, we want the learning rate to be larger when the gradient we are moving along is relatively steep, and we want it to be relatively small when the gradient is relatively flat (so that we minimize the likelihood of overshooting the optimal minumum value for the loss function.


### Learning Rate Decay

__Learning Rate Decay__ is an ANN hyperparameter used to iteratively decrease the learning rate over epochs. A gradual decrease in the learning rate allows the ANN to make large changes to the weights at the beginning of model training and smaller fine tuning changes later in the training schedule.


### Momentum

__Momentum__ is an ANN backpropagation hyperparameter that incorporates the properties from the previous weight update to allow the weights to continue to change in the same direction even when there is less error being calculated.


### How Many Layers + Nodes to Use?

Unfortunately __there is no single set of agreed upon rules for how we go about constructing a neural network for any given problem__. As such, __neural network design and implementation is a highly empirical field of endeavor__. In other words, __you need to experiment with different configurations to determine which will perform best relative to the data you are working with__.

Some basic suggestions on how to choose these parameters can be found here: https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/


## Neural Network Advantages

- Capable of "learning" very complex training data (e.g., images, audio / speech recordings, etc.) and subsequently providing high quality classification and regression output values


- Can be applied to very large data sets (assuming sufficient computing resources are available)


- Are relatively fault tolerant: the failure of an individual node within an ANN generally does not disable or tangibly reduce the effectiveness of the ANN.


## Neural Network Disadvantages

- Very hard to interpret in any intuitive manner: most neural networks are basically __black box__ algorithms that defy intuitive explanation.


- Very computationally expensive: the more nodes + layers included, the slower the training process. The smaller the learning rate, the slower the process. The larger the number of epochs, the slower the process


- Very challenging to train since ANN design is highly empirical: many different models may need to be assessed before an acceptable model is achieved.


- Neural network models are not guaranteed to achieve convergence: so their output may prove to be much less than ideal relative to the data + task at hand


- Quite often much simpler classification + regression models will achieve performance comparable to that of an ANN without the enormous computational expense + interpretability challenges. __Therefore, do not use a neural network if a simpler + less complex type of model is expected to perform adequately (or better) for the same task__.

## How to Implement a Feed-Forward Network in Python?

Chapter 10 of the __Hands On Machine Learning__ textbook includes a well-documented example of one way to implement a multilayer neural network within a Python environment. The author walks you through the process of installing Google’s __Tensorflow__ library and then provides a step-by-step overview of how to use the __Keras__ sub-library to construct a neural network that is capable of classifying images contained within the “Fashion MNIST” data set: https://github.com/ageron/handson-ml2/blob/master/10_neural_nets_with_keras.ipynb

## Module  13 Assignment Guidelines / Requirements