# Introduction to ANN's with Keras

## The Multilayer Perceptron & Backpropogation

When an ANN contains a deep stack of hidden layers, it is called a deep nueral network (dnn). The field of deep learning studies DNN's, and more generally models containing deep stacks of computations.

ANN's and DNN's are trained via the `backpropogation` training algorithm, which in short is Gradient Descent using an efficient technique for computing the gradients automatically. The goal of backpropagation is to minimize the error between the predicted output of the network and the actual target values.

The Backpropagation algorithm works by handling one mini-batch at a time (e.g, containing 32 training examples each) and goe through the full training set multiple times. Each pass is called an `epoch`. 

For each training instance the backpropogation algorith works by first making a prediction (forward pass) and measuring the error, then goes through each layer in reverse to measure the error contribution from each connection (reverse pass) and finally tweaks the connection weights to reduce the error (gradient descent step).

* **Initialization**: Initialize the weights and biases of the neural network with small random values. Define the learning rate, which determines the size of the steps taken during the optimization process.
* **Forward Pass**: Each mini-batch is fed forward through the network layer by layer. Neurons in each layer perform a weighted sum of their inputs, add a bias, and then apply an activation function to produce the output.
* **Calculate Error**: Compare the network's output with the actual target values to calculate the error. The error is typically measured using a loss or cost function, which quantifies the difference between the predicted and actual values.
* **Backward Pass**: The goal is to then update the weights and biases in the network to reduce the error by calculateing the gradient of the error with respect to the weights and biases using the chain rule of calculus. This is done via propagating the gradient backward through the network until the algorithm reaches the input layer. to find how much each weight and bias contributed to the error.
* **Update Weights and Biases**: Adjust the weights and biases in the direction that reduces the error. This is done by subtracting a fraction of the gradient multiplied by the learning rate. The learning rate determines the step size in the weight and bias updates. It's crucial to choose an appropriate learning rate to balance convergence speed and stability.
* **Repeat**: Steps 2 to 5 are repeated for multiple epochs (passes through the entire training dataset) until the network's performance converges to an acceptable level. The backpropagation algorithm essentially iteratively adjusts the weights and biases of the neural network to minimize the error between predicted and actual outputs. This process is an optimization task, and the choice of the loss function, activation functions, and network architecture all play crucial roles in the success of the training process.

In order for the algorithm to work properly, the authors made a key change by replacing the step function with the logistic (sigmoid) function. This was essential, because the step function contains only flat segment so there was no gradient to work with (GD doesn't work on a flat surface) while the logistic function has a well defined nonzero derivitiave everywhere, allowing GD to make some progress at every step. The logistic function is an example of an `activation function`.

### Activation Functions 

An activation function is a mathematical operation applied to the output of each neuron (or node) in a neural network layer. 

Activation functions introduces non-linearity to a DNN, allowing it to learn from and model complex patterns in data. Without non-linear activation functions, the entire neural network would behave like a linear model, regardless of its depth.

The purpose of the activation function can be summarized as follows:

* **Introducing Non-linearity**: Linear transformations (such as weighted sums and biases) are limited to representing linear relationships. By applying non-linear activation functions, the network can learn and approximate non-linear mappings between inputs and outputs.
* **Enabling Complex Representations**: The stacking of non-linear activation functions in deep networks enables the modeling of intricate relationships and hierarchies in data, allowing the network to learn and represent complex patterns.

Here are some commonly used activation functions in deep neural networks:

* **Sigmoid Function (Logistic):**
    * _MOST COMMON_
    * Outputs values between 0 and 1.
    * Historically used in the output layer for binary classification problems, but not as common in hidden layers due to the vanishing gradient problem.
* **Hyperbolic Tangent (tanh)**:
    * _MOST COMMON_
    * Outputs values between -1 and 1.
    * Similar to the sigmoid but with a higher output range.
* **Rectified Linear Unit (ReLU)**:
    * _MOST COMMON_
    * Outputs zero for negative inputs and passes positive inputs as is.
    * Widely used in hidden layers due to its simplicity and effectiveness in training deep networks.
* **Leaky ReLU**:
    * Similar to ReLU but allows a small, non-zero gradient for negative inputs, addressing the "dying ReLU" problem where neurons can become inactive during training.
* **Parametric ReLU (PReLU)**:
    * An extension of Leaky ReLU where  α is learned during training.
* **Exponential Linear Unit (ELU)**:
    * Smoothly saturates for negative inputs, potentially alleviating some issues with ReLU.
    
The choice of activation function depends on the specific characteristics of the data and the problem at hand. Experimentation and consideration of issues like vanishing gradients during training can guide the selection of an appropriate activation function for a given neural network architecture.