### Neuron in Humans
<div>
<img style="width:420px;height:300px" src="https://images.fineartamerica.com/images-medium-large-5/8-motor-neuron-in-ox-spinal-cord-lm-science-stock-photography.jpg"/>

<img src="https://www.quia.com/files/quia/users/lmcgee/Systems/endocrine-nervous/neuronstructure_L.gif" />

### Neuron in Machines

#### What Is a Perceptron?
A perceptron is a simple binary classification algorithm, proposed by Cornell scientist Frank Rosenblatt. It helps to divide a set of input signals into two parts—“yes” and “no”. But unlike many other classification algorithms, the perceptron was modeled after the essential unit of the human brain—the neuron and has an uncanny ability to learn and solve complex problems.
![image.png](attachment:image.png)
perceptron is a very simple learning machine. It can take in a few inputs, each of which has a weight to signify how important it is, and generate an output decision of “0” or “1”. However, when combined with many other perceptrons, it forms an artificial neural network. A neural network can, theoretically, answer any question, given enough training data and computing power.

#### What Is a Multilayer Perceptron?
A multilayer perceptron (MLP) is a perceptron that teams up with additional perceptrons, stacked in several layers, to solve complex problems. The diagram below shows an MLP with three layers. Each perceptron in the first layer on the left (the input layer), sends outputs to all the perceptrons in the second layer (the hidden layer), and all perceptrons in the second layer send outputs to the final layer on the right (the output layer).
![image.png](attachment:image.png)

Each perceptron sends multiple signals, one signal going to each perceptron in the next layer. For each signal, the perceptron uses different weights. In the diagram above, every line going from a perceptron in one layer to the next layer represents a different output. Each layer can have a large number of perceptrons, and there can be multiple layers, so the multilayer perceptron can quickly become a very complex system. The multilayer perceptron has another, more common name—a neural network. A three-layer MLP, like the diagram above, is called a Non-Deep or Shallow Neural Network. An MLP with four or more layers is called a Deep Neural Network. One difference between an MLP and a neural network is that in the classic perceptron, the decision function is a step function and the output is binary. In neural networks that evolved from MLPs, other activation functions can be used which result in outputs of real values, usually between 0 and 1 or between -1 and 1. This allows for probability-based predictions or classification of items into multiple labels.

#### Structure of a Perceptron
The perceptron, or neuron in a neural network, has a simple but ingenious structure. It consists of four parts, illustrated below.
![image.png](attachment:image.png)

#### The Perceptron Learning Process

A perceptron follows these steps:

1. Takes the inputs, multiplies them by their weights, and computes their sum Why It’s Important The weights allow the perceptron to evaluate the relative importance of each of the outputs. Neural network algorithms learn by discovering better and better weights that result in a more accurate prediction. There are several algorithms used to fine tune the weights, the most common is called backpropagation.

2. Adds a bias factor, the number 1 multiplied by a weight Why It’s Important This is a technical step that makes it possible to move the activation function curve up and down, or left and right on the number graph. It makes it possible to fine-tune the numeric output of the perceptron. For more details see our guide on neural network bias.

3. Feeds the sum through the activation function Why It’s Important The activation function maps the input values to the required output values. For example, input values could be between 1 and 100, and outputs can be 0 or 1. The activation function also helps the perceptron to learn, when it is part of a multilayer perceptron (MLP). Certain properties of the activation function, especially its non-linear nature, make it possible to train complex neural networks. For more details see our guide on activation functions.

4. The result is the perceptron output The perceptron output is a classification decision. In a multilayer perceptron, the output of one layer’s perceptrons is the input of the next layer. The output of the final perceptrons, in the “output layer”, is the final prediction of the perceptron learning model.

#### From the Classic Perceptron to a Full-Fledged Deep Neural Network

Although multilayer perceptrons (MLP) and neural networks are essentially the same thing, you need to add a few ingredients before an MLP becomes a full neural network. These are:

- Backpropagation — the backpropagation algorithm allows you to perform a “backward pass”, which helps tune the weights of the inputs. Backpropagation performs iterative backward passes which attempt to minimize the “loss”, or the difference between the known correct prediction and the actual model prediction. With each backward pass, the weights move towards an optimum that minimizes the loss function and results in the most accurate prediction.
- Hyperparameters — in a modern neural network, aspects of the multilayer structure such as the number of layers, initial weights, the type of activation function, and details of the learning process, are treated as parameters and tuned to improve the performance of the neural network. Tuning hyperparameters is an art, and can have a huge impact on the performance of a neural network.
- Advanced structures — many neural networks use a complex structure that builds on the multilayer perceptron. For example, a Recurrent Neural Network (RNN) uses two neural networks in parallel—one runs the training data from beginning to end, the other from the end to the beginning, which helps with language processing. A Convolutional Neural Network (CNN)  uses a three-dimensional MLP—essentially, three multilayer perceptron structures that learn the same data point. This is useful for color images which have three layers of “depth”—red, green and blue.

#### 6 Stages of Neural Network Learning

Generally speaking, neural network or deep learning model training occurs in six stages:

1. Initialization — initial weights are applied to all the neurons.
2. Forward propagation — the inputs from a training set are passed through the neural network and an output is computed.
3. Error function — because we are working with a training set, the correct output is known. An error function is defined, which captures the delta between the correct output and the actual output of the model, given the current model weights (in other words, “how far off” is the model from the correct result).
4. Backpropagation — the objective of backpropagation is to change the weights for the neurons, in order to bring the error function to a minimum
5. Weight update — weights are changed to the optimal values according to the results of the backpropagation algorithm.
6. Iterate until convergence — because the weights are updated a small delta step at a time, several iterations are required in order for the network to learn. After each iteration, the gradient descent force updates the weights towards less and less global loss function. The amount of iterations needed to converge depends on the learning rate, the network meta-parameters, and the optimization method used.

At the end of this process, the model is ready to make predictions for unknown input data. New data can be fed to the model, a forward pass is performed, and the model generates its prediction.

#### What is backpropagation?
After a neural network is defined with initial weights, and a forward pass is performed to generate the initial prediction, there is an error function which defines how far away the model is from the true prediction. There are many possible algorithms that can minimize the error function—for example, one could do a brute force search to find the weights that generate the smallest error. However, for large neural networks, a training algorithm is needed that is very computationally efficient. Backpropagation is that algorithm—it can discover the optimal weights relatively quickly, even for a network with millions of weights.

#### How Backpropagation Works
<img src="https://missinglink.ai/wp-content/uploads/2018/11/Frame1.png" />

1. Forward pass — weights are initialized and inputs from the training set are fed into the network. The forward pass is carried out and the model generates its initial prediction.
2. Error function — the error function is computed by checking how far away the prediction is from the known true value.
3. Backpropagation with gradient descent — the backpropagation algorithm calculates how much the output values are affected by each of the weights in the model. To do this, it calculates partial derivatives, going back from the error function to a specific neuron and its weight. This provides complete traceability from total errors, back to a specific weight which contributed to that error. The result of backpropagation is a set of weights that minimize the error function.
4. Weight update — weights can be updated after every sample in the training set, but this is usually not practical. Typically, a batch of samples is run in one big forward pass, and then backpropagation performed on the aggregate result. The batch size and number of batches used in training, called iterations, are important hyperparameters that are tuned to get the best results. Running the entire training set through the backpropagation process is called an epoch.

#### What is a Neural Network Activation Function?
An activation function is a mathematical equation that determines the output of each element (perceptron or neuron) in the neural network. It takes in the input from each neuron and transforms it into an output, usually between one and zero or between -1 and one. Classic activation functions used in neural networks include the step function (which has a binary input), sigmoid and tanh. New activation functions, intended to improve computational efficiency, include ReLu and Swish.

#### Role of the Activation Function
In a neural network, inputs, which are typically real values, are fed into the neurons in the network. Each neuron has a weight, and the inputs are multiplied by the weight and fed into the activation function.

<img src="https://missinglink.ai/wp-content/uploads/2018/11/activefunction.png" />

Each neuron’s output is the input of the neurons in the next layer of the network, and so the inputs cascade through multiple activation functions until eventually, the output layer generates a prediction. Neural networks rely on nonlinear activation functions—the derivative of the activation function helps the network learn through the backpropagation process

#### 7 Common Activation Functions
1. The sigmoid function has a smooth gradient and outputs values between zero and one. For very high or low values of the input parameters, the network can be very slow to reach a prediction, called the vanishing gradient problem.
2. The TanH function is zero-centered making it easier to model inputs that are strongly negative strongly positive or neutral.
3. The ReLu function is highly computationally efficient but is not able to process inputs that approach zero or negative.
4. The Leaky ReLu function has a small positive slope in its negative area, enabling it to process zero or negative values.
5. The Parametric ReLu function allows the negative slope to be learned, performing backpropagation to learn the most effective slope for zero and negative input values.
6. Softmax is a special activation function use for output neurons. It normalizes outputs for each class between 0 and 1, and returns the probability that the input belongs to a specific class.
7. Swish is a new activation function discovered by Google researchers. It performs better than ReLu with a similar level of computational efficiency.

#### Definition of Bias vs. Variance in Neural Networks

To understand bias vs. variance, we first need to introduce the concept of a training set and validation set:

- A training set is a group of examples which is fed to the neural network during training.
- A validation set is a group of unseen examples which you use to test your neural network to see how it performs.
- An error function calculates the error, for either the training or validation sets. The eror reflects how far away the network’s actual predictions were compared to the known correct outputs.

Bias reflects how well the model fits the training set. A high bias means the neural network is not able to generate correct predictions even for the examples it trained on. Variance reflects how well the model fits unseen examples in the validation set.  A high variance means the neural network is not able to correctly predict for new examples it hasn’t seen

<img src="https://missinglink.ai/wp-content/uploads/2018/11/Frame.png"/>

#### Overfitting and Underfitting in Neural Networks
Overfitting happens when the neural network is good at learning its training set, but is not able to generalize its predictions to additional, unseen examples. This is characterized by low bias and high variance. Underfitting happens when the neural network is not able to accurately predict for the training set, not to mention for the validation set. This is characterized by high bias and high variance.

#### Methods to Avoid Overfitting
- Retraining neural networks — running the same model on the same training set but with different initial weights, and selecting the network with the best performance.
- Multiple neural networks — training several neural network models in parallel, with the same structure but different weights, and averaging their outputs.
- Early stopping — training the network, monitoring the error on the validation set after each iteration, and stopping training when the network starts to overfit the data.
- Regularization — adding a term to the error function equation, intended to decrease the weights and biases, smooth outputs and make the network less likely to overfit.
- Tuning performance ratio — similar to regularization, but using a parameter that defines by how much the network should be regularized.

#### Methods to Avoid Underfitting

- Adding neuron layers or inputs — adding neuron layers, or increasing the number of inputs and neurons in each layer, can generate more complex predictions and improve the fit of the model.
- Adding more training samples or improving quality — the more training samples you feed into the network, and the better they represent the variance in the real population, the better the network will perform.
- Dropout — randomly “kill” a certain percentage of neurons in every training iteration. This ensures some information learned is randomly removed, reducing the risk of overfitting.
- Decreasing regularization parameter — regularization can be overdone. By using a regularization performance parameter, you can learn the optimal degree of regularization, which can help the model better fit the data.

#### What is the Difference Between a Model Parameter and a Hyperparameter?

- Model parameters are internal to the neural network – for example, neuron weights. They are estimated or learned automatically from training samples. These parameters are also used to make predictions in a production model.
- Hyperparameters are external parameters set by the operator of the neural network – for example, selecting which activation function to use or the batch size used in training. Hyperparameters have a huge impact on the accuracy of a neural network, there may be different optimal values for different values, and it is non-trivial to discover those values.

#### Hyperparameters related to neural network structure

1. Number of hidden layers – adding more hidden layers of neurons generally improves accuracy, to a certain limit which can differ depending on the problem.

2. Dropout – what percentage of neurons should be randomly “killed” during each epoch to prevent overfitting.

3. Neural network activation function – which function should be used to process the inputs flowing into each neuron. The activation function can impact the network’s ability to converge and learn for different ranges of input values, and also its training speed.

4. Weights initialization – it is necessary to set initial weights for the first forward pass. Two basic options are to set weights to zero or to randomize them. However, this can result in a vanishing or exploding gradient, which will make it difficult to train the model. To mitigate this problem, you can use a heuristic (a formula tied to the number of neuron layers) to determine the weights. A common heuristic used for the Tanh activation is called Xavier initialization.

#### Hyperparameters related to training algorithm

5. Neural network learning rate – how fast the backpropagation algorithm performs gradient descent. A lower learning rate makes the network train faster but might result in missing the minimum of the loss function.

<table>
    <tr>
        <td>
    <img src="https://missinglink.ai/wp-content/uploads/2018/10/graph2-1.png" />
        </td>
        <td>
    <img src="https://missinglink.ai/wp-content/uploads/2018/10/graph1-1.png" />
        </td>
    </tr>
</table>
6. Deep learning epoch, iterations and batch size – these parameters determine the rate at which samples are fed to the model for training. An epoch is a group of samples which are passed through the model together (forward pass) and then run through backpropagation (backward pass) to determine their optimal weights. If the epoch cannot be run all together due the size of the sample or complexity of the network, it is split into batches, and the epoch is run in two or more iterations. The number of epochs and batches per epoch can significantly affect model fit, as shown below.


<table>
    <tr>
        <td>
    <img src="https://missinglink.ai/wp-content/uploads/2018/10/overfiiting-1.png" />
        </td>
        <td>
    <img src="https://missinglink.ai/wp-content/uploads/2018/10/optimum-2.png" />
        </td>
        <td>
    <img src="https://missinglink.ai/wp-content/uploads/2018/10/underfitting-1.png" />
        </td>
    </tr>
</table>

7. Optimizer algorithm and neural network momentum – when a neural network trains, it uses an algorithm to determine the optimal weights for the model, called an optimizer. The basic option is Stochastic Gradient Descent, but there are other options. Another common algorithm is Momentum, which works by waiting after a weight is updated, and updating it a second time using a delta amount. This speeds up training gradually, with a reduced risk of oscillation. Other algorithms are Nesterov Accelerated Gradient, AdaDelta and Adam.

#### 4 Methods of Hyperparameter Tuning
In a neural network experiment, you will typically try many possible values of hyperparameters and see what works best. In order to evaluate the success of different values, retrain the network, using each set of hyperparameters, and test it against your validation set. If your training set is small, you can use cross-validation—dividing the training set into multiple groups, training the model on each of the groups then validating it on the other groups. Following are common methods used to tune hyperparameters:

1. Manual hyperparameter tuning — an experienced operator can guess parameter values that will achieve very high accuracy. This requires trial and error.
2. Grid search — this involves systematically testing multiple values of each hyperparameter and retraining the model for each combination.
3. Random search — a research study by Bergstra and Bengio showed that using random hyperparameter values is actually more effective than manual search or grid search.
4. Bayesian optimization — a method proposed by Shahriari, et al, which trains the model with different hyperparameter values over and over again, and tries to observe the shape of the function generated by different parameter values. It then extends this function to predict the best possible values. This method provides higher accuracy than random search.

#### Further Reading
- https://www.superdatascience.com/blogs/the-ultimate-guide-to-artificial-neural-networks-ann
- https://towardsdatascience.com/the-most-intuitive-and-easiest-guide-for-artificial-neural-network-6a3f2bc0eecb
- https://london.ac.uk/sites/default/files/study-guides/neural-networks.pdf
- https://codesachin.wordpress.com/2015/12/06/backpropagation-for-dummies/