## Multi-layer Perceptron


The solution to fitting more complex (*i.e.* non-linear) models with neural networks is to use a more complex network that consists of more than just a single perceptron. The take-home message from the perceptron is that all of the learning happens by adapting the synapse weights until prediction is satisfactory. Hence, a reasonable guess at how to make a perceptron more complex is to simply **add more weights**.

There are two ways to add complexity:

1. Add backward connections, so that output neurons feed back to input nodes, resulting in a **recurrent network**
2. Add neurons between the input nodes and the outputs, creating an additional ("hidden") layer to the network, resulting in a **multi-layer perceptron**

The latter approach is more common in applications of neural networks.

<a href="https://i.stack.imgur.com/n2Hde.png">image source</a>

<img src="https://i.stack.imgur.com/n2Hde.png" width=50%>

How to train a multilayer network is not intuitive. Propagating the inputs forward over two layers is straightforward, since the outputs from the hidden layer can be used as inputs for the output layer. However, the process for updating the weights based on the prediction error is less clear, since it is difficult to know whether to change the weights on the input layer or on the hidden layer in order to improve the prediction.

Updating a multi-layer perceptron (MLP) is a matter of: 

1. moving forward through the network, calculating outputs given inputs and current weight estimates
2. moving backward updating weights according to the resulting error from forward propagation. 

In this sense, it is similar to a single-layer perceptron, except it has to be done twice, once for each layer.



## Backpropagation intiution

* In the year 1986 a groundbreaking paper "Learning Internal Representation by Error Propagation" was published by -
    * David Rumelhart,
    * Geoffrey Hinton, &
    * Ronald Williams 
    
* It depicted an efficient way to update weights and biases of the network based on the error/loss function by passing twice through the network i.e forward and backward pass.
    - forward pass: data is passed through the input layer to the hidden layer and it calculates ouput. Its nothing but making prediction.
    - error calculation: Based on loss function error is calculated to check how much deviation is there from the ground truth or actual value and predicted value.
    - error contribution from the each connection of the output layer is calculated.
    - Then algo goes a layer deep and calculates how much previous layer contributed into the error of present layer and this way it propagates till the input layer.
    - This reverse pass measures the error gradient accross all the connection.
    - At last by using these error gradients a gradient step is performed to update the weights.
    
* In MLP key changes were to introduce a sigmoid activation function $$\sigma(z) = \frac{1}{1+e^{-z}}$$
    

## Need of activation function

* No activation function => deep stack of network will behave like a single linear transformation.
* Without activation function all the continuous function cannot be approximated.

In [2]:
!pip install HTMLrenderer

Collecting HTMLrenderer
  Using cached HTMLrenderer-0.1.6-py3-none-any.whl (7.6 kB)
Collecting py-youtube==1.1.7
  Using cached py_youtube-1.1.7-py3-none-any.whl (10 kB)
Collecting ensure==1.0.2
  Using cached ensure-1.0.2-py2.py3-none-any.whl (15 kB)
Installing collected packages: py-youtube, ensure, HTMLrenderer
Successfully installed HTMLrenderer-0.1.6 ensure-1.0.2 py-youtube-1.1.7


In [3]:
from HTMLrenderer.render import render_site, render_HTML

URL="https://slides.com/supremecommander/basic-neural-network/embed"
render_site(URL, width="100%", height=800)

ModuleNotFoundError: No module named 'HTMLrenderer.render'

## iNeuron Notes --

### why we use squashing activation functions ?? 
<b> Answer --> </b> ```If we don't use these functions, then output value can become very large either +ve or -ve, as there are lots of multiplications going on, and output may diverge to actual value. So to converge output value we use squashing activation functions.```

    1) If we don't use activation functions then our complex deep neural network will behave like a single neuron.
    
    2) Activation functions are also used to bring non-linearity into neural networks, so that it can perform on complex structures also. Activation functions helps to gather important informations and drop non important informations. it helps to remember only important features and important informations.
    
    3) Total trainable parameters = first_layer*second_layer + bias = = 784*300 + 300(bias) = 235500
    
    4) Batch_size = number of samples per gradient update. So if we have total datapoints 55000 and have batch size 32 then, we will take 32 samples from 55000 datapoints, pass it through model and then update weights. so total steps would be 55000/32 = 1719. So for each step we will take sample of 32 and update weights. So weights will be updated total 1719 times in 1 epoch.
    
    5) np.reshape(X_train[10:30],(-1,28,28,1)) -1 ---> for 20 samples, it will take automatically take 20 as sample, it is extending the shape with 1. 28,28,1 ---> taking one sample of 28*28 B&W image.
        
    6) (-1,28,28,1) --> 1 is because there is only 1 channel B&W, if its coloured image then we can use 3 channels, RGB. 1 image consist of size*channel. so (size = 28*28) and channel = 1. So 1 image is (28*28*1). -1 is expanding its shape, -1 is used because it can be replaced. -1 is acting here just like dummy variable "_" in python.
        
    7) Patients level means if validation error or any other specified metrics is not increasing from last 5 (specified patients level) epochs, then it will stop training.
    
    8) model checkpoint callback is used to take backup of training model. we can start model training again if it crashes in middle. no need to start training it from scratch again. checkpoints remembers the best weights, it does not try to remember epochs informations. If model training is crashed say at epoch 11, then it will remember best weights from 1-10 epochs and if we start training again, then weights initialised will be best weights from last training and epoch will start from 1 again.
    
    9) By default tensorboard monitors validation loss, we can change it also.