Notes from [DeepLizard](https://deeplizard.com)

### Backpropagation

We move backwards starting from the output layer and the loss function(SGD) cacluates the loss between actual values and preidcted and updates the weight accordingly at the output later    
ie if it is classification problem , then one of the node's value will increase which ever is the actual label and rest all values will be decreased.   
This process continues until we reach the input layer where we do not modify anything.

Steps are :
- Pass data to model via forward propogation (forward pass)
- Calculate loss on output
- SGD minimizes the loss
    - By calculating the gradient of the loss function and updating the weights
    - Gradient is calcualted via backpropogation

![](../resources/notations.png)

![](../resources/notation_2.png)

## Vanishing and Exploding gradient

Unstable Gradients 

By Gradient we mean the gradient of loss with respect to weights.   
This is calculated using backpropogation   
After that we (SGD or any optimizer) update the weights using gradients

Sometimes in the early layers of network the weights become very small (less than 1) , even if we update the weights  that does not have much effect and the network stops to learn , basically yhe weight vanishes


The earlier a weight resides in the network the more dependency it has in the network , becasue of chain rule

Exploding is exactly opposite (greater than 1)

Both of the are problems in traning nn 

### Solutions - possible

While randomly initializing the weight we try to keep the distribution normal ie mean = 0 and sd =1   
To solve this problem while initializing we can force the variance to be small

$ var(weights) = 1/n $ , where n = no of connected nodes from the previous layer

This initialization is called xavier initialization.

In [5]:
from keras.models import Sequential

Using TensorFlow backend.


In [6]:
from  keras.layers import Dense, Activation

In [7]:
model = Sequential([
    Dense(16 , input_shape = (1,5) , activation='relu'),
    Dense(32 , activation='relu', kernel_initializer='glorot_uniform'), # xavier inititalizer , by default same is used
    Dense(2, activation='softmax')
])

### Bias in NN

What is bias in NN ?
 - Each neuron has a bias ( so in an network there are many baises )
 - Each is learnable just like weights
 - Optmizer updates the bias as well while updating the weights (for example SGD is optimiizer)
 - Bias can be thought as threshold 
 - Bias determines if a neuron is activated and by how much
 - Introducing biases increases the flexibility of the model

![](../resources/bias1.png)

Here without bias the activation function will not fire because it become zeros, but if we wanted the threshold to increase  and not be 0 , then we can introduce the bias here.  
After introducing bias the activation fires as output os 0.65 (-0.35 + 1) and relu(0.65) = 0.65 , hence this gets activated after introducing  bias.

This can be done in opposite direction as well,  when we do not want to fire a neuron , eg is we want to activate the neuron only when the output is moree than 5 , then the bias will be -5.

In NN these biases are updated automatically during traning as the model learns 

### Learnable parameter in Fully Connected  NN - Eg Dense

What is a learnable parameter ?
- A parameter that is learned during training (trainable paramters)
- weights and biases 

How is the number of learnable parameters calcualted ? 
- We calculate for each layer and then sum up for all the layers
  - input , output , biases - for dense layers for cnn there is diff type
  - formula ->  $inputs * outputs + biases$. The same need to be done for all the layers in network and then sum up to get total learnable parameters in network

![](../resources/learnable_dense_layer.png)

Input layer = 0 input parameters as these are labels /values   
       
Hidden Layer - input = 2 parameters     
                output = 3 parameters     
                bias = 3     
              formula = $(2*3) + 3)$ = 9
              
Output Layer input = 3 parameters
                output = 2 parameters    
                bias = 2 parameters
               formula = $(3*2) + 2$ = 8
               
Total parameters in network =  $ 9+8 = 17 $ learnable parameters

### Learnable parameter in CNN Layers

Convolutional layers have additional **filter** which dense layers do not have ,also the size of filter matters.   
The input to the layer is dependent on the previous layer and its type

![](../resources/learnable_param_cnn.png)

![](../resources/cnn_learn_param2.png)

Calculation

Input layer
- 0 parameters

1st conv layer
- input param = 3
- filters = 2 
- size of filter = $3*3$
- bias = no of filter = 2
- total params = 3 * (3*3 * 2) + 2  = 56 (input * filter_size * no_of_filter + bias )

2nd conv layer
- input param = 2 (no of filter from previous layer)
- filters = 3
- size of filter = $3*3$
- bias = no of filter = 
- total params = 2 * (3*3 * 3) + 3  = 57 (input * filter_size * no_of_filter + bias )

Note : **Before passing o/p from conv layer to dense layer, we have to flatten the o/p**

Here it is image data (20*20*3) , 3 is filter , and the network uses zero padding 
Output layer
- input param = 20 *20 *3 = 1200
- output param = 2 , only 2 nodes are present
- size of filter = $3*3$
- bias = no of filter = 
- total params = 1200 * 2 + 2 = 2402 (input * nodes +  bias )


total leranable params = 45 + 57 + 2402 = 2515 params

### Regularization 

Regularization is technique that helps reduce overfitting
- It penalizes for complexity
- the most common way to use regularization is to add it to the loss for larger weights
- We generally use regularization to reduce the weights to optimize the objective of minimising the loss

L1 , L2 Regularization 

Vector Norm - The length of the vector is reffered as the vector norm or the vector's magnitude.

In [8]:
from keras import regularizers

In [10]:
model = Sequential([
    Dense(16 , input_shape = (1,) , activation='relu'),
    Dense(32 , activation='relu', kernel_regularizer = regularizers.l2(0.01)), #  regularizer  set per layer basis
    Dense(2, activation='sigmoid')
])

### Batch Size

- Batch Size is the number of samples that will be passed through network at one time.
- A batch is also called as **mini-batch** 
- Larger batches = Faster Training but the quality of model may degrade 

Specifying batch size in model

model.fit(...... , batch_size =10 )

### Fine tuning
- Transfer Learning - gain knowledge while solving one problem ans applying to solve other
- Fine tuning is utlilizing transfer learning - using existing model without building it from scratch

Import the existing model and remove the last layer , which was classifying whether an image is of car or not and add the ouput layer to classify for truck instead of cars.  
Sometimes we may need to remove more than 1 layers. and add more than 1 layers.
- this depends on how similar the task is for each of the models
- generally layers at the beginning learn more generic features like edges and lines, layers at the end are more specific

We need to freeze the older layer weights for initial layers (ie we don't want to update the weights for initial layers only on the new or modified layers it should update)


### Batch Normalization (Norm)

Normalize = Standardize the data

Normalize 
- for numerical data points getting it into lower scale , like 10-1000 to 0-1

Standardize
- $z = (x-m)s$ , forces  the sd to be 1 and mean  = 0

This boils down to putting the data to known or standard scale , trying to get all features on scale will reduce the chance of model being unstable (exploding gradient problem)

Batch Norm is applied to a layer to make sure one of the node's output does not become very large and make the network unstable. 

Steps
 - Normalize the o/p from activation function $z = (x-m)/s$ ,  s = SD , m = mean , x= actual value
 - Multiply normalized output by arbitrary parameter g  i.e. $(z*g)$
 - Add arbitrary parameter b to resulting product  i.e. $(z*g)+b$
   - all these parameters are traninable  (m, s, g, b) ie they will also get optimized during training
   - this is done so that the weights do not become very large and imbalance the detwork
   - this increases the speed of the network
   - Batch normalizes the output from activation function inside the layers comapred to regular normalization which occurs before the input to input layer
   - Also this occurs per batch basis , hence the name

In [13]:
from keras.layers import BatchNormalization

In [17]:
model = Sequential([
    Dense(16 , input_shape = (1,5) , activation='relu'),
    Dense(32 , activation='relu'),
    BatchNormalization(axis=1), # BatchNormalized  following the layer for which we want the o/p  to be normalized
    Dense(2, activation='softmax')
])