# Machine Learning

ML is a practice of using algorithms to analyze data, learn from that data, and predict for new data

# Machine Learning vs Traditional Learning

For Example - Analyzing the sentiment of a media outlet and classifying the sentiment as positive or negative

Traditional Algorithm will look for particular words tagged as positive or negative and based on the count of those words, it may come to an conclusion if the sentiment is positive or negative. It can only predict based on the words it know as positive or negative

Machine Algorithm will analyze large amounts of data and learn the features that classify the sentiment as positive or negative
With what it has learned it can classify new data as positive sentiment or negative sentiment

# Deep Learning

DL is a subfield of ML that uses algorithms inspired by the function and structure of the brain's neural network

The learning can take place in 2 ways - Supervised or Unsupervised

Supervised Learning occurs when the algorithm learns and make inferences from the data which has already been labelled

Unsupervised Learning occurs when the algorithm learns and make inferences from unlabelled data

Labelled Data- If you are learning from the data that has 1000 images of dogs and cats and each image is labelled with either dog or cat

Unlabelled Data - The images are not labelled with dogs and cats and the algorithm now will be learning based on the differnt features of the images and classifying the images based on their likeness or differences

Since the algorithms are based on the function and structure of brain's neural network, the models in deep learning are called Artificial Neural Network

# Artifical Neural Network (ANN)

ANN are computing systems inspired by the brain's neural network

<pre>
1)These network contains a collection of connected units called neurons or artificial neurons.
2)Each connection between neurons can transmit a signal from each neuron to another
3)The receiving neuron process the signal and downstreams the signal to the connected neuron.
4)Neurons are organized in layers where each layer performs a particular transformation,
The signal is transferred from the input layer to output layer, with each layer in between called as hidden layer
</pre>

We will use python's neural network API Keras

# Keras Sequential Model

Sequential Model is a linear stack of layers

In [6]:
from keras.models import Sequential
from keras.layers import Dense, Activation
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")

In [7]:
model=Sequential([Dense(32,input_shape=(10,),activation='relu'),Dense(2,activation='softmax')])

Dense is the most basic layer used in keras, it connects each input to each output

Hidden layer is considered as a Dense Layer, since it connects its each input to each output

# Layers in ANN

<pre>
Different types of layers include-> Each layer is suited for a particular type of task and each layer performs a differnt type of transormation

1) Dense - connects each input to each output
2) Convolutional - most suited for working with images
3) Pooling 
4) Recurrent - suited for working with time series data
5) Normalization
6) Many others
</pre>

Let us understand, how the layers actually work in neural network, consider a neural network with 3 layers, input(3 nodes), hidden (5 nodes) and output(2 nodes), Each node is called a neuron, The nodes in the input layers are the features of a particular sample which is passed, Each neuron is connected to other neuron via weights, weights is just a number between 0 and 1, The inputs received at the next neuron is multiplied by the weights to get the weighted sum and then this weighted sum is passed to an activation function, which transforms the weighted sum into a number between 0 and 1. Then this output is passed to the output layer. The nodes consisting in the output layer are the categories, like here if the problem was to classify the image to cats or dogs, so we would require two nodes for 2 categories, if we were to include lizards as well, we would require 3 nodes.

In [2]:
from keras.models import Sequential
from keras.layers import Dense, Activation
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")

In [4]:
model=Sequential([Dense(5, input_shape=(3,), activation='relu'), Dense(2, activation='softmax')])
#Here we define the model from the hidden layer and we pass the input shape of the input
#layer to the first layer of the model, because the model should know what type of data it will be 
#initially dealing with, the model will infer at the later stage

# Activation Functions in Neural Network

In ANN, activation function of a neuron defines the output of that neuron given a set of inputs

Like Sigmoid activation function, if the input is a very negative number then the activation function will try to transform it close it to 0, if the input is a very positive number then it will try to transform it closer to 1 anf if the input is close to 0 then it transforms it between 0 and 1

Why we need activation function? Biolgically inspired by the activity in our brains, where different neuron are fired or actived by differnt stimuli, like when we smell freshly baked cookies this would cause certain neurons in the brain to fire and when you smell something unpleasant, some other neuron may get fire. So some neuron are firing or not. Like Sigmoid, the more the value is closer to 1 the more it is activated, and more the value is closer to 0, less he neuron is activated. But this is not the case with every activation function to have the value between 0 and 1. Mostly used activation function like relu [Rectified Linear Unit], it transforms the value between max(0,x). Greater the value more activated the neuron is.

Another way to add layer and Activation function

In [6]:
model=Sequential()
model.add(Dense(5,input_shape=(3,)))
model.add(Activation('relu'))

# Training

Training a model is like a optimizatiom problem, where we are trying to optimize the weights within the model, and during the process of learning these weights will constantly be changing and try to reach the optimal value.

How the weights will be optimized will depend on the type of optimizer we are using. Most common optimizer is Stochastic Gradient Descent (SGD), every optimizer has a particular objective. SGD's objective is to minimize the loss function. The loss function can be similar to mean squared error, there can be different types of loss function. SGD's objective is to assign such weights such that the loss function is close to 0.

What is the actual loss? Suppose we are passing an image to the model to classify it as a dog or cat, so when predicting the model will assign probalilities of it being a cat or dog. Loss is the error between what the model is actually predicting versus what the label actually is.

So the data is repeatedly passed and the weights are optimally adjusted anf the model learns

# How a neural network learns?

Single pass of a data through the model is called an epoch.
We will be passing data for multiple epochs till the model learns to predict accurately

When the model is initially passed with the data, it sets some weights and at the end of the network, the output is generated. Then the loss of that computed output is calculated with respect to the actual label. At this point the model will calculate the gradient of the loss function with respect to each weight [Gradient is just another word for the derivative of a function with respect the the variables]. The gradient calculated will then be multiplied with the learning rate[learning rate is a value between 0.01 and 0.0001]. The value we get is used as the updated weight.

The weights are updated after every epoch while  SGD works to minimize the loss. The weights are slowly moving towards their optimized value.

This incremental updation of weights towards optimal value is what we mean when we say the model is learning

<pre>
import keras
from keras import backend as k
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy
#basic imports

model=Sequential([
    Dense(16, input_shape=(1,), activation='relu'),
    Dense(32, activation='relu'),
    Dense(2,activation='softmax')
])
#define a model with 2 hidden layers and 1 output layers

model.compile(Adam(lr=.00001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
#while compiling the model, we pass the optimizers, Adam is just a variant of SGD, with learning rate as 0.0001
#loss function used here is sparse_categorical_crossentropy and metrics is what we want to see when
#the output is predicted

model.fit(scaled_train_samples,train_labels,batch_size=10,epochs=20, shuffle=True, verbose=2)
#this function is used to train the model, train_samples is the training data, with the labels in another
#parameter train_labels, batch_size is the info in how many batches we want to send the data,
#epochs is how many time you want to pass the data, shuffle means if you want to shullfe your
#data after every epoch, verbose is how much output you want to see 
</pre>

# Loss in neural network

As we discussed earlier the loss is calculated for each input and output, for example in a model whose objective is to classify images as dog or cat with 0 being the label for cat and 1 for the dog. Lets suppose the for a given output model outputs 0.25, the error is 0.25-0 = 0.25.

In the same way, the model will calculate loss for every input and after each epoch, the loss for every input will be passed in a loss function.

Example- Mean squared error - in this loss function we square every loss, sum them up and find their average.

The value of loss function will calculated after every epoch and the loss will be constantly be decreasing over multiple epochs as this is the objective of SGD and the weights are updated after every epoch

# Learning Rate in neural network

<pre>
Earlier we had a general idea, that after the loss function is calculated, the gradient of this loss function is calculated with respect to each weight. The gradient is then multiplied by the learning rate.

So initially we start with arbitrary weights, then we incrementally update weights to move closer and closer to the optimized value of weights as SGD focusses on minimizing the loss. These step size to move closer and closer to the optimized weight value depend on the learning rate. So learning rate can be defined as the step size. Basically it varies between 0.01 and 0.0001, bubt the actual value ay vary.

After every epoch when the gradient value is calculated and multiplied with the learning rate, this value is subtracted from the actual weight to get the new updated weights.

Deciding what learning rate to choose requires testing as this is one of those hyper patrameters which needs to be test tuned before applying it to the model. For starting out it can be set between 0.01 and 0.0001

If we chose a learning rate that is greater on this scale, we risk the possibility of overshooting, this happens when we take large steps in the direction of the minimum and shoot past the minimum and misses it. On the other hand if we chose learning rate on the smaller scale, then it might take us longer to reach to the minimum loss and optimized weights.
</pre>

# Train, Tests and Validation sets

In the process for the model to learn the data is broken into 3 parts, train data, validation data and test data, with each epoch the model will output the result of thr train datam will learn from it and simultaneosly the model also outputs to the data in the validation set, the model has not seen this data before and the weights are not updated on the basis of validation data. The main use of validation data is to ensure model does not overfit and underfit to the data.

Overfitting- The model becomes really good at classifying the training data, but it is not as good as classifying the validation data.

Test set is the data which we are going to use while testing the data when it predicts the label.

Difference between test data and other 2 data is that test data is unlabelled while rest 2 data are labelled

In [2]:
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")
    

in model.fit function, the train samples are to be in a format of either a numpy array or a list of numpy array, train lables should be in the form of numpy array.

We dont have to define validation explicitly, we can define a parameter validation_split and give it a fraction to indicate this fraction of data should be used as validation data


We can also create an explicit validation set differnt from train data. To do that we can pass the data in validation_data parameter and keras expects it to be a list of tuples of the form (sample,label)

Test set will be of the same format as that of the train set and we will use this when we will call model.predict

# Predicting with the neural network

If we are happy with the metrics of the model, we will pass the test data for prediction
These predictions are based on what the model has learned till now.

<strong>predictions=model.predict(scaled_test_samples,batch_size=10, verbose=0)</strong>

We get the probabilities of the predictions in each tuple

# Overfitting in neural network

<pre>
When the model is good at predicting the data in training set, but it does not perform well while classifying the data for which it wasn't trained on. This scenario is called overfitting.

How do we know? Based on the metrics, when the validation set metrices are worse than the training set metrices or test metrices are worse than the training set metrices, it is unable to generalize.

How to reduce?
1) Easiest way to reduce it is to add more data, more the data, better the model will learn, with more data, we can also add more diversity in the data and model will be less likely to overfit 

2) Data Augumentation- Is the process of adding additional augmented data by reasonably modifying the data. Like for image classifier, rotating, flipping or zooming the data to add more data to the data set

3) Reducing the complexity of the model, by reducing the number of layers or neurons so that the model is able to generalize better

4) Dropout- If added to our model, it randomly drops a subset of nodes from the layer, this will 
prevent certain nodes to participate in predicting thus making the model generalize better
</pre>

# Underfitting in neural network

<pre>
If model is not even able to predict the data on which it was trained on. This scenario is said as Underfitting. This can be analyzed when the metrices for the training set are poor

How to reduce?

1) Increase the complexity of the model - increasing the number of layers/ increasing the number of neurons or chaning the layers we are using

2) Add more features to the input sample in our training set if we can. Example- if we want to predict the stock prices based on the closing proces of last 3 days. So initial features will be close1, close2, close3. If we add more features to it like opening price and volume the model might learn to classify data better

3) Reduce Dropout(Regularization Technique) The dropout only used for purposes of training and not for validation. So if we see that the model works well for the validation set but not for the training set, this is a good indication that we need to reduce the dropout
</pre>

# Supervised Learning

Supervised learning occurs when the data in the training set is labelled. After each prediction, the loss will be calculated based on the true label(encoded)

In [3]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import Adam

Using TensorFlow backend.


In [4]:
model=Sequential([Dense(16,input_shape=(2,),activation="relu"),
                 Dense(32, activation="relu"),
                 Dense(2,activation="softmax")])

In [5]:
model.compile(Adam(lr=0.0001),loss="sparse_categorical_crossentropy",metrics=['accuracy'])

In [16]:
#weight,height
train_sample=[[150,67],[130,60],[200,65],[125,52],[230,72],[181,70]]

In [17]:
#male=0
#female=1
train_label=[1,1,0,1,0,0]

model.fit(x=train_sample,train_label,batch_size=3,epochs=10,shuffle=True,verbose=2)

# Unsupervised Learning

Data in the training set is not labelled

But if the label is not available, how will the model be evaluated? Since model is not available with the label so there is no point of calculating the accuracy. The accuracy is not the metric in unsupervised learning.

Given the unlabelled data to the model, the model will attempt to learn some type of structure from the data and will try to extract features from it. Essentially model will learn to create mapping for given inputs to particular outputs based on the learning about the structure of the data

One area of unsupervised learning is clustering algorithms. Like given height and weight of people without giving the labels. The model will try to group them into clusters. If plotted the chart, the data can be divided into 2 groups. So each group can be given a label.

Another example of unseupervised learning is Autoencoders, Autoencoder is a ANN which takes the input and outputs the rescontruction of input. Example-given the images of handwritten text (numbers in our case), the goal is to output the recontructed image which should be as close as to the input image. Since this is a neural network, the loss will be calculated, the loss can be defined as the differnce between the original image and the output image. The loss function has to be minimized as this is an objective of SGD(optimizer)

Application of Autoencoders- Denoising the image,where we try to extract the meaningful data from the image that contains noise

# Semi Supervised Learning

Semi Supervised learning uses combination of supervised and unsupervised learning, Suppose you have large data and out of it only some of the data is labelled, so instead of manually labellin the rest of the data, we can train the model on the available labelled data and once the model is created, we use this model to predict the output for the unlabelled data by passing them in the model, after we have the label for the unlabelled data, we can train the model using the full data. This way we can achieve unsupervised learning

# Data Augmentation

Data augmentation is the process of creating more data by reasonably modifying the data available to us by flipping, zooming, rotating, change color, cropping. 

If we have small amount of data in our dataset, the  to make the model more robust we can add more data by modifying the original, This can also be used to reduce overfitting by adding more data to the data set. For example if the data set contains the images of the dogs facing right, but once the model is deployed and if it comes across the images of the dogs facing left, it might not be able to predict, so flipping can be the reasonable modification to increase the data

# One Hot Encoding

Labels for images in Keras are one hot encoding vectors. When we train our model using the labelled images of dogs and cats, the model may not be interpretting these labels as words ans the output which the model predicts is not in the form of words. So these labels are encoded to take the form of integers

One method of encoding the label for the categorical data is using one hot encoding. One hot encoding transforms the labels of categorical data in vectors of 0s and 1s, the length of these vectors is equal to the number of categories. Each index in the vector is associated with a particular category

With each category having its own place in the vector, the intuition behind the name one hot is simple, in the vector all the elements will be 0 except the actual category will be 1

# Convolutional Neural Network(s) (CNN or CompNet)

Most widely used for image analysis and classification. Think of a CNN as a ANN with some specialization for being able to pick out or detect patterns o make sense of them. This pattern detection is what makes CNN useful for image analysis.

CNN has hidden layers called Convolutional layers. It has other layers but the basis of a CNN is convolutional layer  

What convolutional layers do- jsut like other layers it transforms the input and passes the transformed output to the next layer. This operation is called Convolution operation. These layers are able to detect patterns. With each layer we need to specify the filters, these filters actually detect patterns, by pattern we mean if we can imagine what goes on in a single image, it may have edges, shapes, corners, textures, objects etc, so these filters specifically detect a particular pattern like there mey be filters for edge detection, filter to detect square shapes, filter to detect circular shapes, etc. These basis geometrical filters are used at the beginning of the network, the more deeper we go the more sophisticated the filters become, so the filters may be able to detect objects like ear, eyes, skin, etc

Example- we have a CNN and we pass the handwritten images to the network, as we know in each conovlutional layer we need to specify filter. Filter can be thought of as matrix for which we decide the number of rows and coulmns. values within the matrix are initially set with random numbers. Suppose we chose to have 1 filter in the conv. layer of matrix 3*3, now in the input, the filter will slide over each 3*3 pixel in the original image pixels, until it slid over every possible 3*3 matrix. This sliding is actually referred to as Convolving.  

The dot product of the filter matrix with the image matrix. This will occur for each 3*3 matrix.
and we store these dot products.

After the filter is convolved with the input image, we have a new representation of the image. This new matrix will be passed to the next layer. 


# Visualizing Convolutional Filters from CNN

These filters are what detect patterns in the image. We can Keras to visualize the filters of a CNN from VGG-16 neural network. VGG-16 is a CNN that won image net competetion in 2014. In this competetion the teams build algorithms for visual recognition tasks 

# Zero Padding in CNN

<pre>
As discussed earlier, the filter convolves with the original image to give us the computed output. But when this happens, the computed image is of the reduced dimensions. Example- Given a 28*28 image, the and filter of 3*3, the filter can only fit into 26*26 possible positions.

Ahead of time, we can calculate by how much our dimensions are going to shrink. If the given image is of dimensions n*n and filter of dimensions f*f, then the size of the output image will be 
(n-f+1)*(n-f+1)

Issues-
1) In the example of the image 7, the data was present in the middle, so it was not a big deal as  the meaningful data is still present, but if we notice that we only convolved the image with one filter. If the image is passed through the network and if it convolves with multipl filters, then the resultant output will get smaller and smaller. If we start with a relatively small image and after a layer or two the resulting output will get meaningless with how small it gets.

2) We are losing the meaningless data by throwing away the valuable information  around the edges of the input, because the filter is not convolving with those edges as much as it convolves with the inner image

Solution- Zero Padding, It is a technique that allows us to preserve he original input size. This we can specify on a per convolutional layer basis. When we define how many filters and size of the filters, we can specify whether or not to use padding.

Zero padding occurs when we add border of pixels all with value 0 around the edges of the input. This adds a padding of 0's around the edges of the original image. Sometimes we may have to use the border of more that 1 pixel thick to preserve the input size. Neural Network API's figure it out themselves the size of the border, we just have to specify whether or not we have to use the zero padding or not.

2 types of padding-:

1) Valid- means no padding and the size of input image will not be preserved
2) Same- padding to make output image same as the size of the input image
</pre>

In [1]:
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter('ignore')

In [2]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.layers.convolutional import *


Using TensorFlow backend.


In [8]:
model_valid=Sequential([
    Dense(16, input_shape=(20,20,3),activation='relu'),
    Conv2D(32, kernel_size=(3, 3), activation='relu', padding='valid'),#kernel_size=filter_size
    Conv2D(64, kernel_size=(5, 5), activation='relu', padding='valid'),
    Conv2D(128, kernel_size=(7, 7), activation='relu', padding='valid'),
    Flatten(),
    Dense(2, activation='softmax')
])

In [9]:
model_valid.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_7 (Dense)              (None, 20, 20, 16)        64        
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 18, 18, 32)        4640      
_________________________________________________________________
conv2d_11 (Conv2D)           (None, 14, 14, 64)        51264     
_________________________________________________________________
conv2d_12 (Conv2D)           (None, 8, 8, 128)         401536    
_________________________________________________________________
flatten_4 (Flatten)          (None, 8192)              0         
_________________________________________________________________
dense_8 (Dense)              (None, 2)                 16386     
Total params: 473,890
Trainable params: 473,890
Non-trainable params: 0
________________________________________________

As we see the output shape of each layer, the output shape decreases as the input passes by each convolutional layer

In [10]:
model_same=Sequential([
    Dense(16, input_shape=(20,20,3),activation='relu'),
    Conv2D(32, kernel_size=(3, 3), activation='relu', padding='same'),#kernel_size=filter_size
    Conv2D(64, kernel_size=(5, 5), activation='relu', padding='same'),
    Conv2D(128, kernel_size=(7, 7), activation='relu', padding='same'),
    Flatten(),
    Dense(2, activation='softmax')
])

In [11]:
model_same.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_9 (Dense)              (None, 20, 20, 16)        64        
_________________________________________________________________
conv2d_13 (Conv2D)           (None, 20, 20, 32)        4640      
_________________________________________________________________
conv2d_14 (Conv2D)           (None, 20, 20, 64)        51264     
_________________________________________________________________
conv2d_15 (Conv2D)           (None, 20, 20, 128)       401536    
_________________________________________________________________
flatten_5 (Flatten)          (None, 51200)             0         
_________________________________________________________________
dense_10 (Dense)             (None, 2)                 102402    
Total params: 559,906
Trainable params: 559,906
Non-trainable params: 0
________________________________________________

With padding we are able to preserve the output shape equal to the input shape

# Max-Pooling in CNN

<pre>
Max pooling is an operation added after a convolutional layer, which when added to the CNN, reduces the dimensionality of the images by reducing the number of pixels in the output from the previous CNN.

The max pooling operation is performed after we get the output from a convolutional layer. We can define n*n region as a corresponding filter for max pooling operation and we define a stride, meaning by how many pixels do we want our filter to move as it slides across the image.

Then we come to the output image of the layer and we take first n*n region from the output and calculate the max value in this n*n block. then we store this output in the new image. Then we move with a stride value and slide the image till the far right. We can take this region as a pool and since we are taking the max value, the term max-pooling makes sense.

After we slid the whole image, we get the new transformation of the image.

Why add max-pooling?
1) Since max-pooling is reducing the resolution of the given output of a convolutional layer, the network will be looking at the larger area of the image at a time going forward which reduces the amount of parameters in the network and consequently reducing the computational load.

2) Max-pooling may also help reduce overfitting.

The intuition to why max-pooling works is that for particular image the network will be extracting some particular features. Example - if the network is trying to identify the numbers so it is trying to extract the patterns like edges, curves, circles etc,  From the output we can consider the higher valued pixels as the most activated so with max pooling we are able to get the most activated pixels from a regions and preserve these values going forward and discarding the lower values

Ther are other types of pooling other than max-pooing, like average pooling, this operatin takes the average value. But max-pooling is vastly used tha any other type of pooling
</pre>

In [12]:
from keras.layers.pooling import *

In [16]:
model=Sequential([
    Dense(16, input_shape=(20,20,3), activation='relu'),
    Conv2D(32, kernel_size=(3,3), activation='relu', padding='same'),
    MaxPooling2D(pool_size=(2,2), strides=2, padding='valid'),
    Conv2D(64, kernel_size=(5,5), activation='relu', padding='same'),
    Flatten(),
    Dense(2,activation='softmax'),
])

In [17]:
model.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_14 (Dense)             (None, 20, 20, 16)        64        
_________________________________________________________________
conv2d_20 (Conv2D)           (None, 20, 20, 32)        4640      
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 10, 10, 32)        0         
_________________________________________________________________
conv2d_21 (Conv2D)           (None, 10, 10, 64)        51264     
_________________________________________________________________
flatten_8 (Flatten)          (None, 6400)              0         
_________________________________________________________________
dense_15 (Dense)             (None, 2)                 12802     
Total params: 68,770
Trainable params: 68,770
Non-trainable params: 0
__________________________________________________

# BackPropagation - Intuition

<pre>
We discussed earlier about the optimizer, how SGD works to minimize the loss fuction and how the model learns.

When we provide the input to the model, the nodes at the next layer get the input in the form of weighted sum. This weighted sum is then passed into an activation function, which is passed as an output of this particular node. The same process happens at each layer till the output is generated by the output layer. This process is called as forward propagation.

Now After the output is generated, the gradient of the loss function is calculated to update weights.
This is where backpropagation come in.

Back propagation is the tool that gradient descent uses to calculate the gradient of the loss function. We are somehow working backwards through the network to update the weights using backpropagation in order to minimize the loss.

We have the output generated for a particular input, now the loss gets calculated. Now to update the weights the gradient descent looks at the activation outputs from the output nodes. Let node B maps the output that out given input actually corresponds to. If that's the case, then the value from this output node should increase and the output from all the other nodes should decrease. This way SGD can lower the loss for this input. We know that the output of these output nodes come from the weighted sum of the weights for these connections multipied by the output of the previous layers and passing them through the activation functions of the output layer nodes.

Now to update the values at the output layer nodes, we can do it in 2 ways
1) to update the weights
2) to update the activation output of the previous layers.We cannot do this directly as it is based on the calculation of the weights and the output of the previous layers. But we can do it indirectly by moving back and updating the weights. We continue this process till we reach to the input layer

This way we move backwards in the network updating the weights from right to left in order to slightly move the values in the direction they should, meaning the SGD is trying to increase the value of the correct output node and decrease the value for the incorrect output nodes.

Now the updated values we get for the weights are actually the derivatives of the loss function with respect to the each corresponding weights.
</pre>

# Backpropagation-Mathematical Notation

L=number of layers in the network

Layers are indexed as l=1,2,3,4,. . . .,L

Nodes in the given layer l is indexed as j=0,1,2,3,...,n-1

Nodes in the layer l-1 are indexed as k=0,1,2,3,4,...n-1

y<sub>j</sub> is the desired value of the node j in the output layer L for a single training sample
So given the labelled data, we know ahead of time that what will be the desired value of the output node for the given input, so y<sub>j</sub> denotes that only.

C<sub>0</sub> is the loss function of the network for a single training sample which is the sum of squared errors

w<sub>jk</sub><sup>(<i>l</i>)</sup> is the weight of the connection that connects node k in layer l-1 to the node j in layer l

w<sub>j</sub><sup>(<i>l</i>)</sup> is the vector of weights connected to the node j in layer l by each node in layer l-1

z<sub>j</sub><sup>(<i>l</i>)</sup> is the input for the node j in layer l. We know that the input to any node is the weighted sum. This represents the weighted sum

g<sup>(<i>l</i>)</sup> is the activation function used for layer l

a<sub>j</sub><sup>(<i>l</i>)</sup> is the activation output for a node j in the layer l

# Backpropagation - Mathematical Observations

![](Mathematical_Observation_1_1.jpg)

![](Mathematical_Observation_1_2.jpg)

# Backpropagation - Calculating the gradient of the loss function with respect to each weights

For SGD to update weights, it first need to calculate the gradient of the loss function wth respect to each weights

![](Gradient_1_1.jpg)

![](Gradient_1_2.jpg)

![](Gradient_1_3.jpg)

![](Gradient_1_4.jpg)

# BackPropagation-How backpropagation works backwards through the neural network

![](BackPropagation1_1.jpg)

![](BackPropagation1_2.jpg)

![](BackPropagation1_3.jpg)

# Vanishing Gradient Problem

Problem of unstable gradients

By gradient, we mean the gradient of the loss function with respect to weights and we know this gradient is calculated using backpropagation. We update the weights with the calculated gradient

This problem involves weights in the earlier layers in the network. The SGD works to calculate the gradient of the loss with respect to each weights. Sometimes the gradient with respect to the weights in earlier layers becomes very small. Hence vanishing gradient. 

Problem? SGD uses this gradient to update the weights. The weights are updated in some way that is proportional to the gradient. so if the gradient is very small then this update is inturn going to be very small, so if the weight is barely changed, then it doesn't contribute mush to the network and will not change the loss to become minimum. So its kind of stuck, never really moving to its optimal value which has implications to the weights present later in the network, making the model impair to learn.

How does this problem occurs? We know that during backpropagation, the gradient of the loss with respect to any weight depends on the derivative of the components which reside later in the layer. So more terms will be multiplied as the weights live earlier in the network. If the numbers are small than 1, then we get the even smaller number and recall, we still have to multiply this value with the learning rate which is even smaller. So the obtained value is then subtracted from the weight to calculate the updated weight. 

So this is why the earlier layers face this problem, more the number of terms smaller than 1 are multiplied, more quickly the gradient vanishes

Problem in opposite, not the gradient that vanishes, but the gradient that explodes. If the larger terms are multiplied together(much greater than 1), the larger te gradient, and when we proprotionally update the weights the differnce is huge and the optimal value is lost as the otimal value continues to move away and away with each epoch.

# Weight Initialization | Way to reduce vanishing gradient problem

<pre>
When we initially compile the network, the weights are initialized with random numbers. one number per weight. thse random numbers are normally distributed such that the distribution of these numbers has mean of 0 and standard deviation of 1.

How does this random initialization impacts training?

Lets assume, there are 250 nodes in the input layer with value of each as 1, one node in hidden layer, so there will be 250 weights connecting the node in the hidden layer with the nodes in the input layer. These weights generated are normally distributed with a mean of 0 and a standard deviation of 1 (A normal distribution with a mean of 0 and a standard deviation of 1 is called a standard normal distribution) so the input at z(hidden layer node) is the weighted sum.

Z with the sum of numbers normally distributed around 0, will also be normally distributed around 0, but its variance and standard deviation will be greater than 1. This is because the variance of a asum of random numbers is equal to the sum of the variance of the random numbers and since the variance of these random numbers is 1, the variance of Z is 250, standard deviation of 15.811

With this larger std deviation the value of z will take on a larger number and when we pass this value in activation function then if we use sigmoid, then most positive number will be mapped to 1 and vice versa. So if the desired output of the function is on the opposite side then during training, when the SGD updated the weight, it will only make small changes in the value of the activation output barely moving in the right direction, hindering the networks ability to learn.

Solution- Initialize the weights such as to force this variance to be smaller, we need to shrink the variance of these weights which in turn will shrink the variance of weighted sum. The value identified for the variance is 1/n, so the weights are multiplied by sqrt(1/n) to shift the variance of these weigts from 1 to 1/n. This is called Xavier Initialization. If using relu, the value which works best is 2/n
</pre>

In [4]:
from keras.models import Sequential
from keras.layers import Dense, Activation

In [3]:
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter('ignore')

In [5]:
model=Sequential([
    Dense(16, input_shape=(1,5), activation='relu'),
    Dense(32, activation='relu', kernel_initializer='glorot_uniform'),
    Dense(2,activation='softmax')
])

In [6]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 1, 16)             96        
_________________________________________________________________
dense_2 (Dense)              (None, 1, 32)             544       
_________________________________________________________________
dense_3 (Dense)              (None, 1, 2)              66        
Total params: 706
Trainable params: 706
Non-trainable params: 0
_________________________________________________________________


This kernel_initializer is used to define what kind of weight initializer method we are using, we can also use glorot_normal, by default Keras uses glorot_uniform

# Bias in ANN

<pre>
Each neuron in the network has a bias

Biases are learnable, so as SGD learns and updates the weights through backpropagation it also updates biases

Biases can be thought of as the threshold, ass it determines, whether the activation output from the neuron will be propagated forward or not, this bias determines whether or not or by how much the neuron will fire, i.e it lets us know when the neuron is meaningfully activated

Addition of these biases increases the flexibility of the model

Example- Take 2 input nodes with value 1 and 2, weights -0.55 and 0.1. The weighted sum is -0.35. When this weighted sum is passed through the activation function relu,(max of 0 and value), the output is 0 and the neuron is not active. What if we have to change this threshold, this is where bias comes into picture, If we want that the neuron should be activated if its value is greater than or equal to -1, so we add the opposite of -1, i.e 1 as a bias to the weighted sum making the weighted sum to be -0.65 Now the neuron is activated. Since this output can be propagated forward, the model becomes flexible to learn. 

We dont chose any random value as the bias, just like weights the bias ae also learnable paraeters which keep changing in the training process.

</pre>

# Learnable Parameters in ANN

Learning Parameter (Trainable Parameters)- A parameter that is learned by the network during training, like weights and biases

How to calculate number of learning parameters- 

(Input to the node * output from the node + bias) Sum this with all the nodes to get the learnable parameters of the layer and sum this with this value for all the layers


Input layer has no learnable parameter

Example- 2 input nodes, 3 hidden nodes and 2 output nodes,
For 2nd layer, input=2, output=3, bias=3, 2 * 3+3=9 learning param for the 2nd layer
For 3rd layer, input=3, output=2, bias=2, 3 * 2+2=8 learning param for the 3rd layer
Total=17

# Learnable Parameters in CNN