# Week 2_ Neural Network Fundamentals

- Introduction to neural networks and its architecture
- Understanding of Perceptrons and Multi-layer perceptrons (MLP)
- Understanding of activation functions and their role in neural networks
- Introduction to backpropagation and gradient descent optimization algorithms
- Understanding of overfitting and regularization techniques to prevent it
- Understanding of Convolutional Neural Networks (CNN) and its application in NLP
- Understanding of Recurrent Neural Networks (RNN) and its variants such as LSTM and GRU
- Introduction to Transformer architecture and its application in NLP
- Understanding of Generative models such as GAN and VAE
- Understanding of Autoencoder and its application in NLP
- Understanding of Reinforcement learning and its application in NLP
- Introduction to Hyperparameter tuning and its importance in neural networks
- Understanding of Batch normalization and Dropout for improving the performance of neural networks.


## Difference between Biological Neural Network and Artificial Neural Network

<img src="images/comparison.png" width ="1000px" height ="1000px">

Image source: [link to source](https://blog.knoldus.com/getting-familiar-with-activation-function-and-its-types/)

## Neural networks and its architecture

A neural network is a group of algorithms that certify the underlying relationship in a set of data similar to the human brain. The neural network helps to change the input so that the network gives the best result without redesigning the output procedure.

<img src="images/neural network architecture.png" width ="450px" height ="450px">

Image source: [link to source](https://www.google.com/search?q=neural+network+and+its+architecture&sxsrf=AJOqlzVxKYJ02ECS38T8001zjVVOip5s4A:1678040533212&source=lnms&tbm=isch&sa=X&ved=2ahUKEwiSiZqPtMX9AhUFLOwKHWSVB2EQ_AUoAXoECAEQAw&biw=1536&bih=792&dpr=1.25#imgrc=TJYs5ujUOkqz7M)


<img src="images/Screenshot 2023-03-04 211153.png" width ="450px" height ="450px">

Image source: [link to source](https://www.xenonstack.com/blog/artificial-neural-network-applications)


<img src="images/artificial nn.png" width ="600px" height ="600px">

Image source: [link to source](https://blog.knoldus.com/getting-familiar-with-activation-function-and-its-types/)


## Types of Neural Network


<img src="images/types.png" width ="600px" height ="600px">

Image source: [link to source](https://www.xenonstack.com/blog/artificial-neural-network-applications)

## Neural Networks for data-intensive applications

<img src="images/nn app.png" width ="600px" height ="600px">

<img src="images/nn app 1.png" width ="600px" height ="600px">

Image source: [link to source](https://www.xenonstack.com/blog/artificial-neural-network-applications)


## Activation functions and their role in neural networks


Activation functions are the most important part of a neural network.
Very complicated tasks like object detection, language transformation, human face detection, object detection, etc are executed with the help of neural networks and activation functions. So, without it, these tasks are extremely complex to handle.

It decides whether a neuron will be activated or not by calculating the weighted sum and further adding bias with it.
The goal of the activation function is to introduce non-linearity into the output of a neuron.

Activation functions normalize the output in the range of -1 to 1 for any input.

<img src="images/activation function.png" width ="500px" height ="500px">

Image source: [link to source](https://blog.knoldus.com/getting-familiar-with-activation-function-and-its-types/)


### Types of Actiavtion Functions

The most commonly used activation functions are following:

- Linear
- Binary step
- ReLU
- LeakyReLU
- Sigmoid
- Tanh
- Softmax

### Linear Activation Function

<img src="images/linear activation function.png" width ="500px" height ="500px">

Image source: [link to source](https://blog.knoldus.com/getting-familiar-with-activation-function-and-its-types/)

A simple straight line activation function, where our function is directly proportional to the weighted sum of input
#### Equation: f(x) = mx

### Binary Step Activation Function

A very basic activation function, when we try to bound our output it comes to our mind every time. It is basically a classifier that classifies the output based on the threshold.

In this function, we decide the threshold value.

Output is greater than the threshold, neuron activated otherwise deactivated.

#### Equation: f(x) = 1 if x > 0         0 if x<0

For Binary classifiers or problems, we put the threshold value to be 0.

### ReLU Activation Function

<img src="images/RELU activation function.png" width ="500px" height ="500px">

Image source: [link to source](https://blog.knoldus.com/getting-familiar-with-activation-function-and-its-types/)


ReLU stands for Rectified Linear Unit, the most widely used activation function.

Primarily used in hidden layers of artificial neural networks.

Equation: f(x) = max(0,x)

It gives an output x if x is positive and 0 otherwise.


### Leaky ReLU Activation Function


<img src="images/leaky RELU activation.png" width ="500px" height ="500px">

Image source: [link to source](https://blog.knoldus.com/getting-familiar-with-activation-function-and-its-types/)

Leaky ReLU function is an improved version of the ReLU activation function.
It has a small slope for negative values instead of a flat slope.

It solves the “Dying ReLU” problem, as all the negative input values turn into zero rapidly, which would deactivate the neurons in that region.
In Leaky ReLU we do not convert all the negative inputs to zero, but near zero that solved the major issue of the ReLU activation function.

#### Equation: f(x) = max(0.01*x, x)

It returns x for positive input, but for negative value if x, it returns a very small value which is 0.01 times of x.
Thus it gives an output for negative value as well.

### Sigmoid Activation Function

Mostly used activation function because it does its task with great efficiency.
It is a probabilistic approach to decision-making.

#### Equation: f(x) = 1/(1+ e-x)

Non-linear nature, as the x value lies between -2 to 2, y values are very steep, which means that a small change in x would bring a large change in the value of y.

Range: 0 to 1

Usually used in the output layer of binary classifiers, where the result is either 0 or 1.

When we have to make a decision or to predict an output we use the sigmoid activation function because of its minimum range, which makes prediction more accurate.


### Tangent Hyperbolic Activation Function(Tanh)

<img src="images/tangent hyperbolic activation function.png"  width ="500px" height ="500px">

Image source: [link to source](https://blog.knoldus.com/getting-familiar-with-activation-function-and-its-types/)


#### Equation: f(x) = tanh(x) = 2/(1+e-2x) – 1   OR    tanh(x) = 2 * sigmoid(2x) – 1

Range: -1 to 1

Like sigmoid activation function, used in hidden layers as its values lie between -1 to 1 hence the mean for the hidden layer comes out to be 0 or very close to it, hence it helps in centering the data by bringing mean close to 0. This makes learning for the next layer easier and to predict or to differentiate between two classes but it maps the negative input into negative quantity only.


### Softmax Activation Function

The softmax function is also a type of sigmoid function, mostly used for classification problems.

Used primarily at the last layer i.e., Output layer for decision making like sigmoid function works.

Both sigmoid and softmax, considered for Binary Classification problems but when we try to handle multi-class classification problems. It would squeeze the outputs for each class between 0 and 1 and would squeeze the outputs for each class between 0 and 1 and would also divide by the sum of the outputs.

## Backpropagation and Gradient descent optimization algorithms

Backpropagation is a training algorithm used for training feedforward neural networks. It plays an important part in improving the predictions made by neural networks. This is because backpropagation is able to improve the output of the neural network iteratively.

In a feedforward neural network, the input moves forward from the input layer to the output layer. Backpropagation helps improve the neural network’s output. It does this by propagating the error backward from the output layer to the input layer.

<img src="images/back propagation.png"  width ="600px" height ="600px">

Image source: [link to source](https://www.analyticsvidhya.com/blog/2023/01/gradient-descent-vs-backpropagation-whats-the-difference/)

To understand how backpropagation works, let’s first understand how a feedforward network works.

<img src="images/feed foward nn.png"  width ="600px" height ="600px">

Image source: [link to source](https://www.analyticsvidhya.com/blog/2023/01/gradient-descent-vs-backpropagation-whats-the-difference/)


Backpropagation allows us to readjust our weights to reduce output error. The error is propagated backward during backpropagation from the output to the input layer. This error is then used to calculate the gradient of the cost function with respect to each weight.

<img src="images/back propagation 1.png"  width ="600px" height ="600px">

Image source: [link to source](https://www.analyticsvidhya.com/blog/2023/01/gradient-descent-vs-backpropagation-whats-the-difference/)

## Gradient Descent

The weights are adjusted using a process called gradient descent.

Gradient descent is an optimization algorithm that is used to find the weights that minimize the cost function. Minimizing the cost function means getting to the minimum point of the cost function. So, gradient descent aims to find a weight corresponding to the cost function’s minimum point.

To find this weight, we must navigate down the cost function until we find its minimum point.


<img src="images/gradient descent.png"  width ="600px" height ="600px">

Image source: [link to source](https://www.analyticsvidhya.com/blog/2023/01/gradient-descent-vs-backpropagation-whats-the-difference/)

<img src="images/bp vs gd.png"  width ="600px" height ="600px">

Image source: [link to source](https://www.analyticsvidhya.com/blog/2023/01/gradient-descent-vs-backpropagation-whats-the-difference/)


<img src="images/gradient descent and back propagation.png"  width ="450px" height ="400px">

Image source: [link to source](https://test.basel.in/product/gradient-descent-back-propagation/)


### Summarizing Gradient Descent

Gradient descent is an optimization algorithm used to find the weights corresponding to the cost function. It needs to descend the cost function until its minimum point to find these weights. It needs the gradient and the learning rate to descend the cost function. The gradient helps find the direction for reaching the minimum point of the cost function. The learning rate helps determine the speed at which to reach the minimum point. Upon reaching the minimum point, gradient descent finds weights corresponding to the minimum point.

### Summarizing Backpropagation

Backpropagation is the algorithm of calculating the gradients of the cost function with respect to the weights. Backpropagation is used to improve the output of neural networks. It does this by propagating the error in a backward direction and calculating the gradient of the cost function for each weight. These gradients are used in the process of gradient descent.


##  Regularization techniques

What is regularization?

<img src="images/regularization tachniques.png" width ="700px" height ="700px">

Image source: [link to source](https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/)

Have you seen this image before? As we move towards the right in this image, our model tries to learn too well the details and the noise from the training data, which ultimately results in poor performance on the unseen data.
In other words, while going towards the right, the complexity of the model increases such that the training error reduces but the testing error doesn’t. This is shown in the image below.
Regularization is a technique which makes slight modifications to the learning algorithm such that the model generalizes better. This in turn improves the model’s performance on the unseen data as well.

### Regularization VS Non- Regularization

<img src="images/regularized vs non regularized.png" width ="350px" height ="350px">

Image source: [link to source](https://ww2.mathworks.cn/en/discovery/regularization.html)

### The bias-variance tradeoff: Overfitting and Underfitting


<img src="images/nn prevent overfitting.png" width ="600px" height ="600px">

Image source: [link to source](https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/)


<img src="images/correction using.png" width ="600px" height ="600px">

Image source: [link to source](https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/)

### Underfitting
The bias error is an error from wrong assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs. This is called underfitting.

### Overfitting
The variance is an error from sensitivity to small fluctuations in the training set. High variance may result in modeling the random noise in the training data. This is called overfitting.

The bias-variance tradeoff is a term to describe the fact that we can reduce the variance by increasing the bias. Good regularization techniques strive to simultaneously minimize the two sources of error. Hence, achieving better generalization.


 
### What regularization does to a machine learning model


<img src="images/model regularization.png" width ="600px" height ="600px">

Image source: [link to source](https://www.reddit.com/r/learnmachinelearning/comments/w7yrog/what_regularization_does_to_a_machine_learning/)



## Convolutional Neural Networks (CNN) and its application in NLP

Convolutional Neural Network is a type of Feed-Forward Neural Networks used in tasks like image analysis, natural language processing, and other complex image classification problems.
A CNN has hidden layers of convolutional layers that form the base of ConvNets. 

Features refer to minute details in the image data like edges, borders, shapes, textures, objects, circles, etc.

At a higher level, convolutional layers detect these patterns in the image data with the help of filters. The higher-level details are taken care of by the first few convolutional layers.

The deeper the network goes, the more sophisticated the pattern searching becomes.

For example, in later layers rather than edges and simple shapes, filters may detect specific objects like eyes or ears, and eventually a cat, a dog, and what not.

<img src="images/CNN.png"  width ="600px" height ="600px">

Image source: [link to source](https://www.v7labs.com/blog/neural-network-architectures-guide)

## The Deconvolutional Neural Networks (DNN)

Deconvolutional Neural Networks are CNNs that work in a reverse manner.

When we use convolutional layers and max-pooling, the size of the image is reduced. To go to the original size, we use upsampling and transpose convolutional layers. Upsampling does not have trainable parameters—it just repeats the rows and columns of the image data by its corresponding sizes.

<img src="images/DNN.png"  width ="600px" height ="600px">

Image source: [link to source](https://www.v7labs.com/blog/neural-network-architectures-guide)

Transpose Convolutional layer means applying convolutional operation and upsampling at the same time. It is represented as Conv2DTranspose (number of filters, filter size, stride). If we set stride=1, we do not have any upsampling and receive an output of the same input size.





##  Recurrent Neural Networks (RNN) 

Recurrent Neural Networks have the power to remember what it has learned in the past and apply it in future predictions.


<img src="images/RNN.png"  width ="600px" height ="600px">

Image source: [link to source](https://www.v7labs.com/blog/neural-network-architectures-guide)

The input is in the form of sequential data that is fed into the RNN, which has a hidden internal state that gets updated every time it reads the following sequence of data in the input.

The internal hidden state will be fed back to the model. The RNN produces some output at every timestamp.

The mathematical representation is given below:

<img src="images/rnn mathematic.png"  width ="600px" height ="600px">

Image source: [link to source](https://www.v7labs.com/blog/neural-network-architectures-guide)


## The Long Short Term Memory Network (LSTM)

In RNN each of our predictions looked only one timestamp back, and it has a very short-term memory. It doesn't use any information from further back.

To rectify this, we can take our Recurrent Neural Networks structure and expand it by adding some more pieces to it. 

The critical part that we add to this Recurrent Neural Networks is memory. We want it to be able to remember what happened many timestamps ago. To achieve this, we need to add extra structures called gates to the artificial neural network structure. 

<img src="images/LSTM.png"  width ="600px" height ="600px">

Image source: [link to source](https://www.v7labs.com/blog/neural-network-architectures-guide)


## Echo State Networks (ESN)

Echo state Networks is a RNN with sparsely connected hidden layers with typically 1% connectivity.

The connectivity and weight of hidden neurons are fixed and randomly assigned. The only weight that needs to be learned is that of the output layer. It can be seen as a linear model of the weighted input passed through all the hidden layers and the targeted output. The main idea is to keep the early layers fixed.




## Transformer architecture and its application in NLP

The Transformer in NLP is a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease.


<img src="images/transformer.png"  width ="800px" height ="800px">

Image source: [link to source](https://towardsdatascience.com/transformers-89034557de14)

### Applications


<img src="images/applications of transformer.png"  width ="800px" height ="800px">

Image source: [link to source](https://www.researchgate.net/publication/356159551/figure/fig1/AS:1089187616956417@1636693972806/Some-applications-of-transformers-in-different-fields-of-machine-learning-For-NLP-a.png)



## Generative models such as GAN and VAE


### Generative Adversarial Network (GAN)

Generative modeling comes under the umbrella of unsupervised learning, where new/synthetic data is generated based on the patterns discovered from the input set of data.
GAN is a generative model and is used to generate entirely new synthetic data by learning the pattern and hence is an active area of AI research. 

<img src="images/generative.png"  width ="600px" height ="600px">

Image source: [link to source](https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/614fc742b1fa51cbd373f389_generative-adversarial-network.png)

They have two components—a generator and a discriminator that work in a competitive fashion. 


<img src="images/vae and g.png"  width ="600px" height ="600px">

Image source: [link to source](https://www.researchgate.net/publication/355850728/figure/fig4/AS:1085699449200644@1635862328517/Examples-of-generative-models-From-top-to-bottom-Variational-Autoencoder-VAE-and.png)


##  Autoencoder and its application in NLP

What is Auto-Encoder?

Autoencoder is an unsupervised neural network that tries to reconstruct the output layer as similar as the input layer. An autoencoder architecture has two parts:

- Encoder: Mapping from Input space to lower dimension space
- Decoder: Reconstructing from lower dimension space to Output space

<img src="images/auto encoders.png"  width ="500px" height ="500px">

Image source: [link to source](https://towardsdatascience.com/6-applications-of-auto-encoders-every-data-scientist-should-know-dc703cbc892b)

### Applications

### 1) Dimensionality Reduction:

Autoencoders train the network to explain the natural structure in the data into efficient lower-dimensional representation. It does this by using decoding and encoding strategy to minimize the reconstruction error.

<img src="images/dimension reduction.png"  width ="500px" height ="500px">

Image source: [link to source](https://towardsdatascience.com/6-applications-of-auto-encoders-every-data-scientist-should-know-dc703cbc892b)


### 2) Feature Extraction:

Autoencoders can be used as a feature extractor for classification or regression tasks. Autoencoders take un-labeled data and learn efficient codings about the structure of the data that can be used for supervised learning tasks.

<img src="images/feature extraction auto encoder application.png"  width ="500px" height ="500px">

Image source: [link to source](https://towardsdatascience.com/6-applications-of-auto-encoders-every-data-scientist-should-know-dc703cbc892b)

### 3) Image Denoising:

The real-world raw input data is often noisy in nature, and to train a robust supervised model requires cleaned and noiseless data. Autoencoders can be used to denoise the data.

<img src="images/auto encoder application image noising.png"  width ="700px" height ="700px">

Image source: [link to source](https://towardsdatascience.com/6-applications-of-auto-encoders-every-data-scientist-should-know-dc703cbc892b)

### 4) Image Compression:

Image compression is another application of an autoencoder network. The raw input image can be passed to the encoder network and obtained a compressed dimension of encoded data. The autoencoder network weights can be learned by reconstructing the image from the compressed encoding using a decoder network.

<img src="images/image compression auto encoder.png"  width ="500px" height ="500px">

Image source: [link to source](https://towardsdatascience.com/6-applications-of-auto-encoders-every-data-scientist-should-know-dc703cbc892b)

### 5) Image Search:

Autoencoders can be used to compress the database of images. The compressed embedding can be compared or searched with an encoded version of the search image.

<img src="images/image search auto encoder.png"  width ="600px" height ="600px">

Image source: [link to source](https://towardsdatascience.com/6-applications-of-auto-encoders-every-data-scientist-should-know-dc703cbc892b)

### 6) Anomaly Detection:

Anomaly detection is another useful application of an autoencoder network. An anomaly detection model can be used to detect a fraudulent transaction or any highly imbalanced supervised tasks.

### 7) Missing Value Imputation:

Denoising autoencoders can be used to impute the missing values in the dataset. The idea is to train an autoencoder network by randomly placing missing values in the input data and trying to reconstruct the original raw data by minimizing the reconstruction loss.

<img src="images/missing valu imputation.png"  width ="900px" height ="900px">

Image source: [link to source](https://towardsdatascience.com/6-applications-of-auto-encoders-every-data-scientist-should-know-dc703cbc892b)





## Reinforcement learning and its application in NLP

Reinforcement learning is a sub-domain of machine learning that deals with training AI models to yield the maximum reward possible from a process or task assigned to them. The most optimal path or behavior is encouraged in the AI model by giving it negative inputs every time it causes the undesired outcome from a task. AI-based reinforcement learning derives its fundamentals from human psychology research, wherein good behavior is rewarded and bad behavior patterns are punished.

<img src="images/reinforce.jpg"  width ="500px" height ="500px">

Image source: [link to source](https://insights.daffodilsw.com/hs-fs/hubfs/Allen/info.jpg?width=1371&name=info.jpg)

In the diagram given above, we assume that the main agent committing the action is an AI model. Actions are performed based on a list of norms pre-programmed into the AI model which we can refer to as the 'policy'. When a reinforcement learning algorithm is introduced into the natural flow of an AI task, it changes certain things.

Every time an action is performed, based on the outcome, the algorithm decides whether to make changes in the underlying policy. When the outcome is as desired, the policy remains unchanged, but otherwise, a policy update takes place via the reinforcement learning algorithm. After the policy is updated, the AI model performs the same action differently and this goes on until the most optimal outcome is achieved repeatedly.

<img src="images/reinforcement learning.png"  width ="500px" height ="500px">


### Applications

<img src="images/RL app.png"  width ="600px" height ="600px">

Image source: [link to source](https://www.v7labs.com/blog/reinforcement-learning-applications)



## Hyperparameter tuning and its importance in neural networks

<img src="images/hyper parameter tunning.png"  width ="600px" height ="600px">

Image source: [link to source](https://miro.medium.com/v2/resize:fit:828/format:webp/1*HpfPYwNLjrJJy2ewkdk8hw.png)


<img src="images/hyperparameter.png"  width ="600px" height ="600px">

Image source: [link to source](https://editor.analyticsvidhya.com/uploads/64801HPTT.png)

### Importance

Hyperparameter tuning takes advantage of the processing infrastructure of Google Cloud to test different hyperparameter configurations when training your model. It can give you optimized values for hyperparameters, which maximizes your model's predictive accuracy.



##  Batch normalization 

By normalizing the inputs we are able to bring all the inputs features to the same scale. In the neural network, we need to compute the pre-activation for the first neuron of the first layer a₁₁. We know that pre-activation is nothing but the weighted sum of inputs plus bias. In other words, it is the dot product between the first row of the weight matrix W₁ and the input matrix X plus bias b₁₁.

<img src="images/batch normal.png"  width ="500px" height ="500px">

Image source: [link to source](https://miro.medium.com/v2/resize:fit:640/format:webp/1*iJm0g1Od7ekugwTfJY-iUA.png)

Why is it called batch normalization?

Since we are computing the mean and standard deviation from a single batch as opposed to computing it from the entire data. Batch normalization is done individually at each hidden neuron in the network.


## Dropout

Dropout is a regularization technique that “drops out” or “deactivates” few neurons in the neural network randomly in order to avoid the problem of overfitting.

<img src="images/before vs after dropout.png"  width ="500px" height ="500px">

Image source: [link to source](https://miro.medium.com/v2/resize:fit:828/format:webp/1*S-Rr9boTfKusUzETeKW6Mg.png)


<img src="images/drop out.png"  width ="500px" height ="500px">

Image source: [link to source](https://theaisummer.com/regularization/)
