# Deep Learning

<b>Deep learning</b> is a subfield of machine learning that is inspired by the structure and function of the brain, specifically the neural networks that make up the brain. It involves training artificial neural networks (ANNs) on a large dataset, allowing the network to learn and make intelligent decisions on its own.

Deep learning techniques are used in a variety of applications, including image and speech recognition, natural language processing, and machine translation.

<b>Neural network</b> : 
A neural network is a type of machine learning model that is inspired by the structure and function of the brain. It is composed of layers of interconnected "neurons," which process and transmit information.

The basic structure of a neural network consists of an input layer, one or more hidden layers, and an output layer. The input layer receives the input data and passes it through the hidden layers, which use weights and biases to transform the data and extract features. The output layer produces the final output of the model.

![Image of Runcode](https://www.researchgate.net/publication/330120030/figure/fig1/AS:735637925797888@1552401157053/Deep-Neural-Network-architecture.ppm)

Some key concepts and terminologies in deep learning include:

1. <b>Artificial neural networks:</b> These are networks of artificial neurons that are designed to process data in a way that is similar to how the brain processes information.

2. <b>Perceptron</b>: Perceptron is a type of artificial neural network that is used for binary classification tasks. It is based on a single layer of artificial neurons, also known as perceptrons. Each perceptron receives input from data points in the dataset and uses a weighted sum of these inputs to make a prediction. If the weighted sum is above a certain threshold, the perceptron will classify the input as belonging to one class, otherwise it will classify it as belonging to the other class.

3. <b>Multi-layer perceptron (MLP)</b>: Multi-layer perceptron (MLP) is a type of artificial neural network that is composed of multiple layers of perceptrons. It is a feedforward neural network, which means that the information flows through the network in only one direction, from the input layer to the output layer, without looping back. MLPs can be used for a wide range of tasks, including classification and regression.

4. <b>Layers:</b> An artificial neural network is composed of multiple layers of interconnected neurons. Each layer processes the input data and passes it on to the next layer.

5. <b>Weights and biases</b>: Weights and biases are parameters of artificial neural networks that are learned during the training process. They are used to transform the input data and make predictions.

    Weights are the values that are multiplied with the input data to produce the output of a neuron. They are represented as a matrix of values, with one row for each input and one column for each neuron in the next layer. For example, if a neural network has 3 inputs and 4 neurons in the next layer, the weights matrix would have a shape of (3, 4).

    Biases are scalar values that are added to the output of a neuron after the weights have been applied. They are used to shift the activation function of the neuron and can help the network to make better predictions.

    In deep learning, weights and biases are learned through the process of training the network. During training, the network is fed a batch of input data and the corresponding target output. The output of the network is compared to the target output, and the error is used to update the weights and biases of the neurons in the network using an optimization algorithm such as stochastic gradient descent.

![Image of Runcode](https://miro.medium.com/max/1400/1*upfpVueoUuKPkyX3PR3KBg.png)

6. <b>Activation functions</b>: Activation functions are an important component of artificial neural networks. They determine the output of a neuron given its input and introduce nonlinearity into the network. Without activation functions, neural networks would be limited to linear models and would be unable to learn complex relationships in the data.

    There are several types of activation functions that are commonly used in deep learning, including:

    * Sigmoid: The sigmoid activation function maps any real-valued number to the range of 0 to 1. It is often used in the output layer of a binary classification model.
    
    * Tanh: The tanh activation function is similar to the sigmoid function but maps values to the range of -1 to 1. It is often used in the hidden layers of a neural network.
    
    * ReLU (Rectified Linear Unit): The ReLU activation function maps all negative values to 0 and all positive values to the same value. It is the most commonly used activation function in deep learning and is known for its simplicity and effectiveness.
    
    * Leaky ReLU: The leaky ReLU activation function is similar to the ReLU function but allows a small gradient when the input is negative. This helps to alleviate the "dying ReLU" problem, where the weights of a neuron become stuck at 0 and the neuron is unable to learn.
    
    * Softmax: The softmax function is a type of activation function that is often used in the output layer of a neural network for multi-class classification. It maps the output of the network to a probability distribution over the different classes, so that the predicted class is the one with the highest probability.

7. <b>Forward propagation</b>: Forward propagation is the process of passing the input data through an artificial neural network in order to generate an output. It is an essential step in the training and inference process of a neural network.

    In forward propagation, the input data is passed through the input layer of the network and transformed by the weights and biases of the neurons in each successive layer. The output of each layer is then passed as input to the next layer until the output of the final layer is produced.

8. <b>Loss function</b>: A loss function is a function that measures the error between the predicted output of a neural network and the true output. It is used to optimize the performance of the network by adjusting the weights and biases of the neurons.

    There are many types of loss functions that are used in deep learning, depending on the task at hand. Some common loss functions include:

    * Mean squared error (MSE): This loss function measures the average squared difference between the predicted and true output. It is often used for regression tasks.
    
    * Cross-entropy loss: This loss function is often used for classification tasks. It measures the difference between the predicted probability distribution and the true distribution.
    
    * Binary cross-entropy loss: This loss function is used for binary classification tasks. It measures the difference between the predicted probability and the true label.

9. <b>Backpropagation</b>: Backpropagation is the process of adjusting the weights and biases of the neurons in a neural network based on the error between the predicted output and the true output. It is an important step in the training process of a neural network.

    The process of backpropagation involves:

    * Feeding the input data through the network to generate a predicted output.

    * Calculating the error between the predicted output and the true output using the loss function.

    * Propagating the error backwards through the network to update the weights and biases of the neurons.

    * Repeating the process for multiple epochs until the network has learned to make predictions that are accurate enough for the given task.

10. <b>Optimization</b>: Optimization is the process of finding the optimal values for the parameters of a model to minimize the error between the predicted output and the true output. In deep learning, optimization algorithms are used to adjust the weights and biases of the neurons in a neural network to minimize the loss function.

    There are many optimization algorithms that are used in deep learning, each with its own strengths and weaknesses. Some common optimization algorithms include:

    * Stochastic gradient descent (SGD): This is a simple optimization algorithm that involves iteratively updating the weights and biases of the neurons in the direction of the negative gradient of the loss function.
    
    * Momentum: This optimization algorithm is an extension of SGD that uses the past gradients to smooth out the update and reduce oscillation.
    
    * Adagrad: This optimization algorithm adjusts the learning rate of each weight based on its past gradients, so that frequently updated weights have a lower learning rate.
    
    * Adam: This optimization algorithm combines the ideas of momentum and Adagrad to provide fast and stable convergence.

### Hyperparameters:

Hyperparameters are the parameters of a neural network that are set before training. They control the overall behavior of the network and are an important factor in determining the performance of the model. Some common hyperparameters in deep learning include:

1. Learning rate: The learning rate is a hyperparameter that controls the step size at which the optimizer makes updates to the weights and biases of the neurons. A smaller learning rate may lead to slower convergence, but may also result in a better solution.

2. Batch size: The batch size is the number of samples that are processed by the network before the weights and biases are updated. A larger batch size may result in faster convergence, but may also require more memory.

3. Number of epochs: The number of epochs is the number of times the entire dataset is passed through the network during training. A larger number of epochs may result in better performance, but may also lead to overfitting.

4. Number of hidden units: The number of hidden units is the number of neurons in the hidden layers of the network. A larger number of hidden units may result in better performance, but may also increase the complexity of the model.

5. Activation function: The activation function is the function that is applied to the output of each neuron to introduce nonlinearity into the network. Different activation functions can have a significant impact on the performance of the model.

6. Weight initialization: The weight initialization is the method used to set the initial values of the weights of the neurons

Some of the most popular deep learning libraries include:

1. <b>TensorFlow:</b> Developed by Google, TensorFlow is a powerful open-source library for training and deploying deep learning models. It has a large community of users and is highly customizable.

2. <b>PyTorch:</b> PyTorch is an open-source library for training and deploying deep learning models. It is known for its flexibility and ease of use, and is popular for research and development of new ideas.

3. <b>Keras:</b> Keras is a high-level deep learning library that runs on top of TensorFlow, PyTorch, or Theano. It is easy to use and allows users to quickly build and train deep learning models.

There are several types of deep learning techniques, including:

1. <b>Feedforward Neural Networks</b>: These are the most basic type of neural network, and they consist of a linear stack of layers where the output of one layer is fed as input to the next layer. They are used for tasks such as classification and regression.

2. <b>Convolutional Neural Networks (CNNs)</b>: These are commonly used for image and video analysis tasks. They work by applying a series of filters to the input data to extract features and learn patterns in the data.

3. <b>Recurrent Neural Networks (RNNs)</b>: These are commonly used for tasks that involve sequential data, such as language translation and speech recognition. They have the ability to retain memory of previous input and use that information to process current input.

4. <b>Generative Adversarial Networks (GANs)</b>: These are used to generate new data that is similar to a given dataset. They consist of two networks: a generator network that produces new data and a discriminator network that tries to distinguish the generated data from the real data.

Let us explore each and every deep learning techniques in detail

## 8.1 Feedforward neural networks

Feedforward neural networks, also known as fully-connected networks, are a type of artificial neural network where the neurons are fully connected and the data flows through the network in a single direction from the input layer to the output layer.

Here is an example of how to implement a 1 Hidden Layer Feedforward Neural Network (ReLU Activation) in PyTorch:

In [None]:
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets

'''
STEP 1: LOADING DATASET
'''

train_dataset = dsets.MNIST(root='./data', 
                            train=True, 
                            transform=transforms.ToTensor(),
                            download=True)

test_dataset = dsets.MNIST(root='./data', 
                           train=False, 
                           transform=transforms.ToTensor())

'''
STEP 2: MAKING DATASET ITERABLE
'''

batch_size = 100
n_iters = 3000
num_epochs = n_iters / (len(train_dataset) / batch_size)
num_epochs = int(num_epochs)

train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

'''
STEP 3: CREATE MODEL CLASS
'''
class FeedforwardNeuralNetModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(FeedforwardNeuralNetModel, self).__init__()
        # Linear function
        self.fc1 = nn.Linear(input_dim, hidden_dim) 
        # Non-linearity
        self.relu = nn.ReLU()
        # Linear function (readout)
        self.fc2 = nn.Linear(hidden_dim, output_dim)  

    def forward(self, x):
        # Linear function
        out = self.fc1(x)
        # Non-linearity
        out = self.relu(out)
        # Linear function (readout)
        out = self.fc2(out)
        return out
'''
STEP 4: INSTANTIATE MODEL CLASS
'''
input_dim = 28*28
hidden_dim = 100
output_dim = 10

model = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim)

'''
STEP 5: INSTANTIATE LOSS CLASS
'''
criterion = nn.CrossEntropyLoss()


'''
STEP 6: INSTANTIATE OPTIMIZER CLASS
'''
learning_rate = 0.1

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

'''
STEP 7: TRAIN THE MODEL
'''
iter = 0
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        # Load images with gradient accumulation capabilities
        images = images.view(-1, 28*28).requires_grad_()

        # Clear gradients w.r.t. parameters
        optimizer.zero_grad()

        # Forward pass to get output/logits
        outputs = model(images)

        # Calculate Loss: softmax --> cross entropy loss
        loss = criterion(outputs, labels)

        # Getting gradients w.r.t. parameters
        loss.backward()

        # Updating parameters
        optimizer.step()

        iter += 1

        if iter % 500 == 0:
            # Calculate Accuracy         
            correct = 0
            total = 0
            # Iterate through test dataset
            for images, labels in test_loader:
                # Load images with gradient accumulation capabilities
                images = images.view(-1, 28*28).requires_grad_()

                # Forward pass only to get logits/output
                outputs = model(images)

                # Get predictions from the maximum value
                _, predicted = torch.max(outputs.data, 1)

                # Total number of labels
                total += labels.size(0)

                # Total correct predictions
                correct += (predicted == labels).sum()

            accuracy = 100 * correct / total

            # Print Loss
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.item(), accuracy))

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz


100.0%

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz





Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz


100.0%

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz
Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz



100.0%

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz
Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz



100.0%

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz
Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw






Iteration: 500. Loss: 0.22577865421772003. Accuracy: 91.38999938964844
Iteration: 1000. Loss: 0.08613432943820953. Accuracy: 93.05000305175781
Iteration: 1500. Loss: 0.2590981125831604. Accuracy: 93.94999694824219
Iteration: 2000. Loss: 0.1608916074037552. Accuracy: 94.7300033569336
Iteration: 2500. Loss: 0.14946995675563812. Accuracy: 95.29000091552734
Iteration: 3000. Loss: 0.10177124291658401. Accuracy: 95.62000274658203


## 8.2 Convolutional Neural Networks (CNNs):

Convolutional Neural Networks (CNNs) are a type of artificial neural network designed for image recognition and processing. They are inspired by the structure of the visual cortex, which consists of a hierarchy of filters that process the visual input.

![Image of Runcode](https://codetolight.files.wordpress.com/2017/11/network.png?w=1108)

Here are some key terminologies used in CNNs:

1. Convolutional layer: A convolutional layer is a layer of neurons that applies a convolution operation to the input data. The convolution operation involves applying a set of filters (also known as kernels) to the input data to extract features. Each filter is a small matrix that is applied to a region of the input data to generate a feature map.

2. Kernel/Filter: A kernel (also known as a filter) is a small matrix that is used to extract features from the input data in a convolutional layer. Each kernel is applied to a region of the input data and generates a feature map by element-wise multiplication and summing the result.

3. Stride: The stride is the number of pixels that the kernel moves when it is applied to the input data. A larger stride results in a smaller feature map and reduces the computation time, but may also reduce the accuracy of the model.

4. Padding: Padding is the process of adding zeros around the edges of the input data to preserve the spatial dimensions of the output feature map. This allows the kernel to operate on the edges of the input data without reducing the size of the output feature map.

5. Pooling layer: A pooling layer is a layer of neurons that applies a pooling operation to the input data. The pooling operation involves down-sampling the input data by taking the maximum (max pooling) or average (average pooling) of a small region of the input data. Pooling layers are used to reduce the computation time and prevent overfitting.

6. Receptive field: The receptive field is the region of the input data that is used to compute the output of a neuron in a convolutional layer. The size of the receptive field determines the context that the neuron can see, which affects the ability of the network to recognize patterns in the input data

Here is the pytorch code for CNN:

## 8.3 Recurrent Neural Networks (RNNs):

Recurrent Neural Networks (RNNs) are a type of artificial neural network designed to process sequential data. They are capable of capturing patterns in the data that are dependent on the sequence of the data, such as time series, natural language, and speech.

RNNs use the same weights and biases for all time steps, but they have a hidden state that is updated at each time step and is used to pass information from one time step to the next. This allows RNNs to capture temporal dependencies in the data.

There are several techniques that are used to improve the performance of RNNs, including:

1. <b>Long Short-Term Memory (LSTM):</b> LSTM is a type of RNN that uses gating mechanisms to allow the network to store and access information over a longer period of time. It is particularly useful for tasks that require the network to remember long-term dependencies.

    LSTM networks have a hidden state that is updated at each time step, similar to traditional RNNs. However, they also have three additional gates that control the flow of information into and out of the hidden state: the input gate, the output gate, and the forget gate.

    The input gate controls the flow of information into the hidden state and is used to update the hidden state with new information. The output gate controls the flow of information out of the hidden state and is used to generate the output of the LSTM. The forget gate controls the flow of information into the hidden state and is used to forget irrelevant information.

2. <b>Gated Recurrent Unit (GRU):</b> GRU is a type of RNN that uses gating mechanisms similar to LSTM, but with a simpler architecture. It has been shown to perform well on a variety of tasks.

3. <b>Attention:</b> Attention is a technique that allows the network to selectively focus on certain parts of the input sequence when making predictions. This can be particularly useful for tasks where the relevant information is scattered throughout the input sequence.

    <span style="color:blue"><b>Transformers</b></span>: Transformers are a type of neural network architecture that was introduced in the paper "Attention is All You Need" (https://arxiv.org/abs/1706.03762). They are primarily used for natural language processing tasks, such as machine translation, language modeling, and text classification.

    Transformers are based on the idea of self-attention, which allows the network to attend to different parts of the input sequence at different time steps. This allows the network to capture long-range dependencies in the data without the need for recurrence or convolution.

    There are two main components of a transformer: the encoder and the decoder. The encoder processes the input sequence and generates a set of feature vectors (also known as the encoder output). The decoder then uses these feature vectors to generate the output sequence.

    The encoder and decoder are composed of a stack of identical layers, each of which consists of a self-attention layer and a feedforward layer. The self-attention layer computes the attention weights for each position in the input sequence and combines the feature vectors of the input sequence using these weights to generate the output feature vectors. The feedforward layer applies a fully-connected neural network to the output feature vectors to generate the final output.
    
    ![Image of Runcode](https://d2l.ai/_images/transformer.svg)

RNNs have a wide range of applications, including natural language processing, speech recognition, machine translation, and time series forecasting.

## 8.4 Generative Adversarial Networks (GANs):

Generative Adversarial Networks (GANs) are a type of neural network architecture that is used for unsupervised learning. They consist of two neural networks: a generator network and a discriminator network. The generator network is trained to generate new data samples that are similar to a training dataset, while the discriminator network is trained to distinguish between the generated samples and the real samples from the training dataset.

The generator and discriminator networks are trained simultaneously in a zero-sum game, where the generator tries to generate samples that the discriminator cannot distinguish from the real samples, and the discriminator tries to accurately distinguish between the real and generated samples. The objective of the GAN is to find an equilibrium, where the generator generates samples that are indistinguishable from the real samples, and the discriminator is unable to distinguish between the two.


![Image of Runcode](https://production-media.paperswithcode.com/methods/gan.jpeg)


Here are some key terminologies used in GANs:

1. Generator: The generator is a neural network that is trained to generate new data samples that are similar to a training dataset. It takes a random noise input and generates a synthetic data sample.

2. Discriminator: The discriminator is a neural network that is trained to distinguish between the generated samples and the real samples from the training dataset. It takes a data sample as input and outputs a probability score indicating the likelihood that the sample is real.

3. Noise: Noise is a random input that is fed into the generator network. It is typically drawn from a Gaussian distribution or a uniform distribution.

4. Adversarial loss: The adversarial loss is the loss function used to train the generator and discriminator networks. It is defined as the negative log-likelihood of the true labels given the output of the discriminator.

The objective of the GAN can be formalized as the following minimax optimization problem:

<b>min_G max_D V(D, G) = E_x[log D(x)] + E_z[log (1 - D(G(z)))]</b>

Where G is the generator network, D is the discriminator network, x is a real data sample, z is a noise input, and V(D, G) is the objective function.

Here is the pytorch code for Simple GAN network based on fully connected layers and train it on the MNIST dataset:

### Applications of GANs:

Generative Adversarial Networks (GANs) have a wide range of applications and have been used in many different domains. Some of the applications and use cases of GANs include:

1. Image generation: GANs have been used to generate realistic images from a given set of examples. This has been used for tasks such as generating realistic faces, landscapes, and paintings.

2. Image-to-image translation: GANs have been used to translate images from one domain to another. For example, they have been used to transform photographs into paintings, or to transform summer photos into winter photos.

3. Text generation: GANs have been used to generate text, including natural language and code. This has been used for tasks such as generating descriptions of images, generating dialogue, and generating programming code.

4. Data augmentation: GANs have been used to generate synthetic data samples that can be used to augment the training dataset in supervised learning tasks. This can be particularly useful for tasks where the amount of real data is limited.

5. Anomaly detection: GANs have been used to detect anomalous data points in a dataset. The discriminator can be trained to recognize normal data points, and the generator can be used to generate synthetic data points that are similar to the normal data points. Anomalous data points can then be detected by comparing the real data points to the synthetic data points and identifying those that are significantly different.

6. Domain adaptation: GANs have been used to adapt a model trained on one dataset to a different dataset. This can be done by training the generator to generate synthetic data points that are similar to the target dataset and using these data points to train the model.

7. enerating simulations: GANs have been used to generate synthetic simulations of physical processes, such as fluid dynamics and weather patterns. This can be used to generate training data for predictive models or to perform simulations in a virtual environment.