## History of Machine Learning Models

### RNNs(Recurrent Neural Networks):

### What are Recurrent Neural Networks (RNNs)?

Recurrent Neural Networks (RNNs) are a class of neural networks designed to handle sequential data. Unlike traditional feedforward neural networks, which take a fixed-size input and produce a fixed-size output, RNNs can process sequences of varying lengths and are capable of capturing temporal dependencies in the data.

### Recurrent Neural Network (RNN) Architecture

The architecture of a typical Recurrent Neural Network (RNN) consists of three main components:

1. **Input Layer**: The input layer receives the input sequence. Each element of the sequence is represented as a feature vector, and these vectors are fed into the network one at a time.

2. **Recurrent Layer**: The recurrent layer, also known as the hidden layer, is responsible for maintaining the hidden state across time steps. It takes both the current input and the previous hidden state as input and produces an output and a new hidden state. This process is repeated for each element in the sequence, allowing the network to capture temporal dependencies.

3. **Output Layer**: The output layer processes the final hidden state or the sequence of hidden states to produce the final output. Depending on the task, this output can be a single value (e.g., for regression) or a sequence of values (e.g., for sequence prediction or classification).

### Types of Recurrent Layers:

1. **Simple RNN**: The simplest form of the recurrent layer where the hidden state is computed using a linear transformation followed by a non-linear activation function.

2. **Long Short-Term Memory (LSTM)**: A more sophisticated recurrent layer designed to mitigate the vanishing gradient problem and capture long-term dependencies by incorporating specialized memory cells and gating mechanisms.

3. **Gated Recurrent Unit (GRU)**: Similar to LSTM but with a simplified architecture, combining the forget and input gates into a single "update gate" and merging the cell state and hidden state.

### Training and Backpropagation:

RNNs are typically trained using backpropagation through time (BPTT), an extension of standard backpropagation. BPTT propagates gradients through the network over multiple time steps, allowing the network to learn from sequences of arbitrary length.

### Drawbacks of RNNs:

1. **Vanishing Gradient Problem**: RNNs are prone to the vanishing gradient problem, where gradients diminish as they are backpropagated through time, making it difficult to learn long-term dependencies.

2. **Exploding Gradient Problem**: In some cases, gradients can explode during training, leading to numerical instability and poor convergence.

3. **Difficulty in Capturing Long-Term Dependencies**: Despite the introduction of LSTM and GRU units, RNNs still struggle to capture long-term dependencies in sequences.

4. **Sequential Computation**: RNNs process sequences one element at a time, which can be slow and inefficient for long sequences.

### Applications of RNNs:

1. **Sequence Prediction**: RNNs are commonly used for sequence prediction tasks such as language modeling, speech recognition, and time series forecasting.

2. **Natural Language Processing (NLP)**: RNNs are widely used in NLP tasks such as machine translation, sentiment analysis, and text generation.

3. **Speech Recognition**: RNNs are used in speech recognition systems to model the temporal dependencies in audio signals.

4. **Time Series Analysis**: RNNs are effective for modeling and predicting time series data such as stock prices, weather patterns, and sensor readings.

### Read More: https://arxiv.org/abs/1808.03314


### Example Code:
>>[<a href=https://colab.research.google.com/github/tensorflow/docs/blob/snapshot-keras/site/en/guide/keras/rnn.ipynb>Keras RNN</a>]

### What are LSTMs?

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) architecture designed to overcome the limitations of traditional RNNs in capturing long-range dependencies in sequential data. Introduced by Hochreiter and Schmidhuber in 1997, LSTMs have become a fundamental building block in various applications such as natural language processing, speech recognition, time series analysis, and more.

### Architecture of LSTMs:

LSTMs consist of memory cells and various gates that control the flow of information within the cell. The key components include:

1. **Memory Cell**: This is the core unit of the LSTM. It maintains a hidden state that can store information over time. The memory cell is regulated by various gates to control the flow of information into and out of the cell.

2. **Forget Gate**: Determines what information should be discarded from the cell's memory. It takes as input the previous cell state and the current input, and produces a forget vector that decides which information to forget.

3. **Input Gate**: Determines what new information should be stored in the cell's memory. It consists of a sigmoid layer that decides which values to update and a tanh layer that creates a vector of new candidate values to be added to the cell state.

4. **Output Gate**: Determines the output of the LSTM cell. It controls what information should be output based on the current input and the memory of the cell.

### Working of LSTMs:

1. **Forget Stage**: The forget gate decides which information from the previous cell state is relevant to keep.

2. **Input Stage**: The input gate decides which new information is relevant to add to the cell state.

3. **Update Stage**: The input gate's outputs are used to update the cell state with new information.

4. **Output Stage**: The output gate decides what information from the cell state should be output.

### Advantages of LSTMs:

1. **Long-Term Dependency Handling**: LSTMs are capable of capturing long-term dependencies in sequences, making them suitable for tasks where context over a long span is crucial.

2. **Gradient Flow**: LSTMs address the vanishing gradient problem better than traditional RNNs, enabling more effective training of deep networks.

3. **Flexibility**: LSTMs are flexible and can be adapted to various tasks by adjusting the architecture or adding additional components.

### Applications of LSTMs:

1. **Natural Language Processing**: LSTMs are widely used for tasks such as language modeling, sentiment analysis, machine translation, and named entity recognition.

2. **Speech Recognition**: LSTMs have been successfully applied in speech recognition systems to handle temporal dependencies in audio data.

3. **Time Series Prediction**: LSTMs are effective in forecasting and predicting time series data such as stock prices, weather patterns, and energy consumption.

4. **Anomaly Detection**: LSTMs are employed in anomaly detection systems to identify unusual patterns in sequences of data.

### Drawbacks of LSTMs:

1. **Complexity**: LSTMs are more complex than traditional RNNs, which can make them harder to train and tune for optimal performance.

2. **Computational Cost**: LSTMs can be computationally expensive, especially for large-scale applications with high-dimensional data.

3. **Overfitting**: LSTMs are prone to overfitting, especially when dealing with small datasets or noisy data.

### Example Code:
>>[<a href=https://colab.research.google.com/github/d2l-ai/d2l-en-colab/blob/master/chapter_recurrent-modern/lstm.ipynb>Keras LSTM</a>]

### What are Gated Recurrent Units (GRUs):

Gated Recurrent Units (GRUs) are a type of recurrent neural network (RNN) architecture that is similar to Long Short-Term Memory (LSTM) networks but with a simplified structure. Introduced by Cho et al. in 2014, GRUs are designed to capture long-range dependencies in sequential data while being computationally more efficient than LSTMs.

### Architecture of GRUs:

GRUs consist of two main components: a reset gate and an update gate. These gates control the flow of information within the network and help regulate the memory of the network over time.

1. **Reset Gate**: The reset gate determines how much of the past information to forget. It takes as input the previous hidden state and the current input and produces a reset vector that modulates the information from the past.

2. **Update Gate**: The update gate determines how much of the new information to retain. It combines the previous hidden state and the current input to produce an update vector that controls the flow of information into the current hidden state.

### Working of GRUs:

1. **Reset Stage**: The reset gate decides which information from the past hidden state is relevant to forget.

2. **Update Stage**: The update gate decides how much of the new information should be added to the current hidden state.

3. **Output Stage**: The output is computed based on the current hidden state and is used for prediction or further processing.

### Advantages of GRUs:

1. **Simplicity**: GRUs have a simpler architecture compared to LSTMs, making them easier to understand and implement.

2. **Efficiency**: GRUs are computationally more efficient than LSTMs, making them suitable for applications where speed and memory usage are critical.

3. **Performance**: GRUs have been shown to perform comparably to LSTMs on various tasks while requiring fewer parameters.

### Applications of GRUs:

1. **Natural Language Processing**: GRUs are commonly used in NLP tasks such as machine translation, sentiment analysis, and text generation.

2. **Speech Recognition**: GRUs are employed in speech recognition systems to model temporal dependencies in audio data.

3. **Time Series Analysis**: GRUs are effective for modeling and predicting time series data such as stock prices, weather patterns, and sensor readings.

### What are Variational Autoencoders (VAEs)?

Variational Autoencoders (VAEs) are a type of generative model that combines the principles of autoencoders and variational inference to learn a probabilistic representation of input data. Introduced by Kingma and Welling in 2013, VAEs have become popular for tasks such as image generation, data compression, and representation learning.

### Architecture of Variational Autoencoders:

VAEs consist of two main components: an encoder and a decoder. The encoder maps the input data to a latent space representation, while the decoder reconstructs the input data from the latent space representation.

1. **Encoder**: The encoder takes the input data and maps it to a distribution in the latent space. It consists of a neural network that outputs the mean and variance of the latent space distribution.

2. **Decoder**: The decoder takes a sample from the latent space distribution and reconstructs the input data. It consists of a neural network that generates the output based on the latent space representation.

### Working of Variational Autoencoders:

1. **Encoding Stage**: The encoder maps the input data to a distribution in the latent space using a neural network.

2. **Sampling Stage**: A sample is drawn from the latent space distribution using the reparameterization trick.

3. **Decoding Stage**: The decoder reconstructs the input data from the latent space representation using a neural network.

### Objective Function of Variational Autoencoders:

The training objective of VAEs consists of two terms: a reconstruction loss and a regularization term based on the Kullback-Leibler (KL) divergence between the learned latent space distribution and a prior distribution.

### Applications of Variational Autoencoders:

1. **Image Generation**: VAEs are used to generate realistic images by sampling from the learned latent space distribution.

2. **Data Compression**: VAEs can be used for data compression by encoding data into a lower-dimensional latent space representation.

3. **Representation Learning**: VAEs learn a meaningful representation of the input data that can be used for downstream tasks such as classification and clustering.

# Transformer based models

## Natural Language Models

### BERT

### Generative Pre-trained Transformer (GPT) Models

What is the GPT of ChatGPT, why is 

### Llama 

### Others

## Vision Based Models

## Audio Based Models

## Multimodal Models