# <center>Introduction of Transformers</center>
---

> **`I just want to drive the car, I don't care how it works.`**

## Introduction

Transformers are a type of neural network that can process sequential data, such as text, speech, or images, without relying on recurrent or convolutional layers. Instead, they use a mechanism called **attention** to learn the relationships between different elements in the input and output sequences. Attention allows the model to focus on the most relevant parts of the input for each element of the output, and to encode the context and position of each element in the sequence.

Transformers consist of two main components: an **encoder** and a **decoder**. The encoder takes the input sequence and transforms it into a high-dimensional representation, called the **hidden state**. The decoder takes the hidden state and generates the output sequence, one element at a time. Both the encoder and the decoder are composed of multiple identical layers, each containing two sub-layers: a **multi-head attention** layer and a **feed-forward** layer. The multi-head attention layer allows the model to attend to different parts of the sequence simultaneously, using multiple attention heads. The feed-forward layer applies a non-linear transformation to the output of the attention layer.

Transformers have been shown to achieve state-of-the-art results in various natural language processing tasks, such as machine translation, text summarization, question answering, and natural language generation. Some of the most famous Transformer models are **BERT**, **GPT-2**, and **GPT-3**. BERT is a bidirectional encoder that can learn from both left and right context, and can be fine-tuned for various downstream tasks. GPT-2 and GPT-3 are large-scale generative models that can produce coherent and diverse text on various topics, given a prompt or a context.

There isn't just one type of transformer, there are many:
* BERT
* GPT

> Transformers can be applied to anything.

## Contents

* Sentiment Analysis
* Embeddings and nearest neighbour search
* Named Entity Recognition (many to many)
* Text generation
* Masked Language Model
* Text summarization (sequence to sequence)
* Language translation (used for building intuition for Attention)
* Question Answering
* Zero-Shot classification

## How we get from RNNs to Transformers?

RNNs and Transformers are two different types of neural networks that can handle sequential data, such as text, speech, or images. However, they have different ways of processing and learning from the data. Here is a brief intuition of how we get from RNNs to Transformers:

- RNNs are sequential models that process data one element at a time, maintaining an internal hidden state that is updated at each step. They operate in a recurrent manner, where the output at each step depends on the previous hidden state and the current input. This allows them to capture the temporal dependencies and context in the data. However, RNNs have some limitations, such as:
    - They are slow to train, as they cannot be parallelized due to their sequential nature.
    - They suffer from the vanishing or exploding gradient problem, where the influence of earlier inputs diminishes or grows exponentially as the sequence progresses, making it difficult to capture long-term dependencies.
    - They have a fixed-length representation of the input sequence, which may lose some information or introduce noise.
- Transformers are non-sequential models that process data in parallel, using a mechanism called attention to learn the relationships between different elements in the input and output sequences. They do not rely on recurrent or convolutional layers, but instead use multiple layers of self-attention and cross-attention to encode and decode the data. This allows them to capture the global dependencies and context in the data. Some of the advantages of Transformers are:
    - They are fast to train, as they can be parallelized and distributed across multiple devices.
    - They do not suffer from the vanishing or exploding gradient problem, as they do not have recurrent connections or backpropagation through time.
    - They have a variable-length representation of the input sequence, which can preserve more information and reduce noise.

To summarize, RNNs and Transformers are two different approaches to sequence modeling, with different strengths and weaknesses. RNNs are good at capturing local and sequential dependencies, but have problems with long-term dependencies and scalability. Transformers are good at capturing global and parallel dependencies, but have problems with redundancy and interpretability.