# <center>Introduction of Transformers</center>
---

> **`I just want to drive the car, I don't care how it works.`**

## Introduction

Transformers are a type of neural network that can process sequential data, such as text, speech, or images, without relying on recurrent or convolutional layers. Instead, they use a mechanism called **attention** to learn the relationships between different elements in the input and output sequences. Attention allows the model to focus on the most relevant parts of the input for each element of the output, and to encode the context and position of each element in the sequence.

Transformers consist of two main components: an **encoder** and a **decoder**. The encoder takes the input sequence and transforms it into a high-dimensional representation, called the **hidden state**. The decoder takes the hidden state and generates the output sequence, one element at a time. Both the encoder and the decoder are composed of multiple identical layers, each containing two sub-layers: a **multi-head attention** layer and a **feed-forward** layer. The multi-head attention layer allows the model to attend to different parts of the sequence simultaneously, using multiple attention heads. The feed-forward layer applies a non-linear transformation to the output of the attention layer.

Transformers have been shown to achieve state-of-the-art results in various natural language processing tasks, such as machine translation, text summarization, question answering, and natural language generation. Some of the most famous Transformer models are **BERT**, **GPT-2**, and **GPT-3**. BERT is a bidirectional encoder that can learn from both left and right context, and can be fine-tuned for various downstream tasks. GPT-2 and GPT-3 are large-scale generative models that can produce coherent and diverse text on various topics, given a prompt or a context.

### Transformer's main points
1. The **attention mechanism** helps the Neural Network to learn from sequences that are very long.
    * They can learn from longer sequences than LSTM, which is another type of RNNs.
    * Attention was made for LSTM, but Transformers use only attention, and do not use LSTM at all.
2. Transformers are big and slow
    * But computations can be done in parallel (unlike RNNs).

There isn't just one type of transformer, there are many:
* BERT
* GPT

> Transformers can be applied to anything.

## Contents

* Sentiment Analysis
* Embeddings and nearest neighbour search
* Named Entity Recognition (many to many)
* Text generation
* Masked Language Model
* Text summarization (sequence to sequence)
* Language translation (used for building intuition for Attention)
* Question Answering
* Zero-Shot classification

## How we get from RNNs to Transformers?

RNNs and Transformers are two different types of neural networks that can handle sequential data, such as text, speech, or images. However, they have different ways of processing and learning from the data.

**RNNs** are sequential models that process data one element at a time, maintaining an internal hidden state that is updated at each step. They operate in a recurrent manner, where the output at each step depends on the previous hidden state and the current input. This allows them to capture the temporal dependencies and context in the data. However, RNNs have some limitations, such as:
- They are slow to train, as they cannot be parallelized due to their sequential nature.
- They suffer from the vanishing or exploding gradient problem, where the influence of earlier inputs diminishes or grows exponentially as the sequence progresses, making it difficult to capture long-term dependencies.
- They have a fixed-length representation of the input sequence, which may lose some information or introduce noise.

**Transformers** are non-sequential models that process data in parallel, using a mechanism called attention to learn the relationships between different elements in the input and output sequences. They do not rely on recurrent or convolutional layers, but instead use multiple layers of self-attention and cross-attention to encode and decode the data. This allows them to capture the global dependencies and context in the data. Some of the advantages of Transformers are:
- They are fast to train, as they can be parallelized and distributed across multiple devices.
- They do not suffer from the vanishing or exploding gradient problem, as they do not have recurrent connections or backpropagation through time.
- They have a variable-length representation of the input sequence, which can preserve more information and reduce noise.

## Sentiment Analysis (Classification)

**Sentiment analysis** is a NLP technique that detects and extracts the subjective information in a text, such as the author's attitude, opinion, emotion, or sentiment. Some applications of Sentiment analysis are:

- **Reputation management**: Monitoring social media platforms and analyzing the feedback and opinions of customers on products, services, or brands.

- **Competitive intelligence**: Comparing the sentiment of a business and its competitors, and identifying their strengths, weaknesses, opportunities, and threats.

- **Customer support**: Determining the best response based on the customer's sentiment, and providing personalized and empathetic solutions.

- **Stock trading**: Reading news articles/tweets that might affect the stock prices of companies, and using sentiment to decide whether to buy or sell.

### Why use Transformers instead of Bag of Words (BOW) model.

- Transformers can capture the **meaning and context** of the text, while bag of words only counts the **frequency** of the words.
- Transformers can handle **long and complex** sequences of words, while bag of words suffers from the **sparsity and dimensionality** problems.
- Transformers can learn from **pretrained language models**, while bag of words requires **manual feature engineering** or **domain knowledge**.
- Transformers can perform **self-attention** and **cross-attention**, while bag of words ignores the **order and position** of the words.

### Hugging Face Pipeline

Hugging Face **pipeline** is a simple and powerful way to use pre-trained models for various natural language processing tasks, such as text classification, sentiment analysis, question answering, and more.

- It is a **wrapper** around the Hugging Face Transformers library, which provides a large collection of state-of-the-art models for natural language understanding and generation.

- It offers a simple and unified API that **abstracts** away the complex code and logic behind the models, and allows users to focus on the input and output of the task.

- It supports **multiple tasks**, such as named entity recognition, masked language modeling, sentiment analysis, feature extraction, and question answering. Each task has a dedicated pipeline class that inherits from the base pipeline class.

- It allows users to **easily load and use** any model from the Hugging Face Hub, which is a platform that hosts thousands of models from the community and the Hugging Face team.

- It provides various options and parameters to customize the pipeline behavior, such as the device (CPU or GPU), the batch size, the return type (dict or list), the top-k results, the aggregation strategy, and more..

#### Workflow
* **Import** the pipeline
* **Load** a pretrained model
* **Use** the pre-trained model (simply pass the string/list of strings)

In [29]:
from transformers import pipeline

# Create your pipeline (includes tokenization, etc...)
classifier = pipeline(
    task = "sentiment-analysis",
    model = "distilbert-base-uncased-finetuned-sst-2-english"
)

# No need to convert inputs into PyTorch Tensors, Numpy array, Tensorflow Tensor, etc.
# We can simply pass the raw text to the model without any preprocessing.

# Output is a dictionary
print(classifier("This is such a great movie. It can be watched more than once."))  # passing single input
print(classifier("This is not a great movie. It can't be watched more than once.")) # passing single input

print(classifier(
    [
        "This is such a great movie. It can be watched more than once.",
        "This is not a great movie. It can't be watched more than once."
    ]
)) # passing list of inputs

[{'label': 'POSITIVE', 'score': 0.9998340606689453}]
[{'label': 'NEGATIVE', 'score': 0.9995469450950623}]
[{'label': 'POSITIVE', 'score': 0.9998340606689453}, {'label': 'NEGATIVE', 'score': 0.9995469450950623}]


## Sentiment Analysis in Python

In [32]:
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from transformers import pipeline

import sklearn.model_selection as model_selection
import sklearn.metrics as metrics

In [38]:
torch.cuda.is_available()
torch.cuda.current_device()

AssertionError: Torch not compiled with CUDA enabled

In [31]:
# hugging face pipeline
classifier = pipeline(
    task = "sentiment-analysis",
    model = "distilbert-base-uncased-finetuned-sst-2-english"
)

print(type(classifier))

<class 'transformers.pipelines.text_classification.TextClassificationPipeline'>
