<a href="https://colab.research.google.com/github/subratamondal1/Natural-Language-Processing/blob/main/Chapter_1%2C_Hello_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 1, Hello Transformers
---

1. A. Vaswani et al., **“Attention Is All You Need”, (2017).**
2. J. Howard and S. Ruder, **“Universal Language Model Fine-Tuning for Text Classification”, (2018).**
3. A. Radford et al., **“Improving Language Understanding by Generative Pre-Training”, (2018).**
4. J. Devlin et al., **“BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding”, (2018).**

## Overview

### CNN

A **Convolutional Neural Network**, or CNN for short, is a type of computer program inspired by `how our brains process visual information`. It's really good at understanding images and patterns within them. Think of it like a detective that looks at a picture and can spot important features like edges, shapes, or even objects.

CNNs are mainly used for tasks related to `images` and `spatial data`. They're commonly seen in applications like recognizing objects in photos, self-driving cars identifying pedestrians, and even in medical imaging for diagnosing diseases from scans.

### RNN

RNNs, or **Recurrent Neural Networks**, `handles sequences of data`. They're excellent when you're dealing with things that happen over time, like predicting the next word in a sentence or forecasting stock prices. They remember what they've seen previously and use that knowledge to make predictions about what comes next.

RNNs have loops that allow information to flow from one step in a sequence to the next. This looping mechanism lets them maintain a memory of past data, making them great for tasks where understanding context is essential. However, they can struggle with really long sequences.

### LSTM

LSTM is like an upgraded version of RNNs. It stands for **Long Short-Term Memory**, and it's designed to solve that "short memory" problem RNNs have. LSTMs are better at keeping track of information over long sequences because they have a more sophisticated memory system that can remember important things and forget unimportant stuff.

LSTMs are commonly used in tasks that require a long memory, like machine translation (for converting one language into another), speech recognition, and even in chatbots for natural language understanding and generation.

### Transformers
The Transformer is a newer and highly flexible architecture. Instead of relying on sequential processing like RNNs and LSTMs, it uses something called "attention" to process data all at once. It's like having a super-smart group of people in a room where everyone talks to each other and decides what's important collectively.

Transformers have made a huge impact in natural language processing, enabling breakthroughs like Google Translate and chatbots that can have more natural conversations. But they're versatile and have also been applied to tasks like image recognition, making them quite popular in various domains.

To sum it up, CNNs are for images, RNNs and LSTMs are for sequences, and Transformers are versatile and great for various tasks.

### Encoder Decoder Framework

The Encoder-Decoder Framework is a neural network architecture designed for sequence-to-sequence tasks, where the goal is to transform an input sequence into an output sequence. It's widely used in tasks like machine translation, text summarization, speech recognition, and more. The framework leverages two key components to achieve this: the encoder and the decoder.

**Main Ideas**:

1. **Encoder**:
   - The encoder is the first part of the framework and is responsible for processing the input sequence. It encodes the input information into a fixed-length representation called a context vector or thought vector.
   - The encoder typically consists of recurrent neural networks (RNNs), long short-term memory networks (LSTMs), or more recently, transformer-based models like the ones used in the Transformer architecture.
   - The main idea behind the encoder is to capture the essential information from the input sequence and create a condensed representation that contains the relevant context for generating the output sequence.

2. **Decoder**:
   - The decoder is the second part of the framework and is responsible for generating the output sequence based on the encoded context vector.
   - Like the encoder, the decoder can also use RNNs, LSTMs, or transformer-based models. It generates the output sequence one step at a time, taking into account both the previously generated elements of the output sequence and the context vector.
   - The main idea behind the decoder is to use the context vector to produce a coherent and contextually relevant output sequence.

3. **Sequence-to-Sequence Mapping**:
   - The core concept of the Encoder-Decoder Framework is the mapping of variable-length input sequences to variable-length output sequences. This allows it to handle tasks such as language translation, where the length and structure of the input and output can differ.

4. **Context Preservation**:
   - The encoder plays a crucial role in capturing context from the input sequence. It ensures that important information is retained in the context vector, which the decoder can use to generate meaningful output.
   
5. **Training with Teacher Forcing**:
   - During training, the decoder is typically trained using a technique called teacher forcing. In this approach, the actual target sequence is used as input during training, rather than the previously generated output. This helps in stabilizing training but may lead to exposure bias during inference.

6. **Attention Mechanism**:
   - Often, the Encoder-Decoder Framework incorporates attention mechanisms, as discussed in a previous response. Attention allows the model to focus on different parts of the input sequence when generating each element of the output sequence, enhancing its ability to capture long-range dependencies.

7. **Variants and Improvements**:
   - Over time, the Encoder-Decoder Framework has seen various improvements and variants, such as the introduction of attention mechanisms, transformer-based models, and techniques to mitigate issues like vanishing gradients.

In summary, the Encoder-Decoder Framework is a versatile architecture that enables the modeling of complex sequence-to-sequence relationships by using an encoder to capture context from input sequences and a decoder to generate output sequences. It has found applications in a wide range of natural language processing and sequence modeling tasks, making it a fundamental concept in the field of deep learning.

### Attention Mechanisms

The Attention Mechanism is a crucial component in modern deep learning models, especially in tasks like machine translation, image captioning, and natural language understanding. It was originally inspired by human visual attention, which enables us to focus on specific parts of an image or text while processing information.

**Main Ideas**:

1. **Selective Focus**:
   - At its core, the attention mechanism allows a model to selectively focus on different parts of the input sequence or image when making predictions. This mimics human cognition by giving more weight to relevant information and less to irrelevant data.

2. **Contextual Information**:
   - Instead of blindly processing the entire input, attention allows the model to consider the context. It dynamically assigns different levels of importance to different elements based on their relevance to the current task or step in processing.

3. **Weighted Sum**:
   - Attention computes a weighted sum of the input elements, where the weights are learned during training. These weights indicate the importance or relevance of each element in the sequence or image for the task at hand.

4. **Parallelism**:
   - Attention enables parallelism in processing. Unlike traditional sequential models that process inputs one step at a time, attention mechanisms can process multiple inputs simultaneously, making them more efficient for many tasks.

5. **Scalability**:
   - Attention mechanisms can scale to handle inputs of varying lengths. This makes them particularly well-suited for tasks involving sequences of different lengths, such as machine translation where sentences can be of varying lengths.

6. **Self-Attention**:
   - Self-attention is a specific type of attention mechanism where the model attends to different parts of its own input. It's a key component in Transformer models, which have revolutionized natural language processing tasks.

7. **Multi-Head Attention**:
   - To capture different aspects of the input, attention can be divided into multiple "heads," each learning different attention patterns. Multi-head attention has been highly effective in improving model performance.

8. **Interpretable Representations**:
   - Attention mechanisms can also provide insights into model decisions. By visualizing attention weights, we can understand which parts of the input the model focused on when making predictions, making it more interpretable.

Overall, attention mechanisms have significantly improved the performance of deep learning models by allowing them to focus on relevant information, handle variable-length sequences, and process inputs more efficiently, making them a cornerstone in modern deep learning architectures.

### Transfer Learning in NLP

**Transfer Learning in NLP**: Transfer Learning in Natural Language Processing (NLP) is a machine learning paradigm that involves leveraging pre-trained models, originally trained on large-scale language data, to improve the performance of NLP tasks. It's based on the idea that the knowledge acquired from one task or dataset can be transferred and adapted to perform better on a different but related NLP task.

Now, let's explore the main ideas behind Transfer Learning in NLP:

1. **Pre-trained Language Models**: The foundation of Transfer Learning in NLP lies in pre-trained language models. These models are neural networks that have already been trained on vast amounts of text data from the internet. They learn to understand the complexities of language, including semantics, grammar, and context.

2. **Feature Extraction**: One key concept is feature extraction. Instead of starting from scratch when tackling an NLP task, you can use a pre-trained language model to extract meaningful features from your text data. These features capture rich linguistic information and can serve as valuable inputs to your downstream NLP model.

3. **Fine-tuning**: While feature extraction is beneficial, fine-tuning is another crucial idea. After extracting features, you can further train the pre-trained model on your specific NLP task using a smaller dataset related to your problem. Fine-tuning helps adapt the model's knowledge to the nuances of your task, making it more effective.

4. **Task Agnostic Representations**: Pre-trained language models learn task-agnostic representations of text. This means they understand language in a general sense and aren't specialized for any specific NLP task. This versatility allows you to use the same pre-trained model for various NLP tasks.

5. **Domain Adaptation**: Transfer Learning is also powerful for domain adaptation. If you have a pre-trained model on general text but need to work in a specialized domain (e.g., legal, medical), you can fine-tune the model on domain-specific data. This adaptation helps the model understand the domain-specific terminology and context.

6. **State-of-the-Art Performance**: Pre-trained language models have consistently achieved state-of-the-art results on a wide range of NLP benchmarks. They have become a go-to tool for NLP practitioners due to their remarkable performance.

7. **Resource Efficiency**: Transfer Learning in NLP saves both time and resources. Instead of training a massive model from scratch, you can build upon existing pre-trained models, which are usually open-source and readily available.

8. **Challenges**: There are challenges to consider, such as model size, computational resources, and ethical concerns regarding bias and fairness. These are important aspects to address when implementing Transfer Learning in NLP.

### Hugging Face Transformers
Hugging Face Transformers is a popular open-source library and platform for working with state-of-the-art natural language processing (NLP) models, particularly transformer-based models. It provides a comprehensive set of tools and pre-trained models for a wide range of NLP tasks.

**Core Ideas**:

1. **Pre-trained Transformer Models**:
   - Hugging Face Transformers is centered around the use of pre-trained transformer-based models. These models, such as BERT, GPT-2, RoBERTa, and many others, have achieved groundbreaking results in various NLP tasks.
   - The core idea here is that these models are pre-trained on massive text corpora, allowing them to learn rich contextual representations of language. These pre-trained models serve as powerful feature extractors for downstream NLP tasks.

2. **Transfer Learning**:
   - One of the key concepts behind Hugging Face Transformers is transfer learning. Instead of training NLP models from scratch for specific tasks, which can be data and resource-intensive, the library allows practitioners to fine-tune pre-trained models on their specific datasets.
   - Transfer learning with pre-trained models drastically reduces the amount of labeled data required for new tasks, making it more accessible and efficient.

3. **Ease of Use**:
   - Hugging Face Transformers is designed with ease of use in mind. It provides a user-friendly API that abstracts many of the complexities of working with transformer models.
   - Users can easily load pre-trained models, perform tasks like text classification, text generation, and more with just a few lines of code, making it accessible to both researchers and developers.

4. **Community and Open Source**:
   - The library has a vibrant open-source community that contributes to its growth. It encourages collaboration and sharing of models and code.
   - This open approach has led to a vast ecosystem of pre-trained models, model architectures, and tools built around Hugging Face Transformers.

5. **Customization and Fine-tuning**:
   - Hugging Face Transformers allows users to customize and fine-tune models for specific tasks. You can adapt pre-trained models to your unique NLP problems by adjusting hyperparameters and training on your data.

6. **Wide Range of NLP Tasks**:
   - The library supports a broad spectrum of NLP tasks, including text classification, named entity recognition, text generation, language translation, question-answering, and more.
   - This versatility makes it a go-to choice for NLP practitioners and researchers working on various applications.

7. **Model Hub and Model Sharing**:
   - Hugging Face provides a Model Hub, where users can discover, share, and download pre-trained models and configurations. This promotes model sharing and collaboration within the NLP community.

8. **Integration and Deployment**:
   - The library can be easily integrated into production systems, making it suitable for both research and real-world applications. It supports deployment on various platforms and cloud services.

In conclusion, Hugging Face Transformers revolutionized the field of NLP by making powerful pre-trained transformer-based models accessible and easy to use for a wide range of NLP tasks. Its core ideas of transfer learning, community collaboration, ease of use, and support for customization have made it a cornerstone in the NLP and machine learning communities.

## Text Classification

In [14]:
! pip install transformers -q
import pandas as pd
from transformers import pipeline

In [15]:
input_text = """Dear Amazon, last week I ordered an Optimus Prime action figure
    from your online store in Germany. Unfortunately, when I opened the package,
    I discovered to my horror that I had been sent an action figure of Megatron
    instead! As a lifelong enemy of the Decepticons, I hope you can understand my
    dilemma. To resolve the issue, I demand an exchange of Megatron for the
    Optimus Prime figure I ordered. Enclosed are copies of my records concerning
    this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

In [16]:
classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [17]:
outputs = classifier(input_text)
pd.DataFrame(outputs)

Unnamed: 0,label,score
0,NEGATIVE,0.901546


## Named Entity Recognition (NER)
In NLP, real-world objects like products, places, and people are called named entities, and extracting them from text is called named entity recognition (NER).

In [18]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [19]:
outputs = ner_tagger(input_text)
pd.DataFrame(outputs)

Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.87901,Amazon,5,11
1,MISC,0.990859,Optimus Prime,36,49
2,LOC,0.999755,Germany,94,101
3,MISC,0.556571,Mega,216,220
4,PER,0.590256,##tron,220,224
5,ORG,0.669692,Decept,265,271
6,MISC,0.498349,##icons,271,276
7,MISC,0.775362,Megatron,366,374
8,MISC,0.987854,Optimus Prime,387,400
9,PER,0.812096,Bumblebee,526,535


See those weird hash symbols (#) in the word column in the previ‐ ous table? These are produced by the model’s tokenizer, which splits words into atomic units called tokens.

## Question Answering
In question answering, we provide the model with a passage of text called the context, along with a question whose answer we'd like to extract. The model then returns the span of text corresponding to the answer.

In [20]:
reader = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [21]:
outputs = reader(question="What does the customer want?", context=input_text)
pd.DataFrame([outputs])

Unnamed: 0,score,start,end,answer
0,0.631292,351,374,an exchange of Megatron


## Summarization
The goal of text summarization is to take a long text as input and generate a short version with all the relevant facts.

In [22]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [23]:
outputs = summarizer(input_text, max_length=100, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

 Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that he had been sent an action figure of Megatron instead. As a lifelong enemy of the Decepticons, I hope you can understand my dilemma.


## Translation

In [24]:
! pip install transformers[sentencepiece] -q

In [25]:
translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")



In [26]:
outputs = translator(input_text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket öffnete, entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich hoffe, Sie können mein Dilemma verstehen. Um das Problem zu lösen, Ich fordere einen Austausch von Megatron für die Optimus Prime Figur habe ich bestellt. Eingeschlossen sind Kopien meiner Aufzeichnungen über diesen Kauf. Ich erwarte, von Ihnen bald zu hören. Aufrichtig, Bumblebee.


## Text Generation
Let's say you would like to be able to provide faster replies to customer feedback by having access to an autocomplete function.

In [29]:
generator = pipeline("text-generation")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [28]:
response = "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt = input_text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=200)
print(outputs[0]['generated_text'])

NameError: ignored

## Main Challenges with Transformers

1. **Resource Intensiveness**: Training and using state-of-the-art transformer models often demand significant computational resources, including powerful GPUs and TPUs, which can be expensive and inaccessible for some.

2. **Model Size and Deployment**: Many transformer models, like GPT-3 and BERT, have millions or even billions of parameters, making them challenging to deploy in resource-constrained environments, such as edge devices or mobile applications.

3. **Fine-Tuning Complexity**: While pre-trained transformers are powerful, fine-tuning them for specific tasks requires expertise in selecting the right architecture, hyperparameters, and training data, which can be time-consuming.

4. **Data Privacy**: Large transformer models can inadvertently memorize sensitive information from the training data, raising concerns about data privacy and security.

5. **Bias and Fairness**: Transformers can inherit biases from the data they are trained on, leading to biased or unfair predictions. Ensuring fairness and mitigating biases is an ongoing challenge.

6. **Interpretability**: Transformers are often considered "black-box" models, making it challenging to understand why they make specific predictions, especially for complex NLP tasks.

7. **Scalability**: Scaling transformers to handle extremely long sequences or massive datasets presents challenges in terms of memory management and parallelization.

8. **Low-Resource Languages**: Adapting pre-trained models to low-resource languages remains a challenge due to the lack of sufficient training data.

9. **Real-time Inference**: Transformers may not meet low-latency requirements for real-time applications like chatbots without significant optimization.

10. **Environmental Impact**: The energy consumption and environmental impact of training large transformer models are concerning, considering their computational requirements.

11. **Generalization**: While transformers excel in many tasks, they may struggle to generalize effectively in certain scenarios, posing challenges in ensuring robust performance.

12. **Dynamic Context Handling**: Transformers process input sequences in parallel, which may not be suitable for tasks requiring dynamic context understanding.

13. **Lack of Labeled Data**: Despite advances in unsupervised pre-training, labeled data remains crucial for fine-tuning, and obtaining high-quality labeled data can be challenging.

Addressing these challenges is crucial for further advancements in the field of natural language processing and machine learning. Researchers and practitioners are actively working on solutions to make transformers more efficient, fair, interpretable, and applicable in a wider range of real-world scenarios.