# <center> <font size = 24 color = 'steelblue'>**Transformers - An introduction**

## <center> <img alt="transformer-timeline" caption="The transformers timeline" src="https://drive.google.com/uc?export=view&id=1eucC4AUfAJW-rpprZ61i7U0FP9uXnAVF" id="transformer-timeline"/>

<div class="alert alert-block alert-info">
    
<font size = 4>

**By the end of this notebook you will be able to:**

- Understand basics of transformer models.
- Explore transfer learning application in NLP
- Learn applications of transformer model
    
</div>

# <a id= 'p0'>
<font size = 4>
    
**Table of Contents:**<br>
[1. History](#p1)<br>
[2. Encoder decoder framework ](#p2)<br>
[3. Attention mechanisms](#p3)<br>
[4. Transfer learning in NLP](#p4)<br>
[5. Transformer applications](#p5)<br>
>[5.1 Text classification](#p5.1)<br>
>[5.2 Named entity recognition](#p5.2)<br>
>[5.3. Question answering](#p5.3)<br>
>[5.4. Summarization](#p5.4)<br>
>[5.5. Translation](#p5.5)<br>
>[5.6. Text generation](#p5.6)<br>

## <a id = 'p1'>
    
<font size = 10 color = 'midnightblue'> **History**

<div class="alert alert-block alert-success">

## **Transformer Emergence (2017):**

In 2017, Google researchers introduced the Transformer, a groundbreaking neural network architecture for sequence modeling, surpassing the performance of traditional recurrent neural networks (RNNs) in machine translation tasks.

## **ULMFiT and Transfer Learning (Parallel Advancement):**

<font size = 4> Simultaneously, the ULMFiT transfer learning method demonstrated that training LSTM networks on a vast and diverse corpus could yield top-tier text classifiers with minimal labeled data.

## **Transformers' Evolution - GPT and BERT:**

<font size = 4> The success of the Transformer architecture laid the foundation for two influential models: the Generative Pretrained Transformer (GPT) and Bidirectional Encoder Representations from Transformers (BERT).

## **Unsupervised Learning Breakthrough:**

<font size = 4> GPT and BERT, leveraging Transformer architecture and unsupervised learning, eliminated the need for task-specific architectures, setting new benchmarks in Natural Language Processing (NLP).

## **Transformer Model Proliferation:**

<font size = 4> Post-GPT and BERT, a multitude of transformer models have surfaced, reshaping the landscape of NLP benchmarks .

## <center> <img alt="transformer-timeline" caption="The transformers timeline" src="https://drive.google.com/uc?export=view&id=1D6oe9BD6DofpmfqgudxRAYNxGnjmePOZ"  id="transformer-timeline"/>

[top](#p0)

## <a id = 'p2'>
<font size = 10 color = 'midnightblue'> **The Encoder-Decoder Framework**

<div class="alert alert-block alert-success">

## <font size = 6> To grasp the distinctive features of transformers, we must initially clarify:

<font size = 4>
    
1. The encoder-decoder framework <br>
2. Attention mechanisms <br>
3. Transfer learning

<div class="alert alert-block alert-success">
<font size = 4>
    
- Before the advent of transformers, recurrent architectures, specifically LSTMs, stood as the pinnacle in NLP.<br>
- LSTMs, characterized by a feedback loop in network connections, excelled in modeling sequential data, such as text.

## <font size = 5>Check the figure below:

<center> <img alt="rnn" caption="Unrolling an RNN in time." src="https://drive.google.com/uc?export=view&id=1B3Wbx0C2ztqOrbPzbojY8VswgZsz-xgC" id="rnn" width = 1200>

this is my worng

<div class="alert alert-block alert-success">

<font size=4>

- In the left side illustration of figure above, an RNN processes input, which can be a word or character, sending it through the network.<br>
- The RNN generates an output vector known as the hidden state as a result of processing the input.<br>
- Simultaneously, the model establishes a feedback loop, feeding information back to itself for utilization in subsequent steps.<br>
- When the loop is **"unrolled"** on the right side of figure, the RNN transmits information about its state at each step to the next operation in the sequence.<br>
- This unrolling mechanism empowers the RNN to maintain a record of information from prior steps, enhancing its ability to make accurate output predictions.

<div class="alert alert-block alert-success">
    
<font size = 4>

- Recurrent Neural Networks (RNNs) have been extensively applied in NLP, speech processing, and time series tasks.
- RNNs played a crucial role in developing machine translation systems, particularly in tasks involving mapping sequences of words from one language to another.
- For sequence-to-sequence tasks, like machine translation, the encoder-decoder architecture is commonly employed. The encoder encodes input information into a numerical representation (last hidden state), which is then utilized by the decoder to generate the output sequence.
- Encoder and decoder components can adopt various neural network architectures capable of modeling sequences.
- Figure below depicts a pair of RNNs encoding an English sentence ("Transformers are great!") into a hidden state vector, subsequently decoded to produce the German translation ("Transformer sind grossartig!"). The process involves sequential feeding of input and generation of output words.

# <center> <img alt="enc-dec" caption="Encoder-decoder architecture with a pair of RNNs. In general, there are many more recurrent layers than those shown." src="https://drive.google.com/uc?export=view&id=1gd2Y9r5Up3fPgk3o6pVJqCIyzRHqidOs" id="enc-dec"/>

<div class="alert alert-block alert-success">
<font size = 4>

- Despite its simplicity, a drawback of this architecture is the information bottleneck created by the final hidden state of the encoder. It must encapsulate the entire meaning of the input sequence, posing a challenge for long sequences where early information may be lost.
- Handling long sequences becomes especially challenging as the architecture compresses all information into a single, fixed representation.
- To address this bottleneck, a solution involves granting the decoder access to all of the encoder's hidden states.
- The mechanism facilitating decoder access to all encoder hidden states is termed attention. This concept is integral to many contemporary neural network architectures.
- Understanding the development of attention for RNNs serves as a foundational insight into one of the pivotal components of the Transformer architecture. Exploring this concept will enhance comprehension of modern neural network structures.

[top](#p0)

# <a id = 'p3'>
## <font size = 10 color = 'midnightblue'> **Attention mechanisms**

<div class="alert alert-block alert-success">
<font size = 4>

- Attention shifts from producing a single hidden state for the input sequence to generating a hidden state at each step, accessible by the decoder.
- To avoid overwhelming the decoder, a mechanism is crucial for prioritizing which states to utilize among the multiple hidden states produced by the encoder.
- Attention serves this purpose, enabling the decoder to assign varying weights, or "attention," to each encoder state during every decoding timestep.
- Figure below provides a visual representation of how attention operates, particularly in predicting the third token in the output sequence.

# <center> <img alt="enc-dec-attn" caption="Encoder-decoder architecture with an attention mechanism for a pair of RNNs." src="https://drive.google.com/uc?export=view&id=1wD64RcdtBR_gFAdk4ppOx5ulDjQX0D39" id="enc-dec-attn"/>

<div class="alert alert-block alert-success">
<font size =4>

- Attention-based models concentrate on determining the most relevant input tokens at each timestep, facilitating the learning of intricate alignments between
generated translations and source sentences.
- Figure below offers a visualization of attention weights in an English-to-French translation model, with each pixel representing a weight.
This visual representation showcases the model's ability to accurately align words, like "zone" and "Area," even when their order differs in the two languages.


# <center><img alt="attention-alignment" width="500" caption="RNN encoder-decoder alignment of words in English and the generated translation in French (courtesy of Dzmitry Bahdanau)." src="https://drive.google.com/uc?export=view&id=1kS7-UWxrGPu5HYgEYyRzqXbTOUAelWvi" id="attention-alignment"/>

<div class="alert alert-block alert-success">
    
<font size =4>

- While attention improved translations, recurrent models for encoder and decoder suffered a major drawback—sequential computations,
hindering parallelization across the input sequence.
- The transformer introduced a revolutionary modeling paradigm, discarding recurrence entirely and relying on self-attention,
a special attention form.
- Self-attention is a concept where attention operates on all states within the same layer of the neural network.
- Figure below illustrates this paradigm shift, showcasing both encoder and decoder employing self-attention mechanisms.
The outputs feed into feed-forward neural networks (FF NNs), enabling faster training compared to recurrent models and contributing to recent breakthroughs in NLP.


<center> <img alt="transformer-self-attn" caption="Encoder-decoder architecture of the original Transformer." src="https://drive.google.com/uc?export=view&id=1cBMFaVBjWKA--Tl8HykxPFnbql_XmEd9" id="transformer-self-attn" width = 900>

[top](#p0)

# <a id = 'p4'>
## <font size = 10 color = 'midnightblue'> **Transfer learning in NLP**

<div class="alert alert-block alert-success">
    
- <font size = 4>Common practice in computer vision involves employing transfer learning. A convolutional neural network like ResNet is initially trained on one task and
then adapted or fine-tuned for a new task.

- <font size = 4>The model is divided into a body and a head, where the body learns broad features of the source domain during training. The head is a task-specific network.

- <font size = 4>During training, the body's weights acquire knowledge from the original task, initializing a new model for the subsequent task.
This method outperforms traditional supervised learning, yielding high-quality models efficiently across various tasks with minimal labeled data.

- <font size = 4>Figure below provides a visual comparison between traditional supervised learning and the transfer learning approach, highlighting the efficacy of the
latter in producing versatile models.

<center> <img alt="transfer-learning" caption="Comparison of traditional supervised learning (left) and transfer learning (right)." src="https://drive.google.com/uc?export=view&id=1gCyUlp9LppYg3F-xoIbTCaImPCN5cW7w" id="transfer-learning" width = 900>  

[top](#p0)

# <a id = 'p5'>
## <font size = 10 color = 'midnightblue'> **Transformer Applications**

In [3]:
text1 = '''Extremely disappointed with my recent iPhone purchase from Apple. The device constantly lags, and the battery life is abysmal,
barely lasting through the day. Despite the hefty price tag, the performance is far from satisfactory. Customer support has been unhelpful,
providing no viable solutions to address these persistant issues. This experience has left me regretting my decision to choose Apple,
and I expected much better from a company known for its premium products.'''

In [4]:
text2 = '''I recently purchased an iPhone from Apple, and it has been an absolute delight! The device runs smoothly, and the battery life is impressive, easily lasting throughout the day.
The price, though high, is justified by the excellent performance and top-notch customer support. I am thoroughly satisfied with my decision to choose Apple, and it reaffirms their reputation
for delivering premium products. Highly recommended for anyone seeking a reliable and high-performance smartphone'''

###### <a id = 'p5.1'>
###### <font size = 6 color = 'pwdrblue'> **Text Classification**

In [5]:
from transformers import pipeline
classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [6]:
import pandas as pd

outputs1 = classifier(text1)
pd.DataFrame(outputs1)

# the output is label and corresponding probability

Unnamed: 0,label,score
0,NEGATIVE,0.999741


In [7]:
outputs2 = classifier(text2)
pd.DataFrame(outputs2)

Unnamed: 0,label,score
0,POSITIVE,0.999818


[top](#p0)

###### <a id = 'p5.2'>
###### <font size = 6 color = 'pwdrblue'> **Named entity recognition**

In [8]:
len(text1)

474

In [10]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text1)
pd.DataFrame(outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Device set to use cpu


Unnamed: 0,entity_group,score,word,start,end
0,MISC,0.992611,iPhone,38,44
1,ORG,0.99668,Apple,59,64
2,ORG,0.996425,Apple,394,399


In [21]:
outp = ner_tagger("Barack Obama was born in Hawaii. He was elected president in 2008 of the United States. He was against Microsoft and Apple. I loved to eat apples. Apple Orchards he was very find of.")

In [22]:
outp

[{'entity_group': 'PER',
  'score': 0.99922657,
  'word': 'Barack Obama',
  'start': 0,
  'end': 12},
 {'entity_group': 'LOC',
  'score': 0.99950397,
  'word': 'Hawaii',
  'start': 25,
  'end': 31},
 {'entity_group': 'LOC',
  'score': 0.9994687,
  'word': 'United States',
  'start': 73,
  'end': 86},
 {'entity_group': 'ORG',
  'score': 0.9994672,
  'word': 'Microsoft',
  'start': 103,
  'end': 112},
 {'entity_group': 'ORG',
  'score': 0.9986312,
  'word': 'Apple',
  'start': 117,
  'end': 122},
 {'entity_group': 'ORG',
  'score': 0.4024242,
  'word': 'Apple',
  'start': 147,
  'end': 152}]

[top](#p0)

###### <a id = 'p5.3'>
###### <font size = 6 color = 'pwdrblue'> **Question answering**

In [23]:
reader = pipeline("question-answering")
question = "What does the customer want?"
text = "I want to book a flight to Paris."
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cpu


Unnamed: 0,score,start,end,answer
0,0.454185,10,32,book a flight to Paris


In [24]:
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example     of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""

In [25]:
result = reader(question="What is a good example of a question answering dataset?",     context=context)

In [26]:
result

{'score': 0.5152310729026794,
 'start': 151,
 'end': 164,
 'answer': 'SQuAD dataset'}

[top](#p0)

###### <a id = 'p5.4'>
###### <font size = 6 color = 'pwdrblue'>  **Summarization**

In [27]:
text

'I want to book a flight to Paris.'

In [None]:
summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=50, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cpu
Your min_length=56 must be inferior than your max_length=50.
Your max_length is set to 50, but your input_length is only 11. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)


 I want to book a flight to Paris. I'm going to go to Paris, Paris, France. I've never been to Paris before. I want a Paris trip. I want you to go there. I love Paris. I


[top](#p0)

###### <a id = 'p5.5'>
###### <font size = 6 color = 'pwdrblue'>  **Translation**

In [None]:
translator = pipeline("translation_en_to_de",
                      model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

Device set to use cpu


- Ich will einen Flug nach Paris buchen. - Ja. - Ja. - Ja. Ich will einen Flug nach Paris buchen. - Ja, ja, ja, ja, ja, ja, ja, ja, ja. Ich will einen Flug nach Paris buchen. - Ja, ja, ja, ja, ja, ja, ja, ja, ja, ja, ja, ja, ja, ja, ja, ja, ja, ja, ja, ja, ja, ja, ja.


[top](#p0)

###### <a id = 'p5.6'>
###### <font size = 6 color = 'pwdrblue'>  **Text generation**

In [28]:
#hide
from transformers import set_seed
set_seed(78) # Set the seed to get reproducible results

In [29]:
text

'I want to book a flight to Paris.'

In [31]:
prompt

'I want to book a flight to Paris.\n\nCustomer service response:\nDear Patron, Thanks for writing in! I am sorry to hear your experience with us.'

In [30]:
generator = pipeline("text-generation")
response = "Dear Patron, Thanks for writing in! I am sorry to hear your experience with us."
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=150)
print(outputs[0]['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I want to book a flight to Paris.

Customer service response:
Dear Patron, Thanks for writing in! I am sorry to hear your experience with us.

In regards to the issue of security, the French National Police made clear that in order to act against those responsible for the security breach of the European headquarters of the International Monetary Fund (IMF), France would need to step in with the Security Council. We want to make sure that the international community has access to all relevant technical information and we can prevent possible incidents.

In response to the issue of security, the Royal Norwegian Airforce has started an official investigation regarding the case of the flight of this aircraft from Paris, and its captain was contacted on 9 February,


[top](#p0)