# Transformer Pipelines
## This notebook outlines the concepts behing using readymade Pipelines from Transformers library across all major NLP tasks

## NLP Tasks: 

- ***Sentence Classification _(Sentiment Analysis)_***: Indicate if the overall sentence is either positive or negative, i.e. *binary classification task* or *logitic regression task*.
- ***Token Classification (Named Entity Recognition, Part-of-Speech tagging)***: For each sub-entities _(*tokens*)_ in the input, assign them a label, i.e. classification task.
- ***Question-Answering***: Provided a tuple (`question`, `context`) the model should find the span of text in `content` answering the `question`.
- ***Mask-Filling***: Suggests possible word(s) to fill the masked input with respect to the provided `context`.
- ***Summarization***: Summarizes the ``input`` article to a shorter article.
- ***Translation***: Translates the input from a language to another language.
- ***Feature Extraction***: Maps the input to a higher, multi-dimensional space learned from the data.

Pipelines encapsulate the overall process of every NLP process:
 
 1. *Tokenization*: Split the initial input into multiple sub-entities with ... properties (i.e. tokens).
 2. *Inference*: Maps every tokens into a more meaningful representation. 
 3. *Decoding*: Use the above representation to generate and/or extract the final output for the underlying task.

The overall API is exposed to the end-user through the `pipeline()` method with the following 
structure:

```python
from transformers import pipeline

# Using default model and tokenizer for the task
pipeline("<task-name>")

# Using a user-specified model
pipeline("<task-name>", model="<model_name>")

# Using custom model/tokenizer as str
pipeline('<task-name>', model='<model name>', tokenizer='<tokenizer_name>')
```

### Install the transformers library

In [1]:
!pip install -q transformers

[K     |████████████████████████████████| 2.3MB 3.9MB/s 
[K     |████████████████████████████████| 3.3MB 20.2MB/s 
[K     |████████████████████████████████| 901kB 47.3MB/s 
[?25h

### Import the libraries

In [2]:
from __future__ import print_function
import ipywidgets as widgets
from transformers import pipeline

# 1. Sentence Classification - Sentiment Analysis

- Create a Pipeline 
- Call the Pipeline object with a test text

### Create a pipeline object by passing sentiment-analysis as the argument

In [3]:
sa = pipeline('sentiment-analysis')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=629.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267844284.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=48.0, style=ProgressStyle(description_w…




### Call the object with a test sentence

In [4]:
sa('What a beautiful sight it was !')

[{'label': 'POSITIVE', 'score': 0.9998520016670227}]

In [6]:
sa('This movie is aweful and bad but i love the actress!')

[{'label': 'POSITIVE', 'score': 0.9991771578788757}]

# 2. Token Classification - Named Entity Recognition
- Create a Pipeline 
- Call the Pipeline object with a test text

In [7]:
ner = pipeline('ner')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=998.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1334448817.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=60.0, style=ProgressStyle(description_w…




### Call the pipeline object with the test text

In [10]:
ner('I love President Barrack Obama speeches in the USA.')

[{'end': 21,
  'entity': 'I-PER',
  'index': 4,
  'score': 0.9989446997642517,
  'start': 17,
  'word': 'Barr'},
 {'end': 24,
  'entity': 'I-PER',
  'index': 5,
  'score': 0.9969192743301392,
  'start': 21,
  'word': '##ack'},
 {'end': 30,
  'entity': 'I-PER',
  'index': 6,
  'score': 0.9993577599525452,
  'start': 25,
  'word': 'Obama'},
 {'end': 50,
  'entity': 'I-LOC',
  'index': 10,
  'score': 0.9996347427368164,
  'start': 47,
  'word': 'USA'}]

# 3. Question Answering
- Create a Pipeline 
- Call the Pipeline object with a test context passage and a question to get the answer span in the passage

In [11]:
qa = pipeline('question-answering')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=473.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=260793700.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




In [12]:
qa(context='This movie was aweful and bad but somehow i liked the character of the actress.', question='What did I like?')

{'answer': 'the character of the actress',
 'end': 78,
 'score': 0.5428628325462341,
 'start': 50}

In [13]:
qa(context='This movie was aweful and bad but somehow i liked the character of the actress.', question='Was the movie good?')

{'answer': 'This movie was aweful and bad',
 'end': 29,
 'score': 0.27749380469322205,
 'start': 0}

# 4. Text Generation - Mask Filling
- Create a Pipeline 
- Call the Pipeline object with a test text with a masked token in the test text

In [14]:
mask_fill = pipeline('fill-mask')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=480.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=331070498.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355863.0, style=ProgressStyle(descript…




In [17]:
mask_fill('This movie was aweful and bad but somehow i liked the character of the ' + mask_fill.tokenizer.mask_token)

[{'score': 0.11969155818223953,
  'sequence': 'This movie was aweful and bad but somehow i liked the character of the protagonist',
  'token': 24587,
  'token_str': ' protagonist'},
 {'score': 0.09252098947763443,
  'sequence': 'This movie was aweful and bad but somehow i liked the character of the character',
  'token': 2048,
  'token_str': ' character'},
 {'score': 0.07386377453804016,
  'sequence': 'This movie was aweful and bad but somehow i liked the character of the movie',
  'token': 1569,
  'token_str': ' movie'},
 {'score': 0.06450572609901428,
  'sequence': 'This movie was aweful and bad but somehow i liked the character of the villain',
  'token': 17031,
  'token_str': ' villain'},
 {'score': 0.041902683675289154,
  'sequence': 'This movie was aweful and bad but somehow i liked the character of the hero',
  'token': 6132,
  'token_str': ' hero'}]

In [18]:
mask_fill('This movie was aweful and bad but somehow i liked the ' + mask_fill.tokenizer.mask_token + ' of the actress')

[{'score': 0.13578923046588898,
  'sequence': 'This movie was aweful and bad but somehow i liked the character of the actress',
  'token': 2048,
  'token_str': ' character'},
 {'score': 0.08930227905511856,
  'sequence': 'This movie was aweful and bad but somehow i liked the voice of the actress',
  'token': 2236,
  'token_str': ' voice'},
 {'score': 0.0792042538523674,
  'sequence': 'This movie was aweful and bad but somehow i liked the performance of the actress',
  'token': 819,
  'token_str': ' performance'},
 {'score': 0.042302455753088,
  'sequence': 'This movie was aweful and bad but somehow i liked the attitude of the actress',
  'token': 6784,
  'token_str': ' attitude'},
 {'score': 0.024731744080781937,
  'sequence': 'This movie was aweful and bad but somehow i liked the looks of the actress',
  'token': 1326,
  'token_str': ' looks'}]

# 5. Summarization
- Create a Pipeline 
- Call the Pipeline object with a test text


In [19]:
summarizer = pipeline('summarization')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1649.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1222317369.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=26.0, style=ProgressStyle(description_w…




### Create some big text for summarization

In [20]:
TEXT_TO_SUMMARIZE = """ 
New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York. 
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband. 
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other. 
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage. 
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the 
2010 marriage license application, according to court documents. 
Prosecutors said the marriages were part of an immigration scam. 
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further. 
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective 
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002. 
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say. 
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages. 
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted. 
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s 
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali. 
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force. 
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

In [21]:
summarizer(TEXT_TO_SUMMARIZE)

[{'summary_text': ' Liana Barrientos pleaded not guilty to two counts of "offering a false instrument for filing in the first degree" She has been married to 10 men, nine of them between 1999 and 2002 . She is believed to still be married to four men, and at one time, she was married to eight men at once .'}]

# 6. Translation
- Create a Pipeline 
- Call the Pipeline object with a test text

Translation is currently supported by `T5` for the language mappings English-to-French (`translation_en_to_fr`), English-to-German (`translation_en_to_de`) and English-to-Romanian (`translation_en_to_ro`).

### English to French

In [22]:
translator = pipeline('translation_en_to_fr')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1199.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=891691430.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1389353.0, style=ProgressStyle(descript…




In [23]:
translator("This movie was aweful and bad but somehow i liked the character of the actress.")

[{'translation_text': "Ce film a été merveilleux et mauvais mais j'ai d'une certaine façon aimé le personnage de l'actrice."}]

### English to German

In [24]:
translator = pipeline('translation_en_to_de')

In [25]:
translator("This movie was aweful and bad but somehow i liked the character of the actress.")

[{'translation_text': 'Dieser Film war wunderschön und schlecht, aber irgendwie liebte ich die Figur der Schauspielerin.'}]

# 7. Text Generation
- Create a Pipeline 
- Call the Pipeline object with a test text

Text generation is currently supported by GPT-2, OpenAi-GPT, TransfoXL, XLNet, CTRL and Reformer.

In [26]:
text_generator = pipeline("text-generation")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…




In [27]:
text_generator("This movie was aweful and bad but ")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "This movie was aweful and bad but \xa0you definitely want to watch when the movies begin to die out. If not for a couple of days my heart would've dropped in disbelief. My blood pressure was off, with some pain coming out at"}]

# 8. Projection - Features Extraction 
- Create a Pipeline 
- Call the Pipeline object with a test text

In [28]:
nlp_features = pipeline('feature-extraction')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=411.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=263273408.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




In [29]:
output = nlp_features('The movie was aweful and bad but I liked the character of the actress.')

In [30]:
import numpy as np
np.array(output).shape   # (Samples, Tokens, Vector Size)

(1, 18, 768)

Try using this custom-built cell to try and test different tasks with different test texts.

# Sentiment Analysis, NER, Mask fill

In [34]:
task = widgets.Dropdown(
    options=['sentiment-analysis', 'ner', 'fill_mask'],
    value='ner',
    description='Task:',
    disabled=False
)

input = widgets.Text(
    value='',
    placeholder='Enter something',
    description='Your input:',
    disabled=False
)

def forward(_):
    if len(input.value) > 0: 
        if task.value == 'ner':
            output = ner(input.value)
        elif task.value == 'sentiment-analysis':
            output = sa(input.value)
        else:
            if input.value.find('<mask>') == -1:
                output = mask_fill(input.value + ' <mask>')
            else:
                output = mask_fill(input.value)                
        print(output)

input.on_submit(forward)
display(task, input)

Dropdown(description='Task:', index=1, options=('sentiment-analysis', 'ner', 'fill_mask'), value='ner')

Text(value='', description='Your input:', placeholder='Enter something')

[{'sequence': 'This movie is nice but!!!', 'score': 0.2501232624053955, 'token': 16506, 'token_str': '!!!'}, {'sequence': 'This movie is nice but????', 'score': 0.16517703235149384, 'token': 27282, 'token_str': '????'}, {'sequence': 'This movie is nice but ㅋㅋ', 'score': 0.07819453626871109, 'token': 49865, 'token_str': 'ㅋㅋ'}, {'sequence': 'This movie is nice but ____', 'score': 0.0729549303650856, 'token': 28784, 'token_str': '____'}, {'sequence': 'This movie is nice but!!!!', 'score': 0.06213485449552536, 'token': 32376, 'token_str': '!!!!'}]


# Question Answering

In [33]:
context = widgets.Textarea(
    value='Einstein is famous for the general theory of relativity',
    placeholder='Enter something',
    description='Context:',
    disabled=False
)

query = widgets.Text(
    value='Why is Einstein famous for ?',
    placeholder='Enter something',
    description='Question:',
    disabled=False
)

def forward(_):
    if len(context.value) > 0 and len(query.value) > 0: 
        output = qa(question=query.value, context=context.value)            
        print(output)

query.on_submit(forward)
display(context, query)

Textarea(value='Einstein is famous for the general theory of relativity', description='Context:', placeholder=…

Text(value='Why is Einstein famous for ?', description='Question:', placeholder='Enter something')

{'score': 0.4037874639034271, 'start': 27, 'end': 55, 'answer': 'general theory of relativity'}
