Newly introduced in **transformers v2.3.0**, pipelines provides a high-level, easy to use, API for doing inference over a variety of downstream-tasks. <br>
**Pipelines** encapsulate the overall process of every NLP process:
1. **Tokenization**: Split the initial input into multiple sub-entities with ... properties (i.e. tokens).
2. **Inference**: Maps every tokens into a more meaningful representation.
3. **Decoding**: Use the above representation to generate and/or extract the final output for the underlying task.

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 6.0MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/08/cd/342e584ee544d044fb573ae697404ce22ede086c9e87ce5960772084cad0/sacremoses-0.0.44.tar.gz (862kB)
[K     |████████████████████████████████| 870kB 35.1MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 48.0MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.44-cp37-none-any.whl size=886084 sha256=52a3f

In [2]:
import transformers
from transformers import pipeline
from ipywidgets import widgets

import numpy as np

# transformers : 4.5.1  |  np : 1.19.5
print(f'transformers : {transformers.__version__}  |  np : {np.__version__}')

transformers : 4.5.1  |  np : 1.19.5


## 1. Sentence Classification - Sentiment Analysis

In [3]:
%%time
sentiment_classifier = pipeline('sentiment-analysis')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=629.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267844284.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=48.0, style=ProgressStyle(description_w…


CPU times: user 7.13 s, sys: 1.12 s, total: 8.25 s
Wall time: 16.2 s


In [4]:
%%time
sent_list = [
    "That's so mean to say that way to this wonderful child",
    "This is the perfect way of ruinning a movie with such a wonderful actors",
    "I'm not sure that is really wonderful idea.",
    "How come do you think that he is a kind boy?"
]

for sent in sent_list:
  print(f'>>> {sent} :', sentiment_classifier(sent))

>>> That's so mean to say that way to this wonderful child : [{'label': 'POSITIVE', 'score': 0.9997044801712036}]
>>> This is the perfect way of ruinning a movie with such a wonderful actors : [{'label': 'NEGATIVE', 'score': 0.9989961385726929}]
>>> I'm not sure that is really wonderful idea. : [{'label': 'NEGATIVE', 'score': 0.9893096685409546}]
>>> How come do you think that he is a kind boy? : [{'label': 'NEGATIVE', 'score': 0.9777455925941467}]
CPU times: user 418 ms, sys: 7.32 ms, total: 425 ms
Wall time: 358 ms


## 2. Token Classification - Named Entity Recognition

Here is an example using the pipelines do to named entity recognition, trying to identify tokens as belonging to one of 9 classes:
- O, Outside of a named entity
- B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity
- I-MIS, Miscellaneous entity
- B-PER, Beginning of a person’s name right after another person’s name
- I-PER, Person’s name
- B-ORG, Beginning of an organisation right after another organisation
- I-ORG, Organisation
- B-LOC, Beginning of a location right after another location
- I-LOC, Location

In [5]:
%%time
ner_model = pipeline('ner')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=998.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1334448817.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=60.0, style=ProgressStyle(description_w…


CPU times: user 33.8 s, sys: 5.12 s, total: 38.9 s
Wall time: 47.4 s


In [6]:
%%time
def pprint_result_list(model, sent):
  print(f'>>> {sent} :')
  for dic in model(sent):
    print('\t', dic)

sent_list = [
    'Hugging Face is a French company based in New York.',
    'My Google Workspace id is handed to Sangeun.',
    'Trump made something new for Cheoin-gu.',
    'Trump made something new for Cheoin-gu, Korea'
]

pprint_result_list(ner_model, sent_list[0])
print('\n')

ner_model.grouped_entities = True
for sent in sent_list:
  pprint_result_list(ner_model, sent)

>>> Hugging Face is a French company based in New York. :
	 {'word': 'Hu', 'score': 0.9968873858451843, 'entity': 'I-ORG', 'index': 1, 'start': 0, 'end': 2}
	 {'word': '##gging', 'score': 0.9329522848129272, 'entity': 'I-ORG', 'index': 2, 'start': 2, 'end': 7}
	 {'word': 'Face', 'score': 0.9781811237335205, 'entity': 'I-ORG', 'index': 3, 'start': 8, 'end': 12}
	 {'word': 'French', 'score': 0.9981815814971924, 'entity': 'I-MISC', 'index': 6, 'start': 18, 'end': 24}
	 {'word': 'New', 'score': 0.9987512826919556, 'entity': 'I-LOC', 'index': 10, 'start': 42, 'end': 45}
	 {'word': 'York', 'score': 0.9976728558540344, 'entity': 'I-LOC', 'index': 11, 'start': 46, 'end': 50}


>>> Hugging Face is a French company based in New York. :
	 {'entity_group': 'ORG', 'score': 0.9693402647972107, 'word': 'Hugging Face', 'start': 0, 'end': 12}
	 {'entity_group': 'MISC', 'score': 0.9981815814971924, 'word': 'French', 'start': 18, 'end': 24}
	 {'entity_group': 'LOC', 'score': 0.998212069272995, 'word': 'N

## 3. Question Answering

In [7]:
%%time
qa_model = pipeline('question-answering')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=473.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=260793700.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…


CPU times: user 8.28 s, sys: 1.23 s, total: 9.51 s
Wall time: 18.3 s


In [8]:
%%time
# Several lines from https://en.wikipedia.org/wiki/Seoul
txt = """
    Seoul (/soʊl/, like soul; Korean: 서울 [sʌ.ul] (About this soundlisten); lit. 'Capital'), officially the Seoul Special City, is the capital[7] and largest metropolis of South Korea.[8] Seoul has a population of 9.7 million people, and forms the heart of the Seoul Capital Area with the surrounding Incheon metropolis and Gyeonggi province. Seoul was the world's 4th largest metropolitan economy in 2014 after Tokyo, New York City and Los Angeles.[9] In 2017, the cost of living in Seoul was ranked the 6th highest globally.[10][11]
    With technology hubs centered in Gangnam and Digital Media City,[12] the Seoul Capital Area is home to the headquarters of 14 Fortune Global 500 companies, including Samsung,[13] LG, and Hyundai. The metropolis exerts a major influence in regional affairs as one of the five leading hosts of global conferences; as of 2018 it was ranked 3rd in the world after Singapore (1st) and Brussels (2nd).[14] Seoul has hosted the 1986 Asian Games, 1988 Summer Olympics, 2002 FIFA World Cup (with Japan), and the 2010 G-20 Seoul summit.
"""

sent_list = [
    'Answer the population of the city',
    'How many humans are there?',
    
    'Any special event that happened in Seoul?',
    'Any special event that happened in Seoul'
]

for sent in sent_list:
  print(f'>>> {sent} :')
  print('\t', qa_model(context=txt, question=sent))

>>> Answer the population of the city :
	 {'score': 0.6106517314910889, 'start': 214, 'end': 225, 'answer': '9.7 million'}
>>> How many humans are there? :
	 {'score': 0.8688034415245056, 'start': 214, 'end': 225, 'answer': '9.7 million'}
>>> Any special event that happened in Seoul? :
	 {'score': 5.868524021934718e-05, 'start': 108, 'end': 126, 'answer': 'Seoul Special City'}
>>> Any special event that happened in Seoul :
	 {'score': 2.3620407318958314e-06, 'start': 1038, 'end': 1064, 'answer': 'the 2010 G-20 Seoul summit'}
CPU times: user 2.35 s, sys: 32.5 ms, total: 2.38 s
Wall time: 1.2 s


## 4. Text Generation - Mask Filling

In [9]:
%%time
mlm_model = pipeline('fill-mask')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=480.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=331070498.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355863.0, style=ProgressStyle(descript…


CPU times: user 9.99 s, sys: 1.36 s, total: 11.4 s
Wall time: 22.6 s


In [10]:
%%time
mask_token = mlm_model.tokenizer.mask_token
sent_list = [
    f'Hugging Face is a French company based in {mask_token}',
    f'Hugging Face is a {mask_token} company based in New York',
    f'I visited home to {mask_token} milk',
    f'I visited home to {mask_token} her'
]

for sent in sent_list:
  pprint_result_list(mlm_model, sent)

>>> Hugging Face is a French company based in <mask> :
	 {'sequence': 'Hugging Face is a French company based in Paris', 'score': 0.2775893807411194, 'token': 2201, 'token_str': ' Paris'}
	 {'sequence': 'Hugging Face is a French company based in Lyon', 'score': 0.14941272139549255, 'token': 12790, 'token_str': ' Lyon'}
	 {'sequence': 'Hugging Face is a French company based in Geneva', 'score': 0.045763980597257614, 'token': 11559, 'token_str': ' Geneva'}
	 {'sequence': 'Hugging Face is a French company based in France', 'score': 0.04576262831687927, 'token': 1470, 'token_str': ' France'}
	 {'sequence': 'Hugging Face is a French company based in Brussels', 'score': 0.04067569971084595, 'token': 6497, 'token_str': ' Brussels'}
>>> Hugging Face is a <mask> company based in New York :
	 {'sequence': 'Hugging Face is a cosmetics company based in New York', 'score': 0.08008444309234619, 'token': 23353, 'token_str': ' cosmetics'}
	 {'sequence': 'Hugging Face is a technology company based in N

## 5. Summarization
Summarization is currently supported by Bart and T5.

In [11]:
%%time
summarizer = pipeline('summarization')

# function to print string in pretty way
def pprint_result_string(text):
  line_nth_char = 0
  for char in text:
    print(char, end='')
    line_nth_char += 1
    
    if line_nth_char > 100 and char in [' ', '.']:
      print()
      line_nth_char = 0
    elif char == '\n':
      line_nth_char = 0
  print('\n')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1649.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1222317369.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=26.0, style=ProgressStyle(description_w…


CPU times: user 38.9 s, sys: 4.6 s, total: 43.5 s
Wall time: 54.8 s


In [12]:
%%time
# Several lines from https://www.quora.com/In-Python-vs-R-which-one-do-you-prefer-Why-What-are-the-pros-and-cons-What-can-Python-do-that-R-cannot-and-vice-versa
text = """ 
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
"""
summarized_txt = summarizer(text)[0]['summary_text']
pprint_result_string(summarized_txt)

# 요약문 번역 결과 (구글 번역기)
# 우리는주의 메커니즘만을 기반으로하는 새로운 단순 네트워크 아키텍처 인 Transformer를 제안하며, 재발과 컨볼 루션을 완전히 제거합니다.
# 두 가지 기계 번역 작업에 대한 실험을 통해 이러한 모델은 품질이 우수하면서도 병렬화가 가능하고 학습 시간이 훨씬 단축되는 것으로 나타났습니다.
# Transformer는 영어 구성 구문 분석에 성공적으로 적용하여 다른 작업에 잘 일반화됩니다.

 We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, 
dispensing with recurrence and convolutions entirely . Experiments on two machine translation tasks show 
these models to be superior in quality while being more parallelizable and requiring significantly less 
time to train . The Transformer generalizes well to other tasks by applying it successfully to English 
constituency parsing .

CPU times: user 15.8 s, sys: 159 ms, total: 15.9 s
Wall time: 8.01 s


## 6. Translation
Translation is currently supported by T5 for the language mappings English-to-French (translation_en_to_fr), English-to-German (translation_en_to_de) and English-to-Romanian (translation_en_to_ro).

In [13]:
%%time
# English to German
translator = pipeline('translation_en_to_de')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1199.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=891691430.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1389353.0, style=ProgressStyle(descript…


CPU times: user 25.3 s, sys: 4.02 s, total: 29.3 s
Wall time: 1min 9s


In [14]:
%%time

src_sent = "The history of natural language processing (NLP) generally started in the 1950s, although work can be found from earlier periods."
target_sent = translator(src_sent)[0]['translation_text']
pprint_result_string(target_sent)

# 번역문 번역 결과 (구글 번역기)
# 자연어 처리 (NLP)의 역사는 일반적으로 1950 년대에 시작되었지만 초기 작업을 찾을 수 있습니다.

# 번역문 번역 결과 (네이버 파파고 번역기)
# 자연 언어 처리의 역사(NLP)는 일반적으로 1950년대에 시작되었지만, 초기 작품들은 발견될 수 있다.

Die Geschichte der natürlichen Sprachenverarbeitung (NLP) begann im Allgemeinen in den 1950er Jahren, 
obwohl Arbeit aus früheren Zeiten gefunden werden kann.

CPU times: user 5.61 s, sys: 129 ms, total: 5.74 s
Wall time: 2.88 s


## 7. Text Generation
Text generation is currently supported by GPT-2, OpenAi-GPT, TransfoXL, XLNet, CTRL and Reformer.

In [15]:
%%time
text_generator = pipeline("text-generation")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…


CPU times: user 15.6 s, sys: 2.13 s, total: 17.8 s
Wall time: 28.4 s


In [16]:
%%time
text_list = [text_generator("Today is a beautiful day and I will")[0]['generated_text'] for i in range(3)]

for text in text_list:
  # print('>>> ', end='')
  print('='*50, '\n')
  pprint_result_string(text)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Today is a beautiful day and I will always keep it warm, so please feel free to help the community when 
you can. Be sure to check out our donation page to have a look at our beautiful artwork. Thank you for 
keeping this beautiful city safe


Today is a beautiful day and I will do all I can" —Donald J. Trump (@realDonaldTrump) January 15, 2017

If it didn't cost me a big fortune to support my family, my children might never have learned all I learned


Today is a beautiful day and I will always look back at you," she said.

CPU times: user 8.9 s, sys: 76.9 ms, total: 8.97 s
Wall time: 4.53 s


## 8. Projection - Features Extraction
This pipeline extracts the hidden states from the base transformer, which can be used as features in downstream tasks.

In [17]:
%%time
ftr_extract_model = pipeline('feature-extraction')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=411.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=263273408.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…


CPU times: user 6.68 s, sys: 1.2 s, total: 7.88 s
Wall time: 16.7 s


In [18]:
%%time
output = ftr_extract_model('Hugging Face is a French company based in Paris')  # 9 words

print(type(output), np.array(output).shape)   # (Samples, Tokens, Vector Size)
print(ftr_extract_model.tokenizer.convert_ids_to_tokens(ftr_extract_model.tokenizer('Hugging Face is a French company based in Paris')['input_ids']))

<class 'list'> (1, 12, 768)
['[CLS]', 'Hu', '##gging', 'Face', 'is', 'a', 'French', 'company', 'based', 'in', 'Paris', '[SEP]']
CPU times: user 82.7 ms, sys: 961 µs, total: 83.7 ms
Wall time: 42.6 ms


## 9. Try various pipelines in an interactive way

In [21]:
task = widgets.Dropdown(
    options=['sentiment-analysis', 'ner', 'fill_mask'],
    value='ner',
    description='Task:',
    disabled=False
)

input = widgets.Text(
    value='',
    placeholder='Enter something',
    description='Your input:',
    disabled=False
)

def forward(_):
    if len(input.value) > 0: 
        if task.value == 'ner':
            output = ner_model(input.value)
        elif task.value == 'sentiment-analysis':
            output = sentiment_classifier(input.value)
        else:
            if input.value.find('<mask>') == -1:
                output = mlm_model(input.value + ' <mask>')
            else:
                output = mlm_model(input.value)                
        print(input.value, '=>', output)

input.on_submit(forward)
display(task, input)

Dropdown(description='Task:', index=1, options=('sentiment-analysis', 'ner', 'fill_mask'), value='ner')

Text(value='', description='Your input:', placeholder='Enter something')

I'm in a so so mood today => [{'label': 'POSITIVE', 'score': 0.9935868978500366}]
I'm logging in to facebook => []
I'm Logging in to Facebook => [{'entity_group': 'ORG', 'score': 0.723970890045166, 'word': 'Facebook', 'start': 18, 'end': 26}]


In [20]:
context = widgets.Textarea(
    value='Einstein is famous for the general theory of relativity',
    placeholder='Enter something',
    description='Context:',
    disabled=False
)

query = widgets.Text(
    value='Why is Einstein famous for ?',
    placeholder='Enter something',
    description='Question:',
    disabled=False
)

def forward(_):
    if len(context.value) > 0 and len(query.value) > 0: 
        output = qa_model(question=query.value, context=context.value)            
        print(query.value, '=>', output)

query.on_submit(forward)
display(context, query)

Textarea(value='Einstein is famous for the general theory of relativity', description='Context:', placeholder=…

Text(value='Why is Einstein famous for ?', description='Question:', placeholder='Enter something')

Why is Einstein famous for ? => {'score': 0.40378785133361816, 'start': 27, 'end': 55, 'answer': 'general theory of relativity'}
When did she get married => {'score': 0.4662879705429077, 'start': 14, 'end': 52, 'answer': 'When Liana Barrientos was 23 years old'}
what happened after marriage => {'score': 0.29999497532844543, 'start': 262, 'end': 277, 'answer': 'she got hitched'}
When was her second marriage? => {'score': 0.7660509943962097, 'start': 233, 'end': 240, 'answer': '18 days'}
When was her first marriage? => {'score': 0.29526281356811523, 'start': 14, 'end': 52, 'answer': 'When Liana Barrientos was 23 years old'}
