###**Hugging Face Transformer Pretrained Pipeline Models for NLP Tasks**

The easiest way to use a pretrained model on a given task is to use pipeline(). HuggingFace Transformers provides the following tasks out of the box:

* Sentiment analysis: is a text positive or negative?

* Text generation (in English): provide a prompt and the model will generate what follows.

* Name entity recognition (NER): in an input sentence, label each word with the entity it represents (person, place, etc.)

* Question answering: provide the model with some context and a question, extract the answer from the context.

* Filling masked text: given a text with masked words (e.g., replaced by [MASK]), fill the blanks.

* Summarization: generate a summary of a long text.

* Translation: translate a text in another language.

* Feature extraction: return a tensor representation of the text.

* Text Classification.

* Token classification.

* Zero shot classification.

* Conversation.
etc.



The task (str) defining which pipeline will be returned. Currently accepted tasks are:

**"feature-extraction"**: will return a FeatureExtractionPipeline.

**"sentiment-analysis"**: will return a TextClassificationPipeline.

**"ner"**: will return a TokenClassificationPipeline.

**"question-answering"**: will return a QuestionAnsweringPipeline.

**"fill-mask"**: will return a FillMaskPipeline.

**"summarization"**: will return a SummarizationPipeline.

**"translation_xx_to_yy"**: will return a TranslationPipeline.

**"text2text-generation"**: will return a Text2TextGenerationPipeline.

**"text-generation"**: will return a TextGenerationPipeline.

**"text-classification"**: will return a TextClassificationPipeline.

**"token-classification"** will return a TpkenClassificationPipeline.

**"zero-shot-classification"**: will return a ZeroShotClassificationPipeline.

**"conversation"**: will return a ConversationalPipeline.

### Install Transformers and import Pipeline

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 8.5 MB/s 
Collecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.4-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 29.6 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 2.4 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 56.6 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 20.1 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Foun

In [2]:
from transformers import pipeline

### Pretrained pipeline models

In [None]:
# tasks:task identifier

all_tasks = {'feature_extraction':'feature-extraction',
             'text_classification':'text-classification',
             'sentiment_analysis':'sentiment-analysis',
             'token_classification':'token-classification',
             'name_entity_recognition':'ner',
             'question_answering':'question-answering',
             'table_question_answering':'table-question-answering',
             'fill_mask':'fill-mask',
             'summarization':'summarization',
             'translation':'translation_xx_to_yy',
             'text_to_text_generation':'text2text-generation',
             'text_generation':'text-generation',
             'zero_shot_clasification':'zero-shot-classification',
             'conversation':'conversational'
             'Audio_classification':'audio-classification'}


## 1. Sentiment analysis

In [3]:
# default model: distilbert-base-uncased-finetuned-sst-2-english
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [4]:
classifier('i am happy')

[{'label': 'POSITIVE', 'score': 0.9998801946640015}]

In [5]:
results = classifier(["We are very happy to show you the Transformers library.", "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309


#### Predict twiter samples

In [6]:
import pandas as pd
from nltk.corpus import twitter_samples

In [7]:
import nltk

In [8]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


True

In [9]:
documents = ([(t, "POSITIVE") for t in twitter_samples.strings("positive_tweets.json")] +
             [(t, "neg") for t in twitter_samples.strings("negative_tweets.json")])

In [10]:
df = pd.DataFrame(documents, columns=['tweet','label'])

In [11]:
df.shape

(10000, 2)

In [12]:
df.head(3)

Unnamed: 0,tweet,label
0,#FollowFriday @France_Inte @PKuchly57 @Milipol...,POSITIVE
1,@Lamb2ja Hey James! How odd :/ Please call our...,POSITIVE
2,@DespiteOfficial we had a listen last night :)...,POSITIVE


In [13]:
df['label'].value_counts()

POSITIVE    5000
neg         5000
Name: label, dtype: int64

In [None]:
#pred = classifier(df.tweet.tolist())

In [14]:
import numpy as np

In [15]:
tweets = df.tweet.tolist()

In [16]:
classifier(tweets[2])

[{'label': 'POSITIVE', 'score': 0.9996201992034912}]

In [17]:
tweets_chunks = np.array_split(tweets, 10)

In [18]:
pred = []
for i in tweets[0:100]:
  pred.append(classifier(i))

In [19]:
df_new = df[0:100]

In [20]:
pred_label = []
for p in pred:
  pred_label.append((p[0]['label']))

In [21]:
df_new['pred'] = pred_label

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [22]:
df_new

Unnamed: 0,tweet,label,pred
0,#FollowFriday @France_Inte @PKuchly57 @Milipol...,POSITIVE,POSITIVE
1,@Lamb2ja Hey James! How odd :/ Please call our...,POSITIVE,POSITIVE
2,@DespiteOfficial we had a listen last night :)...,POSITIVE,POSITIVE
3,@97sides CONGRATS :),POSITIVE,POSITIVE
4,yeaaaah yippppy!!! my accnt verified rqst has...,POSITIVE,NEGATIVE
...,...,...,...
95,Those friends know themselves :),POSITIVE,POSITIVE
96,waiting for nudes :-),POSITIVE,NEGATIVE
97,@JacobWhitesides go sleep u ! :))))))))),POSITIVE,NEGATIVE
98,Stats for the day have arrived. 1 new follower...,POSITIVE,NEGATIVE


### Evaluate the prediction

In [23]:
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
print(classification_report(df_new['label'], df_new['pred']))
print(accuracy_score(df_new['label'], df_new['pred']))

              precision    recall  f1-score   support

    NEGATIVE       0.00      0.00      0.00         0
    POSITIVE       1.00      0.56      0.72       100

    accuracy                           0.56       100
   macro avg       0.50      0.28      0.36       100
weighted avg       1.00      0.56      0.72       100

0.56


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [24]:
cm = confusion_matrix(df_new['label'], df_new['pred'])
print(cm)

[[ 0  0]
 [44 56]]


## 2. Named Entity Recognition

Named Entity Recognition (NER) is the task of classifying tokens according to a class, for example, identifying a token as a person, an organisation or a location. 

Here is an example of using pipelines to do named entity recognition, specifically, trying to identify tokens as belonging to one of 9 classes:

* O, Outside of a named entity
* B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity
* I-MIS, Miscellaneous entity
* B-PER, Beginning of a person's name right after another person's name
* I-PER, Person's name
* B-ORG, Beginning of an organisation right after another organisation
* I-ORG, Organisation
* B-LOC, Beginning of a location right after another location
* I-LOC, Location

In [None]:
# default model: dbmdz/bert-large-cased-finetuned-conll03-english
clf = pipeline("ner")
clf("Hugging Face is a French company based in New-York")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

[{'end': 2,
  'entity': 'I-ORG',
  'index': 1,
  'score': 0.9968092,
  'start': 0,
  'word': 'Hu'},
 {'end': 7,
  'entity': 'I-ORG',
  'index': 2,
  'score': 0.9333935,
  'start': 2,
  'word': '##gging'},
 {'end': 12,
  'entity': 'I-ORG',
  'index': 3,
  'score': 0.9784389,
  'start': 8,
  'word': 'Face'},
 {'end': 24,
  'entity': 'I-MISC',
  'index': 6,
  'score': 0.9981635,
  'start': 18,
  'word': 'French'},
 {'end': 45,
  'entity': 'I-LOC',
  'index': 10,
  'score': 0.9978009,
  'start': 42,
  'word': 'New'},
 {'end': 46,
  'entity': 'I-LOC',
  'index': 11,
  'score': 0.94690835,
  'start': 45,
  'word': '-'},
 {'end': 50,
  'entity': 'I-LOC',
  'index': 12,
  'score': 0.9975541,
  'start': 46,
  'word': 'York'}]

In [None]:
clf('Google campus at San Jose downtown will open in 2023')

[{'end': 6,
  'entity': 'I-ORG',
  'index': 1,
  'score': 0.9529939,
  'start': 0,
  'word': 'Google'},
 {'end': 20,
  'entity': 'I-LOC',
  'index': 4,
  'score': 0.99928087,
  'start': 17,
  'word': 'San'},
 {'end': 25,
  'entity': 'I-LOC',
  'index': 5,
  'score': 0.997044,
  'start': 21,
  'word': 'Jose'}]

In [None]:
#clf = pipeline("ner")
sentence = 'Larry Page is an owner and co-founder of Google.'
print(clf(sentence))

[{'entity': 'I-PER', 'score': 0.9986791, 'index': 1, 'word': 'Larry', 'start': 0, 'end': 5}, {'entity': 'I-PER', 'score': 0.99919504, 'index': 2, 'word': 'Page', 'start': 6, 'end': 10}, {'entity': 'I-ORG', 'score': 0.999159, 'index': 11, 'word': 'Google', 'start': 41, 'end': 47}]


#### NER passing in a specific model and tokenizer

In [None]:
from transformers import AutoModelForTokenClassification, AutoTokenizer

In [None]:
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
clf = pipeline("ner", model=model, tokenizer=tokenizer)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [None]:
sentence = 'Larry Page is an owner and co-founder of Google.'
print(clf(sentence))

[{'entity': 'I-PER', 'score': 0.9986791, 'index': 1, 'word': 'Larry', 'start': 0, 'end': 5}, {'entity': 'I-PER', 'score': 0.99919504, 'index': 2, 'word': 'Page', 'start': 6, 'end': 10}, {'entity': 'I-ORG', 'score': 0.999159, 'index': 11, 'word': 'Google', 'start': 41, 'end': 47}]


## 3. Fill mask

In [None]:
# default model: distilroberta-base
nlp_fill = pipeline('fill-mask')
nlp_fill('Hugging Face is a French company based in <mask>')

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

[{'score': 0.2775883078575134,
  'sequence': 'Hugging Face is a French company based in Paris',
  'token': 2201,
  'token_str': ' Paris'},
 {'score': 0.14941270649433136,
  'sequence': 'Hugging Face is a French company based in Lyon',
  'token': 12790,
  'token_str': ' Lyon'},
 {'score': 0.0457640215754509,
  'sequence': 'Hugging Face is a French company based in Geneva',
  'token': 11559,
  'token_str': ' Geneva'},
 {'score': 0.04576275497674942,
  'sequence': 'Hugging Face is a French company based in France',
  'token': 1470,
  'token_str': ' France'},
 {'score': 0.04067588970065117,
  'sequence': 'Hugging Face is a French company based in Brussels',
  'token': 6497,
  'token_str': ' Brussels'}]

In [None]:
pd.DataFrame(nlp_fill("Argentina is a country located in South <mask>", top_k=5))


Unnamed: 0,score,token,token_str,sequence
0,0.641242,730,America,Argentina is a country located in South America
1,0.333804,1327,Africa,Argentina is a country located in South Africa
2,0.015436,1817,Asia,Argentina is a country located in South Asia
3,0.003906,6312,Sudan,Argentina is a country located in South Sudan
4,0.001698,1101,Korea,Argentina is a country located in South Korea


## 4. Feature extraction

In [25]:
tokens = "Argentina is a third wolrd country located in South America".split()

In [26]:
# defaul model: distilbert-base-cased
#import numpy as np
nlp_features = pipeline('feature-extraction')
output = nlp_features(tokens)
res = np.array(output)

No model was supplied, defaulted to distilbert-base-cased (https://huggingface.co/distilbert-base-cased)


Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/251M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

  """


In [27]:
res[1]

array([list([[0.46375441551208496, 0.15333601832389832, 0.2019384801387787, -0.28952354192733765, -0.239096000790596, -0.17164203524589539, 0.04430026561021805, -0.10592067241668701, -0.013992395251989365, -1.2095521688461304, -0.4319513142108917, 0.27107301354408264, -0.1402595490217209, 0.14163576066493988, -0.5714165568351746, -0.09066338837146759, 0.5053672194480896, 0.22202971577644348, -0.12447606027126312, -0.3575628399848938, 0.24094873666763306, -0.11784090101718903, 0.7015999555587769, -0.3101547658443451, 0.3623453378677368, -0.19150112569332123, 0.3590286672115326, 0.012654951773583889, -0.1349295973777771, 0.3730219602584839, 0.017548223957419395, 0.3047291040420532, 0.1616121381521225, 0.12038038671016693, -0.078517384827137, 0.18424250185489655, -0.05088845267891884, -0.3788972795009613, 0.04498166963458061, -0.025049524381756783, -0.41945749521255493, 0.13264252245426178, 0.6972384452819824, -0.16336338222026825, 0.3010530173778534, -0.44742831587791443, 0.1426295340061

In [28]:
res.shape

(10, 1)

## 5. Text Generation

In [None]:
# Default model: gpt2
nlp_generation = pipeline("text-generation")

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [None]:
nlp_generation("As far I am concerned, I will", do_sample=True, max_length=100, temperature=1.0)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'As far I am concerned, I will have to wait until the full report is ready to take effect.\n\nWe have also received reports that a number of issues involving the transfer of funds are going well. It does not look like our bank accounts are in dire need of funds, but there is no guarantee either way.\n\nIn a press conference the Chief Operating Officer commented:\n\n"We believe that all existing accounts and accounts with outstanding balances are of acceptable quality and that this new funding'

In [None]:
nlp_generation('It is raining now')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'It is raining now.\n\n"How\'s my day?"\n\nHans suddenly paused, but she took a step back and looked closer. The darkness had been fading through the night, but not very soon. One of the people who\'d'}]

## 6. Summarization

In [None]:
# default model: sshleifer/distilbart-cnn-12-6
#from transformers import pipeline
summarizer = pipeline('summarization')

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

In [None]:
summarizer('For English language, BERT have 2 kinds of model, cased model and uncased model. Cased model will leave the word token with cased, while uncased model will lower all the word token. Since for NER task, the cased for the word token is important, e.g. when we talk about “Apple” mostly, we may mean the company, we talk about “apple” we may mostly mean the fruit. So we will choose BERT cased model for fine-tuning new model, and be careful ,when we preprocess the data before this step, we also need to load BERT cased tokenizer and leave the cased form of word token unchanged.')

[{'summary_text': ' For English language, BERT have 2 kinds of model, cased model and uncased model . Cased model will leave the word token with cased, while uncase model will lower all the word . Since for NER task, the cased for the word tokens is important, e.g. when we talk about “Apple” mostly, we may mean the company . So we will choose BERT cased . model for fine-tuning new model .'}]

In [None]:
# another example
article = ''' The number of lives claimed by the Covid-19 coronavirus in India escalated sharply to 640 on Wednesday morning, with the total tally of positive cases rapidly nearing the 20,000 mark. The Indian Medical Association (IMA) called off the White Alert protest of doctors after the association was assured by Union Home Minister Amit Shah that they would be provided security by government. Meanwhile, a civil aviation ministry employee tested coronavirus positive today after which the B wing of the ministry was sealed and sanitisation procedure was initiated. '''
print(summarizer(article, max_length=90, min_length=20))

[{'summary_text': ' The total tally of positive cases nearing the 20,000 mark . The Indian Medical Association (IMA) called off the White Alert protest of doctors . Civil aviation ministry employee tested positive today after which the B wing of the ministry was sealed .'}]


## 7. Conversation

In [None]:
from transformers import pipeline, Conversation

In [None]:
# default model: microsoft/DialoGPT-medium
#from transformers import pipeline
conversations = pipeline('conversational')

No model was supplied, defaulted to microsoft/DialoGPT-medium (https://huggingface.co/microsoft/DialoGPT-medium)


Downloading:   0%|          | 0.00/642 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/823M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

In [None]:
conv1 = Conversation("Going to the movies tonight - any suggestions?")
conv2 = Conversation("What's the last book you have read?")

conversations([conv1, conv2])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Conversation id: 8d5f511e-8c33-4111-b71e-6449010b43db 
 user >> Going to the movies tonight - any suggestions? 
 bot >> The Big Lebowski ,
 Conversation id: 0b607dd5-d05f-4b69-b5ba-f1814d387552 
 user >> What's the last book you have read? 
 bot >> The Last Question ]

#### Add continuing conversations

In [None]:
conv1_next = "What's it about?"
conv2_next = "Cool, what is the genre of the book?"

In [None]:
conv1.add_user_input(conv1_next)
conv2.add_user_input(conv2_next)

In [None]:
conversations([conv1, conv2])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Conversation id: 8d5f511e-8c33-4111-b71e-6449010b43db 
 user >> Going to the movies tonight - any suggestions? 
 bot >> The Big Lebowski 
 user >> What's it about? 
 bot >> It's a comedy about a guy who gets a job at a movie theater and has to deal with the people who work there. ,
 Conversation id: 0b607dd5-d05f-4b69-b5ba-f1814d387552 
 user >> What's the last book you have read? 
 bot >> The Last Question 
 user >> Cool, what is the genre of the book? 
 bot >> I'm not sure, I think it's a fantasy. ]

## 8. Question Answering

In [None]:
# default model :distilbert-base-cased-distilled-squad
question_answerer = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [None]:
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn"
)

{'answer': 'Hugging Face', 'end': 45, 'score': 0.6949771046638489, 'start': 33}

In [None]:
question_answerer(
    question="Wat is my name?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn"
)

{'answer': 'Sylvain', 'end': 18, 'score': 0.9976813793182373, 'start': 11}

## 9. Zero-shot classification

In [None]:
# default model: facebook/bart-large-mnli
classifier = pipeline("zero-shot-classification")

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [None]:
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

{'labels': ['education', 'business', 'politics'],
 'scores': [0.8445987105369568, 0.11197440326213837, 0.043426960706710815],
 'sequence': 'This is a course about the Transformers library'}

In [None]:
# Another example
classifier(
    "policis is great and I wish I study polical science",
    candidate_labels=["education", "politics", "business" ,"education"],
)

{'labels': ['politics', 'education', 'education', 'business'],
 'scores': [0.9671274423599243,
  0.011469840072095394,
  0.011469840072095394,
  0.009932917542755604],
 'sequence': 'policis is great and I wish I study polical science'}

## 10. Translation

In [None]:
# After install, need to restart runtime!
#!pip install transformers[sentencepiece]
!pip install datasets transformers[sentencepiece]

Collecting datasets
  Downloading datasets-1.18.3-py3-none-any.whl (311 kB)
[K     |████████████████████████████████| 311 kB 5.3 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.1.0-py3-none-any.whl (133 kB)
[K     |████████████████████████████████| 133 kB 45.1 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 40.7 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 48.0 MB/s 
Collecting multidict<7.0,>=4.5
  Downloading multidict-6.0.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (94 kB)
[K     |████████████████████████████████| 94 kB 3.0 MB/s 
[?25hCollecting async-timeout<5.0,>=4.0.0a3
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting asynctest==0.13.0
  Downloading asynctest-0.1

In [None]:
# translate France to English
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

Downloading:   0%|          | 0.00/784k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.28M [00:00<?, ?B/s]

[{'translation_text': 'This course is produced by Hugging Face.'}]

In [None]:
# Translate English to France
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-fr")
translator("This course is produced by Hugging Face.")

Downloading:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/784k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.28M [00:00<?, ?B/s]

[{'translation_text': 'Ce cours est produit par Hugging Face.'}]

## 11. Translation_xx_to_yy

In [None]:
# default model: t5-base
# from English to German
en_de_translator = pipeline("translation_en_to_de")
print(en_de_translator("Hugging Face is a technology company based in New York and Paris", max_length=40))

No model was supplied, defaulted to t5-base (https://huggingface.co/t5-base)


Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]


In [None]:
# default model: t5-base
# from English to France
en_fr_translator = pipeline("translation_en_to_fr")
en_fr_translator("How old are you?")

No model was supplied, defaulted to t5-base (https://huggingface.co/t5-base)


[{'translation_text': ' quel âge êtes-vous?'}]

## 12. Text2text-Generation

In [None]:
# default model: t5-base
text2text_generator = pipeline("text2text-generation")

No model was supplied, defaulted to t5-base (https://huggingface.co/t5-base)


In [None]:
text2text_generator("question: What is 42 ? context: 42 is the answer to life, the universe and everything")

[{'generated_text': 'the answer to life, the universe and everything'}]

In [None]:
text2text_generator("question: Which is capital city of United States? context: Washtington DC is United States' capital city")

[{'generated_text': 'Washtington DC'}]

## 13. Text-Classification

In [None]:
# default model: distilbert-base-uncased-finetuned-sst-2-english
clf = pipeline("text-classification")
clf("This restaurant is awesome")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'POSITIVE', 'score': 0.9998743534088135}]

In [None]:
#clf = pipeline("text-classification")
clf(["This restaurant is awesome", "This restaurant is aweful", "This restaurant is bad"])

[{'label': 'POSITIVE', 'score': 0.9998743534088135},
 {'label': 'POSITIVE', 'score': 0.9996858835220337},
 {'label': 'NEGATIVE', 'score': 0.9998098015785217}]

#### Use a specific model

In [None]:
pipe = pipeline("text-classification", model="roberta-large-mnli")
pipe("This restaurant is awesome")

Downloading:   0%|          | 0.00/688 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

[{'label': 'NEUTRAL', 'score': 0.7313133478164673}]

#### use a generator

In [None]:
from transformers import pipeline
pipe = pipeline("text-classification")

def data():
    while True:
        # This could come from a dataset, a database, a queue or HTTP request in a server
        # Caveat: because this is iterative, you cannot use `num_workers > 1` variable
        # to use multiple threads to preprocess data. You can still have 1 thread that
        # does the preprocessing while the main runs the big inference
        yield "This is a test"


for out in pipe(data()):
    print(out)
    # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
    # {"text": ....}
    # ....

## 14. Table-Question-Answering

In [None]:
# this pipeline only available in PyTorch
import torch
torch.__version__

'1.10.0+cu111'

In [None]:
# after install, need to restart runtime!
!pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html

Looking in links: https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
Collecting torch-scatter
  Downloading torch_scatter-2.0.9.tar.gz (21 kB)
Building wheels for collected packages: torch-scatter
  Building wheel for torch-scatter (setup.py) ... [?25l[?25hdone
  Created wheel for torch-scatter: filename=torch_scatter-2.0.9-cp37-cp37m-linux_x86_64.whl size=3225862 sha256=7e657dab222c6559443fc21b0005fceca36e41f51d98b3c83fbe94230ea112e0
  Stored in directory: /root/.cache/pip/wheels/dd/57/a3/42ea193b77378ce634eb9454c9bc1e3163f3b482a35cdee4d1
Successfully built torch-scatter
Installing collected packages: torch-scatter
Successfully installed torch-scatter-2.0.9


In [None]:
#from transformers import pipeline
import pandas as pd

In [None]:
tqa = pipeline(task="table-question-answering", 
               model="google/tapas-base-finetuned-wtq")

Downloading:   0%|          | 0.00/490 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/154 [00:00<?, ?B/s]

In [None]:
table = pd.read_csv("/content/2022WinterOlympicMedals.csv")
table = table.astype(str)

In [None]:
table

Unnamed: 0.1,Unnamed: 0,Country,Gold_medal,silver_medal,copper_medal,total
0,1,Norway,9,5,7,21
1,2,Germany,8,5,2,15
2,3,United States,7,6,3,16
3,4,Netherlands,6,4,2,12
4,5,Austria,5,6,4,15
5,6,Sweden,5,3,3,11
6,7,China,5,3,2,10
7,8,ROC,4,6,8,18


In [None]:
query = "Witch country has highest gold_medal?"
print(tqa(table=table, query=query)["answer"])

Norway


In [None]:
query = ["Germany's total?", 
         "how many gold_medal Germany has?"]
answer = tqa(table=table, query=query)
for ans in answer:
    print(ans["answer"])

SUM > 15
SUM > 8


## 15. Token Classification
The model will return a json with PoS tags for each token.

In [None]:
#from transformers import pipeline

classifier = pipeline("token-classification", model = "vblagoje/bert-english-uncased-finetuned-pos")
classifier("Hello I'm Omar and I live in Zürich.")

Downloading:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

[{'end': 5,
  'entity': 'INTJ',
  'index': 1,
  'score': 0.99677867,
  'start': 0,
  'word': 'hello'},
 {'end': 7,
  'entity': 'PRON',
  'index': 2,
  'score': 0.9994424,
  'start': 6,
  'word': 'i'},
 {'end': 8,
  'entity': 'AUX',
  'index': 3,
  'score': 0.9962947,
  'start': 7,
  'word': "'"},
 {'end': 9,
  'entity': 'AUX',
  'index': 4,
  'score': 0.9960801,
  'start': 8,
  'word': 'm'},
 {'end': 14,
  'entity': 'PROPN',
  'index': 5,
  'score': 0.9989945,
  'start': 10,
  'word': 'omar'},
 {'end': 18,
  'entity': 'CCONJ',
  'index': 6,
  'score': 0.999172,
  'start': 15,
  'word': 'and'},
 {'end': 20,
  'entity': 'PRON',
  'index': 7,
  'score': 0.99947625,
  'start': 19,
  'word': 'i'},
 {'end': 25,
  'entity': 'VERB',
  'index': 8,
  'score': 0.9985421,
  'start': 21,
  'word': 'live'},
 {'end': 28,
  'entity': 'ADP',
  'index': 9,
  'score': 0.9994098,
  'start': 26,
  'word': 'in'},
 {'end': 35,
  'entity': 'PROPN',
  'index': 10,
  'score': 0.9989255,
  'start': 29,
  'word':

## 16. audio-classification

In [None]:
clf = pipeline("audio-classification")

## 17. Object-dectection

In [None]:
clf = pipeline("object-detection")