#### The pipline function returns and end-to-end object that performs and NLP task on one or several texts, it supports most common NLP tasks out of the box.
#### The pipline consists of three stages
<div>
<img src="image/pipeline1.png" width="800"/>
</div>

#### The first task for trying the pipeline API on is ***sentiment analysis***, it classifies texts as positive or negative

In [1]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I have been waiting for a HuggingFace course my whole life.")
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'POSITIVE', 'score': 0.943362832069397}]


#### Multiple texts can be passed to the object returned by a pipeline to treat them together

In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier(["I have been waiting for a HuggingFace course my whole life.",
                    "I hate it so much!"])
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'POSITIVE', 'score': 0.943362832069397}, {'label': 'NEGATIVE', 'score': 0.9995473027229309}]


#### The ***zero-shot-classification*** pipeline lets you selecet the labels for classification

In [3]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
result = classifier(
    "This is a course about the Transfomers library.",
    candidate_labels=["education", "politics", "bussiness"]
    )
print(result)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


{'sequence': 'This is a course about the Transfomers library.', 'labels': ['education', 'bussiness', 'politics'], 'scores': [0.8097747564315796, 0.1434357762336731, 0.046789489686489105]}


#### The ***text-generation*** pipeline uses an input prompt to generate text 

In [4]:
from transformers import pipeline

generator = pipeline("text-generation")
result = generator("In this course, we will teach you how to",
                   pad_token_id=generator.tokenizer.eos_token_id
                   )
print(result)

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'generated_text': 'In this course, we will teach you how to make and run your own "Federation of the Stars" web-app with the help of the Unity editor. Before you are able to compile and run your application, first install the application on a'}]


#### For each task, you can search the model hub for various models to use in the pipeline: [HuggingFace model hub](https://huggingface.co/models)

#### Here is another ***text generation*** pipeline, using the ***distilgpt2*** model

In [5]:
from transformers import pipeline

generator = pipeline("text-generation",
                     model="distilgpt2"
                     )
result = generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)
print(result)


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to learn the fundamentals of the C-E-K.\n\n\n\nOur goal is so simple'}, {'generated_text': "In this course, we will teach you how to program your basic knowledge: English, French language, and Arabic. In English, you'll learn how"}]


#### The ***fill-mask*** pipeline  will predict missing words in a sentence

In [6]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
result = unmasker(
    "This course will teach you all about <mask> models.",
    top_k=2
    )
print(result)

No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `P

[{'score': 0.1961977630853653, 'token': 30412, 'token_str': ' mathematical', 'sequence': 'This course will teach you all about mathematical models.'}, {'score': 0.04052729532122612, 'token': 38163, 'token_str': ' computational', 'sequence': 'This course will teach you all about computational models.'}]


#### The ***NER*** pipeline indentifies entities such as persons, organizations or locations in a sentence.

In [7]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
result = ner("My name is Qiyao Xue and I am studying the HuggingFace transformers library.")
print(result)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Hardware accelerator e.g. GPU is

[{'entity_group': 'PER', 'score': 0.9850975, 'word': 'Qiyao Xue', 'start': 11, 'end': 20}, {'entity_group': 'ORG', 'score': 0.9732389, 'word': 'HuggingFace', 'start': 43, 'end': 54}]


#### The ***question-answering*** pipeline extracts answers to a question from a given context

In [8]:
from transformers import pipeline

QA = pipeline("question-answering")
result = QA(
    question="Where do I work?",
    context="My name is QiyaoXue and I work at Pittsburgh"
)
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


{'score': 0.9962271451950073, 'start': 34, 'end': 44, 'answer': 'Pittsburgh'}


#### The ***summarization*** pipeline creates summaries of long texts

In [9]:
from transformers import pipeline

summarizer = pipeline("summarization")
result = summarizer("""Hugging Face, Inc. is an American company incorporated under the Delaware General Corporation Law[1] and based in New York City that develops computation tools for building applications using machine learning. It is most notable for its transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets and showcase their work.""",
                    max_length=18
                    )
print(result)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
  return self.fget.__get__(instance, owner)()
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Your min_length=56 must be inferior than your max_length=18.


[{'summary_text': ' Hugging Face, Inc. develops computation tools for building applications using machine learning'}]


#### The ***translation*** pipeline translate text from one language to another

In [10]:
from transformers import pipeline

translator = pipeline("translation",
                      model="Helsinki-NLP/opus-mt-fr-en"
                      )
result = translator("Ce cours est produit par HuggingFace.")
print(result)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'translation_text': 'This course is produced by HuggingFace.'}]


#### The pipline consists of three stages
<div>
<img src="image/pipeline1.png" width="800"/>
</div>

## Stage1: Tokenlization
<div>
<img src="image/pipeline2.png" width='800'/>
</div>

#### The ***AutoTokenizer*** class can load the tokeniszer for any checkpoint

In [11]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_inputs = [
    "I have been waiting for a HuggingFace course my whole life.",
    "I hate it so much"
]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
# padding=True: padding science the two sentences are not of the same size, the shortest one need to be padded to be able to build an array
# truncation=True: ensure that anyt sentence longer than the maximum the model can handle is truncated
# return_tensors="pt": let tokenizer returns pytorch tensor
# the output attention_mask indicate where the padding has been applied, so the model not pay attention to it
print(inputs)

{'input_ids': tensor([[  101,  1045,  2031,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2009,  2061,  2172,   102,     0,     0,     0,
             0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


## Stage2: Model

#### The ***AutoModel*** class loads a modle without its pretaining head

In [12]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
# the output [2, 15, 768] refers to [batch size, sequence length, hidden size]

torch.Size([2, 15, 768])


#### Each ***AutoModelForXXX*** class loads a model sutiable for a specific task

In [13]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits)
# logits refers to the last layer FFN output

tensor([[-1.3781,  1.4346],
        [ 4.3257, -3.4977]], grad_fn=<AddmmBackward0>)


## Stage3: Postprocessing

#### To go from logits to probabilities we apply a SoftMax layer

In [15]:
import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions.detach())

tensor([[5.6637e-02, 9.4336e-01],
        [9.9960e-01, 4.0010e-04]])


In [31]:
id2label = model.config.id2label
print(id2label)
print(f"sentence1: {id2label[torch.argmax(predictions[0]).item()]}, sentence2: {id2label[torch.argmax(predictions[1]).item()]}")

{0: 'NEGATIVE', 1: 'POSITIVE'}
sentence1: POSITIVE, sentence2: NEGATIVE


## Inside the token classification pipeline

#### The token classification pipeline gives each token in the sentence a label, whether each word corresponding to a person, an organization or a location

In [35]:
from transformers import pipeline

token_classifier = pipeline("token-classification")
result = token_classifier("My name is QiyaoXue and I am from China")
print(result)


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Hardware accelerator e.g. GPU is

[{'entity': 'I-PER', 'score': 0.99662614, 'index': 4, 'word': 'Qi', 'start': 11, 'end': 13}, {'entity': 'I-PER', 'score': 0.9735996, 'index': 5, 'word': '##ya', 'start': 13, 'end': 15}, {'entity': 'I-PER', 'score': 0.94115335, 'index': 6, 'word': '##o', 'start': 15, 'end': 16}, {'entity': 'I-PER', 'score': 0.9444561, 'index': 7, 'word': '##X', 'start': 16, 'end': 17}, {'entity': 'I-PER', 'score': 0.9376296, 'index': 8, 'word': '##ue', 'start': 17, 'end': 19}, {'entity': 'I-LOC', 'score': 0.99976164, 'index': 13, 'word': 'China', 'start': 34, 'end': 39}]


#### It can also group together tokens corresponding to the same entity

In [36]:
token_classifier = pipeline("token-classification", aggregation_strategy="simple")
agg_result = token_classifier("My name is QiyaoXue and I am from China")
print(agg_result)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Hardware accelerator e.g. GPU is

[{'entity_group': 'PER', 'score': 0.9586929, 'word': 'QiyaoXue', 'start': 11, 'end': 19}, {'entity_group': 'LOC', 'score': 0.99976164, 'word': 'China', 'start': 34, 'end': 39}]


#### token classification pipeline process

In [37]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Qiyao Xue and I am a Chinese."
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)

print(inputs["input_ids"].shape)
print(outputs.logits.shape)


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([1, 16])
torch.Size([1, 16, 9])


#### Get the classification result

In [38]:
import torch
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0]
predictions = probabilities.argmax(dim=-1).tolist()
print(f"id2label{model.config.id2label}")
print(f"prediction id:{predictions}\nprediction label:{[model.config.id2label[predict] for predict in predictions]}")

id2label{0: 'O', 1: 'B-MISC', 2: 'I-MISC', 3: 'B-PER', 4: 'I-PER', 5: 'B-ORG', 6: 'I-ORG', 7: 'B-LOC', 8: 'I-LOC'}
prediction id:[0, 0, 0, 0, 4, 4, 4, 4, 4, 0, 0, 0, 0, 2, 0, 0]
prediction label:['O', 'O', 'O', 'O', 'I-PER', 'I-PER', 'I-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'I-MISC', 'O', 'O']


#### The start and end character positions can be found settignt the ***return_offset_mapping=True*** when giving text input to the tokenizer, the returned start, end index is left close right open

In [61]:
results = []
input_with_offsets = tokenizer(example, return_offsets_mapping=True)
print(input_with_offsets)
tokens = input_with_offsets.tokens()
offsets = input_with_offsets["offset_mapping"]
zero_label = model.config.id2label[0]

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != zero_label:
        start, end = offsets[idx]
        results.append({"entity": label, "score": probabilities[idx][pred].item(), "word": tokens[idx], "start": start, "end": end})

print(results)


{'input_ids': [101, 1422, 1271, 1110, 24357, 2315, 1186, 17584, 1162, 1105, 146, 1821, 170, 1922, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'offset_mapping': [(0, 0), (0, 2), (3, 7), (8, 10), (11, 13), (13, 15), (15, 16), (17, 19), (19, 20), (21, 24), (25, 26), (27, 29), (30, 31), (32, 39), (39, 40), (0, 0)]}
O
[{'entity': 'I-PER', 'score': 0.9963464140892029, 'word': 'Qi', 'start': 11, 'end': 13}, {'entity': 'I-PER', 'score': 0.9570150375366211, 'word': '##ya', 'start': 13, 'end': 15}, {'entity': 'I-PER', 'score': 0.9778035283088684, 'word': '##o', 'start': 15, 'end': 16}, {'entity': 'I-PER', 'score': 0.9940474033355713, 'word': 'Xu', 'start': 17, 'end': 19}, {'entity': 'I-PER', 'score': 0.9704844951629639, 'word': '##e', 'start': 19, 'end': 20}, {'entity': 'I-MISC', 'score': 0.9973861575126648, 'word': 'Chinese', 'start': 32, 'end': 39}]


#### There are generally two way of labeling to do the token classification
* use the B-XXX label at the beginning of each new entity
* use the B-XXX label to separate two adjacent entities of the same type

In [60]:
import numpy as np

label_map = model.config.id2label
results = []
idx = 0
while idx < len(predictions):
    pred = predictions[idx]
    label = label_map[pred]
    if label != zero_label:
        # remove B- or I- in the label
        label = label[2:]
        start, _ = offsets[idx]
        while idx < len(predictions) and label_map[predictions[idx]] == f"I-{label}":
            _, end = offsets[idx]
            idx += 1
        
        word = example[start:end]
        results.append({"entity_group": label, "word": word, "start": start, "end": end})
    idx += 1

print(results)

PER
MISC
[{'entity_group': 'PER', 'word': 'Qiyao Xue', 'start': 11, 'end': 20}, {'entity_group': 'MISC', 'word': 'Chinese', 'start': 32, 'end': 39}]
