Transformers <code>pipeline()</code> function connects a model with necessary pre & post processing steps

By default, the pipeline selects a certain pretrained model that has been finetuned for sentiment analysis in English. 
* Download the model and cache it
* The cached model will be used & no need to redownload again

Three main steps of handling input text:
* Text is preprocessed into a format that model can understand
* Pass preprocessed inputs to model
* Predictions of the model are post-processed

In [1]:
!pip install transformers[sentencepiece]



In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I have been waiting 4 a huggingface course my entire life.")

  from .autonotebook import tqdm as notebook_tqdm


[2023-11-26 21:00:12,931] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Using /home/cybertron/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/cybertron/.cache/torch_extensions/py310_cu117/cuda_kernel/build.ninja...
Building extension module cuda_kernel...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Failed to load CUDA kernels. Mra requires custom CUDA kernels. Please verify that compatible versions of PyTorch and CUDA Toolkit are installed: Error building extension 'cuda_kernel'
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip

[1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cuda_kernel -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/cybertron/anaconda3/envs/llava/lib/python3.10/site-packages/torch/include -isystem /home/cybertron/anaconda3/envs/llava/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/cybertron/anaconda3/envs/llava/lib/python3.10/site-packages/torch/include/TH -isystem /home/cybertron/anaconda3/envs/llava/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/cybertron/anaconda3/envs/llava/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/cybertron/an

[{'label': 'POSITIVE', 'score': 0.7453370094299316}]

In [3]:
sentences = ["I have been waiting for this delicious tuna sandwich my whole life",
             "I hate pickles"]
classifier(sentences)

[{'label': 'POSITIVE', 'score': 0.996371865272522},
 {'label': 'NEGATIVE', 'score': 0.998421311378479}]

## Zero-shot classification
* Allow us to specify desired labels for classification, without using labels of the pretrained model

In [4]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
sentence = "This is a course about making sandwich"
candidate_labels = ["education","politics","business"]

classifier(sentence, candidate_labels)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'This is a course about making sandwich',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.6514201760292053, 0.25742191076278687, 0.09115791320800781]}

## Text generation
* Provide a prompt and the model will auto-complete it
* May not have same result due to randomness

In [5]:
from transformers import pipeline

generator = pipeline("text-generation")
prompt = "In this sandwich making class, we will teach you how"

generator(prompt)

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "In this sandwich making class, we will teach you how to make sandwiches in a quick time.\n\nIf you have a big time interest in how to do most people's life as a few days away from home, here are some tips you should"}]

Select a specific model from Hub to be used in pipeline

In [6]:
generator = pipeline("text-generation", model='distilgpt2')
prompt = "In this sandwich making class, we will teach you how"

generator(prompt,max_length=30, num_return_sequences=2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this sandwich making class, we will teach you how to make an example of food. All the basic ingredients required are some simple ingredients and you will'},
 {'generated_text': 'In this sandwich making class, we will teach you how to learn different types of meatballs and make all of your breadballs from scratch. Learn how'}]

## The Inference API
* Test available models
* Is a paid product

## Mask filling
Fill in the blanks

In [7]:
unmasker = pipeline("fill-mask")
unmasker("This pasta making course will teach you about <mask> spaghetti.",top_k=2)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.19449160993099213,
  'token': 442,
  'token_str': ' making',
  'sequence': 'This pasta making course will teach you about making spaghetti.'},
 {'score': 0.09701454639434814,
  'token': 17798,
  'token_str': ' homemade',
  'sequence': 'This pasta making course will teach you about homemade spaghetti.'}]

## Named Entity Recognition

* Find which parts of input text corresponds to entities such as persons, locations, or organizations.
* PER: person, ORG: organization, LOC: location
* <code>grouped_entities=True</code> tells the pipeline to regroup together parts of the sentence that correspond to the same group. 

In [8]:
ner = pipeline("ner", grouped_entities=True)
text = "My name is John and I work at Starbucks in Georgetown"
ner(text)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.9986809,
  'word': 'John',
  'start': 11,
  'end': 15},
 {'entity_group': 'ORG',
  'score': 0.9959449,
  'word': 'Starbucks',
  'start': 30,
  'end': 39},
 {'entity_group': 'LOC',
  'score': 0.9956474,
  'word': 'Georgetown',
  'start': 43,
  'end': 53}]

In [9]:
ner(text, grouped_entities=True)

[{'entity_group': 'PER',
  'score': 0.9986809,
  'word': 'John',
  'start': 11,
  'end': 15},
 {'entity_group': 'ORG',
  'score': 0.9959449,
  'word': 'Starbucks',
  'start': 30,
  'end': 39},
 {'entity_group': 'LOC',
  'score': 0.9956474,
  'word': 'Georgetown',
  'start': 43,
  'end': 53}]

## Question answering

In [10]:
qa = pipeline("question-answering")
question = "Where to can buy a sandwich?"
context = "Subway is a sandwich joint from United States that sells delicious sandwich and cookies in Kuala Lumpur"
qa(question=question,context=context)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.8467773795127869, 'start': 0, 'end': 6, 'answer': 'Subway'}

## Summarization

In [11]:
summarizer = pipeline("summarization")
text = """Sandwich, in its basic form, slices of meat,  
cheese, or other food placed between two slices of bread.
Although this mode of consumption must be as old as meat 
and bread, the name was adopted only in the 18th century
for John Montagu, 4th earl of Sandwich. According to an 
often-cited account from a contemporary French travel book,
Sandwich had sliced meat and bread brought to him at the 
gaming table on one occasion so that he could continue 
to play as he ate; it seems more likely, however, 
that he ate these sandwiches as he worked at his desk or 
that the world became aware of them when he requested 
hem in London society. His title lent the preparation 
cachet, and soon it was fashionable to serve sandwiches
on the European continent, and the word was incorporated 
into the French language. Since that time the 
sandwich has been incorporated into virtually every 
cuisine of the West by virtue of its simplicity of 
preparation, portability, and endless variety.
"""

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [12]:
summarizer(text)

[{'summary_text': ' Sandwich, in its basic form, slices of meat, cheese, or other food placed between two slices of bread . The name was adopted only in the 18th century for John Montagu, 4th earl of Sandwich . His title lent the preparation, and soon it was fashionable to serve sandwiches on European continent .'}]

## Translation

In [13]:
translator = pipeline("translation",model="Helsinki-NLP/opus-mt-fr-en")
text_fr = "Le sandwich est fait avec des ananas et du bacon"
translator(text_fr)



[{'translation_text': 'The sandwich is made with pineapple and bacon'}]

## Bias and limitations

In [16]:
unmasker = pipeline("fill-mask", model="bert-base-uncased")
text1 = "This man works as a [MASK]."
result = unmasker(text1)
print([r["token_str"] for r in result])

text2 = "This woman works as a [MASK]."
result = unmasker(text2)
print([r["token_str"] for r in result])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']
['nurse', 'maid', 'teacher', 'waitress', 'prostitute']


In [40]:
text3 = "The best meal consists of egg, [MASK] and tuna."
result = unmasker(text3)
print([r["token_str"] for r in result])

['fish', 'chicken', 'rice', 'beef', 'meat']
