In [None]:
# Transformers installation
! pip install transformers

In [None]:
!pip install xformers

The easiest way to use a pretrained model on a given task is to use pipeline.

Transformers provides the following tasks out of the box:

* Sentiment analysis: is a text positive or negative?

* Text generation (in English): provide a prompt and the model will generate what follows.

* Name entity recognition (NER): in an input sentence, label each word with the entity it represents (person, place, etc.)

Question answering: provide the model with some context and a question, extract the answer from the context.

* Filling masked text: given a text with masked words (e.g., replaced by [MASK]), fill the blanks.

* Summarization: generate a summary of a long text.

* Translation: translate a text in another language.

Let's see how this work for sentiment analysis (the other tasks are all covered in the task summary):

In [None]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')

When typing this command for the first time, a pretrained model and its tokenizer are downloaded and cached. We will look at both later on, but I as an introduction the tokenizer's job is to preprocess the text for the model, which is then responsible for making predictions. The pipeline groups all of that together, and post-process the predictions to make them readable. For instance:

In [4]:
classifier('We are very happy to show you the Transformers library.')

[{'label': 'POSITIVE', 'score': 0.9997994303703308}]

In [5]:
classifier("The pizza is not that great but the crust is awesome.")

[{'label': 'POSITIVE', 'score': 0.9998461008071899}]

That's encouraging! You can use it on a list of sentences, which will be preprocessed then fed to the model as a batch, returning a list of dictionaries like this one:

In [6]:
results = classifier(["We are very happy to show you the Transformers library.", "We hope you don't hate it."])
for result in results:
  print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309


You can see the second sentence has been classified as negative (it needs to be positive or negative) but its score is fairly neutral.

By default, the model downloaded for this pipeline is called "distilbert-base-uncased-finetuned-sst-2-english". We can look at its model page to get more information about it. It uses the DistilBERT architecture and has been fine-tuned on a dataset called SST-2 for the sentiment analysis task.

Let's say we want to use another model; for instance, one that has been trained on French data. We can search through the model hub that gathers models pretrained on a lot of data by research labs, but also community models (usually fine-tuned versions of those big models on a specific dataset). Applying the tags "French" and "text-classification" gives back a suggestion "nlptown/bert-base-multilingual-uncased- sentiment". Let's see how we can use it.

You can directly pass the name of the model to use to pipeline:
















**Named entity recognition (NER)**

In [8]:
# Named entity recognition (NER)
from transformers import pipeline
classifier = pipeline (task="ner")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

In [9]:
# preds = classifier("Google is a search engine")
preds = classifier("India is my country")

In [10]:
preds = [
    {
          "entity": pred["entity"],
          "score": round(pred["score"], 4),
          "index": pred["index"],
          "word": pred["word"],
          "start": pred["start"],
          "end": pred["end"],
    }
    for pred in preds
]
print(*preds, sep="\n")

{'entity': 'I-LOC', 'score': 0.9998, 'index': 1, 'word': 'India', 'start': 0, 'end': 5}


**Question Answering**

* Question-Answering Models are deep learning models that can answer questions given some context

In [11]:
from transformers import pipeline
question_answerer = pipeline(task="question-answering")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [12]:
preds = question_answerer (
    question="Where did sam lives",
    context="sam lives in India since from his childhood",
)
print(f"score: {round(preds['score'], 4)}, answer: {preds['answer']}")

score: 0.9815, answer: India


In [14]:
preds = question_answerer (
    question="What kind of sending technology is being used to protect tribal lands in the Amazon?",
    context="The use of remote sensing for the conservation of the Amazon is also being used by the indigenous tribes of the basin to protect their tribal lands from commercial"
    )
print(f"score: {round(preds['score'], 2)}, answer: {preds['answer']}")

score: 0.98, answer: remote sensing


**Text Summarization**

In [15]:
from transformers import pipeline
summarizer = pipeline (task="summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [16]:
t = summarizer(""" Python is a high-level, general-purpose programming language.
Its design philosophy emphasizes code readability with the use of significant indentation.
Python is dynamically type-checked and garbage-collected. It supports multiple programming paradigms,
 including structured (particularly procedural), object-oriented and functional programming.
 It is often described as a "batteries included" language due to its comprehensive standard library.""")

Your max_length is set to 142, but your input_length is only 84. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=42)


In [17]:
t

[{'summary_text': ' Python is a high-level, general-purpose programming language . Its design philosophy emphasizes code readability with the use of significant indentation . Python is dynamically type-checked and garbage-collected . It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming .'}]

**Machine Translation**

In [18]:
# Translate English to Hindi:
from transformers import pipeline
text = "Hugging Face is a community-based open-source platform for machine learning."
translator = pipeline (task="translation", model="Helsinki-NLP/opus-mt-en-hi")

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/306M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/812k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]



In [19]:
translated = translator(text)
translated

[{'translation_text': 'मशीन सीखने के लिए आम तौर पर खुली-ट-ट-टथिंग का एक समुदाय है।'}]

In [None]:
!pip install sentencepiece

**Language modeling**

* Its a task that predicts a word in a sequence of text.

In [21]:
from transformers import pipeline
prompt = "Hugging Face is a community-based open-source platform for machine learning."
# prompt = "India is my country"
generator = pipeline (task="text-generation")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [22]:
generator(prompt)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'Hugging Face is a community-based open-source platform for machine learning. It supports machine learning with the new NeuralNet extension, DeepLearning2. It includes a number of powerful built-in neural nets and a plethora of APIs for data mining'}]

In [23]:
# Fill in the blank
fill_mask = pipeline (task="fill-mask")

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [25]:
text = "Hugging Face is a community-based open-source <mask> for machine learning."
preds = fill_mask(text, top_k=1)
preds = [
    {
        "score": round(pred ["score"], 4),
        "token": pred["token"],
        "token_str": pred ["token_str"],
        "sequence": pred ["sequence"],
    }
    for pred in preds
]
preds

[{'score': 0.224,
  'token': 3944,
  'token_str': ' tool',
  'sequence': 'Hugging Face is a community-based open-source tool for machine learning.'}]