In [None]:
from google.colab import drive
drive.mount("/content/drive")

!pip install datasets evaluate transformers[sentencepiece]

The pipeline() makes it simple to use any model from the Hub for inference on any language, computer vision, speech, and multimodal tasks. Even if you don’t have experience with a specific modality or aren’t familiar with the underlying code behind the models, you can still use them for inference with the pipeline()! This tutorial will teach you to:

* Use a pipeline() for inference.
* Use a specific tokenizer or model.
* Use a pipeline() for audio, vision, and multimodal tasks.

#  Pipeline usage

While each task has an associated pipeline(), it is simpler to use the general pipeline() abstraction which contains all the task-specific pipelines. The pipeline() automatically loads a default model and a preprocessing class capable of inference for your task.

1. Start by creating a pipeline() and specify an inference task:

In [2]:
from transformers import pipeline 

generator = pipeline(task="text-generation")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

2. Pass your input text to the pipeline():

In [3]:
generator("One day, I was walking down to the")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "One day, I was walking down to the side and when I turned around suddenly people started coming round my right arm. I just shook my head and yelled, 'What can you do? You're crazy.' I can't remember what I said."}]

3. If you have more than one input, pass your input as a list:

In [5]:
generator(["I was absolutely right", "On that topic I would like to do"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[[{'generated_text': "I was absolutely right.\n\nThe whole picture, in short, is like the whole world in which you live because you're in your 50s and your 50s in your 60s, you can no longer be there to see what people are"}],
 [{'generated_text': 'On that topic I would like to do my usual list of questions, but in order to answer, I wanted to provide some answers.\n\nQ1 - How many are people using Microsoft Office?\n\nA1 - There is a pretty high'}]]

Any additional parameters for your task can also be included in the pipeline(). The text-generation task has a generate() method with several parameters for controlling the output. For example, if you want to generate more than one output, set the num_return_sequences parameter:

In [6]:
generator("How do you want to drink your coffee", num_return_sequences=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'How do you want to drink your coffee? I\'m going to go have coffee… and maybe then we could start a family…"\n\nA few years ago, I took a look at a book called "What I Do With Your Ears."'},
 {'generated_text': "How do you want to drink your coffee?\n\n(I'm just a non-professional) Why don't you have a cup of coffee and then get it cold.\n\nDo you need to have the mug?\n\nWhy don't"},
 {'generated_text': 'How do you want to drink your coffee, your rice, your water, or your tea?" "Oh, my!" said the woman beside Goh-Kong. "We want to know how good your drink is." "It\'s our daily'}]

#  Choose a model and tokenizer

The pipeline() accepts any model from the Hub. There are tags on the Hub that allow you to filter for a model you’d like to use for your task. Once you’ve picked an appropriate model, load it with the corresponding AutoModelFor and AutoTokenizer class. For example, load the AutoModelForCausalLM class for a causal language modeling task:

In [7]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilgpt2")

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/353M [00:00<?, ?B/s]

Create a pipeline() for your task, and specify the model and tokenizer you’ve loaded:

In [8]:
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

Pass your input text to the pipeline() to generate some text:

In [9]:
generator("How do you want to drink your coffee", num_return_sequences=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'How do you want to drink your coffee or coffee everyday and have them become your friend and friend? Did you like what they say when you come face to face on a daily basis?\n\n\n\n\n\n\n\n\n\n\n\n\n'},
 {'generated_text': "How do you want to drink your coffee regularly?\n\n\n\n\n\nThere aren't so many options for drinking, especially during peak days, but even in a peak-day scenario you should choose your drink most carefully.\n\nIt's"},
 {'generated_text': "How do you want to drink your coffee while you're watching TV in the morning or watching TV in the afternoon? Because you've been drinking a whole lot lately so you wouldn't want to drink it while you're watching television in the morning? Because"}]

# Audio pipeline

The pipeline() also supports audio tasks like audio classification and automatic speech recognition.

For example, let’s classify the emotion in this audio clip:

In [10]:
from datasets import load_dataset 
import torch

torch.manual_seed(42)
ds = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
audio_file = ds[0]["audio"]["path"]

Downloading builder script:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

Downloading and preparing dataset librispeech_asr_demo/clean to /root/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_demo/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset librispeech_asr_demo downloaded and prepared to /root/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_demo/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b. Subsequent calls will reuse this data.


Find an audio classification model on the Model Hub for emotion recognition and load it in the pipeline():

In [11]:
audio_classifier = pipeline(task="audio-classification", model="ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition")

Downloading:   0%|          | 0.00/2.28k [00:00<?, ?B/s]

  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "


Downloading:   0%|          | 0.00/1.27G [00:00<?, ?B/s]

Some weights of the model checkpoint at ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition were not used when initializing Wav2Vec2ForSequenceClassification: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.output.bias', 'classifier.output.weight']
- This IS expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition and are newly initialized: ['projector.weight', 'classifier.bias', 'p

Downloading:   0%|          | 0.00/214 [00:00<?, ?B/s]

Pass the audio file to the pipeline():

In [12]:
preds = audio_classifier(audio_file)
preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
preds

[{'score': 0.1315, 'label': 'calm'},
 {'score': 0.1307, 'label': 'neutral'},
 {'score': 0.1274, 'label': 'sad'},
 {'score': 0.1261, 'label': 'fearful'},
 {'score': 0.1242, 'label': 'happy'}]

# Vision pipeline

Using a pipeline() for vision tasks is practically identical.

Specify your task and pass your image to the classifier. The image can be a link or a local path to the image. For example, what species of cat is shown below?

In [13]:
vision_classifier = pipeline(task="image-classification")
preds = vision_classifier(images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] 
preds

No model was supplied, defaulted to google/vit-base-patch16-224 and revision 5dca96d (https://huggingface.co/google/vit-base-patch16-224).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/69.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/346M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/160 [00:00<?, ?B/s]

[{'score': 0.4403, 'label': 'lynx, catamount'},
 {'score': 0.0343,
  'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor'},
 {'score': 0.0321, 'label': 'snow leopard, ounce, Panthera uncia'},
 {'score': 0.0235, 'label': 'Egyptian cat'},
 {'score': 0.023, 'label': 'tiger cat'}]

# Multimodal pipeline

The pipeline() supports more than one modality. For example, a visual question answering (VQA) task combines text and image. Feel free to use any image link you like and a question you want to ask about the image. The image can be a URL or a local path to the image.

For example, if you use the same image from the vision pipeline above:

In [14]:
image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
question = "where is the cat?"

Create a pipeline for vqa and pass it the image and question:

In [15]:
vqa = pipeline("vqa")
preds = vqa(image=image, question=question)
preds = [{"score": round(pred["score"], 4), "answer": pred["answer"]} for pred in preds]
preds

No model was supplied, defaulted to dandelin/vilt-b32-finetuned-vqa and revision 4355f59 (https://huggingface.co/dandelin/vilt-b32-finetuned-vqa).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/136k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/470M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/320 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/251 [00:00<?, ?B/s]

[{'score': 0.9112, 'answer': 'snow'},
 {'score': 0.8796, 'answer': 'in snow'},
 {'score': 0.6717, 'answer': 'outside'},
 {'score': 0.0291, 'answer': 'on ground'},
 {'score': 0.027, 'answer': 'ground'}]