Transformers 库中最基本的对象是 pipeline() 函数。它将模型与所需的预处理和后续处理步骤连接起来，使我们能够通过直接输入任何文本并获得最终的结果.

将一些文本传递到 pipeline 时涉及三个主要步骤：
文本被预处理为模型可以理解的格式。
将预处理后的输入传递给模型。
对模型的预测进行后续处理并输出最终人类可以理解的结果。

目前 可用的一些pipeline 有：
eature-extraction （获取文本的向量表示）
fill-mask （完形填空）
ner （命名实体识别）
question-answering （问答）
sentiment-analysis （情感分析）
summarization （提取摘要）
text-generation （文本生成）
translation （翻译）
zero-shot-classification （零样本分类）

In [4]:
%pip install sentencepiece

Collecting sentencepiece
  Obtaining dependency information for sentencepiece from https://files.pythonhosted.org/packages/0f/35/e63ba28062af0a3d688a9f128e407a1a2608544b2f480cb49bf7f4b1cbb9/sentencepiece-0.2.0-cp311-cp311-macosx_10_9_x86_64.whl.metadata
  Downloading sentencepiece-0.2.0-cp311-cp311-macosx_10_9_x86_64.whl.metadata (7.7 kB)
Downloading sentencepiece-0.2.0-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.2.0
Note: you may need to restart the kernel to use updated packages.


In [5]:
%pip install transformers[sentencepiece] #开发版本，它带有几乎所有所需的依赖项

zsh:1: no matches found: transformers[sentencepiece]
Note: you may need to restart the kernel to use updated packages.


## sentiment-analysis

In [1]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier(["I've been waiting for the class my whole life", "I do not like it"])

  torch.utils._pytree._register_pytree_node(
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9945021867752075},
 {'label': 'NEGATIVE', 'score': 0.998218834400177}]

## zero-shot-classification
零样本分类: 我们需要对没有标签的文本进行分类。这是实际项目中的常见场景，因为给文本打标签通常很耗时并且需要领域专业知识。对于这项任务 zero-shot-classification pipeline 非常强大：它允许你直接指定用于分类的标签，因此你不必依赖预训练模型的标签。
下面的模型展示了如何使用标签将句子分类为 education（教育） 、 politics（政治） 、或者 business（商业).
这个 pipeline 称为 zero-shot （零样本学习），因为你不需要对数据上的模型进行微调即可使用它。它可以直接返回你想要的任何标签列表的概率分数！
也可以使用你喜欢的任何其他标签集对文本进行分类。

In [2]:
zero_shot_classifier = pipeline("zero-shot-classification")
zero_shot_classifier("This is a class about investment", candidate_labels=["education", "politics", "business"], multi_lable = True)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'This is a class about investment',
 'labels': ['business', 'education', 'politics'],
 'scores': [0.840237557888031, 0.10738636553287506, 0.052376046776771545]}

## text-generation 
文本生成的主要使用方法是你提供一些文本，模型将通过生成剩余的文本来自动补全整段话。这类似于许多手机上的预测文本功能。文本生成涉及随机性，因此如果你没有得到相同的如下所示的结果也是正常的。
可以使用参数 num_return_sequences 控制生成多少个不同的候选的句子，并使用参数 max_length 控制输出文本的总长度

In [6]:
generator = pipeline("text-generation")
generator("In the following 90 days, let's work together to", num_return_sequences=5, max_length=20)

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "In the following 90 days, let's work together to bring these characters back to life, and I"},
 {'generated_text': "In the following 90 days, let's work together to increase awareness and to expand our resources.\n"},
 {'generated_text': "In the following 90 days, let's work together to solve some difficult issues, such as how to"},
 {'generated_text': "In the following 90 days, let's work together to make our game more consistent.\n\n1"},
 {'generated_text': "In the following 90 days, let's work together to find every last piece of information on her life"}]

## fill-mask
完形填空： 填补给定文本中的空白
top_k 参数控制要显示的结果有多少种。请注意，这里模型填补了特殊的 <mask> 词，它通常被称为 mask token 。不同的 mask-filling 模型可能有不同的 mask token ，因此在探索其他模型时要验证正确的 mask token 是什么。检查它的一种方法是查看小组件中使用的 mask token.

In [8]:
unmasker = pipeline("fill-mask")
unmasker("Usually, the color of apples is red or <mask>.", top_k=2)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.24010099470615387,
  'token': 5718,
  'token_str': ' yellow',
  'sequence': 'Usually, the color of apples is red or yellow.'},
 {'score': 0.18887124955654144,
  'token': 14327,
  'token_str': ' purple',
  'sequence': 'Usually, the color of apples is red or purple.'}]

## ner （命名实体识别）: Named Entity Recognition

What does NER do? Identifies and classifies key information in text, Extracts information from unstructured text, Sorts and ranks information by importance, Helps identify private information, and Helps determine if certain subjects appear in text. 

What is NER used for? 
chatbots, sentiment analysis tools, search engines, healthcare, finance, human resources, customer support, higher education, social media analysis, and research.

What are named entities? 
Names of individuals, Names of organizations, Locations, Times, Quantities, Medical codes, Monetary values, Percentages, Events, and Products.


In [12]:
ner = pipeline("ner", grouped_entities=True) #grouped_entities=True 参数告诉 pipeline 将与同一实体对应的句子部分重新分组, i.e.,“Hugging”和“Face”分组为一个组织，即使名称由多个词组成
ner("George's hometown is Crystal City and he works at Amazon in New York.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.9993166,
  'word': 'George',
  'start': 0,
  'end': 6},
 {'entity_group': 'LOC',
  'score': 0.9538238,
  'word': 'Crystal City',
  'start': 21,
  'end': 33},
 {'entity_group': 'ORG',
  'score': 0.998565,
  'word': 'Amazon',
  'start': 50,
  'end': 56},
 {'entity_group': 'LOC',
  'score': 0.99916613,
  'word': 'New York',
  'start': 60,
  'end': 68}]

## question-answering
使用给定上下文中的信息回答问题

In [13]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where does George work?",
    context="George's hometown is Crystal City and he works at Amazon in New York.",
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.5575575232505798, 'start': 50, 'end': 56, 'answer': 'Amazon'}

## summarization
提取摘要是将文本缩减为较短文本的任务，同时保留文本中所有（或大部分）主要（重要）的信息

In [19]:
summarizer = pipeline("summarization")
summarizer("""The Department of Government Efficiency, or DOGE, run by President Donald Trump's billionaire adviser and Tesla CEO Elon Musk, has gained access to sensitive Treasury data including Social Security and Medicare customer payment systems, according to two people familiar with the situation.
The move by DOGE, a Trump administration task force assigned to find ways to fire federal workers, cut programs and slash federal regulations, means it could have wide leeway to access important taxpayer data, among other things.
The New York Times first reported the news of the group's access of the massive federal payment system. The two people who spoke to The Associated Press spoke on condition of anonymity because they were not authorized to speak publicly.
The highest-ranking Democrat on the Senate Finance Committee, Ron Wyden of Oregon, on Friday sent a letter to Trump's Treasury Secretary Scott Bessent expressing concern that "officials associated with Musk may have intended to access these payment systems to illegally withhold payments to any number of programs."
"To put it bluntly, these payment systems simply cannot fail, and any politically motivated meddling in them risks severe damage to our country and the economy," Wyden said.
""", max_length=30, min_length=15, do_sample=False)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': ' The Department of Government Efficiency, or DOGE, gained access to Social Security and Medicare payment systems . The move means it could have'}]

## translation
对于翻译，如果你在任务名称中包含语言对（例如“ translation_en_to_fr ”），则可以使用默认模型，但最简单的方法是在模型中心（hub） 选择要使用的模型.

In [4]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face")

[{'translation_text': 'This course is produced by Hugging Face'}]

## 在 pipeline 中使用 Hub 中的其他模型
前面的示例使用了默认模型，但你也可以从 Hub 中选择一个特定模型，将其用于特定任务.

In [7]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Downloading config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use R.P. In practice, there are two classes (1) where you will have to'},
 {'generated_text': "In this course, we will teach you how to control your energy flow through your body and your mind.\n\n\nLet's look at some of"}]