# CH1 Quick Start

In [1]:
from transformers import pipeline

# 情感分析
classifier = pipeline(task="sentiment-analysis")
result = classifier("""we'd like to introduce our new goods to you! hope you like 
                    these""")
print(result)


  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.



[{'label': 'POSITIVE', 'score': 0.9998639822006226}]




result: 
```python
[{'label': 'POSITIVE', 'score': 0.9998639822006226}]
```
label为情感分类标签, score为置信度, 即概率

In [3]:
# 对多个输入的情况, 可将输入作为列表传入pipeline函数
# 这会返回一个包含多个字典的列表
inputs = [
    "We are happy to show the transformers library.",
    "We hope you don't hate it.",
]
results = classifier(inputs)

for result in results:
    print(
        f"""label: {result['label']}, score: {round(result['score'], 4)}"""
    )
"""
round(number, ndigits=None): 将number保留ndigits位后返回
    ndigits可以是负数
"""

label: POSITIVE, score: 0.9998
label: NEGATIVE, score: 0.5309


'\nround(number, ndigits=None): 将number保留ndigits位后返回\n    ndigits可以是负数\n'

results: 

```python
label: POSITIVE, score: 0.9998
label: NEGATIVE, score: 0.5309
```

pipeline函数还可以对整个数据集执行指定任务

对于数据量较大的情况(如语音或视觉数据), 则需要将生成器传递给模型

## 使用AutoModelForSequenceClassifier和AutoTokenizer加载预训练模型和对应的分词器(在下一节详细讨论)

示例:
### Pytorch
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
```

### Tensorflow
```python
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
```

在使用pipeline()时指定模型和分词器即可用于更多语言的任务
`classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)`

在代码实现上, AutoModelForSequenceClassification和AutoTokenizer类协同作用, 为pipeline()提供模型和分词器

AutoClass是一个快捷方式, 能够自动从模型名称或路径中检索预训练模型的架构, 我们只需要为需要完成的任务选择合适的Autoclass和相关的预处理类即可

### AutoTokenizer

分词器(tokenizer)负责将预处理文本以数组形式输入模型中, 分词过程收到多种规则的制约, 包括如何分割单词以及在何种级别上分割单词(详细信息参阅[分词器概述](tokenizer_summary.ipynb)), 最重要的是, 你需要使用与模型预训练时相同的模型名称实例化分词器, 确保分词规则相同

In [4]:
# 使用AutoTokenizer加载分词器
from transformers import AutoTokenizer

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [5]:
# 将文本传递给分词器
text = "We are very happy to show you the 🤗 Transformers library."
encoding = tokenizer(text)

print(encoding)

{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


result:
```python
{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
```
分词器返回了一个字典, 包含`input_ids`, `token_type_ids`, `attention_mask`
- `input_ids(输入ID)`: 该单词在分词器词典中的位置
- `attention_mask(注意力掩码)`: 该单词是否应该关注

In [6]:
# 分词器还可以接受一个输入列表, 并对文本进行填充和截断, 确保文本长度均匀
# PyTorch
pt_batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.",
     "We hope you don't hate it",],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)
"""
padding=True: 在文本序列两端添加填充字符, 使每个文本序列长度等于'max_length'
truncation=True: 截断文本长度超过'max_length'的部分, 使长度等于'max_length'
max_length=512: 设定文本序列的最大长度(单位:字符)
return_tensors="pt": 指定返回文本序列类型为PyTorch张量
"""

'\npadding=True: 在文本序列两端添加填充字符, 使每个文本序列长度等于\'max_length\'\ntruncation=True: 截断文本长度超过\'max_length\'的部分, 使长度等于\'max_length\'\nmax_length=512: 设定文本序列的最大长度(单位:字符)\nreturn_tensors="pt": 指定返回文本序列类型为PyTorch张量\n'

In [7]:
# TensorFlow
tf_batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.",
     "We hope you don't hate it",],
     padding=True,
     truncation=True,
     max_length=512,
     return_tensors="tf"
)


**注**: 有关分词器的详细信息, 参阅[预处理教程](preprocess.ipynb)

### AutoModel

Transformers提供了一种简单且统一的加载预训练实例的方法, 加载方式与AutoTokenizer类似, 唯一的区别是需要为任务选择正确的AutoModel

对于文本分类任务, 应该选择加载AutoModelForSequenceClassification

在默认情况下, 无论权重(weights)以什么数据类型(如`torch.float16`)存储, 都会以全精度(`torch.float32`)进行加载

将`torch_dtype="auto"`设置为模型加载`config.json`中定义的数据类型, 可以自动加载最节省内存的数据类型

In [9]:
# PyTorch
from transformers import AutoModelForSequenceClassification

pt_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype="auto",
)

现在, 将预处理后的数据批量输入传递给模型, 在传递时需要添加`**`解包字典

In [11]:
pt_outputs = pt_model(**pt_batch)

pt_outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-2.6222, -2.7745, -0.8967,  2.0137,  3.3064],
        [-0.0182, -0.2979, -0.1277, -0.1001,  0.3065]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

该模型用`logits`属性输出最终激活值, 可使用softmax函数将logits转换为概率值

In [13]:
from torch import nn

pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
"""
dim=-1: 沿着最后一个维度进行计算
"""
print(pt_predictions)

tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
        [0.2017, 0.1525, 0.1808, 0.1859, 0.2791]], grad_fn=<SoftmaxBackward0>)


相应地, TensorFlow中同样提供了与PyTorch相同的TFAutoModel接口

In [None]:
# TensorFlow
from transformers import TFAutoModelForSequenceClassification

tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

In [None]:
tf_outputs = tf_model(tf_batch)

和PyTorch一样, 使用softmax函数获取概率值

In [None]:
import tensorflow as tf

