<a href="https://colab.research.google.com/github/zcongfly/huggingface-nlp-learning-note/blob/main/06_Putting_it_all_together_(PyTorch).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Putting it all together (PyTorch)

In [None]:
# Install the Transformers, Datasets, and Evaluate libraries to run this notebook.
!pip install datasets evaluate transformers[sentencepiece]

在最后几节中，我们一直在尽最大努力手工完成大部分工作。我们探索了分词器的工作原理，并研究了分词、转换为输入 ID、填充、截断和注意掩码。

然而，正如我们在第 2 节中看到的，Transformers API 可以使用我们将在此处深入研究的高级函数为我们处理所有这些。当你直接在句子上调用你的 tokenizer 时，你会得到准备好通过你的模型的输入：

In [None]:
from transformers import AutoTokenizer

checkpoint="bert-base-chinese"
tokenizer=AutoTokenizer.from_pretrained(checkpoint)

sequences=[
    "我欲穿花寻路，误入白云深处。",
    "为中华之崛起而读书！"
]

model_inputs=tokenizer(sequences)
print(model_inputs)

{'input_ids': [[101, 2769, 3617, 4959, 5709, 2192, 6662, 8024, 6428, 1057, 4635, 756, 3918, 1905, 511, 102], [101, 711, 704, 1290, 722, 2307, 6629, 5445, 6438, 741, 8013, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs)

{'input_ids': [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


在这里， model_inputs 变量包含模型正常运行所需的一切。对于 DistilBERT，这包括输入 ID 和注意力掩码。接受额外输入的其他模型也将具有 tokenizer 对象的输出。

正如我们将在下面的一些示例中看到的，这种方法非常强大。首先，它可以标记单个序列：

In [None]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

它还一次处理多个序列，API 没有变化：

In [None]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

它可以根据几个目标进行填充：

In [None]:
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

它还可以截断序列：

In [None]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

tokenizer 对象可以处理到特定框架张量的转换，然后可以将其直接发送到模型。例如，在下面的代码示例中，我们提示分词器从不同的框架返回张量—— "pt" 返回 PyTorch 张量， "tf" 返回 TensorFlow 张量， "np" 返回 NumPy 数组：

In [None]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

## Special tokens

如果我们看一下分词器返回的输入 ID，我们会发现它们与之前的稍有不同：

In [None]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]


开头加一个token ID，最后加一个。让我们解码上面的两个 ID 序列，看看这是关于什么的：

In [None]:
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

[CLS] i've been waiting for a huggingface course my whole life. [SEP]
i've been waiting for a huggingface course my whole life.


分词器在开头添加了特殊词 [CLS] ，在末尾添加了特殊词 [SEP] 。这是因为模型是使用这些进行预训练的，因此为了获得相同的推理结果，我们还需要添加它们。请注意，某些模型不添加特殊词，或添加不同的词；模型也可以只在开头或结尾添加这些特殊词。在任何情况下，分词器都知道哪些是预期的，并会为您处理。

## 总结：从分词器到模型

现在我们已经了解了 tokenizer 对象在应用于文本时使用的所有单独步骤，让我们最后一次看看它如何处理多个序列（填充！）、非常长的序列（截断！）和多种类型的张量及其主要 API：

In [None]:
import torch
from transformers import AutoTokenizer,AutoModelForSequenceClassification

checkpoint="bert-base-chinese"
tokenizer=AutoTokenizer.from_pretrained(checkpoint)
model=AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences=[
    "我欲穿花寻路，误入白云深处。",
    "为中华之崛起而读书！",
    "这件事让我感到无法接受。"
]

tokens=tokenizer(sequences,padding=True,truncation=True,return_tensors="pt")
output=model(**tokens)

print(output.logits)

results=torch.nn.functional.softmax(output.logits)
print(results)

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-chinese and are newly i

tensor([[-0.0392, -0.7188],
        [ 0.0293, -0.4774],
        [-0.7877, -0.4168]], grad_fn=<AddmmBackward0>)
tensor([[0.6637, 0.3363],
        [0.6240, 0.3760],
        [0.4083, 0.5917]], grad_fn=<SoftmaxBackward0>)


  results=torch.nn.functional.softmax(output.logits)


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

print(output.logits)

tensor([[-1.5607,  1.6123],
        [-3.6183,  3.9137]], grad_fn=<AddmmBackward0>)
