<a href="https://colab.research.google.com/github/zcongfly/huggingface-nlp-learning-note/blob/main/09_%E4%B8%80%E4%B8%AA%E5%AE%8C%E6%95%B4%E7%9A%84%E8%AE%AD%E7%BB%83%E8%BF%87%E7%A8%8B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 一个完整的训练过程

In [None]:
# Install the Transformers, Datasets, and Evaluate libraries to run this notebook.
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the following line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl

现在，我们将了解如何在不使用Trainer类的情况下获得与上一节相同的结果。同样，我们假设您已经学习了第 2 节中的数据处理。下面是一个简短的总结，涵盖了您需要的所有内容:

In [4]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer

raw_datasets=load_dataset("glue","mrpc")

checkpoint = "bert-base-uncased"
tokenizer=AutoTokenizer.from_pretrained(checkpoint)
model=AutoModelForSequenceClassification.from_pretrained(checkpoint)
data_collator=DataCollatorWithPadding(tokenizer=tokenizer)
training_args=TrainingArguments("test-trainer",save_strategy="steps", save_steps=5000)

def tokenized_function(example):
    return tokenizer(example["sentence1"],example["sentence2"],truncation=True)

tokenized_datasets=raw_datasets.map(tokenized_function, batched=True)
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()



  0%|          | 0/3 [00:00<?, ?it/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.5525
1000,0.3517


TrainOutput(global_step=1377, training_loss=0.3824670995242715, metrics={'train_runtime': 182.4405, 'train_samples_per_second': 60.316, 'train_steps_per_second': 7.548, 'total_flos': 406183858377360.0, 'train_loss': 0.3824670995242715, 'epoch': 3.0})

## 训练前的准备

在实际编写我们的训练循环之前，我们需要定义一些对象。第一个是我们将用于迭代批次的数据加载器。我们需要对我们的tokenized_datasets做一些处理，来处理Trainer自动为我们做的一些事情。具体来说，我们需要:

* 删除与模型不期望的值相对应的列（如sentence1和sentence2列）。
* 将列名label重命名为labels（因为模型期望参数是labels）。
* 设置数据集的格式，使其返回 PyTorch 张量而不是列表。

针对上面的每个步骤，我们的 tokenized_datasets 都有一个方法，然后，我们可以检查结果中是否只有模型能够接受的列:

In [8]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

至此，我们可以轻松定义数据加载器:

In [9]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

为了快速检验数据处理中没有错误，我们可以这样检验其中的一个批次:

In [10]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 61]),
 'token_type_ids': torch.Size([8, 61]),
 'attention_mask': torch.Size([8, 61])}

请注意，实际的形状可能与您略有不同，因为我们为训练数据加载器设置了shuffle=True，并且模型会将句子填充到batch中的最大长度。

现在我们已经完全完成了数据预处理（对于任何 ML 从业者来说都是一个令人满意但难以实现的目标），让我们将注意力转向模型。我们完全像在上一节中所做的那样实例化它:

In [11]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

为了确保训练过程中一切顺利，我们将batch传递给这个模型:

In [12]:
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(0.8452, grad_fn=<NllLossBackward0>) torch.Size([8, 2])


当我们提供 labels 时，Transformers 模型都将返回这个batch的loss，我们还得到了 logits(batch中的每个输入有两个，所以张量大小为 8 x 2)。

我们几乎准备好编写我们的训练循环了！我们只是缺少两件事：优化器和学习率调度器。由于我们试图自行实现 Trainer的功能，我们将使用相同的优化器和学习率调度器。Trainer 使用的优化器是 AdamW , 与 Adam 相同，但在权重衰减正则化方面有所不同(参见“[Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101)”作者:Ilya Loshchilov 和 Frank Hutter):

In [13]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)



最后，默认使用的学习率调度器只是从最大值 (5e-5) 到 0 的线性衰减。 为了定义它，我们需要知道我们训练的次数，即所有数据训练的次数(epochs)乘以的数据量（这是我们所有训练数据的数量）。Trainer默认情况下使用三个epochs，因此我们定义训练过程如下:

In [14]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader) #len(train_dataloader)表示训练数据集的批次数量
# 这行代码创建一个学习率调度器。
# get_scheduler函数是从transformers库中获取调度器的辅助函数。
# 在这里，使用了"linear"调度器，它会将学习率线性地减小到0。
# optimizer参数指定了要进行学习率调度的优化器，
# num_warmup_steps=0表示没有预热阶段，num_training_steps表示总的训练步数。
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

1377


## 训练循环

最后一件事：如果我们可以访问 GPU,我们将希望使用 GPU(在 CPU 上，训练可能需要几个小时而不是几分钟)。为此，我们定义了一个 device,它在GPU可用的情况下指向GPU 我们将把我们的模型和batche放在device上:

In [15]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

我们现在准备好训练了！为了了解训练何时结束，我们使用 tqdm 库,在训练步骤数上添加了一个进度条:

In [16]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss #损失函数
        loss.backward()   #反向传播

        optimizer.step()    #更新优化器
        lr_scheduler.step() #更新学习率调度器
        optimizer.zero_grad() #梯度清零
        progress_bar.update(1)

  0%|          | 0/1377 [00:00<?, ?it/s]

您可以看到训练循环的核心与介绍中的非常相似。我们没有要求任何检验，所以这个训练循环不会告诉我们任何关于模型目前的状态。我们需要为此添加一个评估循环。

## 评估循环

正如我们之前所做的那样，我们将使用Evaluate 库提供的指标。我们已经了解了 metric.compute() 方法，当我们使用 add_batch()方法进行预测循环时，实际上该指标可以为我们累积所有 batch 的结果。一旦我们累积了所有 batch ，我们就可以使用 metric.compute() 得到最终结果 .以下是在评估循环中实现所有这些的方法:

In [17]:
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.8357843137254902, 'f1': 0.8873949579831933}

同样，由于模型头部初始化和数据改组的随机性，您的结果会略有不同，但它们应该在同一个范围内。

## 使用Accelerate加速您的训练循环

我们之前定义的训练循环在单个 CPU 或 GPU 上运行良好。但是使用Accelerate库，只需进行一些调整，我们就可以在多个 GPU 或 TPU 上启用分布式训练。从创建训练和验证数据加载器开始，我们的手动训练循环如下所示：

In [20]:
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

process_bar=tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch={k:v.to(device) for k,v in batch.items()}
        outputs=model(**batch)
        loss=outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

  0%|          | 0/1377 [00:00<?, ?it/s]

要添加的第一行是导入Accelerator。第二行实例化一个 Accelerator对象 ，它将查看环境并初始化适当的分布式设置。Accelerate 为您处理数据在设备间的传递，因此您可以删除将模型放在设备上的那行代码（或者，如果您愿意，可使用 accelerator.device 代替 device ）。

然后大部分工作会在将数据加载器、模型和优化器发送到的accelerator.prepare()中完成。这将会把这些对象包装在适当的容器中，以确保您的分布式训练按预期工作。要进行的其余更改是删除将batch放在 device 的那行代码（同样，如果您想保留它，您可以将其更改为使用 accelerator.device ) 并将 loss.backward() 替换为accelerator.backward(loss)。

为了使云端 TPU 提供的加速发挥最大的效益，我们建议使用标记器(tokenizer)的 `padding=max_length` 和 `max_length` 参数将您的样本填充到固定长度。

如果您想复制并粘贴来直接运行，以下是Accelerate 的完整训练循环:

In [21]:
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

  0%|          | 0/1377 [00:00<?, ?it/s]

把这个放在 train.py 文件中，可以让它在任何类型的分布式设置上运行。要在分布式设置中试用它，请运行以下命令:

```python
accelerate config
```
这将询问您几个配置的问题并将您的回答转储到此命令使用的配置文件中:

```python
accelerate launch train.py
```
这将启动分布式训练。

如果您想在 Notebook 中尝试此操作（例如，在 Colab 上使用 TPU 进行测试），只需将代码粘贴到 training_function() 并使用以下命令运行最后一个单元格:

In [22]:
from accelerate import notebook_launcher

notebook_launcher(training_function)

ou can find more examples in the [Accelerate repo](https://github.com/huggingface/accelerate/tree/main/examples).