## A full training

预处理数据

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [2]:
print(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})


### Prepare for training

在训练之前我们需要实例化一些对象. 第一个就是用来遍历batch的`dataloaders`。但在此之前我们需要对`tokenized_datasets`作一些处理（在之前trainer自动帮我们做了），具体：
- 去除掉一些model不需要的列
- 将label改为labels(因为model参数被命名为labels)
- 设置datasets的格式，使其返回torch.tensor而不是list

In [3]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label","labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

In [4]:
print(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})


接下来我们就可以定义loader了

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    dataset=tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
# 注意，在验证集中并没有打乱顺序 shuffle=None
eval_dataloader = DataLoader(
    dataset=tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

为了快速检查我们在数据处理的过程中是否有错误，我们可以检查一个batch像这样：

In [8]:
for batch in train_dataloader:
    break
{k: v.shape for k,v in batch.items()}

{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 69]),
 'token_type_ids': torch.Size([8, 69]),
 'attention_mask': torch.Size([8, 69])}

现在我们已经完成了数据预处理，接下来开始定义我们的模型

In [9]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


为确保一切正常运行，我们可以将之前的bach放入model中试着运行。

In [12]:
outputs = model(**batch)
print(batch)
print(outputs.loss, outputs.logits.shape)

{'labels': tensor([1, 0, 0, 1, 0, 1, 1, 0]), 'input_ids': tensor([[  101,  1996,  2924,  2036,  2056,  2049,  3749,  2001,  3395,  2000,
          1996,  3820,  1997,  2852,  8528,  1005,  1055,  3026,  5085,  1010,
          3026,  5416, 13304,  1998,  2002,  2094,  4726,  5085,  2011,  2382,
          2244,  2494,  1012,   102,  1996,  3749,  2003,  2036,  3395,  2000,
         17765,  6608,  2019,  3820,  2007,  2852,  8528,  1005,  1055,  3026,
          5085,  1010,  3026,  5416, 13304,  1998,  2002,  2094,  4726,  5085,
          2011, 17419,  1012,  2382,  1010,  2009,  2056,  1012,   102],
        [  101,  2016,  2038,  1037,  6898,  1010,  2205,  1010,  4584,  2056,
          1024, 17001,  1012, 19469,  9530,  7913,  8180,  1010,  2040,  2938,
          2007,  2014,  2155,  9857,  1012,   102,  2016,  2036, 15583,  2014,
          6898,  1010, 17001,  1012, 19469,  9530,  7913,  8180,  1010,  2040,
          2001,  3564,  2279,  2000,  1996,  2754,  1012,   102,     0,     0,


当数据含有labels时，所有的transformers模型将返回loss.

现在还有最后两件事，我们就可以开始training loop!
- 定义optimizer和learning rate scheduler
- use GPU!

定义optimizer和learning rate scheduler

In [14]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

最后，默认使用的学习率调度器只是从最大值（5e-5）线性衰减至 0。要正确定义它，我们需要知道将要采取的训练步数，即我们想要运行的轮数乘以训练批次的数量（也就是我们训练数据加载器的长度）。默认情况下，Trainer 使用三轮，所以我们也将遵循这一设置：

In [18]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)
print(num_training_steps)

1377


use GPU

我们需要将两个东西放入GPU中
- model
- batches
  

In [17]:
import torch
# python的三元表达式 val1 if 条件1 else val2
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

现在我们准备开始训练！为了大致了解训练何时结束，我们使用 tqdm 库在训练步数上添加了一个进度条：

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        # 这样写的目的是把batch中的元素放到GPU
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        # 注意zero_grad
        optimizer.zero_grad()

        # 更新进度条: 即每个batch训练完后更新一次进度条
        progress_bar.update(1)

  0%|          | 0/1377 [00:00<?, ?it/s]

可以看到，训练循环的核心部分与介绍中的非常相似。我们没有要求任何报告，所以这个训练循环不会告诉我们模型的表现如何。为此我们需要添加一个评估循环。

### The evaluation loop

和之前一样，我们将使用 🤗 Evaluate 库提供的一个指标。我们已经见过 metric.compute() 方法，但指标实际上可以通过在预测循环中使用 add_batch() 方法为我们累积批次。一旦累积了所有批次，我们就可以通过 metric.compute() 获取最终结果。以下是在评估循环中实现这一切的方法：

In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    # 省略反向传播
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    # 累积批次
    metric.add_batch(predictions=predictions, references=batch["labels"])
# 计算metric
metric.compute()

{'accuracy': 0.8455882352941176, 'f1': 0.8904347826086957}

### Supercharge your training loop with 🤗 Accelerate