# <center>一课Transformers</center>

## 第一节 介绍与准备工作

### 1.1 transformers简介及本教程目的

随着NLP领域预训练模型的盛行，从BERT、GPT到T5、ELECTRA、Longformer、MobileBERT等越来越多的模型涌现了出来。每个模型的作者可能用tf，也可能是pytorch，而且很可能不同的环境版本，这对于学术界、工业界的学习、复现、使用都带来了一定困难。幸好，huggingface公司下的transformers库帮我们解决了这个难题。

huggingface是一家公司，transformers[1][2][3]是其公司开发的开源库，已有30k+个star。该库表现为：简单易用；同时支持tf2和pytorch；支持很多预训练模型如BERT, GPT, ALBERT, T5, DialoGPT, ELECTRA等，而且随时维护；提供统一的、标准的Config、Model、Tokenizer、Trainer接口，同时提供标准化的模型方式，方便复现、拓展和实验。

目前一些开源库使用huggingface/transformers为基础进行开发，提供一些分类、生成任务，如基于transformers库开发的中文文本生成[4]；基于transformers库进行的科研，如DialoGPT[5]。

两种方式去使用，第一种是pipeline方式，高度集成，直接使用，可以用于情感分类、NER标注等；第二种是提供标注的模型，去训练，更符合标准的使用，更能使用我们日常学习、工作中的任务。

本篇讲解第二种使用方式，以pytorch版的模型使用为例，希望能通过一节课的时间帮助大家入门transformers库的使用。

注：本代码参考和使用了大量的官方样例。

### 1.2 本文结构

#### 第一节 介绍与准备工作

本节对huggingface/transformers是什么进行了说明，说明本文的目的，章节结构，准备工作等内容。

#### 第二节 快速入门

一个快速入门的例子，tokenizer、model加载与保存的概念，讲解如何利用tokenizer将文本转换成模型的输入，通过model得到logits与loss等结果。通过这么简单地几步，这是标准的pytorch一个batch的损失计算过程。基本上实现了一个完整的周期迭代。

#### 第三节 概念与说明

介绍transformers的服务人群、目标、Configuration、Tokenizer、Model、AutoModels、Trainer这几个类的设计目的，从高层次上理解该库的设计思路。

#### 第四节 GLUE/MRPC数据集进行文本分类的示例

基于上面的知识，进行完整的实战代码演示。这是一个完整的文本分类fine-tune与线上部署的代码，共分为三个部分：

第一部分，使用transformers库的examples中的代码，进行分类模型的训练。这是简单的能跑的例子。

第二部分，加载第一部分得到的模型结果，进行预测结果。

第三部分，第一部分的详细代码，讲解如何加载数据、模型、tokenizer等，去初始化Trainer，并训练。

#### 第五节 GPT2训练使用示例




#### 第六节

#### 第七节

### 1.3 准备工作

#### 1.3.1 下载预训练模型和数据集

下面的讲解使用到bert-base-uncased、gpt2两个预训练模型；同时会使用到glue下的MRPC数据集、wikitext-2-raw数据集。可以直接通过下面的百度云链接和密码去下载。可以将下载好的数据放在与本notebook同级目录下，方便使用。

链接：

密码：


#### 1.3.2 克隆transformers库至本地

因为需要用到transformers库中的样例，所以需要将相应的库克隆下来，可以将库放在本notebook同级目录下。在shell中执行下面代码进行克隆：

```
git clone git@github.com:huggingface/transformers.git
```


#### 1.3.3 安装相应的环境

本教程以pytorch为基础，需要安装pytorch，建议pytorch>=1.4.0。

transformers升级至最新版本，升级方式：

```
pip install --upgrade transformers
```

可能会需要其他环境，如pandas、xlrd、sklearn等，提示缺少什么包，直接安装即可。


## 第二节 快速入门

通过简单的例子，感受从文本到模型输出的快捷。

In [1]:
import torch
import transformers

print(f'torch: {torch.__version__}')
print(f'transformers: {transformers.__version__}')



torch: 1.4.0+cu100
transformers: 3.0.2


In [2]:
# 所有大写的内容，需要改为自己的实际路径
BERT_MODEL_NAME_OR_PATH = 'transformers_data_and_model/bert-base-uncased'

In [3]:
# 展示文件夹中的内容
!tree {BERT_MODEL_NAME_OR_PATH}

transformers_data_and_model/bert-base-uncased
├── config.json
├── modelcard.json
├── pytorch_model.bin
└── vocab.txt

0 directories, 4 files


In [4]:
# 初始化model和tokenizer
#　所有model和tokenizer的初始化都使用from_pretrained的方法，保存都使用save_pretrained的方法
# 对from_pretrained:第一个参数是文件夹的路径/文件的路径/模型的short name等几种方法，这里推荐使用文件夹的方式
# model初始化默认是eval模式，这里加载的是BERT的tokenizer和分类模型model
model = transformers.AutoModelForSequenceClassification.from_pretrained(BERT_MODEL_NAME_OR_PATH)
tokenizer = transformers.AutoTokenizer.from_pretrained(BERT_MODEL_NAME_OR_PATH)

Some weights of the model checkpoint at transformers_data_and_model/bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized 

In [6]:
# tokenizer的作用：对于给定的文本，经过tokenizer处理成model可以接受的格式
# tokenizer最重要的用法是__call__，这个方法可以将文本输出为模型的要的格式
# tokenizer还有其他方法，encode/decode，顾名思义，就是将文本转换成input_ids及将input_ids转换为文本
# encode/decode与__call__其实无本质区别，只是__call__为了提供统一的处理接口
inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")
# input_ids是文本每个词的index；token_type_id是表示文本是第一句/第二句； attention_mask是处理mask用的
# 这里处理的是一条样本且只有一句话的例子，如果是多条单句样本，输入为一个文本list即可。
print(inputs)

{'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [9]:
# 一条样本且有两句话的例子，分别作为第一个和第二个参数输入
# 如果是多个样本，每个样本都是两句话，则第一个参数是第一句话的文本list，第二个参数为第二句话的文本list
inputs2 = tokenizer('hello', 'good morning')
print(inputs2)

{'input_ids': [101, 7592, 102, 2204, 2851, 102], 'token_type_ids': [0, 0, 0, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1]}


In [11]:
# 更多时候，我们需要的model输入是成batch格式的
# 第一个输入是文本list，padding设置为True，truncation设置为True可以进行padding和truncation
# return_tensors写明了返回的格式，是一个pytorch的tensor
pt_batch = tokenizer(
     ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
     padding=True,
     truncation=True,
     return_tensors="pt"
)
for key, value in pt_batch.items():
     print(f"{key}: {value.numpy().tolist()}")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


input_ids: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0, 0]]
token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]


In [24]:
# 对于tokenizer处理后的文本，目标是送入model，用于分类、预测等任务
# 上面的pt_batch就是一个batch，增加**直接输入模型
# 根据transformers库的规则， 所有model的输出都是元组
# 如果只有每个batch的输入，元组输出的第一个是logits
# 如果同时传入了labels的参数，则元组输出的第一个是loss，第二个是logits
# 一般回归任务loss用的Mean-Square loss，分类任务则是Cross-Entroy
pt_outputs = model(**pt_batch)
print(f'logits:{pt_outputs[0]}\n')
# 传入labels参数的情况
pt_outputs = model(**pt_batch, labels=torch.LongTensor([1, 0]))
print(f'loss: {pt_outputs[0]}\nlogits: {pt_outputs[1]}')

logits:tensor([[0.2274, 0.1681],
        [0.1150, 0.2867]], grad_fn=<AddmmBackward>)

loss: 0.7529614567756653
logits: tensor([[0.2274, 0.1681],
        [0.1150, 0.2867]], grad_fn=<AddmmBackward>)


上面就是pytorch版transformers的基本输入逻辑：

transformers中的所有model都是pytorch的标准模型类torch.nn.Module。文本可以通过tokenizer调用转换为模型的输入，模型输入这些信息，得到logits，计算loss，进行误差回传backward，进行迭代，就完成了训练/fine-tune。模型输入的时候，如果传入labels的参数，也可以直接得到相应的loss，一样backward，多次迭代，完成训练。

除了标准的pytorch方式，transformers还封装了Trainer类来帮助我们简化pytorch代码，后面再讲。

In [25]:
# 完成训练/微调后，可以将tokenizer和model保存至相同的文件夹。
# 在transformers框架里，一个很好的习惯，将model、tokenizer参数、训练参数等所有存放在同一文件夹。
# 这里的model/tokenizer/config的初始化使用from_pretrained，保存使用save_pretrained
# save_pretrained传入具体的文件夹名即可
SAVE_DIRECTORY = 'transformers_data_and_model/bert_save_example'
tokenizer.save_pretrained(SAVE_DIRECTORY)
model.save_pretrained(SAVE_DIRECTORY)

In [26]:
# 保存的结果展示
!tree {SAVE_DIRECTORY}

transformers_data_and_model/bert_save_example
├── config.json
├── pytorch_model.bin
├── special_tokens_map.json
├── tokenizer_config.json
└── vocab.txt

0 directories, 5 files


### 总结：

1. 在transformers框架里，提供了model/tokenizer/config通用化的加载和保存，也就是from_pretraiend/save_pretrained；
2. tokenizer的作用在于，通过\_\_call__将文本进行转换成模型接受的格式，model的输出都是元组，依据这些元组的内容进行计算loss；
3. tokenizer包装了Byte-Pair Encoding、WordPiece、SentencePiece等不同的方式；
4. model是包装了BERT、GPT2、ALBERT等不同的模型，并且提供标准化的类。等下细说。
5. 对于我们来讲，去复现、实验、研究更容易。

## 第三节 概念与说明

本节讲解transformers涉及的服务群体、目标、主要概念、AutoModels、Trainer类等。

### 3.1 transformers的服务人群：
- 寻找用于使用、学习、扩展大型transformers模型的NLP研究者和教育者
- 希望对模型进行微调或在生产生提供服务的实践者
- 只想下载预训练模型，并将其用于解决NLP任务的工程师

### 3.2 transformers的两个强目标：
- 尽可能的简单，快速的使用
- 为最新模型提供与原始模型尽可能接近的性能

### 3.3 transformers的小目标:
- 尽可能的内部接口一致
- 纳入主观选择的有前途的工具，以对这些模型进行微调/研究
- 在PyTorch和TensorFlow 2.0之间轻松切换，从而允许使用一种框架进行训练，而使用另一种框架进行推理

### 3.4 主要概念：

主要的类有Model、Configuration、Tokenizer这三个类，下面分别介绍。

Model类，比如BertModel，均从pytorch models(torch.nn.Module)或者keras models(tf.keras.Model)继承而来，用于处理预训练权重。

Configuration类，比如BertConfig，里面保存着建立模型所需要的所有参数。并不是总是我们手动去初始化这个类，尤其是当你使用没有做任何更改的预训练时，model会自动处理好这个类。也就是说，如果自己重新预训练的模型且架构不一致时，是许我们我们去初始化这个类的。

Tokenizer类，比如BERTTokenizer，为每个模型保存词典，并将文本进行编码/解码成模型需要的格式——token嵌入的索引。

上面的类，都有下面两个方法去实例化类和保存至本地：

`from_pretrained()` 允许我们加载预训练模型，可以使用short_name，也可以使用本地的模型，作为第一个参数model_name_or_path传入，可以是文件夹、文件、short_name等。其中文件夹的话，会默认寻找文件夹中的pytorh_model.bin。

`save_pretrained()` 允许我们将模型保存至本地，保存的参数是可以是文件夹。

### 3.5 AutoModels

以BERT为例，每个模型都有一个Config（BertConfig）；有1-2个tokenizer，分别是基于rust的快速tokenizer（BertTokenizerFast），一个是基于python原版的tokenizer（BertTokenizer），部分没有提供rust的快速tokenizer；有多个皆有不同head的Model，比如最原始的模型，不含head的BertModel、预训练MLM和NSP的BertForPreTraining、MLM head的BertForMaskedLM，NSP的BertForNextSentencePrediction，用于句子分类的BertForSequenceClassification，用于多选的BertForMultipleChoice，用单词分类的BertForTokenClassification，用于问答的BertForQuestionAnswering。

不同的模型，会稍有不同。但是config类都继承自PretrainedConfig；tokenizer都继承自PreTrainedTokenizer或PreTrainedTokenizerFast；model都继承自PreTrainedModel。

为了使用方便，AutoConfig、AutoTokenizer、AutoModel、AutoModelForPreTraining、AutoModelWithLMHead、AutoModelForSequenceClassification、AutoModelForQuestionAnswering、AutoModelForTokenClassification等可以用于自动查找模型。

### 3.6 Trainer类

Trainer类提供了一个完整的标准训练的API，目前支持语言模型、文本分类、单词分类（NER）等任务。对于前面的config、tokenizer、model，我们可以认为，帮助我们简化的是写模型的这一步，正常生成dataset、dataloader，然后再每个epoch、batch进行训练，得到最终的结果。正常写的话，Trainer类可以不用，Trainer其实是简化的是我们训练的这一步。

对于通常的训练过程，写法大致是这样的（下面是伪代码）:

```
# 加载数据
train_data, test_data = get_data()
# 转换成features，获得dataset
train_dataset = MyDataset(train_data, args)
test_dataset = MyDataset(test_data, args)

# 转换成dataloader，用于生成batch
train_sampler, test_sampler = 
train_dataloader = Dataloader(train_dataset, sampler=train_sampler, batch_size=batch_size, collate_fn=collate_fn)
test_dataloader = Dataloader(train_dataset, sampler=test_sampler, batch_size=batch_size, collate_fn=collate_fn)
# 初始化tensorboard
tb_writer = SummaryWriter(log_dir=None)
# 加载optimizer
optimizer = 
mode.to(GPU)
# 开始训练
for epoch in range(epochs):
    for batch in train_dataloader:
        # 切换为train
        model.train()
        # 传入ＧＰＵ
        batch.to(GPU)
        # 计算每一步的loss，然后回传
        tr_loss = train_step(model, inputs, optimizer)
        tr_loss.backword()
        model.zero_grad()
for batch in test_dataloder:
    ***
model.save_pretrained(OUTPUT_PATH)
```

使用了Trainer之后，就能大大的简化训练过程：

```
# 加载数据
train_data, test_data = get_data()
# 转换成features，获得dataset
train_dataset = MyDataset(train_data, args)
test_dataset = MyDataset(test_data, args)
# 读入train_args
train_args = **


#初始化本Trainer
trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=build_compute_metrics_fn(data_args.task_name),
)
# 训练
if training_args.do_train:
    trainer.train(
        model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
    )
    # trainer保存模型，里面调用的还是save_pretrained的方法
    trainer.save_model()
    # 保存tokenizer至同一个文件夹，方便使用
    if trainer.is_world_master():
        tokenizer.save_pretrained(training_args.output_dir)
```

从上面的代码，可以看出，Trainer简化的过程就是训练的过程。对于传统的训练中常见的过程进行了封装，我们只要去初始化Trainer这个类，就很方便的去训练，保存模型使用是save_model，内部调用的还是上文提到的save_pretrained的方法。下面的章节也会用到Trainer类进行训练，更详细的大家可以看Trainer的API说明[8]及源代码。

## 第四节 GLUE/MRPC数据集进行文本分类的示例

### 4.1 本节说明

本节课以简单的例子，说明文本分类的fine-tune过程，线上部署代码，详细的fine-tune过程。其中4.2节训练模型，以transformers库里的训练代码为例，简单的跑通了文本fine-tune任务；4.3节线上预测以4.2节训练好的模型为例，进行线上预测，4.4节模型训练的详细过程更加详细的说明了4.2节的训练过程，将代码拆解，方便大家去适配新的分类，4.5节transformers使用总结，总结了transformers在文本分类任务中基本经验。

本节涉及到的模型时bert-base-uncased，涉及到的数据集是glue数据集下的MRPC数据集。

glue数据集，共有9个任务，其中STS-B是一个回归任务，MNLI是三分类任务，剩余7类均是二分类任务。更详细的glue数据集的信息，可以参考[6]。九个任务之一的MRPC(The Microsoft Research Paraphrase Corpus，微软研究院释义语料库)，相似性和释义任务，是从在线新闻源中自动抽取句子对语料库，并人工注释句子对中的句子是否在语义上等效。类别并不平衡，其中68%的正样本，所以遵循常规的做法，报告准确率（accuracy）和F1值。样本个数：训练集3, 668个，开发集408个，测试集1, 725个。任务：是否释义二分类，是释义，不是释义两类。评价准则：准确率（accuracy）和F1值。标签为1（正样本，互为释义）的样例（每个样例是两句话，中间用tab隔开）

### 4.2 训练模型

本小节是使用transformers库给定的官方分类样例，进行文本训练（fine-tune）的例子，非常简单，就能直接跑通了。

In [1]:
# run_glue.py文件位置，在transformers库的位置：transformers/examples/text-classification/run_glue.py
RUN_GLUE_PY = 'transformers/examples/text-classification/run_glue.py'
# 下载好的预训练模型位置
BERT_MODEL_NAME_OR_PATH = 'transformers_data_and_model/bert-base-uncased'
# 下载好的glue数据集中的MRPC数据集的位置
MRPC_DATA_DIR = 'transformers_data_and_model/MRPC'
# finetune好的model、tokenizer等各种参数存放的位置
FINETUNED_MRPC = 'transformers_data_and_model/finetuned-mrpc'

In [2]:
# 直接使用transformers官方的代码运行
!python {RUN_GLUE_PY} \
  --model_name_or_path {BERT_MODEL_NAME_OR_PATH} \
  --task_name MRPC \
  --do_train \
  --do_eval \
  --data_dir {MRPC_DATA_DIR} \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 3e-5 \
  --num_train_epochs 3.0 \
  --output_dir {FINETUNED_MRPC} \
  --overwrite_cache \
  --overwrite_output_dir

07/27/2020 08:09:54 - INFO - transformers.training_args -   PyTorch: setting up devices
07/27/2020 08:09:54 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='transformers_data_and_model/finetuned-mrpc', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluate_during_training=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, learning_rate=3e-05, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Jul27_08-09-54_b268e5129060', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, past_index=-1)
07/27/2020 08:09:54 - INFO - transformers.configurat

In [3]:
# 查看finetune后的模型等
!tree {FINETUNED_MRPC}

transformers_data_and_model/finetuned-mrpc
├── config.json
├── eval_results_mrpc.txt
├── pytorch_model.bin
├── special_tokens_map.json
├── tokenizer_config.json
├── training_args.bin
└── vocab.txt

0 directories, 7 files


因为MRPC数据集非常小，任务也比较简单，3个epoch，学习率3e-5，就能轻松的完成了训练，可以看到在验证集上accuracy达到了81.1%，F1值达到了0.84。训练好的模型model和tokenizer都保存在了FINETUNE_MRPC文件夹下了。

### 4.3 线上预测

训练好了模型，就可以线上进行预测了，怎么加载训练好的模型进行预测呢？也非常简单。

In [22]:
import torch
import transformers

print(f'torch: {torch.__version__}')
print(f'transformers: {transformers.__version__}')

torch: 1.4.0+cu100
transformers: 3.0.2


In [23]:
# 加载tokenizer和model
tokenizer = transformers.AutoTokenizer.from_pretrained(FINETUNED_MRPC)
model = transformers.AutoModelForSequenceClassification.from_pretrained(FINETUNED_MRPC)

# 类别标签
classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"

# 通过tokenizer将文本转成model需要的格式，返回为pytorch的tensor
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")

with torch.no_grad():
    # 是释义的样例
    # 输入模型，模型输出为元组，不输入正确标签labels参数的情况下，第一个logits
    paraphrase_classification_logits = model(**paraphrase)[0]
    # logits经过softmax得到最终的概率
    paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]

# 结果应当是释义
for i in range(len(classes)):
     print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
print(f"classification result: {classes[paraphrase_results.index(max(paraphrase_results))]}")

not paraphrase: 38%
is paraphrase: 62%
classification result: is paraphrase


总结：可以看到，线上用于预测的代码也非常简单。加载tokenizer和model，然后通过tokenizer标准化为model的输入，模型输入输出为元组，元组的结果进行softmax，得到各个label的概率，得到最终的结果概率。可以看到，对于训练好的模型，进行线上部署的代码也非常简单和容易。

上面为了简单起见，没有使用GPU。如果线上部署需要支持GPU，也很简单，只要将tokenizer输出的结果传入GPU，model也传入GPU即可。一般tokenizer输出结果是一个词典格式，词典的value为tensor，可以转入GPU，请看下面的例子。

In [21]:
# 支持GPU则使用GPU，不支持则使用CPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# 加载tokenizer和model
tokenizer = transformers.AutoTokenizer.from_pretrained(FINETUNED_MRPC)
model = transformers.AutoModelForSequenceClassification.from_pretrained(FINETUNED_MRPC)
# model转入cuda
model = model.to(device)

# 类别标签
classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
# 通过tokenizer将文本转成model需要的格式，返回为pytorch的tensor
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
# paraphrase转入GPU
# value是tensor类的，转入GPU进行加速
# 除了model和输入的转入GPU，其他无区别
for k, v in paraphrase.items():
    if isinstance(v, torch.Tensor):
        paraphrase[k] = v.to(device)

with torch.no_grad():
    # 是释义的样例
    # 输入模型，模型输出为元组，不输入正确标签labels参数的情况下，第一个logits
    paraphrase_classification_logits = model(**paraphrase)[0]
    # logits经过softmax得到最终的概率
    paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]

# 结果应当是释义
for i in range(len(classes)):
     print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
print(f"classification result: {classes[paraphrase_results.index(max(paraphrase_results))]}")

not paraphrase: 38%
is paraphrase: 62%
classification result: is paraphrase


现在已经看到了线上部署的全部代码，那么训练过程4.2节的详细代码是什么样的呢？下一小节就是要说明这个问题。

### 4.4 模型训练的详细过程 

transformers提供了config、tokenizer、model等类简化了分词、模型等步骤，同时又有Trainer类简化了训练过程。那么更详细的训练过程是什么呢？本节主要的内容就是实现和讲解模型分类的详细过程。

简单的讲，文本最开始需要载入，可以通过写明一个processor

本过程不仅是为了实现MRPC分类，更重要的是，它可以是文本分类的一个标准化流程，也是pytorch使用的标准化流程，可以方便以后按照此思路进行扩展。

In [25]:
import logging
import os
import sys
import enum
import time
from dataclasses import dataclass, field
from typing import Callable, Dict, Optional, List, Union, NamedTuple

import filelock
import torch
import numpy as np
import transformers
from scipy.stats import pearsonr, spearmanr
from sklearn.metrics import matthews_corrcoef, f1_score

In [26]:
# 日志文件
logger = logging.getLogger(__name__)

In [27]:
# 定义模型参数，包含model、config、tokenizer、cache_dir等
# 有三类参数：
#   一个是模型参数，这个就是下面的定义；
#   一个是模型训练参数， 可以参考transformers/src/train_args.py文件，主要是epoch、batch_size等常见的训练参数，也包含device这种设备参数
#   一个数据参数，决定数据处理任务的参数，使用什么数据，数据名称，是否覆盖数据的cache，最长长度，这个长度是用于生成features使用的
# 下面这个是模型参数
@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
    """

    model_name_or_path: str = field(
        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
    )
    config_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
    )
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )
    cache_dir: Optional[str] = field(
        default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
    )
        

# 数据（训练使用的）参数
@dataclass
class GlueDataTrainingArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.

    Using `HfArgumentParser` we can turn this class
    into argparse arguments to be able to specify them on
    the command line.
    """

    task_name: str = field(metadata={"help": "The name of the task to train on: " + ", ".join(transformers.glue_processors.keys())})
    data_dir: str = field(
        metadata={"help": "The input data dir. Should contain the .tsv files (or other data files) for the task."}
    )
    max_seq_length: int = field(
        default=128,
        metadata={
            "help": "The maximum total input sequence length after tokenization. Sequences longer "
            "than this will be truncated, sequences shorter will be padded."
        },
    )
    overwrite_cache: bool = field(
        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
    )

    def __post_init__(self):
        self.task_name = self.task_name.lower()

In [28]:
# 下载好的预训练模型位置
BERT_MODEL_NAME_OR_PATH = '/dfsdata2/yucc1_data/models/huggingface/bert-base-uncased'
# 下载好的glue数据集中的MRPC数据集的位置
MRPC_DATA_DIR = '/dfsdata2/yucc1_data/datasets/glue_data/MRPC'
# finetune好的model、tokenizer等各种参数存放的位置
FINETUNED_MRPC = '/dfsdata2/yucc1_data/output/finetuned-mrpc'

input_args = ['--model_name_or_path', BERT_MODEL_NAME_OR_PATH,
             '--task_name', 'MRPC',
             '--do_train',
             '--do_eval',
             '--data_dir', MRPC_DATA_DIR,
             '--max_seq_length', '128',
             '--per_device_train_batch_size', '32',
             '--learning_rate', '3e-5',
             '--num_train_epochs', '3.0',
             '--output_dir', FINETUNED_MRPC,
             '--overwrite_cache',
             '--overwrite_output_dir']

# transformers里，有一个HfArguementParser用于解析上面格式的参数，为标准的python参数
parser = transformers.HfArgumentParser((ModelArguments, transformers.GlueDataTrainingArguments, transformers.TrainingArguments))
# 查看help？
# 将三类参数分别解析为对应的空间
# 模型本身的参数，数据的参数，训练的参数
model_args, data_args, training_args = parser.parse_args_into_dataclasses(input_args)

In [29]:
model_args

ModelArguments(model_name_or_path='/dfsdata2/yucc1_data/models/huggingface/bert-base-uncased', config_name=None, tokenizer_name=None, cache_dir=None)

In [30]:
data_args

GlueDataTrainingArguments(task_name='mrpc', data_dir='/dfsdata2/yucc1_data/datasets/glue_data/MRPC', max_seq_length=128, overwrite_cache=True)

In [31]:
training_args

TrainingArguments(output_dir='/dfsdata2/yucc1_data/output/finetuned-mrpc', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluate_during_training=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, learning_rate=3e-05, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Jul24_16-53-35_d585a65fe6e2', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, past_index=-1)

In [32]:
# 确保output_dir可以用
if (
    os.path.exists(training_args.output_dir)
    and os.listdir(training_args.output_dir)
    and training_args.do_train
    and not training_args.overwrite_output_dir
):
    raise ValueError(
        f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome."
    )

In [33]:
# 设定日志格式，记录一些关键参数，并且将训练参数打印出来
# 一个很重要的感受：使用logger打印中间变量很重要，对于transformers库，还是我们学习、工作中，都是如此
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    level=logging.INFO if training_args.local_rank in [-1, 0] else logging.WARN,
)
logger.warning(
    "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
    training_args.local_rank,
    training_args.device,
    training_args.n_gpu,
    bool(training_args.local_rank != -1),
    training_args.fp16,
)
logger.info("Training/evaluation parameters %s", training_args)

# 设定种子
transformers.set_seed(training_args.seed)

07/24/2020 16:53:41 - INFO - transformers.training_args -   PyTorch: setting up devices
07/24/2020 16:53:41 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='/dfsdata2/yucc1_data/output/finetuned-mrpc', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluate_during_training=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, learning_rate=3e-05, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Jul24_16-53-35_d585a65fe6e2', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, past_index=-1)


In [None]:
InputExample 

=>

InputFeature

In [None]:
Processor:
    
    get_train_examples
        list(InputExample)
    
    get_dev_examples
    
    get_labels:
        ['']

In [34]:
glue_tasks_num_labels = {
    "cola": 2,
    "mnli": 3,
    "mrpc": 2,
    "sst-2": 2,
    "sts-b": 1,
    "qqp": 2,
    "qnli": 2,
    "wnli": 2,
}


glue_output_modes = {
    "cola": "classification",
    "mnli": "classification",
    "mnli-mm": "classification",
    "mrpc": "classification",
    "sst-2": "classification",
    "sts-b": "regression",
    "qqp": "classification",
    "qnli": "classification",
    "rte": "classification",
    "wnli": "classification",
}


# 获得标签的个数
# 输出的模式，这里是classification与regression两种
# 如果适配新任务，我们要的不是去按照这种格式，而是要得到这两个参数
try:
    num_labels = glue_tasks_num_labels[data_args.task_name]
    output_mode = glue_output_modes[data_args.task_name]
except KeyError:
    raise ValueError("Task not found: %s" % (data_args.task_name))

In [35]:
# 加载config、tokenizer、model这三个。
# config是包含层数、dropout参数、head个数、finetune任务等模型相关内容的参数，这个参数加载后只是为了model使用。
# config内写入标签的个数，决定model后面分类使用的全连接的输出的个数
config = transformers.AutoConfig.from_pretrained(
    model_args.config_name if model_args.config_name else model_args.model_name_or_path,
    num_labels=num_labels,
    finetuning_task=data_args.task_name,
    cache_dir=model_args.cache_dir,
)
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
    cache_dir=model_args.cache_dir,
)
model = transformers.AutoModelForSequenceClassification.from_pretrained(
    model_args.model_name_or_path,
    from_tf=bool(".ckpt" in model_args.model_name_or_path),
    config=config,
    cache_dir=model_args.cache_dir,
)

07/24/2020 16:53:51 - INFO - transformers.configuration_utils -   loading configuration file /dfsdata2/yucc1_data/models/huggingface/bert-base-uncased/config.json
07/24/2020 16:53:51 - INFO - transformers.configuration_utils -   Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": "mrpc",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

07/24/2020 16:53:51 - INFO - transformers.configuration_utils -   loading configuration file /dfsdata2/yucc1_data/models/huggingface/bert-base-uncased/config.json
07/24/2020 16:53:51 - INFO - transformers.configuration_utils -   Model config BertCon

In [36]:
# transformers.DataProcessor是一个基类，需要实现get_train_examples,get_dev_examples, get_test_examples, get_labels等几个函数，
# 分别用于提供的InputExample的集和（list）和标签的集和
class MrpcProcessor(transformers.DataProcessor):
    """Processor for the MRPC data set (GLUE version)."""

    def get_example_from_tensor_dict(self, tensor_dict):
        """See base class."""
        return InputExample(
            tensor_dict["idx"].numpy(),
            tensor_dict["sentence1"].numpy().decode("utf-8"),
            tensor_dict["sentence2"].numpy().decode("utf-8"),
            str(tensor_dict["label"].numpy()),
        )

    def get_train_examples(self, data_dir):
        """See base class."""
        logger.info("LOOKING AT {}".format(os.path.join(data_dir, "train.tsv")))
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_test_examples(self, data_dir):
        """See base class."""
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training, dev and test sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, i)
            text_a = line[3]
            text_b = line[4]
            label = None if set_type == "test" else line[0]
            examples.append(transformers.InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples
    

glue_processors = {
    # 使用上面的代码
    "mrpc": MrpcProcessor,
    # 使用官方代码库中的代码
    "cola": transformers.glue_processors['cola'],
    "mnli": transformers.glue_processors['mnli'],
    "mnli-mm": transformers.glue_processors['mnli-mm'],
    "sst-2": transformers.glue_processors['sst-2'],
    "sts-b": transformers.glue_processors['sts-b'],
    "qqp": transformers.glue_processors['qqp'],
    "qnli": transformers.glue_processors['qnli'],
    "rte": transformers.glue_processors['rte'],
    "wnli": transformers.glue_processors['wnli'],
}

In [37]:
# 将examples转换成features
def glue_convert_examples_to_features(
    examples: List[transformers.InputExample],
    tokenizer: transformers.PreTrainedTokenizer,
    max_length: Optional[int] = None,
    task=None,
    label_list=None,
    output_mode=None,
):
    """
    Loads a data file into a list of ``InputFeatures``

    Args:
        examples: List of ``InputExamples`` or ``tf.data.Dataset`` containing the examples.
        tokenizer: Instance of a tokenizer that will tokenize the examples
        max_length: Maximum example length. Defaults to the tokenizer's max_len
        task: GLUE task
        label_list: List of labels. Can be obtained from the processor using the ``processor.get_labels()`` method
        output_mode: String indicating the output mode. Either ``regression`` or ``classification``

    Returns:
        If the ``examples`` input is a ``tf.data.Dataset``, will return a ``tf.data.Dataset``
        containing the task-specific features. If the input is a list of ``InputExamples``, will return
        a list of task-specific ``InputFeatures`` which can be fed to the model.

    """
    if max_length is None:
        max_length = tokenizer.max_len

    if task is not None:
        processor = glue_processors[task]()
        if label_list is None:
            label_list = processor.get_labels()
            logger.info("Using label list %s for task %s" % (label_list, task))
        if output_mode is None:
            output_mode = glue_output_modes[task]
            logger.info("Using output mode %s for task %s" % (output_mode, task))

    label_map = {label: i for i, label in enumerate(label_list)}

    def label_from_example(example: transformers.InputExample) -> Union[int, float, None]:
        if example.label is None:
            return None
        if output_mode == "classification":
            return label_map[example.label]
        elif output_mode == "regression":
            return float(example.label)
        raise KeyError(output_mode)

    labels = [label_from_example(example) for example in examples]

    batch_encoding = tokenizer(
        [(example.text_a, example.text_b) for example in examples],
        max_length=max_length,
        padding="max_length",
        truncation=True,
    )

    features = []
    for i in range(len(examples)):
        inputs = {k: batch_encoding[k][i] for k in batch_encoding}

        feature = transformers.InputFeatures(**inputs, label=labels[i])
        features.append(feature)

    for i, example in enumerate(examples[:5]):
        logger.info("*** Example ***")
        logger.info("guid: %s" % (example.guid))
        logger.info("features: %s" % features[i])

    return features

In [38]:
class Split(enum.Enum):
    train = "train"
    dev = "dev"
    test = "test"
    
    
class GlueDataset(torch.utils.data.dataset.Dataset):
    """
    This will be superseded by a framework-agnostic approach
    soon.
    """

    args: GlueDataTrainingArguments
    output_mode: str
    features: List[transformers.InputFeatures]

    def __init__(
        self,
        args: transformers.GlueDataTrainingArguments,
        tokenizer: transformers.PreTrainedTokenizer,
        limit_length: Optional[int] = None,
        mode: Union[str, Split] = Split.train,
        cache_dir: Optional[str] = None,
    ):
        self.args = args
        self.processor = glue_processors[args.task_name]()
        self.output_mode = glue_output_modes[args.task_name]
        if isinstance(mode, str):
            try:
                mode = Split[mode]
            except KeyError:
                raise KeyError("mode is not a valid split name")
        # Load data features from cache or dataset file
        cached_features_file = os.path.join(
            cache_dir if cache_dir is not None else args.data_dir,
            "cached_{}_{}_{}_{}".format(
                mode.value, tokenizer.__class__.__name__, str(args.max_seq_length), args.task_name,
            ),
        )
        label_list = self.processor.get_labels()
        if args.task_name in ["mnli", "mnli-mm"] and tokenizer.__class__ in (
            RobertaTokenizer,
            RobertaTokenizerFast,
            XLMRobertaTokenizer,
            BartTokenizer,
            BartTokenizerFast,
        ):
            # HACK(label indices are swapped in RoBERTa pretrained model)
            label_list[1], label_list[2] = label_list[2], label_list[1]
        self.label_list = label_list

        # Make sure only the first process in distributed training processes the dataset,
        # and the others will use the cache.
        lock_path = cached_features_file + ".lock"
        with filelock.FileLock(lock_path):

            if os.path.exists(cached_features_file) and not args.overwrite_cache:
                start = time.time()
                self.features = torch.load(cached_features_file)
                logger.info(
                    f"Loading features from cached file {cached_features_file} [took %.3f s]", time.time() - start
                )
            else:
                logger.info(f"Creating features from dataset file at {args.data_dir}")

                if mode == Split.dev:
                    examples = self.processor.get_dev_examples(args.data_dir)
                elif mode == Split.test:
                    examples = self.processor.get_test_examples(args.data_dir)
                else:
                    examples = self.processor.get_train_examples(args.data_dir)
                if limit_length is not None:
                    examples = examples[:limit_length]
                self.features = glue_convert_examples_to_features(
                    examples,
                    tokenizer,
                    max_length=args.max_seq_length,
                    label_list=label_list,
                    output_mode=self.output_mode,
                )
                start = time.time()
                torch.save(self.features, cached_features_file)
                # ^ This seems to take a lot of time so I want to investigate why and how we can improve.
                logger.info(
                    "Saving features into cached file %s [took %.3f s]", cached_features_file, time.time() - start
                )

    def __len__(self):
        return len(self.features)

    def __getitem__(self, i) -> transformers.InputFeatures:
        return self.features[i]

    def get_labels(self):
        return self.label_list

In [39]:
# 获得dataset
train_dataset = (
    GlueDataset(data_args, tokenizer=tokenizer, cache_dir=model_args.cache_dir) if training_args.do_train else None
)
eval_dataset = (
    GlueDataset(data_args, tokenizer=tokenizer, mode="dev", cache_dir=model_args.cache_dir)
    if training_args.do_eval
    else None
)
test_dataset = (
    GlueDataset(data_args, tokenizer=tokenizer, mode="test", cache_dir=model_args.cache_dir)
    if training_args.do_predict
    else None
)

07/24/2020 16:54:29 - INFO - filelock -   Lock 139787600255688 acquired on /dfsdata2/yucc1_data/datasets/glue_data/MRPC/cached_train_BertTokenizer_128_mrpc.lock
07/24/2020 16:54:29 - INFO - __main__ -   Creating features from dataset file at /dfsdata2/yucc1_data/datasets/glue_data/MRPC
07/24/2020 16:54:29 - INFO - __main__ -   LOOKING AT /dfsdata2/yucc1_data/datasets/glue_data/MRPC/train.tsv
07/24/2020 16:54:32 - INFO - __main__ -   *** Example ***
07/24/2020 16:54:32 - INFO - __main__ -   guid: train-1
07/24/2020 16:54:32 - INFO - __main__ -   features: InputFeatures(input_ids=[101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [40]:
class EvalPrediction(NamedTuple):
    """
    Evaluation output (always contains labels), to be used to compute metrics.

    Parameters:
        predictions (:obj:`np.ndarray`): Predictions of the model.
        label_ids (:obj:`np.ndarray`): Targets to be matched.
    """

    predictions: np.ndarray
    label_ids: np.ndarray

In [41]:
def simple_accuracy(preds, labels):
    return (preds == labels).mean()

def acc_and_f1(preds, labels):
    acc = simple_accuracy(preds, labels)
    f1 = f1_score(y_true=labels, y_pred=preds)
    return {
        "acc": acc,
        "f1": f1,
        "acc_and_f1": (acc + f1) / 2,
    }

def pearson_and_spearman(preds, labels):
    pearson_corr = pearsonr(preds, labels)[0]
    spearman_corr = spearmanr(preds, labels)[0]
    return {
        "pearson": pearson_corr,
        "spearmanr": spearman_corr,
        "corr": (pearson_corr + spearman_corr) / 2,
    }

def glue_compute_metrics(task_name, preds, labels):
    assert len(preds) == len(labels)
    if task_name == "cola":
        return {"mcc": matthews_corrcoef(labels, preds)}
    elif task_name == "sst-2":
        return {"acc": simple_accuracy(preds, labels)}
    elif task_name == "mrpc":
        return acc_and_f1(preds, labels)
    elif task_name == "sts-b":
        return pearson_and_spearman(preds, labels)
    elif task_name == "qqp":
        return acc_and_f1(preds, labels)
    elif task_name == "mnli":
        return {"mnli/acc": simple_accuracy(preds, labels)}
    elif task_name == "mnli-mm":
        return {"mnli-mm/acc": simple_accuracy(preds, labels)}
    elif task_name == "qnli":
        return {"acc": simple_accuracy(preds, labels)}
    elif task_name == "rte":
        return {"acc": simple_accuracy(preds, labels)}
    elif task_name == "wnli":
        return {"acc": simple_accuracy(preds, labels)}
    elif task_name == "hans":
        return {"acc": simple_accuracy(preds, labels)}
    else:
        raise KeyError(task_name)

In [42]:
# 得到计算函数
def build_compute_metrics_fn(task_name: str) -> Callable[[EvalPrediction], Dict]:
    def compute_metrics_fn(p: EvalPrediction):
        if output_mode == "classification":
            preds = np.argmax(p.predictions, axis=1)
        elif output_mode == "regression":
            preds = np.squeeze(p.predictions)
        return glue_compute_metrics(task_name, preds, p.label_ids)

    return compute_metrics_fn

In [43]:
#初始化本Trainer
trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=build_compute_metrics_fn(data_args.task_name),
)

07/24/2020 16:55:06 - INFO - transformers.trainer -   You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.


In [44]:
# 训练
if training_args.do_train:
    trainer.train(
        model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
    )
    trainer.save_model()
    # For convenience, we also re-save the tokenizer to the same directory,
    # so that you can share your model easily on huggingface.co/models =)
    if trainer.is_world_master():
        tokenizer.save_pretrained(training_args.output_dir)

07/24/2020 16:55:06 - INFO - transformers.trainer -   ***** Running training *****
07/24/2020 16:55:06 - INFO - transformers.trainer -     Num examples = 3668
07/24/2020 16:55:06 - INFO - transformers.trainer -     Num Epochs = 3
07/24/2020 16:55:06 - INFO - transformers.trainer -     Instantaneous batch size per device = 32
07/24/2020 16:55:06 - INFO - transformers.trainer -     Total train batch size (w. parallel, distributed & accumulation) = 64
07/24/2020 16:55:06 - INFO - transformers.trainer -     Gradient Accumulation steps = 1
07/24/2020 16:55:06 - INFO - transformers.trainer -     Total optimization steps = 174
07/24/2020 16:55:06 - INFO - transformers.trainer -     Starting fine-tuning.


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=3.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=58.0, style=ProgressStyle(description_wid…






HBox(children=(FloatProgress(value=0.0, description='Iteration', max=58.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=58.0, style=ProgressStyle(description_wid…

07/24/2020 16:56:12 - INFO - transformers.trainer -   

Training completed. Do not forget to share your model on huggingface.co/models =)








07/24/2020 16:56:12 - INFO - transformers.trainer -   Saving model checkpoint to /dfsdata2/yucc1_data/output/finetuned-mrpc
07/24/2020 16:56:13 - INFO - transformers.configuration_utils -   Configuration saved in /dfsdata2/yucc1_data/output/finetuned-mrpc/config.json
07/24/2020 16:56:13 - INFO - transformers.modeling_utils -   Model weights saved in /dfsdata2/yucc1_data/output/finetuned-mrpc/pytorch_model.bin


In [45]:
# 评估结果
eval_results = {}
if training_args.do_eval:
    logger.info("*** Evaluate ***")

    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_datasets = [eval_dataset]
    if data_args.task_name == "mnli":
        mnli_mm_data_args = dataclasses.replace(data_args, task_name="mnli-mm")
        eval_datasets.append(
            GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, mode="dev", cache_dir=model_args.cache_dir)
        )

    for eval_dataset in eval_datasets:
        trainer.compute_metrics = build_compute_metrics_fn(eval_dataset.args.task_name)
        eval_result = trainer.evaluate(eval_dataset=eval_dataset)

        output_eval_file = os.path.join(
            training_args.output_dir, f"eval_results_{eval_dataset.args.task_name}.txt"
        )
        if trainer.is_world_master():
            with open(output_eval_file, "w") as writer:
                logger.info("***** Eval results {} *****".format(eval_dataset.args.task_name))
                for key, value in eval_result.items():
                    logger.info("  %s = %s", key, value)
                    writer.write("%s = %s\n" % (key, value))

        eval_results.update(eval_result)

07/24/2020 16:56:14 - INFO - __main__ -   *** Evaluate ***
07/24/2020 16:56:14 - INFO - transformers.trainer -   ***** Running Evaluation *****
07/24/2020 16:56:14 - INFO - transformers.trainer -     Num examples = 408
07/24/2020 16:56:14 - INFO - transformers.trainer -     Batch size = 16


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=26.0, style=ProgressStyle(description_wi…

07/24/2020 16:56:15 - INFO - transformers.trainer -   {'eval_loss': 0.4694334108095903, 'eval_acc': 0.8112745098039216, 'eval_f1': 0.8718801996672212, 'eval_acc_and_f1': 0.8415773547355714, 'epoch': 3.0, 'step': 174}
07/24/2020 16:56:15 - INFO - __main__ -   ***** Eval results mrpc *****
07/24/2020 16:56:15 - INFO - __main__ -     eval_loss = 0.4694334108095903
07/24/2020 16:56:15 - INFO - __main__ -     eval_acc = 0.8112745098039216
07/24/2020 16:56:15 - INFO - __main__ -     eval_f1 = 0.8718801996672212
07/24/2020 16:56:15 - INFO - __main__ -     eval_acc_and_f1 = 0.8415773547355714
07/24/2020 16:56:15 - INFO - __main__ -     epoch = 3.0





In [46]:
# 预测
if training_args.do_predict:
    logging.info("*** Test ***")
    test_datasets = [test_dataset]
    if data_args.task_name == "mnli":
        mnli_mm_data_args = dataclasses.replace(data_args, task_name="mnli-mm")
        test_datasets.append(
            GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, mode="test", cache_dir=model_args.cache_dir)
        )

    for test_dataset in test_datasets:
        predictions = trainer.predict(test_dataset=test_dataset).predictions
        if output_mode == "classification":
            predictions = np.argmax(predictions, axis=1)

        output_test_file = os.path.join(
            training_args.output_dir, f"test_results_{test_dataset.args.task_name}.txt"
        )
        if trainer.is_world_master():
            with open(output_test_file, "w") as writer:
                logger.info("***** Test results {} *****".format(test_dataset.args.task_name))
                writer.write("index\tprediction\n")
                for index, item in enumerate(predictions):
                    if output_mode == "regression":
                        writer.write("%d\t%3.3f\n" % (index, item))
                    else:
                        item = test_dataset.get_labels()[item]
                        writer.write("%d\t%s\n" % (index, item))

In [47]:
print(eval_results)

{'eval_loss': 0.4694334108095903, 'eval_acc': 0.8112745098039216, 'eval_f1': 0.8718801996672212, 'eval_acc_and_f1': 0.8415773547355714, 'epoch': 3.0}


### 4.5 transformers使用总结

基于transformers框架，写代码会非常简单，总体步骤也非常少。

1. 参数的传入，参数包含三种类型的参数：模型的参数，模型训练的参数，数据的参数。
2. 数据的处理，从原始数据到模型可接受的数据，最终是将数据分成可以迭代的，定长的、标准的batch，供模型使用。这里面通常标准化为几个步骤，这也是pytorch代码编写的流程：

第一步，设立processor，完成加载数据集为examples的集和（list），每个example可以为一个InputExample对象。这里通常还是文本，只是text_a, text_b，label变为InputExample的属性。

第二步，转换为torch.utils.data.Dataset，pytorch的Dataset是一个抽象类，我们需要重写__len__,\_\_getitem__这两个方法。最终的目标是将文本变成一个个模型可以用的一条条的数据。在这里将每个InputExample转换成feature，而InputExample转换成feature的时候，就需要tokenizer发挥作用了，这里可以进行padding和truncation。

第三步，在transformers里，可以将上一步得到的dataset传入Trainer进行使用，Trainer会自动处理此步骤，不必过分操心里面的事情。对于通常的pytorch来说，此步骤是将dataset转换为dataloader，要决定batch_size，是否shuffle，抽样方法，collate_fn聚合方式。对于transformers里，collate_fn可以使用默认或者自己写的传入，可以将batch_size通过1的参数传入。

DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
           batch_sampler=None, num_workers=0, collate_fn=None,
           pin_memory=False, drop_last=False, timeout=0,
           worker_init_fn=None)

3. 定义metrics用来评估预测的效果。如果有需要，也需要写collate_fn聚合函数。
4. 对于已经处理好的数据train_dataset、dev_dataset、test_dataset，连同model、tokenizer、metric、args一同送入trainer，进行训练，保存。这里使用的是save_model，其实就是调用的标准的save_pretrained的接口。

## 第五节 GPT2训练使用示例

### 5.1 从头预训练模型与文本生成

以GPT2为例，示范如何使用从头进行预训练模型，及如何使用GPT2进行文本生成，同时更加详细的预训练介绍。

#### 直接从头进行预训练

In [1]:
RUN_LANGUAGE_MODEL_PY = ('/dfsdata2/yucc1_data/projects/transformers_study/'
    'transformers/examples/language-modeling/run_language_modeling.py')
# config, tokenizer, model
CONFIG_NAME = '/dfsdata2/yucc1_data/models/huggingface/gpt2'
TOKENIZER_NAME = '/dfsdata2/yucc1_data/models/huggingface/gpt2'
# 从头进行预训练，不需要指定此参数；继续finetune，需要指定原始的gpt2模型所在位置
GPT2_MODEL_NAME_OR_PATH = '/dfsdata2/yucc1_data/models/huggingface/gpt2'
# 生成的模型
GPT2_OUTPUT_DIR = '/dfsdata2/yucc1_data/output/gpt2-train-new-model'
# TRAIN_DATA_FILE = '/dfsdata2/yucc1_data/datasets/wikitext-2-raw/wiki.train.raw'
TRAIN_DATA_FILE = '/dfsdata2/yucc1_data/datasets/wikitext-2-raw/wiki.test.raw'
EVAL_DATA_FILE = '/dfsdata2/yucc1_data/datasets/wikitext-2-raw/wiki.test.raw'

以下是重头预训练的代码，如果只是finetune，加入一面一行代码即可：

--model_name_or_path={GPT2_MODEL_NAME_OR_PATH} \

In [2]:
!python {RUN_LANGUAGE_MODEL_PY} \
--output_dir={GPT2_OUTPUT_DIR} \
--model_type=gpt2 \
--config_name={CONFIG_NAME} \
--tokenizer_name={TOKENIZER_NAME} \
--do_train \
--train_data_file={TRAIN_DATA_FILE} \
--do_eval \
--eval_data_file={EVAL_DATA_FILE} \
--block_size=510 \
--save_steps=5000 \
--num_train_epochs=2.0 \
--overwrite_cache \
--overwrite_output_dir

07/24/2020 17:34:26 - INFO - transformers.training_args -   PyTorch: setting up devices
07/24/2020 17:34:26 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='/dfsdata2/yucc1_data/output/gpt2-train-new-model', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluate_during_training=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, learning_rate=5e-05, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=2.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Jul24_17-34-26_d585a65fe6e2', logging_first_step=False, logging_steps=500, save_steps=5000, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, past_index=-1)
07/24/2020 17:34:26 - INFO - transformers.conf

#### 5.2 使用预训练好的模型进行生成

In [3]:
import torch
import transformers

In [5]:
# config, tokenizer, model
CONFIG_NAME = '/dfsdata2/yucc1_data/models/huggingface/gpt2'
TOKENIZER_NAME = '/dfsdata2/yucc1_data/models/huggingface/gpt2'
# 从头进行预训练，不需要指定此参数；继续finetune，需要指定原始的gpt2模型所在位置
GPT2_MODEL_NAME_OR_PATH = '/dfsdata2/yucc1_data/models/huggingface/gpt2'
# 生成的模型
GPT2_OUTPUT_DIR = '/dfsdata2/yucc1_data/output/gpt2-train-new-model'
# TRAIN_DATA_FILE = '/dfsdata2/yucc1_data/datasets/wikitext-2-raw/wiki.train.raw'
TRAIN_DATA_FILE = '/dfsdata2/yucc1_data/datasets/wikitext-2-raw/wiki.test.raw'
EVAL_DATA_FILE = '/dfsdata2/yucc1_data/datasets/wikitext-2-raw/wiki.test.raw'

In [6]:
!tree {GPT2_OUTPUT_DIR}

/dfsdata2/yucc1_data/output/gpt2-train-new-model
├── config.json
├── eval_results_lm.txt
├── merges.txt
├── pytorch_model.bin
├── special_tokens_map.json
├── tokenizer_config.json
├── training_args.bin
└── vocab.json

0 directories, 8 files


In [7]:
# 使用刚刚预训练的模型
# 本模型与下面的模型加载，二选一即可
model = transformers.AutoModelForCausalLM.from_pretrained(GPT2_OUTPUT_DIR)
tokenizer = transformers.AutoTokenizer.from_pretrained(GPT2_OUTPUT_DIR)

In [9]:
# 使用GPT2官方的预训练好的模型
model = transformers.AutoModelForCausalLM.from_pretrained(GPT2_MODEL_NAME_OR_PATH)
tokenizer = transformers.AutoTokenizer.from_pretrained(GPT2_MODEL_NAME_OR_PATH)

Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at /dfsdata2/yucc1_data/models/huggingface/gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


##### 文本生成

文本解码的策略有两类[7]：

- Argmax Decoding: 主要包括beam search, class-factored softmax等
- Stochastic Decoding: 主要包括temperature sampling, top-k sampling等

在大多数文本生成任务中，大家都直接采用Argmax Decoding，最常见的就是beam search。但如果我们的vocabulary size较大，达到了50k甚至150k，在softmax层的运算量就会变得非常大，因为要经过softmax。有两种效率更高的方法，Class-factored Softmax和Pointer-generator Network。

实际上Argmax Decoding常常会导致模型生成重复的句子，如"I don't know. I don't know. I don't know...."。一个可行的解决方案就是在decoding过程中引入randomness，，但是The Curious Case of Neural Text Degeneration这篇论文指出，sampling from full vocabulary distribution生成的句子会非常的杂乱无章，因为当vocabulary size非常大时，每个词的probability都会变得很小，这时模型会有非常高的可能性sample到一个tail distribution中的词，一旦sample到了tail distribution中一个和前文非常不相关的词，很有可能接下来的词都受其影响，使得句子脱离原本的意思。因此，我们需要sampling from truncated vocabulary distribution，比较常见的算法主要有以下几种：(1) Temperature Sampling，(2) Top-k Sampling，(3) Top-p Sampling.

In [10]:
# 解码示例一：
# 初始文本
prompt = "Today the weather is really nice and I am planning on "
# 转换成tensor
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
# 使用beam search生成
beam_outputs = model.generate(input_ids=inputs,
           max_length=50+inputs.shape[-1],
           min_length=2+inputs.shape[-1],
           num_beams=10,
           num_return_sequences=5,)
for i in range(5):
    output_ids = beam_outputs[i].tolist()
    text = tokenizer.decode(output_ids, skip_special_tokens=True)
    print(f'decode {i}: {text}')

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


decode 0: Today the weather is really nice and I am planning on  going out on a limb and going out on a limb and going out on a limb and going out on a limb and going out on a limb and going out on a limb and going out on a limb and going out on a limb and going
decode 1: Today the weather is really nice and I am planning on  going to go out on a limb and go out on a limb and go out on a limb and go out on a limb and go out on a limb and go out on a limb and go out on a limb and go out on a limb
decode 2: Today the weather is really nice and I am planning on  going out on a limb and going out on a limb and going out on a limb and going out on a limb and going out on a limb and going out on a limb and going out on a limb and going out on a limb.

decode 3: Today the weather is really nice and I am planning on  going out on a limb and going out on a limb and going out on a limb and going out on a limb and going out on a limb and going out on a limb and going out on a limb and going out o

In [11]:
# 解码示例二：
# 初始文本
prompt = "Today the weather is really nice and I am planning on "
# 转换成tensor
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
# 给定tensor，生成新的tensor
# 此处需要注意的是max_length包含着前面的prompt的长度，也就是前面的长度非常长，超过250，就无法生成了
# top_p是概率从大到小排列，最小的个数达到0.95，后面的砍掉不使用；top_k是选择概率最大的60个，后面的砍掉不使用；
# 综合起来就是概率相加不超过0.95且最多60个的单词去生成。
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = tokenizer.decode(outputs[0])
print(generated)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


Today the weather is really nice and I am planning on  going to do a run in April and May this year  to play the Tibetan Cup games in Nairobi or just to play with me as a spectator for a year. I will post a few pictures shortly!
I will be going over a few of my favourite items that were last year, but there will be many more!  
These items are good stuff so i won't do them up here just to show you!  
There are two main things I would like to see in your purchases during this tour, and you will find them on my shop, right here.  For example, let's say you were to visit Nairobi in April, I was going to visit Nairobi and if you visited in May I would give you a discount on Nairobi tickets from the best book sellers here as well as a free lunch and drinks from Nairobi, which is great!  Now, you will also want to see my new 3 year anniversary event to watch the cricket team play in Nairobi during the World Cup 2015 in South Africa!
Here at Nairobi you will get everything that you need


#### 5.3 预训练的详细过程

In [1]:
import logging
import math
import os
import time
import pickle
from dataclasses import dataclass, field
from typing import Optional, List, Dict, Tuple

import filelock
import torch
import transformers



In [2]:
# 初始化日志
logger = logging.getLogger(__name__)

In [3]:
MODEL_CONFIG_CLASSES = list(transformers.MODEL_WITH_LM_HEAD_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)

In [4]:
# 同上面的训练一样，这里是模型参数，下面是数据参数
@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
    """

    model_name_or_path: Optional[str] = field(
        default=None,
        metadata={
            "help": "The model checkpoint for weights initialization. Leave None if you want to train a model from scratch."
        },
    )
    model_type: Optional[str] = field(
        default=None,
        metadata={"help": "If training from scratch, pass a model type from the list: " + ", ".join(MODEL_TYPES)},
    )
    config_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
    )
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )
    cache_dir: Optional[str] = field(
        default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
    )


@dataclass
class DataTrainingArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.
    """

    train_data_file: Optional[str] = field(
        default=None, metadata={"help": "The input training data file (a text file)."}
    )
    eval_data_file: Optional[str] = field(
        default=None,
        metadata={"help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."},
    )
    line_by_line: bool = field(
        default=False,
        metadata={"help": "Whether distinct lines of text in the dataset are to be handled as distinct sequences."},
    )

    mlm: bool = field(
        default=False, metadata={"help": "Train with masked-language modeling loss instead of language modeling."}
    )
    mlm_probability: float = field(
        default=0.15, metadata={"help": "Ratio of tokens to mask for masked language modeling loss"}
    )
    plm_probability: float = field(
        default=1 / 6,
        metadata={
            "help": "Ratio of length of a span of masked tokens to surrounding context length for permutation language modeling."
        },
    )
    max_span_length: int = field(
        default=5, metadata={"help": "Maximum length of a span of masked tokens for permutation language modeling."}
    )

    block_size: int = field(
        default=-1,
        metadata={
            "help": "Optional input sequence length after tokenization."
            "The training dataset will be truncated in block of this size for training."
            "Default to the model max input length for single sentence inputs (take into account special tokens)."
        },
    )
    overwrite_cache: bool = field(
        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
    )

In [5]:
RUN_LANGUAGE_MODEL_PY = ('/dfsdata2/yucc1_data/projects/transformers_study/'
    'transformers/examples/language-modeling/run_language_modeling.py')
# config, tokenizer, model
CONFIG_NAME = '/dfsdata2/yucc1_data/models/huggingface/gpt2'
TOKENIZER_NAME = '/dfsdata2/yucc1_data/models/huggingface/gpt2'
# 从头进行预训练，不需要指定此参数；继续finetune，需要指定原始的gpt2模型所在位置
GPT2_MODEL_NAME_OR_PATH = '/dfsdata2/yucc1_data/models/huggingface/gpt2'
# 生成的模型
GPT2_OUTPUT_DIR = '/dfsdata2/yucc1_data/output/gpt2-train-new-model'
# TRAIN_DATA_FILE = '/dfsdata2/yucc1_data/datasets/wikitext-2-raw/wiki.train.raw'
TRAIN_DATA_FILE = '/dfsdata2/yucc1_data/datasets/wikitext-2-raw/wiki.test.raw'
EVAL_DATA_FILE = '/dfsdata2/yucc1_data/datasets/wikitext-2-raw/wiki.test.raw'

input_args = [
    '--output_dir', GPT2_OUTPUT_DIR,
    '--model_type', 'gpt2',
    '--config_name', CONFIG_NAME,
    '--tokenizer_name', TOKENIZER_NAME,
    '--do_train',
    '--train_data_file', TRAIN_DATA_FILE,
    '--do_eval',
    '--eval_data_file', EVAL_DATA_FILE,
    '--block_size', '510',
    '--save_steps', '5000',
    '--num_train_epochs', '2.0',
    '--overwrite_cache',
    '--overwrite_output_dir',
]

parser = transformers.HfArgumentParser((ModelArguments, DataTrainingArguments, transformers.TrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses(input_args)

In [6]:
# 确保输入参数正确
# do_eval与eval_data_file两个参数统一
# 输出文件夹合理
if data_args.eval_data_file is None and training_args.do_eval:
    raise ValueError(
        "Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
        "or remove the --do_eval argument."
    )

if (
    os.path.exists(training_args.output_dir)
    and os.listdir(training_args.output_dir)
    and training_args.do_train
    and not training_args.overwrite_output_dir
):
    raise ValueError(
        f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome."
    )


In [7]:
# 设置日志格式，并记录本次训练的重要参数
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    level=logging.INFO if training_args.local_rank in [-1, 0] else logging.WARN,
)
logger.warning(
    "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
    training_args.local_rank,
    training_args.device,
    training_args.n_gpu,
    bool(training_args.local_rank != -1),
    training_args.fp16,
)
logger.info("Training/evaluation parameters %s", training_args)

# 设置seed
transformers.set_seed(training_args.seed)

07/26/2020 09:34:46 - INFO - transformers.training_args -   PyTorch: setting up devices
07/26/2020 09:34:46 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='/dfsdata2/yucc1_data/output/gpt2-train-new-model', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluate_during_training=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, learning_rate=5e-05, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=2.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Jul26_09-34-44_b2275efd265b', logging_first_step=False, logging_steps=500, save_steps=5000, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, dataloader_drop_last=False)


In [8]:
# 加载config
if model_args.config_name:
    config = transformers.AutoConfig.from_pretrained(model_args.config_name, cache_dir=model_args.cache_dir)
elif model_args.model_name_or_path:
    config = transformers.AutoConfig.from_pretrained(model_args.model_name_or_path, cache_dir=model_args.cache_dir)
else:
    config = CONFIG_MAPPING[model_args.model_type]()
    logger.warning("You are instantiating a new config instance from scratch.")

# 加载tokenizer
if model_args.tokenizer_name:
    tokenizer = transformers.AutoTokenizer.from_pretrained(model_args.tokenizer_name, cache_dir=model_args.cache_dir)
elif model_args.model_name_or_path:
    tokenizer = transformers.AutoTokenizer.from_pretrained(model_args.model_name_or_path, cache_dir=model_args.cache_dir)
else:
    raise ValueError(
        "You are instantiating a new tokenizer from scratch. This is not supported, but you can do it from another script, save it,"
        "and load it from here, using --tokenizer_name"
    )

if model_args.model_name_or_path:
    model = transformers.AutoModelForCausalLM.from_pretrained(
        model_args.model_name_or_path,
        from_tf=bool(".ckpt" in model_args.model_name_or_path),
        config=config,
        cache_dir=model_args.cache_dir,
    )
else:
    logger.info("Training new model from scratch")
    model = transformers.AutoModelForCausalLM.from_config(config)

# 如果tokenizer与model的embedding个数不同，设置为相同
model.resize_token_embeddings(len(tokenizer))

07/26/2020 09:34:48 - INFO - transformers.configuration_utils -   loading configuration file /dfsdata2/yucc1_data/models/huggingface/gpt2/config.json
07/26/2020 09:34:48 - INFO - transformers.configuration_utils -   Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "vocab_size": 50257
}

07/26/2020 09:34:48 - INFO - transformers.configuration_utils -   loading configuration file 

Embedding(50257, 768)

In [9]:
# 校验参数，并且设置block_size
if config.model_type in ["bert", "roberta", "distilbert", "camembert"] and not data_args.mlm:
    raise ValueError(
        "BERT and RoBERTa-like models do not have LM heads but masked LM heads. They must be run using the"
        "--mlm flag (masked language modeling)."
    )

if data_args.block_size <= 0:
    data_args.block_size = tokenizer.max_len
    # Our input block size will be the max possible for the model
else:
    data_args.block_size = min(data_args.block_size, tokenizer.max_len)

In [10]:
class TextDataset(torch.utils.data.dataset.Dataset):
    """
    This will be superseded by a framework-agnostic approach
    soon.
    """

    def __init__(
        self, tokenizer: transformers.PreTrainedTokenizer, file_path: str, block_size: int, overwrite_cache=False,
    ):
        assert os.path.isfile(file_path)

        block_size = block_size - tokenizer.num_special_tokens_to_add(pair=False)

        directory, filename = os.path.split(file_path)
        cached_features_file = os.path.join(
            directory, "cached_lm_{}_{}_{}".format(tokenizer.__class__.__name__, str(block_size), filename,),
        )

        # Make sure only the first process in distributed training processes the dataset,
        # and the others will use the cache.
        lock_path = cached_features_file + ".lock"
        with filelock.FileLock(lock_path):

            if os.path.exists(cached_features_file) and not overwrite_cache:
                start = time.time()
                with open(cached_features_file, "rb") as handle:
                    self.examples = pickle.load(handle)
                logger.info(
                    f"Loading features from cached file {cached_features_file} [took %.3f s]", time.time() - start
                )

            else:
                logger.info(f"Creating features from dataset file at {directory}")

                self.examples = []
                with open(file_path, encoding="utf-8") as f:
                    text = f.read()

                tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))

                for i in range(0, len(tokenized_text) - block_size + 1, block_size):  # Truncate in block of block_size
                    self.examples.append(
                        tokenizer.build_inputs_with_special_tokens(tokenized_text[i : i + block_size])
                    )
                # Note that we are losing the last truncated example here for the sake of simplicity (no padding)
                # If your dataset is small, first you should loook for a bigger one :-) and second you
                # can change this behavior by adding (model specific) padding.

                start = time.time()
                with open(cached_features_file, "wb") as handle:
                    pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
                logger.info(
                    "Saving features into cached file %s [took %.3f s]", cached_features_file, time.time() - start
                )

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i) -> torch.Tensor:
        return torch.tensor(self.examples[i], dtype=torch.long)


class LineByLineTextDataset(torch.utils.data.dataset.Dataset):
    """
    This will be superseded by a framework-agnostic approach
    soon.
    """

    def __init__(self, tokenizer: transformers.PreTrainedTokenizer, file_path: str, block_size: int):
        assert os.path.isfile(file_path)
        # Here, we do not cache the features, operating under the assumption
        # that we will soon use fast multithreaded tokenizers from the
        # `tokenizers` repo everywhere =)
        logger.info("Creating features from dataset file at %s", file_path)

        with open(file_path, encoding="utf-8") as f:
            lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]

        batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)
        self.examples = batch_encoding["input_ids"]

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i) -> torch.Tensor:
        return torch.tensor(self.examples[i], dtype=torch.long)


In [11]:
@dataclass
class DataCollatorForLanguageModeling:
    """
    Data collator used for language modeling.
    - collates batches of tensors, honoring their tokenizer's pad_token
    - preprocesses batches for masked language modeling
    """

    tokenizer: transformers.PreTrainedTokenizer
    mlm: bool = True
    mlm_probability: float = 0.15

    def __call__(self, examples: List[torch.Tensor]) -> Dict[str, torch.Tensor]:
        batch = self._tensorize_batch(examples)
        if self.mlm:
            inputs, labels = self.mask_tokens(batch)
            return {"input_ids": inputs, "labels": labels}
        else:
            labels = batch.clone().detach()
            labels[labels == self.tokenizer.pad_token_id] = -100
            return {"input_ids": batch, "labels": labels}

    def _tensorize_batch(self, examples: List[torch.Tensor]) -> torch.Tensor:
        length_of_first = examples[0].size(0)
        are_tensors_same_length = all(x.size(0) == length_of_first for x in examples)
        if are_tensors_same_length:
            return torch.stack(examples, dim=0)
        else:
            if self.tokenizer._pad_token is None:
                raise ValueError(
                    "You are attempting to pad samples but the tokenizer you are using"
                    f" ({self.tokenizer.__class__.__name__}) does not have one."
                )
            return pad_sequence(examples, batch_first=True, padding_value=self.tokenizer.pad_token_id)

    def mask_tokens(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
        """

        if self.tokenizer.mask_token is None:
            raise ValueError(
                "This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer."
            )

        labels = inputs.clone()
        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
        probability_matrix = torch.full(labels.shape, self.mlm_probability)
        special_tokens_mask = [
            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
        ]
        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
        if self.tokenizer._pad_token is not None:
            padding_mask = labels.eq(self.tokenizer.pad_token_id)
            probability_matrix.masked_fill_(padding_mask, value=0.0)
        masked_indices = torch.bernoulli(probability_matrix).bool()
        labels[~masked_indices] = -100  # We only compute loss on masked tokens

        # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
        indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
        inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)

        # 10% of the time, we replace masked input tokens with random word
        indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
        random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
        inputs[indices_random] = random_words[indices_random]

        # The rest of the time (10% of the time) we keep the masked input tokens unchanged
        return inputs, labels

In [12]:
# 获得dataset
def get_dataset(args: DataTrainingArguments, tokenizer: transformers.PreTrainedTokenizer, evaluate=False):
    file_path = args.eval_data_file if evaluate else args.train_data_file
    if args.line_by_line:
        return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
    else:
        return TextDataset(
            tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, overwrite_cache=args.overwrite_cache
        )

In [13]:
# 获取dataset
train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
eval_dataset = get_dataset(data_args, tokenizer=tokenizer, evaluate=True) if training_args.do_eval else None
if config.model_type == "xlnet":
    data_collator = transformers.DataCollatorForPermutationLanguageModeling(
        tokenizer=tokenizer, plm_probability=data_args.plm_probability, max_span_length=data_args.max_span_length,
    )
else:
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=data_args.mlm, mlm_probability=data_args.mlm_probability
    )

07/26/2020 09:35:01 - INFO - filelock -   Lock 139977698655704 acquired on /dfsdata2/yucc1_data/datasets/wikitext-2-raw/cached_lm_GPT2Tokenizer_510_wiki.test.raw.lock
07/26/2020 09:35:01 - INFO - __main__ -   Creating features from dataset file at /dfsdata2/yucc1_data/datasets/wikitext-2-raw
07/26/2020 09:35:03 - INFO - __main__ -   Saving features into cached file /dfsdata2/yucc1_data/datasets/wikitext-2-raw/cached_lm_GPT2Tokenizer_510_wiki.test.raw [took 0.036 s]
07/26/2020 09:35:03 - INFO - filelock -   Lock 139977698655704 released on /dfsdata2/yucc1_data/datasets/wikitext-2-raw/cached_lm_GPT2Tokenizer_510_wiki.test.raw.lock
07/26/2020 09:35:03 - INFO - filelock -   Lock 139976122421544 acquired on /dfsdata2/yucc1_data/datasets/wikitext-2-raw/cached_lm_GPT2Tokenizer_510_wiki.test.raw.lock
07/26/2020 09:35:03 - INFO - __main__ -   Creating features from dataset file at /dfsdata2/yucc1_data/datasets/wikitext-2-raw
07/26/2020 09:35:06 - INFO - __main__ -   Saving features into cached 

In [15]:
train_dataset[0].shape

torch.Size([510])

In [76]:
# 初始化Trainer
trainer = transformers.Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    prediction_loss_only=True,
)

07/21/2020 13:44:23 - INFO - transformers.trainer -   You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.


In [77]:
# 训练
if training_args.do_train:
    model_path = (
        model_args.model_name_or_path
        if model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path)
        else None
    )
    trainer.train(model_path=model_path)
    trainer.save_model()
    # For convenience, we also re-save the tokenizer to the same directory,
    # so that you can share your model easily on huggingface.co/models =)
    if trainer.is_world_master():
        tokenizer.save_pretrained(training_args.output_dir)

07/21/2020 13:44:37 - INFO - transformers.trainer -   ***** Running training *****
07/21/2020 13:44:37 - INFO - transformers.trainer -     Num examples = 561
07/21/2020 13:44:37 - INFO - transformers.trainer -     Num Epochs = 2
07/21/2020 13:44:37 - INFO - transformers.trainer -     Instantaneous batch size per device = 8
07/21/2020 13:44:37 - INFO - transformers.trainer -     Total train batch size (w. parallel, distributed & accumulation) = 16
07/21/2020 13:44:37 - INFO - transformers.trainer -     Gradient Accumulation steps = 1
07/21/2020 13:44:37 - INFO - transformers.trainer -     Total optimization steps = 72


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=2.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=36.0, style=ProgressStyle(description_wid…






HBox(children=(FloatProgress(value=0.0, description='Iteration', max=36.0, style=ProgressStyle(description_wid…

07/21/2020 13:45:26 - INFO - transformers.trainer -   

Training completed. Do not forget to share your model on huggingface.co/models =)


07/21/2020 13:45:26 - INFO - transformers.trainer -   Saving model checkpoint to /dfsdata2/yucc1_data/output/gpt2-train-new-model
07/21/2020 13:45:26 - INFO - transformers.configuration_utils -   Configuration saved in /dfsdata2/yucc1_data/output/gpt2-train-new-model/config.json






07/21/2020 13:45:27 - INFO - transformers.modeling_utils -   Model weights saved in /dfsdata2/yucc1_data/output/gpt2-train-new-model/pytorch_model.bin


In [78]:
# 评估
results = {}
if training_args.do_eval:
    logger.info("*** Evaluate ***")

    eval_output = trainer.evaluate()

    perplexity = math.exp(eval_output["eval_loss"])
    result = {"perplexity": perplexity}

    output_eval_file = os.path.join(training_args.output_dir, "eval_results_lm.txt")
    if trainer.is_world_master():
        with open(output_eval_file, "w") as writer:
            logger.info("***** Eval results *****")
            for key in sorted(result.keys()):
                logger.info("  %s = %s", key, str(result[key]))
                writer.write("%s = %s\n" % (key, str(result[key])))

    results.update(result)

07/21/2020 13:45:27 - INFO - __main__ -   *** Evaluate ***
07/21/2020 13:45:27 - INFO - transformers.trainer -   ***** Running Evaluation *****
07/21/2020 13:45:27 - INFO - transformers.trainer -     Num examples = 561
07/21/2020 13:45:27 - INFO - transformers.trainer -     Batch size = 16


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=36.0, style=ProgressStyle(description_wi…

07/21/2020 13:45:38 - INFO - transformers.trainer -   {'eval_loss': 8.00778447257148, 'epoch': 2.0, 'step': 72}
07/21/2020 13:45:38 - INFO - __main__ -   ***** Eval results *****
07/21/2020 13:45:38 - INFO - __main__ -     perplexity = 3004.2537276158396





In [79]:
print(results)

{'perplexity': 3004.2537276158396}


答疑：
Trainer到底干什么

## 第六节 总结、思考与展望

1. 开源代码及工具报的书写方式。setup.py用于pip包写作；tests是测试样例；同名的文件夹下是代码。
2. 所有输出日志非常重要且有用。
3. typing, dataclasses, classmethod等方法的使用。
4. 开源库的学习。第一步，看文档，跑官方examples：第二步，修改代码，仿写自己的代码；第三步，重复前两步；第四步，看源码，修改源码。
5. 通过学习源码，学到了很多，希望后续对源码更加清楚，使用更加熟练；希望能在工作、学习、科研上对自己、对大家更有帮助。 

## 第七节 相关网址
1. github代码：https://github.com/huggingface/transformers
2. doc说明：https://huggingface.co/transformers/index.html
3. 模型下载地址：https://huggingface.co/models
4. 利用transformers开发的中文chitchat生成：https://github.com/yangjianxin1/GPT2-chitchat
5. DialoGPT：https://github.com/microsoft/DialoGPT
6. glue数据集介绍：https://zhuanlan.zhihu.com/p/135283598
7. 文本解码策略：https://zhuanlan.zhihu.com/p/68383015
8. Trainer说明：https://huggingface.co/transformers/main_classes/trainer.html