<a href="https://colab.research.google.com/github/shwliyi/transformers/blob/main/transformers_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0. 环境准备
安装Transformers与datasets

datasets同样也是Huggingface推出的一个包，能够方便地加载各种数据集，以及可以方便地计算该数据集对应的metric。

In [None]:
! pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 14.0 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 63.5 MB/s 
[?25hCollecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 72.9 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 65.2 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 70.2 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-no

# 1. 使用已经fine-tune好的现成模型直接完成下游任务

如果你只是想直接利用PLM完成下游任务，例如情感分类、机器问答等，你可以直接调用transformers中提供的Pipeline类

以下是一个情感分类的例子

In [None]:
from transformers import pipeline
classifier_sent = pipeline('sentiment-analysis')
classifier_sent('I love you! ')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9998782873153687}]

In [None]:
classifier_sent('I hate you!')

[{'label': 'NEGATIVE', 'score': 0.9987472295761108}]

又或者，你也可以尝试机器问答。在机器问答中，输入为一个dict，包含context和question两个key。

In [None]:
classifier_qa = pipeline('question-answering')
classifier_qa({
	'question': 'What is the name of the repository ?',
	'context': 'Pipeline has been included in the huggingface / transformers repository'
})

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

{'answer': 'huggingface / transformers',
 'end': 60,
 'score': 0.30970141291618347,
 'start': 34}

# 2. 在下游任务上进行fine-tuning

Pipeline不支持fine-tune。如果你想加载预训练模型，并自己进行fine-tune的话，你需要额外写一些加载模型、数据处理的代码。但幸运的是，Huggingface帮我们把这一流程简化了不少。

我们以**BERT在GLUE的SST-2数据集上进行情感分析**为例，为大家展示如何使用Huggingface进行预训练模型的fine-tune（GLUE是一个benchmark的名称，里面包含了数个任务，SST-2则是其中之一）。

## 2.1 加载数据集
在datasets这个package中，囊括了许多目前主流的数据集，通过一行命令(`load_dataset`)就可以完成数据集的下载与加载，且能够加载该数据集对应的metric以便计算(`load_metric`)。在这个例子中，我们需要加载GLUE中的SST-2任务。

In [None]:
from datasets import load_dataset, load_metric
dataset = load_dataset("glue", 'sst2')
metric = load_metric('glue', 'sst2')

Downloading builder script:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/sst2 (download: 7.09 MiB, generated: 4.81 MiB, post-processed: Unknown size, total: 11.90 MiB) to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/7.44M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

简单查看一下我们取得的dataset，可以看到数据集分为train, validation, test，其中每个集合中包含三个key，分别对应文本、标签以及编号。

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

In [None]:
dataset['train'][0]

{'idx': 0,
 'label': 0,
 'sentence': 'hide new secretions from the parental units '}

将metric直接print出来的话，能够看到metric的描述信息。在这个例子中，SST-2的metric为Accuracy

In [None]:
metric

Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(res

在我们有了模型的预测结果以及正确结果之后，我们可以通过调用`metric.compute`来方便地计算模型的表现。我们先随机生成一些数据来展示使用方法。

In [None]:
import numpy as np

fake_preds = np.random.randint(0, 2, size=(64,))                  # 随机生成一些预测结果
fake_labels = np.random.randint(0, 2, size=(64,))                 # 随机生成一些标签
metric.compute(predictions=fake_preds, references=fake_labels)    # 将二者输入metric.compute中

{'accuracy': 0.40625}

至此，我们就完成了数据集的下载、加载，以及其对应metric的准备。

## 2.2 Tokenization

预训练模型并不直接接受文本作为输入，每个预训练模型有自己的tokenization方式以及自己的词表，我们在使用某一模型时，需要：
1. 使用该模型的tokenization方式对数据进行tokenization
2. 使用该模型的词表，将tokenize之后的每个token转化成对应的id。

另外，除了词语的id以外，预训练模型还需要其他的一些输入。例如BERT还需要`token_type_ids`、`attention_mask`等。

这一过程看起来繁琐，但Huggingface同样为我们进行了简化。我们只需要加载我们所想要使用的模型的tokenizer，那么该tokenizer将会帮我们完成上述的所有事。

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

将文本直接传给`tokenizer`就能得到模型的输入，例如：

In [None]:
tokenizer("Tsinghua University is located in Beijing.")

{'input_ids': [101, 24529, 2075, 14691, 2118, 2003, 2284, 1999, 7211, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

我们开始利用`tokenizer`处理数据集。由于BERT只能处理长度不超过512的序列，因此我们指定`truncation=True`。

In [None]:
def preprocess_function(examples):
    return tokenizer(examples['sentence'], truncation=True, max_length=512)

使用数据集中的前五条数据集检查一下处理结果。可以看到处理结果为一个dict，包含`input_ids`, `token_type_ids`以及`attention_mask`。

In [None]:
preprocess_function(dataset['train'][:5])

{'input_ids': [[101, 5342, 2047, 3595, 8496, 2013, 1996, 18643, 3197, 102], [101, 3397, 2053, 15966, 1010, 2069, 4450, 2098, 18201, 2015, 102], [101, 2008, 7459, 2049, 3494, 1998, 10639, 2015, 2242, 2738, 3376, 2055, 2529, 3267, 102], [101, 3464, 12580, 8510, 2000, 3961, 1996, 2168, 2802, 102], [101, 2006, 1996, 5409, 7195, 1011, 1997, 1011, 1996, 1011, 11265, 17811, 18856, 17322, 2015, 1996, 16587, 2071, 2852, 24225, 2039, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

我们现在用`preprocess_function`来处理整个数据集。这一过程可以借助`dataset.map`函数来实现，这个函数能够将我们自定义的处理函数应用到数据集的所有数据上。此外，通过指定`batched=True`，可以实现多线程并行处理。

In [None]:
encoded_dataset = dataset.map(preprocess_function, batched=True)



  0%|          | 0/68 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

查看一下`encoded_dataset`，我们可以发现`encoded_dataset`在原先的`dataset`基础上，多出了三个`feature`，分别就是`tokenizer`输出的三个结果。

In [None]:
encoded_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1821
    })
})

In [None]:
encoded_dataset['train'][0]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'idx': 0,
 'input_ids': [101, 5342, 2047, 3595, 8496, 2013, 1996, 18643, 3197, 102],
 'label': 0,
 'sentence': 'hide new secretions from the parental units ',
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

至此，我们将所有数据转化成了模型能够接受的输入格式（`input_ids`, `token_type_ids`, `attention_mask`）。

# 3. Fine-tune模型

数据集已经准备完毕，我们终于可以开始fine-tune模型了！首先，我们需要利用transformers把预训练模型下载下来以便加载，这一过程也可以通过以下一行代码实现。其中，由于SST-2的标签种类只有2种（positive, negative），因此我们指定`num_labels=2`。

In [None]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

代码会输出一些像是报错的信息，不用担心。这是因为我们为了利用BERT来进行情感分类，舍弃了原先BERT用来做masked language modeling和sentence relationship prediction的参数，替换为了一个新的分类层来进行训练。

接下来，我们将使用Huggingface提供的Trainer类来进行模型的fine-tune。首先，我们设置Trainer的各种参数如下

In [None]:
from transformers import TrainingArguments

batch_size=16
args = TrainingArguments(
    "bert-base-uncased-finetuned-sst2",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy"
)

每个参数的含义：
+ 第一个参数：本次训练的名称
+ evaluation_strategy="epoch"：在每个epoch结束的时候在validation集上测试模型效果
+ save_strategy="epoch"：在每个epoch结束的时候保存一个checkpoint
+ learning_rate=2e-5：优化的学习率
+ per_device_train_batch_size=batch_size：训练时每个gpu上的batch_size
+ per_device_eval_batch_size=batch_size：测试时每个gpu上的batch_size
+ num_train_epochs=5：训练5个epoch
+ weight_decay=0.01：优化时采用的weight_decay
+ load_best_model_at_end=True：在训练结束后，加载训练过程中最好的参数
+ metric_for_best_model="accuracy"：以准确率作为指标

另外，我们还需要定义一个函数，告诉Trainer怎么计算指标

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred                 # predictions: [batch_size,num_labels], labels:[batch_size,]
    predictions = np.argmax(logits, axis=1)    # 将概率最大的类别作为预测结果
    return metric.compute(predictions=predictions, references=labels)

然后我们可以定义出Trainer类了

In [None]:
from transformers import Trainer
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

你也可以自定义optimizer和scheduler，并传给Trainer。在这里我们直接使用默认的选项，即优化器是AdamW，scheduler是linear warmup。

紧接着，调用`trainer`的`train`方法，我们就能够开始训练了

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 67349
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 21050


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1773,0.327943,0.916284
2,0.1218,0.339612,0.917431
3,0.0897,0.341416,0.918578
4,0.0539,0.441544,0.915138
5,0.0303,0.4644,0.91055


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 872
  Batch size = 16
Saving model checkpoint to bert-base-uncased-finetuned-sst2/checkpoint-4210
Configuration saved in bert-base-uncased-finetuned-sst2/checkpoint-4210/config.json
Model weights saved in bert-base-uncased-finetuned-sst2/checkpoint-4210/pytorch_model.bin
tokenizer config file saved in bert-base-uncased-finetuned-sst2/checkpoint-4210/tokenizer_config.json
Special tokens file saved in bert-base-uncased-finetuned-sst2/checkpoint-4210/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `B

TrainOutput(global_step=21050, training_loss=0.10431196976727375, metrics={'train_runtime': 2929.8649, 'train_samples_per_second': 114.935, 'train_steps_per_second': 7.185, 'total_flos': 6090242903971080.0, 'train_loss': 0.10431196976727375, 'epoch': 5.0})