# Basic tutorial for the transformers lib

modified from https://github.com/ZeweiChu/transformers-tutorial (@ZeweiChu)

关于transformers库的更多信息: 

- https://github.com/huggingface/transformers

- https://huggingface.co/transformers/

## 数据预处理

transformer库中提供的数据预处理主要是tokenizer。

In [1]:
import torch
from transformers import AutoTokenizer
# tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
tokenizer = AutoTokenizer.from_pretrained('/home/data/tmp/bert-base-uncased')

encoded_input = tokenizer("Hello, I'm a single sentence!")
print(encoded_input)

{'input_ids': [101, 7592, 1010, 1045, 1005, 1049, 1037, 2309, 6251, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [2]:
tokenizer.decode(encoded_input["input_ids"])

"[CLS] hello, i'm a single sentence! [SEP]"

In [3]:
batch_sentences = ["Hello I'm a single sentence",
                   "And another sentence",
                   "And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)

{'input_ids': [[101, 7592, 1045, 1005, 1049, 1037, 2309, 6251, 102], [101, 1998, 2178, 6251, 102], [101, 1998, 1996, 2200, 2200, 2197, 2028, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}


注意这里tokenizer在encode之后返回了三个部分：
- input_ids: 相当于是分词之后每个token转变成了一个id
- token_type_ids: 我们知道BERT模型允许我们传入两个sequence。而这个token_type_id就表示当前的token究竟是第一个sequence还是第二个sequence
- attention_mask: 表示当前的位置是真正的token还是只是padding而已。

我们来试试中文

In [4]:
# tokenizer_cn = AutoTokenizer.from_pretrained('bert-base-chinese')
tokenizer_cn = AutoTokenizer.from_pretrained('/home/data/tmp/bert-base-chinese')

encoded_input = tokenizer_cn("大家好，我是一句中文！")
print(encoded_input)

{'input_ids': [101, 1920, 2157, 1962, 8024, 2769, 3221, 671, 1368, 704, 3152, 8013, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [5]:
tokenizer_cn.decode(encoded_input["input_ids"])

'[CLS] 大 家 好 ， 我 是 一 句 中 文 ！ [SEP]'

In [6]:
encoded_input = tokenizer_cn("七月在线")
print(encoded_input)

{'input_ids': [101, 673, 3299, 1762, 5296, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}


In [7]:
batch = tokenizer(batch_sentences, padding=True, truncation=True, max_length=100, return_tensors="pt")
print(batch)

{'input_ids': tensor([[ 101, 7592, 1045, 1005, 1049, 1037, 2309, 6251,  102],
        [ 101, 1998, 2178, 6251,  102,    0,    0,    0,    0],
        [ 101, 1998, 1996, 2200, 2200, 2197, 2028,  102,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 0]])}


我们知道BERT允许传入两个sequence，所以这里也是一样。

In [8]:
encoded_input = tokenizer("How old are you?", "I'm 6 years old")
print(encoded_input)

{'input_ids': [101, 2129, 2214, 2024, 2017, 1029, 102, 1045, 1005, 1049, 1020, 2086, 2214, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


这里我们看到token_type_ids出现了数字0和1。分别表示当前的token究竟是来自第一段文字还是第二段文字。

In [9]:
tokenizer.decode(encoded_input["input_ids"])

"[CLS] how old are you? [SEP] i'm 6 years old [SEP]"

In [10]:
batch_sentences = ["Hello I'm a single sentence",
                   "And another sentence",
                   "And the very very last one"]
batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
                             "And I should be encoded with the second sentence",
                             "And I go with the very last one"]
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
print(encoded_inputs)

{'input_ids': [[101, 7592, 1045, 1005, 1049, 1037, 2309, 6251, 102, 1045, 1005, 1049, 1037, 6251, 2008, 3632, 2007, 1996, 2034, 6251, 102], [101, 1998, 2178, 6251, 102, 1998, 1045, 2323, 2022, 12359, 2007, 1996, 2117, 6251, 102], [101, 1998, 1996, 2200, 2200, 2197, 2028, 102, 1998, 1045, 2175, 2007, 1996, 2200, 2197, 2028, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


In [11]:
for ids in encoded_inputs["input_ids"]:
    print(tokenizer.decode(ids))

[CLS] hello i'm a single sentence [SEP] i'm a sentence that goes with the first sentence [SEP]
[CLS] and another sentence [SEP] and i should be encoded with the second sentence [SEP]
[CLS] and the very very last one [SEP] and i go with the very last one [SEP]


## transformer模型的训练

In [12]:
from transformers import BertForSequenceClassification
# model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('/home/data/tmp/bert-base-uncased', return_dict=True)
model.train()

Some weights of the model checkpoint at /home/data/tmp/bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the mode

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [13]:
from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=1e-5)

我们也可以给不同的部分的参数赋予不同的优化系数

In [14]:
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)

有了模型和优化器之后，我们就可以加载tokenizer以及数据了

In [15]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('/home/data/tmp/bert-base-uncased')
text_batch = ["I love Pixar.", "I don't care for Pixar."]
encoding = tokenizer(text_batch, return_tensors='pt', padding=True, truncation=True)
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


并且我们可以计算相应的损失函数，并且对损失函数进行BP，从而优化模型的参数

In [16]:
labels = torch.tensor([1,0]).unsqueeze(0)
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs[0]
loss.backward()
optimizer.step()

In [17]:
outputs[1]

tensor([[-0.1815, -0.2233],
        [-0.2254, -0.0510]], grad_fn=<AddmmBackward>)

In [18]:
outputs.loss

tensor(0.7492, grad_fn=<NllLossBackward>)

In [19]:
outputs

SequenceClassifierOutput(loss=tensor(0.7492, grad_fn=<NllLossBackward>), logits=tensor([[-0.1815, -0.2233],
        [-0.2254, -0.0510]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

上面这段代码中我们使用了transformer库当中为我们计算出来的损失函数来做模型优化。如果我们不在model的forward函数中提供labels这个参数，那么BertForSequenceClassification这个库就不会为我们返回分类是所用的cross entropy loss。

我们来花点时间看一看BertForSequenceClassification这个库是如何设计的。
https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/modeling_bert.py

我们当然也可以用自己定义的方式来计算损失

In [20]:
from torch.nn import functional as F
labels = torch.tensor([1,0])
outputs = model(input_ids, attention_mask=attention_mask)
loss = F.cross_entropy(outputs[0], labels)
loss.backward()
optimizer.step()

In [21]:
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-0.5709,  0.0413],
        [-0.4010, -0.1815]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

如果我们想要把BERT当中encoder的所有参数都固定住不再训练的话，可以把它们的requires_grad给设置为False。

In [22]:
for param in model.base_model.parameters():
    param.requires_grad = False