# BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

# BERT
本文基于Jacob Devlin的[论文](https://arxiv.org/pdf/1810.04805.pdf)，以及Huggingface在github上的[code](https://github.com/huggingface/pytorch-transformers)，对BERT进行解读


BERT使用了Transformer中的encoder部分，以$BERT_{base}$为例，encoder Layer为$L=12$层，multi-attention中的heads为，Hidden Layer为$H=12$，feed-foward的filter size（也可以理解为中间隐含层）的尺寸为$4H=3072$

In [2]:
import torch
from pytorch_transformers import BertTokenizer, BertModel, BertForMaskedLM

## 语言模型的训练任务一：Masked LM
对于每一个句子，以15%的概率选出单词，以下面三种方式处理：
- 80%的时候，替换成[MASK], 举例: my do is hairy $\rightarrow$ my dog is [MASK]
- 10%的时候，替换成随机单词，举例: my do is hairy $\rightarrow$ my dog is apple
- 10%的时候，保存不变，举例: my do is hairy $\rightarrow$ my dog is hairy


## 语言模型的训练任务二：NextSentence
判断句子A，句子B之间是否为上下句关系：
- 50%的时候，是上下句关系：[CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
- 50%的时候，不是上下句关系：[CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]

## Tokenizer
采用WordPiece embedding, 有30k的词汇量。后缀会分成以'##'开头的片段。特殊词汇定义如下
unk_token="[UNK]", sep_token="[SEP]", pad_token="[PAD]", cls_token="[CLS]",mask_token="[MASK]"，其中句首为[CLS]，句子分隔符为[SEP]

In [3]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Tokenize input
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)

['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', 'henson', 'was', 'a', 'puppet', '##eer', '[SEP]']


### 分词

In [4]:
# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a',
                              'puppet', '##eer', '[SEP]']
print(tokenized_text)

['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']


### 转换成单词id

In [5]:
# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
print(tokens_tensor)
print(segments_tensors)

tensor([[  101,  2040,  2001,  3958, 27227,  1029,   102,  3958,   103,  2001,
          1037, 13997, 11510,   102]])
tensor([[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])


## 模型
得到句子在语言模型中对应的单词向量

In [6]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
model.config.output_hidden_states=True
# Set the model in evaluation mode to desactivate the DropOut modules
# This is IMPORTANT to have reproductible results during evaluation!
model.eval()
# Predict hidden states features for each layer
with torch.no_grad():
    # See the models docstrings for the detail of the inputs
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    # PyTorch-Transformers models always output tuples.
    # See the models docstrings for the detail of all the outputs
    # In our case, the first element is the hidden state of the last layer of the Bert model
    encoded_layers = outputs[0]
# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)
assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size)

## 预测
预测mask单词

In [14]:
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()
# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    predictions = outputs[0]
# confirm we were able to predict 'henson'
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token)
assert predicted_token == 'henson'

# Bert Model有三部分组成
- embedding：将句子映射成向量序列：（batch，sequence_length,hidden_size），向量包含了单词，位置，句子序号信息
- encoder：通过Transformer的L层encoderLayer，得到L个隐含状态层
- pooler：对最后一个状态中句子开头的[CLS]所对应的向量（代码中进行线性变换，tanh并进行输出，论文中为明说）

其中forward框架如下：

输入:
 - input_ids: 单词tokenized后的序号
 - token_type_ids: 就是Segmentation id；句子A、句子B
 - attention_mask：区分padding和正常单词

输出:
 - sequence_output (batch*sequence_length*hidden_size):包含了句子中每一个单词对应的最后层隐含状态
 - pooled_output: 第一个单词对应的隐含状态的再次线性变换

In [41]:
def forward(self, input_ids, token_type_ids=None, attention_mask=None,
        position_ids=None, head_mask=None):
    extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2) # 增加head，query维度
    extended_attention_mask = extended_attention_mask.to(
        dtype=next(self.parameters()).dtype) # fp16 compatibility
    extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
    
    # step-1
    embedding_output = self.embeddings(input_ids, position_ids=position_ids, token_type_ids=token_type_ids)
    # step-2
    encoder_outputs = self.encoder(embedding_output,
                                   extended_attention_mask,
                                   head_mask=head_mask)
    sequence_output = encoder_outputs[0]
    #
    pooled_output = self.pooler(sequence_output)

    outputs = (sequence_output, pooled_output,) + encoder_outputs[1:]  # add hidden_states and attentions if they are here
    return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)
    

## Bert Model流程

##  step-1 embedding
其中step-1中的embedding又分为了三部分：
- WordPiece Embedding: 
- learned position embedding 同词向量embedding相同。句子最长512。（512*dim）
- Segment Embedding：句子1、句子2. （2*dim）
![](images/embedding.png)

In [48]:
embedding_output = model.embeddings(tokens_tensor, position_ids=None, 
                                   token_type_ids=segments_tensors)
assert tuple(embedding_output.shape) == (1, len(indexed_tokens), model.config.hidden_size)

encoder中代码摘要
```python
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

def forward(self, input_ids, token_type_ids=None, position_ids=None):
    embeddings = words_embeddings + position_embeddings + token_type_embeddings
```

## Step-2 Encoder
encoder的结构图如所示，包含了L个encoder Layer，每个layer包含一个Multihead Self-Attention（BertAttention）子层和FeedForward(BertIntermediate+BertOutput)子层。详见Transformer
其中BertLayer的结构如下（就是Transformer中的encoder Layer）：
- BertAttention: LayerNorm(MultiHeadAttention(x)+x)
- BertIntermediate+BertOutput: 就是LayerNorm(FF(x)+x)

![](images/encoder.png)

In [49]:
attention_mask = torch.ones_like(tokens_tensor)
extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
extended_attention_mask = extended_attention_mask.to(dtype=next(model.parameters()).dtype) # fp16 compatibility
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0

head_mask = [None] * model.config.num_hidden_layers
model.encoder.output_hidden_states=True
encoder_outputs = model.encoder(embedding_output,
                                       extended_attention_mask,
                                       head_mask=head_mask)
# encoder_ouputs: outputs, (hidden states), (attentions)

### BertEncoder 代码摘要

In [None]:
class BertLayer(nn.Module):
    def __init__(self, config):
        super(BertLayer, self).__init__()
        self.attention = BertAttention(config)
        self.intermediate = BertIntermediate(config)
        self.output = BertOutput(config)

    def forward(self, hidden_states, attention_mask, head_mask=None):
        attention_outputs = self.attention(hidden_states, attention_mask, head_mask)
        attention_output = attention_outputs[0]
        intermediate_output = self.intermediate(attention_output)
        layer_output = self.output(intermediate_output, attention_output)
        outputs = (layer_output,) + attention_outputs[1:]  # add attentions if we output them
        return outputs

In [None]:
class BertEncoder(nn.Module):
    def __init__(self, config):
        super(BertEncoder, self).__init__()
        self.output_attentions = config.output_attentions
        self.output_hidden_states = config.output_hidden_states
        self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])

    def forward(self, hidden_states, attention_mask, head_mask=None):
        all_hidden_states = ()
        all_attentions = ()
        for i, layer_module in enumerate(self.layer):
            if self.output_hidden_states:
                all_hidden_states = all_hidden_states + (hidden_states,)

            layer_outputs = layer_module(hidden_states, attention_mask, head_mask[i])
            hidden_states = layer_outputs[0]

            if self.output_attentions:
                all_attentions = all_attentions + (layer_outputs[1],)

        # Add last layer
        if self.output_hidden_states:
            all_hidden_states = all_hidden_states + (hidden_states,)

        outputs = (hidden_states,)
        if self.output_hidden_states:
            outputs = outputs + (all_hidden_states,)
        if self.output_attentions:
            outputs = outputs + (all_attentions,)
        return outputs  # outputs, (hidden states), (attentions)

## Step-3 BertPool
取每一个句子的第一个单词，进行线性变换

In [55]:
sequence_output = encoder_outputs[0]
pooled_output = model.pooler(sequence_output)

In [None]:
class BertPooler(nn.Module):
    def __init__(self, config):
        super(BertPooler, self).__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output

In [54]:
first_token_tensor = sequence_output[:,0]
assert tuple(first_token_tensor.shape) == (1,model.config.hidden_size)

# BertForMaskedLM
包含两个组件
- BertModel
- BertOnlyMLMHEAD(cls): 其实就是一个包含单隐层的FeedForward网络
    - BertLMPredictionHead
        - BertPredictionHeadTransform (hidden_size->hidden_size[acitvation])
        - decoder（线性映射hidden_size->vocab_size）

        

In [None]:
def forward(self, input_ids, token_type_ids=None, attention_mask=None, masked_lm_labels=None,
            position_ids=None, head_mask=None):
    outputs = self.bert(input_ids, position_ids=position_ids, token_type_ids=token_type_ids,
                        attention_mask=attention_mask, head_mask=head_mask)

    sequence_output = outputs[0]
    prediction_scores = self.cls(sequence_output)

In [None]:
class BertPredictionHeadTransform(nn.Module):
    def __init__(self, config):
        super(BertPredictionHeadTransform, self).__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        if isinstance(config.hidden_act, str) or (sys.version_info[0] == 2 and isinstance(config.hidden_act, unicode)):
            self.transform_act_fn = ACT2FN[config.hidden_act]
        else:
            self.transform_act_fn = config.hidden_act
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)

    def forward(self, hidden_states):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.transform_act_fn(hidden_states)
        hidden_states = self.LayerNorm(hidden_states)
        return hidden_states

In [None]:
class BertLMPredictionHead(nn.Module):
    def __init__(self, config):
        super(BertLMPredictionHead, self).__init__()
        self.transform = BertPredictionHeadTransform(config)

        # The output weights are the same as the input embeddings, but there is
        # an output-only bias for each token.
        self.decoder = nn.Linear(config.hidden_size,
                                 config.vocab_size,
                                 bias=False)

        self.bias = nn.Parameter(torch.zeros(config.vocab_size))

    def forward(self, hidden_states):
        hidden_states = self.transform(hidden_states)
        hidden_states = self.decoder(hidden_states) + self.bias
        return hidden_states


In [None]:
class BertOnlyMLMHead(nn.Module):
    def __init__(self, config):
        super(BertOnlyMLMHead, self).__init__()
        self.predictions = BertLMPredictionHead(config)

    def forward(self, sequence_output):
        prediction_scores = self.predictions(sequence_output)
        return prediction_scores


# BertForNextSentencePrediction流程
- BertModel
- BertOnlyNSPHead(hidden_size->2)


In [None]:
class BertOnlyNSPHead(nn.Module):
    def __init__(self, config):
        super(BertOnlyNSPHead, self).__init__()
        self.seq_relationship = nn.Linear(config.hidden_size, 2)

    def forward(self, pooled_output):
        seq_relationship_score = self.seq_relationship(pooled_output)
        return seq_relationship_score

In [None]:
outputs = self.bert(input_ids, position_ids=position_ids, token_type_ids=token_type_ids,
                            attention_mask=attention_mask, head_mask=head_mask)
pooled_output = outputs[1]
seq_relationship_score = self.cls(pooled_output)