### Adapted from:
- https://blog.scaleway.com/2019/understanding-text-with-bert/
- https://www.kaggle.com/christofhenkel/loading-bert-using-pytorch-with-tokenizer-apex
- https://github.com/huggingface/transformers/blob/master/examples/run_squad.py

In [1]:
from __future__ import absolute_import, division, print_function

In [2]:
import argparse
import logging
import os
import random
import glob

In [3]:
import numpy as np
import torch
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
from torch.utils.data.distributed import DistributedSampler

In [4]:
from torch.utils.tensorboard import SummaryWriter

In [5]:
from tqdm import tqdm, trange

In [6]:
from pytorch_pretrained_bert import BertConfig, BertForQuestionAnswering, BertTokenizer, BertModel

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


#### Architecture of BERT Encoder

The architecture of the BERT model is composed of an Embedding layer plus a BERT Encoder and a final pooler.

The Bert Embedding layer has the following components and dimensions:

- word embeddings: size of vocabulary by size of hidden state;

- position embeddings: max. length of sentence (512 tokens) by size of hidden state;

- token_type_embeddings: 2 by size of hidden state (this embedding is apparently used for token-level binary classification, such as probability that a given token is the beginning/end of a span);

- LayerNorm: ???

- Dropout: dropout on embedding layer???

The Bert Encoder is composed of 12 Bert Layers stacked on top of each other. Each Bert Layer is a replica of the same structure:

- Bert self attention: query (size of hidden state by size of hidden state); key (size of hidden state by size of hidden state); value (size of hidden state by size of hidden state); dropout;

- Bert self attention output: dense affine layer (size of hidden state by size of hidden state); LayerNorm; dropout;

- Bert layer intermediate: dense affine layer (size of hidden state by 4 * size of hidden state; the 4 is presumably due to the concatenation of query, key, value and self attention output hidden state vectors);

- Bert layer output: dense affine layer (4 * size of hidden state by size of hidden state); dropout;

The pooler is a dense affine layer (size of hidden state by size of hidden state) with an tanh activation function;

In [7]:
bert_model = BertModel.from_pretrained("bert-base-uncased").cuda()
bert_model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): BertLayerNorm()
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features

#### Architecture of BERT for Question Answering: BERT Encoder + top dense affine layer

In [8]:
bert_model_QA = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
bert_model_QA.eval()

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
         

The command below loads the tokenizer pre-trained on a specific vocabulary (in this case `bert-base-uncased`)

In [9]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The command below tokenizes one sample sentence.

In [10]:
text = 'Hi, my name is Luca! I do not live on the second floor, but on the 5th floor instead.'
tokens = tokenizer.tokenize(text)
print(tokens)
print('The number of tokens is: {}'.format(len(tokens)))

['hi', ',', 'my', 'name', 'is', 'luca', '!', 'i', 'do', 'not', 'live', 'on', 'the', 'second', 'floor', ',', 'but', 'on', 'the', '5th', 'floor', 'instead', '.']
The number of tokens is: 23


The command below converts the sample sentence to BERT-formatted word ids

In [11]:
#tokens = ["[CLS]"] + tokens + ["[SEP]"]
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids

[7632,
 1010,
 2026,
 2171,
 2003,
 15604,
 999,
 1045,
 2079,
 2025,
 2444,
 2006,
 1996,
 2117,
 2723,
 1010,
 2021,
 2006,
 1996,
 4833,
 2723,
 2612,
 1012]

In [12]:
bert_output = bert_model(torch.tensor([input_ids]).cuda())

In [26]:
print('type: {} of length: {}.'.format(type(bert_output), len(bert_output)))

type: <class 'tuple'> of length: 2.


In [30]:
bert_output[1]

tensor([[-7.4687e-01, -3.5512e-01, -6.8854e-01,  7.8208e-01,  2.1600e-01,
         -1.3680e-01,  6.5701e-01,  2.6315e-01,  4.1703e-02, -9.9915e-01,
          3.2556e-01,  6.0464e-01,  9.7480e-01,  3.9086e-01,  9.3562e-01,
         -7.0522e-01, -2.5194e-01, -5.1356e-01,  1.9722e-01, -1.4919e-01,
          8.2281e-01,  9.9965e-01,  3.0079e-01,  3.1392e-01,  2.7452e-01,
          5.9897e-01, -5.4306e-01,  9.2878e-01,  9.0739e-01,  7.9554e-01,
         -6.9410e-01, -5.1478e-02, -9.9209e-01, -1.6266e-01, -9.5810e-01,
         -9.8643e-01,  3.2855e-01, -4.5272e-01, -1.5372e-01,  1.3056e-01,
         -9.1916e-01,  2.9005e-01,  9.9928e-01, -4.4838e-01,  6.1825e-01,
         -2.0422e-01, -9.9996e-01,  1.2594e-01, -8.9928e-01, -1.8152e-01,
          5.8565e-01, -3.0661e-01,  8.1745e-02,  3.5404e-01,  3.0300e-01,
         -3.5766e-01, -1.3039e-01,  7.4340e-02, -1.0266e-01, -4.4181e-01,
         -4.7261e-01,  4.7402e-01, -6.0517e-01, -8.1791e-01, -2.1128e-01,
          5.2509e-01, -2.6170e-01, -3.

In [33]:
print('there are {} elements in bert_output[0]'.format(len(bert_output[0])))
bert_output[0]

there are 12 elements in bert_output[0]


[tensor([[[ 0.0254, -0.1865, -0.0811,  ...,  0.0142, -0.5408, -0.1748],
          [ 0.0422,  0.1128, -0.1998,  ...,  0.5271,  0.5992,  0.1966],
          [ 0.2058,  0.5050, -0.3177,  ..., -1.3612,  0.3292, -0.2163],
          ...,
          [ 1.2883,  0.4178, -0.1631,  ...,  0.2341,  0.2116, -1.2128],
          [-1.1631,  0.6263, -1.3262,  ...,  0.5443,  0.5697,  0.0098],
          [ 0.1227,  0.3578, -0.2816,  ...,  0.3877,  0.4342,  0.2016]]],
        device='cuda:0', grad_fn=<AddBackward0>),
 tensor([[[-0.0496, -0.3631, -0.3137,  ...,  0.1796, -0.1739, -0.1072],
          [-0.0724,  0.0648, -0.1184,  ...,  0.7879,  0.3994,  0.0904],
          [ 0.3511,  0.4609,  0.1004,  ..., -1.0712,  0.4891, -0.5336],
          ...,
          [ 0.9920,  0.7816, -0.5033,  ...,  0.4102,  0.2895, -0.7581],
          [-1.3341,  0.9566, -1.5386,  ..., -0.0635,  1.1035,  0.1095],
          [ 0.1019,  0.6002,  0.1395,  ...,  0.2404,  0.3441,  0.2954]]],
        device='cuda:0', grad_fn=<AddBackward0>),
 t

In [50]:
print('the shape of bert_output[0][0] is {}'.format(bert_output[0][0].shape))
bert_output[0][0]

the shape of bert_output[0][0] is torch.Size([1, 23, 768])


tensor([[[ 0.0254, -0.1865, -0.0811,  ...,  0.0142, -0.5408, -0.1748],
         [ 0.0422,  0.1128, -0.1998,  ...,  0.5271,  0.5992,  0.1966],
         [ 0.2058,  0.5050, -0.3177,  ..., -1.3612,  0.3292, -0.2163],
         ...,
         [ 1.2883,  0.4178, -0.1631,  ...,  0.2341,  0.2116, -1.2128],
         [-1.1631,  0.6263, -1.3262,  ...,  0.5443,  0.5697,  0.0098],
         [ 0.1227,  0.3578, -0.2816,  ...,  0.3877,  0.4342,  0.2016]]],
       device='cuda:0', grad_fn=<AddBackward0>)

In [57]:
bert_output[0][11]

tensor([[[ 0.0500,  0.5176,  0.3170,  ..., -0.5147,  0.3445,  0.8903],
         [-0.2042,  0.7380,  0.4011,  ..., -0.4998,  0.5254,  0.8762],
         [-0.1559,  0.8274,  0.1780,  ..., -0.2038,  0.0221,  1.4586],
         ...,
         [-0.1945,  0.6219,  0.2148,  ..., -0.5717,  0.6080,  0.1845],
         [-0.1612,  0.4702,  0.1959,  ..., -0.5780,  0.9934,  0.9168],
         [-0.0887,  0.1486,  0.5231,  ..., -0.0932,  0.8672,  0.1894]]],
       device='cuda:0', grad_fn=<AddBackward0>)