### Outline:  
In this tutorial, we are going to discuss about implementation of BERT model purposed Nov. 2018.  
We will mainly focus on 4 aspects:  

1. What is BERT and word embedding?  
2. What is a tokenizer and contextual level embedding?  
3. How to get word, sentence, and paragraph embeddings from token level embeddings?  
4. How to fine tune your own BERT model?    

### 1. What is BERT and word embedding?
"This week, we **open sourced** a new technique for NLP **pre-training** called Bidirectional Encoder Representations from Transformers, or BERT. With this release, anyone in the world can train their own **state-of-the-art** question answering system (or a variety of other models) in about 30 minutes on a single Cloud TPU, or in a few hours using a single GPU." - [Google AI Blog, Nov.2.2018](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)  


<img src=https://4.bp.blogspot.com/-iQZIsE3lbVY/W9i8Tc-F7RI/AAAAAAAADfU/DrxjBoDfqrwe6GJUxENqWuzQ0IPlgT3TgCLcBGAs/s1600/image3.png width="600">


#### Motivitions:
 - The first deeply bidirectional contextual embedding. (vs Glove)
 - Data gap for Natural Language Processing tasks.  
 - Easy to use and fine tune for you own project to achieve state of the art.

### 2. What is a tokenizer and what is contextual embedding?
From word to embedding vector:  
word -> tokens -> ids -> tensors -> hidden state vectors -> embeddings  

Contextua = 1 to many  
Non-contextual = 1 to 1


In [1]:
# ! pip install pytorch_pretrained_bert

In [2]:
import torch
print(torch.cuda.is_available())
# torch.cuda.get_device_name(0)
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel

True


In [3]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


RETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {  
    'bert-base-uncased': 512,  
    'bert-large-uncased': 512,  
    'bert-base-cased': 512,  
    'bert-large-cased': 512,  
    'bert-base-multilingual-uncased': 512,  
    'bert-base-multilingual-cased': 512,  
    'bert-base-chinese': 512,  
    'bert-base-german-cased': 512,  
    'bert-large-uncased-whole-word-masking': 512,  
    'bert-large-cased-whole-word-masking': 512,  
    'bert-large-uncased-whole-word-masking-finetuned-squad': 512,  
    'bert-large-cased-whole-word-masking-finetuned-squad': 512,  
    'bert-base-cased-finetuned-mrpc': 512,  
}  

from https://github.com/huggingface/pytorch-transformers/blob/df9d6effae43e92761eb92540bc45fac846789ee/pytorch_transformers/tokenization_bert.py#L86

In [4]:
# word -> tokens
raw_text_a = "Seahorses dreamed about word embeddings and artificial intelligence"
raw_text_b = "Seahorse is a smart animal"
text_a = "[CLS] " + raw_text_a + " [SEP]" # notice the space
text_b = raw_text_b + " [SEP]" # don't need [CLS] for the second sentence
tokens_a = tokenizer.tokenize(text_a)
tokens_b = tokenizer.tokenize(text_b)

wrong_tokens = tokenizer.tokenize("[CLS]" + raw_text_a + "[SEP]")
print("Wrong sentence:\n", "[CLS]" + raw_text_a + "[SEP]")
print(wrong_tokens, "\n")
print("Right sentence:\n", text_a)
print(tokens_a) # notice the word seahorses and embedding

Wrong sentence:
 [CLS]Seahorses dreamed about word embeddings and artificial intelligence[SEP]
['[', 'cl', '##s', ']', 'sea', '##horse', '##s', 'dreamed', 'about', 'word', 'em', '##bed', '##ding', '##s', 'and', 'artificial', 'intelligence', '[', 'sep', ']'] 

Right sentence:
 [CLS] Seahorses dreamed about word embeddings and artificial intelligence [SEP]
['[CLS]', 'sea', '##horse', '##s', 'dreamed', 'about', 'word', 'em', '##bed', '##ding', '##s', 'and', 'artificial', 'intelligence', '[SEP]']


In [5]:
print(list(tokenizer.vocab.keys())[:5])
print(list(tokenizer.vocab.keys())[6000:6005])

['[PAD]', '[unused0]', '[unused1]', '[unused2]', '[unused3]']
['peninsula', 'adults', 'novels', 'emerged', 'vienna']


In [6]:
tokens = []
input_type_ids = []
# masks for segment, 0 for the first sentence, 1 for the second sentence.
# use 1 if there's only one sentence.

for token in tokens_a:
    tokens.append(token)
    input_type_ids.append(0)

if tokens_b:
    for token in tokens_b:
        tokens.append(token)
        input_type_ids.append(1)
        
print("tokens:", tokens)   
print("type_ids:", input_type_ids)

tokens: ['[CLS]', 'sea', '##horse', '##s', 'dreamed', 'about', 'word', 'em', '##bed', '##ding', '##s', 'and', 'artificial', 'intelligence', '[SEP]', 'sea', '##horse', 'is', 'a', 'smart', 'animal', '[SEP]']
type_ids: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]


In [7]:
# tokens -> ids
input_ids = tokenizer.convert_tokens_to_ids(tokens)
for pair in zip(tokens, input_ids):
    print(pair)
# notice the case ---> uncased

('[CLS]', 101)
('sea', 2712)
('##horse', 23024)
('##s', 2015)
('dreamed', 13830)
('about', 2055)
('word', 2773)
('em', 7861)
('##bed', 8270)
('##ding', 4667)
('##s', 2015)
('and', 1998)
('artificial', 7976)
('intelligence', 4454)
('[SEP]', 102)
('sea', 2712)
('##horse', 23024)
('is', 2003)
('a', 1037)
('smart', 6047)
('animal', 4111)
('[SEP]', 102)


In [8]:
# padding
seq_length = 30 # max allowed length & padding length for each pair of sentences. 512
input_mask = [1] * len(input_ids)
print(input_ids)
print(input_mask)
print(input_type_ids)
while len(input_ids) < seq_length:
    input_ids.append(0)
    input_mask.append(0)
    input_type_ids.append(0)
    
print()
print(input_ids)
print(input_mask)
print(input_type_ids)

[101, 2712, 23024, 2015, 13830, 2055, 2773, 7861, 8270, 4667, 2015, 1998, 7976, 4454, 102, 2712, 23024, 2003, 1037, 6047, 4111, 102]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

[101, 2712, 23024, 2015, 13830, 2055, 2773, 7861, 8270, 4667, 2015, 1998, 7976, 4454, 102, 2712, 23024, 2003, 1037, 6047, 4111, 102, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]


In [9]:

# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
model = model.cuda()
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): BertLayerNorm()
    (dropout): Dropout(p=0.1)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=

In [10]:
input_ids

[101,
 2712,
 23024,
 2015,
 13830,
 2055,
 2773,
 7861,
 8270,
 4667,
 2015,
 1998,
 7976,
 4454,
 102,
 2712,
 23024,
 2003,
 1037,
 6047,
 4111,
 102,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [11]:
# Predict hidden states features for each layer
with torch.no_grad():
    # ids -> hidden state vectors
    input_tensor = torch.LongTensor(input_ids).cuda().view(-1,1)
    input_mask = torch.LongTensor(input_mask).cuda().view(-1,1)
    input_type_ids = torch.LongTensor(input_type_ids).cuda().view(-1,1)
    
    print(input_tensor.shape)
    print(input_mask.shape)  
    print(input_type_ids.shape)
    encoded_layers, _ = model(input_tensor, token_type_ids=input_type_ids, attention_mask=input_mask)

torch.Size([30, 1])
torch.Size([30, 1])
torch.Size([30, 1])


In [13]:
print(type(encoded_layers), len(encoded_layers))
print(type(encoded_layers[0]), encoded_layers[0].shape)
print(input_tensor.shape)

<class 'list'> 12
<class 'torch.Tensor'> torch.Size([30, 1, 768])
torch.Size([30, 1])


In [14]:
for i,layer in enumerate(encoded_layers):
    print(f"{i}th layer with shape = {layer.shape}")

0th layer with shape = torch.Size([30, 1, 768])
1th layer with shape = torch.Size([30, 1, 768])
2th layer with shape = torch.Size([30, 1, 768])
3th layer with shape = torch.Size([30, 1, 768])
4th layer with shape = torch.Size([30, 1, 768])
5th layer with shape = torch.Size([30, 1, 768])
6th layer with shape = torch.Size([30, 1, 768])
7th layer with shape = torch.Size([30, 1, 768])
8th layer with shape = torch.Size([30, 1, 768])
9th layer with shape = torch.Size([30, 1, 768])
10th layer with shape = torch.Size([30, 1, 768])
11th layer with shape = torch.Size([30, 1, 768])


#### Now, what do we do with these hidden states?

(Image from [Jay Allamar](http://jalammar.github.io/illustrated-bert/)'s blog)


![alt text](http://jalammar.github.io/images/bert-feature-extraction-contextualized-embeddings.png)



In [16]:
# to get the token embedding vector, we can sum the last four
print(text_a, text_b)
sum_last_four = torch.sum(torch.stack(encoded_layers[-4:]), dim=0)
print(sum_last_four.shape)

[CLS] Seahorses dreamed about word embeddings and artificial intelligence [SEP] Seahorse is a smart animal [SEP]
torch.Size([30, 1, 768])


In [17]:
print(torch.cat(encoded_layers[-4:]).shape)

torch.Size([120, 1, 768])


### 3. From tokens to everything.
Now we have embedding vectors for each token, how do we use them to get embedding vector for word, sentence, and paragraph?  
 - sum
 - average

In [18]:
# token -> sentence 
print(torch.mean(sum_last_four, 1).shape)

torch.Size([30, 768])


In [19]:
sum_last_four.squeeze()[0].shape

torch.Size([768])

In [26]:
# token -> words
raw_text = raw_text_a + " " + raw_text_b
text = text_a + " " + text_b
print(raw_text.split(" "))
print(tokens)


cur_sentence = text.split(" ")
step_output = torch.zeros([len(cur_sentence), 768]).cuda() # +1 to include [SEP]
print_tokens = []
j = 0

for idx, word in enumerate(cur_sentence):
    print(word)
    num_tokens = len(tokenizer.tokenize(word))
    tmp_tokens = []
    for _ in range(num_tokens):
        tmp_tokens.append(tokens[j])
        step_output[idx, :] += sum_last_four.squeeze()[j]
        j += 1
    print_tokens.append(tmp_tokens)
#     step_output[i,:] /= num_tokens #  you can average the embeddings or not.
#     pass

['Seahorses', 'dreamed', 'about', 'word', 'embeddings', 'and', 'artificial', 'intelligence', 'Seahorse', 'is', 'a', 'smart', 'animal']
['[CLS]', 'sea', '##horse', '##s', 'dreamed', 'about', 'word', 'em', '##bed', '##ding', '##s', 'and', 'artificial', 'intelligence', '[SEP]', 'sea', '##horse', 'is', 'a', 'smart', 'animal', '[SEP]']
[CLS]
Seahorses
dreamed
about
word
embeddings
and
artificial
intelligence
[SEP]
Seahorse
is
a
smart
animal
[SEP]


In [27]:
print(step_output.shape)
print(print_tokens)

torch.Size([16, 768])
[['[CLS]'], ['sea', '##horse', '##s'], ['dreamed'], ['about'], ['word'], ['em', '##bed', '##ding', '##s'], ['and'], ['artificial'], ['intelligence'], ['[SEP]'], ['sea', '##horse'], ['is'], ['a'], ['smart'], ['animal'], ['[SEP]']]


In [28]:
print("salads&pokè")
print(tokenizer.tokenize("salads&pokè"))

salads&pokè
['salad', '##s', '&', 'poke']


In [29]:
tokenizer.convert_tokens_to_ids(tokenizer.tokenize("[PAD]"))

[0]

#### Confirming contextual feature

In [32]:
# BOOM!

max_length = 30 # 256/512
text = "After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."
contextual_ids = tokenizer.convert_tokens_to_ids(tokenizer.tokenize("[CLS] " + text + " [SEP]"))

# padding
while len(contextual_ids) < max_length:
    contextual_ids.append(0)

contextual_tensor = torch.LongTensor(contextual_ids).cuda() # dim = [22]

with torch.no_grad():
    encoded_layers, _ = model(contextual_tensor) # You need the .view(-1, 1)
    
contextual_encoded_layers = torch.sum(torch.stack(encoded_layers[-4:]), dim=0)

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

In [33]:
max_length = 30 # 256/512
text = "After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."
contextual_ids = tokenizer.convert_tokens_to_ids(tokenizer.tokenize("[CLS] " + text + " [SEP]"))

# padding
while len(contextual_ids) < max_length:
    contextual_ids.append(0)

contextual_tensor = torch.LongTensor(contextual_ids).cuda() # dim = [22]

with torch.no_grad():
    encoded_layers, _ = model(contextual_tensor.view(1,-1))
    
contextual_encoded_layers = torch.sum(torch.stack(encoded_layers[-4:]), dim=0)    

In [34]:
contextual_encoded_layers.shape

torch.Size([1, 30, 768])

In [35]:
for i,x in enumerate(tokenizer.tokenize("[CLS] " + text + " [SEP]")):
    print(i,x)

0 [CLS]
1 after
2 stealing
3 money
4 from
5 the
6 bank
7 vault
8 ,
9 the
10 bank
11 robber
12 was
13 seen
14 fishing
15 on
16 the
17 mississippi
18 river
19 bank
20 .
21 [SEP]


In [36]:
print("First fifteen values of 'bank' as in 'bank vault':")
contextual_encoded_layers.squeeze()[6][:15]

First fifteen values of 'bank' as in 'bank vault':


tensor([ 2.1455, -3.6647, -0.8976, -0.1880,  1.2952,  0.2723, -2.5086,  1.3888,
        -0.9232, -2.2783, -0.8348, -1.5500, -0.5243,  1.6353, -3.8010],
       device='cuda:0')

In [37]:
print("First fifteen values of 'bank' as in 'bank robber':")
contextual_encoded_layers.squeeze()[10][:15]

First fifteen values of 'bank' as in 'bank robber':


tensor([ 1.0651, -2.8878, -0.0540, -0.6728,  0.9880,  1.5694, -2.5486, -0.1006,
         0.4420, -1.0535, -0.6334, -0.2961, -0.1209,  2.1677, -3.8774],
       device='cuda:0')

In [38]:
print("First fifteen values of 'bank' as in 'river bank':")
contextual_encoded_layers.squeeze()[19][:15]

First fifteen values of 'bank' as in 'river bank':


tensor([ 0.7650, -0.8561, -0.4355, -1.4705,  1.7509, -0.3871,  1.4671,  4.2738,
        -0.2546, -1.5692,  2.3346, -0.2058, -0.5221,  1.4032, -3.3482],
       device='cuda:0')

In [41]:
from sklearn.metrics.pairwise import cosine_similarity
# Compare "bank" as in "bank robber" to "bank" as in "river bank"
different_bank = cosine_similarity(
    contextual_encoded_layers.squeeze()[10].view(1,-1).cpu(), 
    contextual_encoded_layers.squeeze()[19].view(1,-1).cpu())[0][0]

# Compare "bank" as in "bank robber" to "bank" as in "bank vault" 
same_bank = cosine_similarity(
    contextual_encoded_layers.squeeze()[10].view(1,-1).cpu(),  
    contextual_encoded_layers.squeeze()[6].view(1,-1).cpu())[0][0]

print("Similarity of 'bank' as in 'bank robber' to 'bank' as in 'bank vault':",  same_bank)
print("Similarity of 'bank' as in 'bank robber' to 'bank' as in 'river bank':",  different_bank)

Similarity of 'bank' as in 'bank robber' to 'bank' as in 'bank vault': 0.9276867
Similarity of 'bank' as in 'bank robber' to 'bank' as in 'river bank': 0.73005724


### 4. Fine tunning.
To make it even better.  
https://github.com/huggingface/pytorch-transformers/tree/master/examples/lm_finetuning

#### - Input format
The scripts in this folder expect a single file as input, consisting of untokenized text, with one sentence per line, and one blank line between documents. The reason for the sentence splitting is that part of BERT's training involves a next sentence objective in which the model must predict whether two sequences of text are contiguous text from the same document or not, and to avoid making the task too easy, the split point between the sequences is always at the end of a sentence. The linebreaks in the file are therefore necessary to mark the points where the text can be split.

In [None]:
!python3 pregenerate_training_data.py 
--train_corpus receipts_corpus.txt 
--bert_model bert-base-multilingual-uncased 
--do_lower_case 
--output_dir training/ 
--epochs_to_generate 3 
--max_seq_len 256

python3 finetune_on_pregenerated.py 
--pregenerated_data training/ 
--bert_model bert-base-multilingual-uncased 
--do_lower_case --output_dir finetuned_lm/ 
--epochs 3 
--reduce_memory 
--train_batch_size 4


In [None]:
stop