# Load Pre-Trained BERT

Install the pytorch interface for BERT by Hugging Face.

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/fd/1a/41c644c963249fd7f3836d926afa1e3f1cc234a1c40d80c5f03ad8f6f1b2/transformers-4.8.2-py3-none-any.whl (2.5MB)
[K     |▏                               | 10kB 13.9MB/s eta 0:00:01[K     |▎                               | 20kB 17.5MB/s eta 0:00:01[K     |▍                               | 30kB 20.1MB/s eta 0:00:01[K     |▌                               | 40kB 22.0MB/s eta 0:00:01[K     |▋                               | 51kB 22.4MB/s eta 0:00:01[K     |▉                               | 61kB 22.8MB/s eta 0:00:01[K     |█                               | 71kB 22.6MB/s eta 0:00:01[K     |█                               | 81kB 22.9MB/s eta 0:00:01[K     |█▏                              | 92kB 23.7MB/s eta 0:00:01[K     |█▎                              | 102kB 24.4MB/s eta 0:00:01[K     |█▍                              | 112kB 24.4MB/s eta 0:00:01[K     |█▋                              | 

In [None]:
import torch
from transformers import BertTokenizer, BertModel

# For Logging
import logging
#logging.basicConfig(level=logging.INFO)

# For Plotting
import matplotlib.pyplot as plt
% matplotlib inline

# Load pre-trained model tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




## BERT Tokenizer


BERT provides its own tokenizer. All pre-trained tokenizer come with their own tokenizer. 

BERT tokenizer was created with a WordPiece model. This model greedily creates a fixed-size vocabulary of individual characters, subwords, and words that best fits our language data.

This vocabulary contains four things:

1. Whole words
2. Subwords occuring at the front of a word or in isolation
3. Subwords not at the front of a word, which are preceded by '##' to denote this case
4. Individual characters

To tokenize a word under this model, the tokenizer first checks if the whole word is in the vocabulary. If not, it tries to break the word into the largest possible subwords contained in the vocabulary, and as a last resort will decompose the word into individual characters. Because of this, we can always represent a word as, at the very least, the collection of its individual characters. No need to assign out of vocabulary words to a catch-all token like 'OOV' or 'UNK,

Break sentence into tokens with Bert Tokenizer

In [None]:
text = "After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."
marked_text = "[CLS] " + text + " [SEP]"

# Tokenize our sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(marked_text)

# Print out the tokens.
print (tokenized_text)

['[CLS]', 'after', 'stealing', 'money', 'from', 'the', 'bank', 'vault', ',', 'the', 'bank', 'robber', 'was', 'seen', 'fishing', 'on', 'the', 'mississippi', 'river', 'bank', '.', '[SEP]']


After breaking the text into tokens, we then have to convert the sentence from a list of strings to a list of vocabulary indices.

In [None]:
# Map the token strings to their vocabulary indices.
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

# Display the words with their indices.
for tup in zip(tokenized_text, indexed_tokens):
    print('{} {}'.format(tup[0], tup[1]))

[CLS] 101
after 2044
stealing 11065
money 2769
from 2013
the 1996
bank 2924
vault 11632
, 1010
the 1996
bank 2924
robber 27307
was 2001
seen 2464
fishing 5645
on 2006
the 1996
mississippi 5900
river 2314
bank 2924
. 1012
[SEP] 102


### Segment ID
BERT is trained on and expects sentence pairs, using 1s and 0s, to distinguish between the two sentences. 

If we have two sentences: we assign each word in the first sentence plus the '[SEP]' token a 0, and all tokens of the second sentence a 1.

Single-sentence inputs only require a series of 1s.



In [None]:
# Mark each of the tokens as belonging to sentence "1".
segments_ids = [1] * len(tokenized_text)

print (segments_ids)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


# Extracting Embeddings using BERT

## Running BERT on our text

The BERT PyTorch interface requires that the data be in torch tensors rather than Python lists

In [None]:
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

Calling "from_pretrained" will fetch the model from the internet. 

model.eval() puts our model in evaluation mode as opposed to training mode. It also turns off dropout regularization which is used in training.

In [None]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

Evaluate BERT on our example text, and fetch the hidden states of the network.

`torch.no_grad` tells PyTorch not to construct the compute graph during this forward pass (since we won't be running backprop here). It just reduces memory consumption and speeds things up a little.

Note: Evaluating the model will return a different number of objects based on how it's  configured in the `from_pretrained` call earlier. In this case, 
becase we set `output_hidden_states = True`, the third item will be the hidden states from all layers. (Ref: https://huggingface.co/transformers/model_doc/bert.html#bertmodel )

In [None]:
# Run the text through BERT, and collect all of the hidden states produced
# from all 12 layers. 
with torch.no_grad():

    outputs = model(tokens_tensor, segments_tensors)

    
    hidden_states = outputs[2]

In [None]:
print ("Number of layers:", len(hidden_states), "  (initial embeddings + 12 BERT layers)")
layer_i = 0

print ("Number of batches:", len(hidden_states[layer_i]))
batch_i = 0

print ("Number of tokens:", len(hidden_states[layer_i][batch_i]))
token_i = 0

print ("Number of hidden units:", len(hidden_states[layer_i][batch_i][token_i]))

# 'hidden_states' is a Python list.
print('Type of hidden_states: ', type(hidden_states))

# Each layer in the list is a torch tensor.
print('Tensor shape for each layer: ', hidden_states[0].size())

Number of layers: 13   (initial embeddings + 12 BERT layers)
Number of batches: 1
Number of tokens: 22
Number of hidden units: 768
      Type of hidden_states:  <class 'tuple'>
Tensor shape for each layer:  torch.Size([1, 22, 768])


## Get Token Embeddings from hidden states

In [None]:
# Concatenate the tensors for all layers. We use `stack` here to
# create a new dimension in the tensor.
token_embeddings = torch.stack(hidden_states, dim=0)

token_embeddings.size()

torch.Size([13, 1, 22, 768])

In [None]:
# Remove dimension 1, the "batches" (not needed)
token_embeddings = torch.squeeze(token_embeddings, dim=1)

token_embeddings.size()

torch.Size([13, 22, 768])

In [None]:
# Swap dimensions 0 and 1.
token_embeddings = token_embeddings.permute(1,0,2)

token_embeddings.size()

torch.Size([22, 13, 768])

## Creating word and sentence vectors from hidden states

We want to get individual vectors for each of our tokens, or a single vector representation of the whole sentence

Given above, for each token of our input we have 13 separate vectors each of length 768. In order to get the individual vectors we will need to combine some of the layer vectors.

But which layer or combination of layers provides the best representation? 
There's no single easy answer. We can try a couple reasonable approaches

### Word Vectors



Approach A: **concatenate** 

Concatenate the last four (or n) layers, giving us a single word vector per token. Each vector will have length `4 x 768 = 3,072`. 

In [None]:
# Stores the token vectors, with shape [22 x 3,072]
token_vecs_cat = []

# 'token_embeddings' is a [22 x 12 x 768] tensor.

# For each token in the sentence...
for token in token_embeddings:
    
    # 'token' is a [12 x 768] tensor

    # Concatenate the vectors (append them together) from the last four layers.
    # Each layer vector is 768 values, so 'cat_vec' is length 3,072.
    cat_vec = torch.cat((token[-1], token[-2], token[-3], token[-4]), dim=0)
    
    # Use 'cat_vec' to represent `token`.
    token_vecs_cat.append(cat_vec)

print ('Shape is: %d x %d' % (len(token_vecs_cat), len(token_vecs_cat[0])))

Shape is: 22 x 3072


Approach B: **summing** 

Create the word vectors by summing together the last four (or n) layers.

In [None]:
# Stores the token vectors, with shape [22 x 768]
token_vecs_sum = []

# 'token_embeddings' is a [22 x 12 x 768] tensor.

# For each token in the sentence...
for token in token_embeddings:

    # 'token' is a [12 x 768] tensor

    # Sum the vectors from the last four layers.
    sum_vec = torch.sum(token[-4:], dim=0)
    
    # Use 'sum_vec' to represent `token`.
    token_vecs_sum.append(sum_vec)

print ('Shape is: %d x %d' % (len(token_vecs_sum), len(token_vecs_sum[0])))

Shape is: 22 x 768


### Sentence Vectors



To get a single vector for our entire sentence we have multiple application-dependent strategies.

A simple approach is to average the second to last hiden layer of each token producing a single 768 length vector.

In [None]:
# 'hidden_states' has shape [13 x 1 x 22 x 768]

# 'token_vecs' is a tensor with shape [22 x 768]
token_vecs = hidden_states[-2][0]

# Calculate the average of all 22 token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)

print ("Final sentence embedding vector has shape:", sentence_embedding.size())

Final sentence embedding vector has shape: torch.Size([768])


## Confirming contextually dependent vectors

To confirm that the value of these vectors are in fact contextually dependent, let's look at the different instances of the word "bank" in our example sentence:

"After stealing money from the **bank vault**, the **bank robber** was seen fishing on the Mississippi **river bank**."

In [None]:
for i, token_str in enumerate(tokenized_text):
  print (i, token_str)

0 [CLS]
1 after
2 stealing
3 money
4 from
5 the
6 bank
7 vault
8 ,
9 the
10 bank
11 robber
12 was
13 seen
14 fishing
15 on
16 the
17 mississippi
18 river
19 bank
20 .
21 [SEP]


They are at 6, 10, and 19.

For this analysis, let's use the word vectors that we created by summing the last four layers.

In [None]:
print('First 5 vector values for each instance of "bank".')
print('')
print("bank vault   ", str(token_vecs_sum[6][:5]))
print("bank robber  ", str(token_vecs_sum[10][:5]))
print("river bank   ", str(token_vecs_sum[19][:5]))

First 5 vector values for each instance of "bank".

bank vault    tensor([ 3.3596, -2.9805, -1.5421,  0.7065,  2.0031])
bank robber   tensor([ 2.7359, -2.5577, -1.3094,  0.6797,  1.6633])
river bank    tensor([ 1.5266, -0.8895, -0.5152, -0.9298,  2.8334])


The vector values in each case differ, but let's calculate the cosine similarity between the vectors to make a more precise comparison.

In [None]:
from scipy.spatial.distance import cosine

# Calculate the cosine similarity between the word bank
# in "bank robber" vs "bank vault" (same meaning).
same_bank = 1 - cosine(token_vecs_sum[10], token_vecs_sum[6])
print('Vector similarity for  *similar*  meanings:  %.2f' % same_bank)

# Calculate the cosine similarity between the word bank 
# in "bank robber" vs "river bank" (different meanings).
diff_bank = 1 - cosine(token_vecs_sum[10], token_vecs_sum[19])
print('Vector similarity for *different* meanings:  %.2f' % diff_bank)

Vector similarity for  *similar*  meanings:  0.94
Vector similarity for *different* meanings:  0.69
