# Introduction

In 2018, models such as [BERT](https://arxiv.org/abs/1810.04805), [Open GPT](https://openai.com/blog/better-language-models/) and [ELMO](https://arxiv.org/abs/1802.05365) were released and performed well on many benchmark NLP tasks with minimal task-specific tuning. The publicly available pretrained models can be used either to extract high quality language features from your text data, or fine-tuned on a specific task (classification, entity recognition, etc.) with your data to produce [state of the art predictions](https://gluebenchmark.com/leaderboard).

In this notebook we'll first walk you through transformers and BERT's architecture. Then we'll explore how to work with PyTorch's built-in BERT model and apply it to some toy text data. 

#### This notebook has three sections:
I. **Transformers**:
    We explain the transformer architecture. The transformer is the building block of BERT. (Theoretical)

II. **Making BERT**
    We discuss how BERT is built from transformers. We explain the details of how BERT is trained and used. (Theoretical)

III. **Using BERT**:
    We explore using a pretrained BERT model on text. (Coding)

## I. Transformers


The transfomer is a relatively simple network architecture that was first presented in the paper [Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf). The architecture gets rid of the forward/backward reccurent structure of BiRNNs and uses only self-attention. The emphasis on attention and lack of reccurent structure helped solved two of the main problems associated with reccurent models:

1. Due to the time dependencies, reccurrent models aren't parrelallizable per layer.

2. Long term dependencies between words are often "forgotten" in recurrent models. 

In Attention is All You Need, the authors provide state of the art results on machine translation tasks. Similar to older seq2seq architectures they use a encoder-decoder model, but their encoder and decoder are built from transformers rather than RNNs. In our goal to understand BERT we will mainly concern ourselves with the encoder, which takes in input word (or sub-word piece) embeddings and transforms them into contextual word embeddings. These outputted embeddings are distinct from [GloVe's](https://www-nlp.stanford.edu/pubs/glove.pdf) or [Word2Vec's](https://en.wikipedia.org/wiki/Word2vec) static embeddings in that they are dependent on the context of the word. We will see an example of this in Section III. 

### High-level Transformer Architecture

*Images are taken from Jay Alammar's excellent blog post [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)*

For this part of the discussion assume that the transformer takes individual words as inputs, although it could operate at a sub-word piece level. We first present a high-level, visual overview of an transformer block looks like: 

<img src="https://i.imgur.com/bTzKBtO.png" width="500px" height="330px" align="center"/>


In the above image the transformer takes in a two word sequence as input. The vectors $x_1$ and $x_2$ represent word embeddings of the two input words. These embeddings can be pre-trained or learned. Note that unlike a RNN the transformer inputs the sequence in entirety and does not input each word sequentially. First we send all our word embeddings $x_i$ through a mysterious self-attention attention layer which outputs the $z_i$. These $z_i$ are then each individually put through an identical, standard feed forward neural network. The resulting $r_i$ is word $i$'s resulting hidden state. The $r_i$ are the transformer block's outputs. 

Now that we have an overview of the architecture, let's take a closer look at the self-attention layer. 

###Query, Key, Value Self-Attention

To start off our self attention we need to come up with three new embeddings for each of our $x_i$. We will treat $x_i$ as a row vector, Supposing $x_i \in \mathbb{R}^{1 \times d}$ (our initial embeddings have dimension $d$). We define 

$$q_i := x_iW_{Q}$$
$$k_i := x_iW_{K}$$
$$v_i := x_iW_{V}$$

where $W_{Q}, W_{K} \in \mathbb{R}^{d \times d_k}$ and $W_{V} \in \mathbb{R}^{d \times d_v}$ are trainable weights. For each $x_i$, the vectors $q_i$, $k_i$, $v_i$ represent query, key, and value embeddings respectively. To find  $z_i$ we do the following: We use $q_i$ to "query" the "keys" $k_j$ of all the words in the input sequence (including the $i$th word). We do this by computing a similarity score $\alpha_j = q_ik_j^T/\sqrt{d_k}$ between the $i$th word and the queried $j$th word. The factor of $1/\sqrt{d_k}$ ensures that $\alpha_j$ doesn't get for large values of $d_k$. Treating $\alpha$ as a vector whose $j$th component is $\alpha_j$ we then have 
$$z_i = \sum_j \text{Softmax}(\alpha)_j v_j$$
Namely, $z_i$ is a weighted sum of the value keys. To brush up on your linear algebra skills, you can confirm for yourself that if $X$ is a matrix with rows $x_i$ and 

$$Q:= XW_{Q}$$
$$K := XW_{K}$$
$$V := XW_{v}$$

Then 

$$Z := \text{Softmax}(QK^T/\sqrt{d_k})V$$
gives a matrix with rows $z_i$. 


###Multi-headed Self-Attention

Our description of the self-attention layer is close to the real deal, but it's not quite there yet. In the paper, the authors implement something they call "multi-headed self-attention". What we've described so far represents one head. Generally, we want our self-attention layer with $h$ heads. Thankfully, this simply corresponds to making $h$ different copies of our one attention head. For $s = 1, \dots, h$ we define 

$$Q^{(s)}:= XW_{Q}^{(s)}$$
$$K^{(s)} := XW_{K}^{(s)}$$
$$V^{(s)} := XW_{v}^{(s)}$$
and

$$Z^{(s)} := \text{Softmax}(Q^{(s)}(K^{(s)})^T/\sqrt{d_k})V^{(s)}$$
Then define
$$Z_{wide} = [Z^{(1)}, Z^{(2)}, \dots, Z^{(s)}]$$
to be the concatenation of all the $Z^{(s)}$. To reduce the dimensionality of $Z_{wide}$ we define another trainable weight matrix $W_{O} \in \mathbb{R}^{hd_v \times d}$ and let 
$$Z := Z_{wide}W_{O}$$ 
The rows $z_i$ of $Z$ are exactly the outputs of our self-attention layer! 

The idea behind multihead attention is that each head can learn to pay attention to different things. One attention head may learn to have words attend to their noun modifiers, while another may have direct objects attend to their verbs. There's been signficant empirical evidence that this does in fact happen. We discuss it in Section III. 

Note that this operation is easily parralelizable both across different heads and across words. Parallelizing matrix multiplications is exactly what GPUs are good at! In the RNN case we had to wait for the hidden state embedding from the $(i-1)$st word to compute the hidden state embedding for the $i$th word, hence no parraleization. This added efficiency allows us to make and train much larger, faster, and deeper networks. Also note that the model shouldn't have issues with long term dependcies as all word pairs have equal capability to attend to one another no matter how far apart they are. 

### Final Details

Once we understand the transformer block it's really easy to make an encoder. We simply stack $N$ transformer blocks on top of one another. We use the output of the first transformer block as the input for the second, so on and so forth. The outputs from the final transfomer block are the outputs of our encoder.

To round out our discussion, there are a few more important final details. 

1. When working with transformers we're going to want input minibatches of variable length sequences. The transformer will be trained to take in a sequences of some fixed `MAX_LENGTH`. Longer sequences are often truncated (or some middle part of the sequence may be cut out), but simply padding shorter sequences with 0 vectors isn't enough. Suppose that our sequence is length $\ell$. You can check that although the $1st$ to $\ell$th outputs of the first transformer block won't be affected if we input 0 vectors for $x_i$, $i > \ell$, the inputted 0 vectors have non-zero outputs. This will alter the model output downstream. To remedy this we pad the shorter sequences AND use attention masks to make sure the padding tokens aren't attended to. 

2. You may have noticed that, currently, the model does not at all take into account the ordering of the input sequence. We could permute the sequence and get the same output by applying the inverse permutation to the output. Since language is heavily dependent on word's ordering, this isn't desirable. To remedy this, the authors don't input $x_i$ into the model but rather $x_i + p(i)$, where $p(i)$ is a vector positional embedding that depends on $i$. For the specifics of $p(i)$ one can refer to section 3.5 of the [orginal paper](https://arxiv.org/pdf/1706.03762.pdf).

3. Lastly, since we are making our model deeper by stacking transformer blocks, the authors add a [residual connection](https://towardsdatascience.com/residual-blocks-building-blocks-of-resnet-fd90ca15d6ec) that jumps around each layer in the transformer block and [layer normalization](https://arxiv.org/pdf/1607.06450.pdf) after each each of this residual connections. These adjustments which have been shown to greatly improve the performance of deep models. The end result is the following finalized transformer architecture: 

<img src="https://i.imgur.com/kkbCAc6.png" width="500px" height="420px" align="center"/>

Note that positional encodings are only added prior to the first transformer block. 


# Making BERT

### BERT Architecture

[BERT](https://arxiv.org/pdf/1810.04805.pdf) is a large, transformed-based language model that has been trained on a very large corpus of text. The original paper trains two models, BERTbase and BERTlarge. Both are made of "multi-layer bidirectional trasnformers" whose implementation is "almost identical" to that found in the Attention is All You Need paper. Effectively, BERTbase is exactly 12 stacked transformers and BERTlarge is 24 stacked transformers.

BERT operates at a subword level and takes in sequences of learned [WordPiece](https://medium.com/@_init_/why-bert-has-3-embedding-layers-and-their-implementation-details-9c261108e28a) embeddings. Individual wordpieces are refered to as tokens. By operating at a subword level, BERT automatically generalizes its performance to out of vocabulary words. BERT's token vocabulary consists of about 30,000 tokens. BERT is designed so that our input sequence can be either be a single "sentence" or a pair of distinct "sentences", where in this case a "setence" can be thought of loosely as a string of contiguous text. This way BERT easily generalizes to downstream tasks that involve interpreting two distinct blocks of text, such as question answering.  

When inputting a token sequences into BERT, the first token is always a special [CLS] token. The outputed hidden state (the hidden state output of the final transformer) corresponding to this token is used as the aggregate sequence representation for classification tasks. When inputting "sentence" pairs, the pairs are seperated by a [SEP] token. Additionally, along with positional embeddings, learned embeddings $E_A$ and $E_B$ are added to token embeddings corresponding which "sentence" they belong to. The following diagram from BERT's paper displays how to compute the input representations:

<img src="https://i.imgur.com/aMPNkY6.png" width="750px" height="240px" style="display: block; margin: 0 auto;"/>


### Training BERT


Training BERT comprises of two steps: pre-training and fine-tuning. In pre-training the authors train BERT over a huge corpus comprising of [BookCorpus](https://arxiv.org/pdf/1506.06724.pdf) (800M words) and English Wikipedia (2,500M words). The training involves two tasks:

1. Masked Langauge Modeling - The first task is language modeling. We want BERT to be able to predict a token based from the tokens around it. Sadly, our task can't be as simple as to predict token $i$ from the outputted hidden states of the tokens around it. These hidden states will have been formed while attending to token $i$'s embedding, making such a task trivial. To address this, the authors randomly select 15% of the tokens. They replace 80% of the selected tokens with special [MASK] tokens, leave 10% of them the same, and replace 10% of them with a random token.  Suppose the $i$th token is one of those 15% randomly selected tokens, The model attempts to predict the original $i$th token by putting the $i$th hidden state output through a softmax layer. By masking the original token with a [MASK] token the model is not clued in to the original token for 80% of these predictions, and by leaving the original token or swapping it out for a random one the model is robust to not seeing [MASK] tokens (which will be the case during testing) and can identify when a token has artificially altered. 

2. Next Sentence Prediction - The second task is simpler. During training the model is always given sequences comprised of two "sentences". Half the time "sentence} $B$ follows "sentence" $A$ in the corpus. Half the time "sentence" $B$ is selected randomly from the corpus. The outputted hidden state embedding of the [CLS] token is used to predict if sentence $B$ follows sentence $A$. Training on this task is meant to help the model understand relationships between two continguous blocks of text. 

Following the pre-training is the fine tuning process. The output of BERT is akin to the output of an encoder - we expect each hidden state output to be a contextual embedding for each token. Due to BERT's robust pre-training, we expect these contextual representations to carry a useful understanding of language generally and also a more specific understanding of how the token is functioning in its particular context. 

While fine-tuning we feed BERT's output into another model (or simply some additional layers) designed for a specific language task. We can either treat BERT's weights as trainable parameters or we can freeze them. Conceptually, in the first case we are building a huge model and simply initializing most of the weights to known good values. In the second case we are treating BERT purely as a feature extractor. The authors choose to do the former. With just one additional output layer they achieved state of the art on eleven benchmark NLP tasks!

# Using BERT

Finally, now that we have a good understanding of how BERT works, let's try it out! First, we need to install some necessary packages. 


In [1]:
# Install the required modules
!pip install numpy
!pip install torch
!pip install transformers


import sys
!test -d bertviz_repo && echo "FYI: bertviz_repo directory already exists, to pull latest version uncomment this line: !rm -r bertviz_repo"
# !rm -r bertviz_repo # Uncomment if you need a clean pull from repo
!test -d bertviz_repo || git clone https://github.com/jessevig/bertviz bertviz_repo
if not 'bertviz_repo' in sys.path:
  sys.path += ['bertviz_repo']
!pip install regex

FYI: bertviz_repo directory already exists, to pull latest version uncomment this line: !rm -r bertviz_repo


### Tokenization

Documentation for anything done in this section can be found [here](https://huggingface.co/transformers/main_classes/tokenizer.html#). 

In [2]:
# We are loading a smaller 'bert-base-uncased' model for this notebook. For more information on pre-trained models, check this: https://github.com/google-research/bert#pre-trained-models
import torch
from transformers import *

# model -> BertModel
# tokenizer -> BertTokenizer
# model name -> 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


As discussed above, BERT recognizes about 30,000 WordPieces, meaning many words in the English vocabulary are not in BERT's vocabulary. In the example below, we see how to use BERT's tokenizer to split "ephemeral" into tokens that are in BERT's vocabulary. Hash signs preceding these subwords are the tokenizer's way of denoting that this subword or character is part of a larger word and preceded by another subword. 

(For more information about WordPiece, see the [original paper](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf) and further disucssion in Google's [Neural Machine Translation System](https://arxiv.org/pdf/1609.08144.pdf).)



In [3]:
# Using the BERT tokenizer
rare_text = "ephemeral"
print(tokenizer.tokenize(rare_text))

['ep', '##hem', '##eral']


As expected, BERT's vocabularly contains common words.

In [4]:
common_text = "actually"
print(tokenizer.tokenize(common_text))

['actually']


In [5]:
# Adding tokens [CLS] and [SEP] at start and end of a sentence
text = "After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."
marked_text = "[CLS] " + text + " [SEP]"

print ("Before tokenization:\n", marked_text)

# Using BERT word piece tokenizer 
tokenized_text = tokenizer.tokenize(marked_text)
print ("\nAfter tokenization:\n",tokenized_text)

Before tokenization:
 [CLS] After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank. [SEP]

After tokenization:
 ['[CLS]', 'after', 'stealing', 'money', 'from', 'the', 'bank', 'vault', ',', 'the', 'bank', 'robber', 'was', 'seen', 'fishing', 'on', 'the', 'mississippi', 'river', 'bank', '.', '[SEP]']


Each token has a unique id, which we can retreive as follows. When we want to do run a foward pass on a sequence we will input the sequence's token ids into BERT. 

In [6]:
# Retreiving token ids from tokens
input_ids = tokenizer.convert_tokens_to_ids(tokenized_text)

for tup in zip(tokenized_text, input_ids):
    print(tup)

('[CLS]', 101)
('after', 2044)
('stealing', 11065)
('money', 2769)
('from', 2013)
('the', 1996)
('bank', 2924)
('vault', 11632)
(',', 1010)
('the', 1996)
('bank', 2924)
('robber', 27307)
('was', 2001)
('seen', 2464)
('fishing', 5645)
('on', 2006)
('the', 1996)
('mississippi', 5900)
('river', 2314)
('bank', 2924)
('.', 1012)
('[SEP]', 102)


By using `tokenizer.encode` we can go from text to token ids in one line. The `add_special_tokens` (which defaults to `True`) flag toggles whether or not to add the [CLS] and [SEP] tokens. 

In [7]:
tokenizer.encode(text, add_special_tokens=True)

[101,
 2044,
 11065,
 2769,
 2013,
 1996,
 2924,
 11632,
 1010,
 1996,
 2924,
 27307,
 2001,
 2464,
 5645,
 2006,
 1996,
 5900,
 2314,
 2924,
 1012,
 102]

If we want we wanted our input to be a pair of setences (such as in the next sentence prediction task) we can use the optional parameter `text_pair`.

In [8]:
text_pair = "To this day, he is still at large."
tokenizer.encode(text, text_pair=text_pair, add_special_tokens=True)

[101,
 2044,
 11065,
 2769,
 2013,
 1996,
 2924,
 11632,
 1010,
 1996,
 2924,
 27307,
 2001,
 2464,
 5645,
 2006,
 1996,
 5900,
 2314,
 2924,
 1012,
 102,
 2000,
 2023,
 2154,
 1010,
 2002,
 2003,
 2145,
 2012,
 2312,
 1012,
 102]

#### Token Type IDs

As discussed, BERT is trained on and expects "sentence" pairs. Again, we are using the term "sentence" loosely to refer to one contiguous body of text. At times it is useful to input "sentence pairs" rather than individual "sentences" (e.g. for question answering). When inputting a pair of sentences into our model, it is important to input a list of token type ids along with a list of token ids. The token type id is a binary list which tells the model which sentence each token belongs to. 

Here's an example of how we'd want to construct our token type ids. 

- **"<font color='green'>[CLS] The man was accused of robbing a bank. [SEP] The man was seen fishing by the river bank. [SEP]</font>"**
  - tokens in this case would be: ['[CLS]', 'the', 'man', 'was', 'accused', 'of', 'robb', '##ing', 'a', 'bank', '.', '[SEP]', 'the', 'man', 'was', 'seen', 'fishing', 'by', 'the', 'river', 'bank', '.', '[SEP]']
  - token type ids in this case would be: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
  
Note that the first [SEP] token is assigned a 0. If we input each sentence individually then we would have:

- **"<font color='green'>[CLS] The man was accused of robbing a bank. [SEP]</font>"**
  - tokens in this case would be: ['[CLS]', 'the', 'man', 'was', 'accused', 'of', 'robb', '##ing', 'a', 'bank', '.', '[SEP]']
  - token type ids in this case would be: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

- **"<font color='green'>[CLS] The man was seen fishing by the river bank. [SEP]</font>"**
  - tokens in this case would be: ['[CLS]', 'the', 'man', 'was', 'seen', 'fishing', 'by', 'the', 'river', 'bank', '.', '[SEP]']
  - token type ids in this case would be: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In the case of inputting single sentences a token type id list is not needed.

To avoid having to find token type ids ourselves, we can use the `encode_plus` function, which returns a dictionary comprising of our desired `input_ids` and `token_type_ids`. Similarly to the `encode` function we can toggle whether or not to add the [CLS] and [SEP] tokens with the `add_special_tokens` flag (defaults to `True`) and we can encode a pair of sentences using the `text_pair` parameter (defaults to `None`). We provide an example below. 

In [9]:
encode_dict = tokenizer.encode_plus(text, text_pair=text_pair)
print(encode_dict['input_ids'])
print(encode_dict['token_type_ids'])

[101, 2044, 11065, 2769, 2013, 1996, 2924, 11632, 1010, 1996, 2924, 27307, 2001, 2464, 5645, 2006, 1996, 5900, 2314, 2924, 1012, 102, 2000, 2023, 2154, 1010, 2002, 2003, 2145, 2012, 2312, 1012, 102]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


### Minibatches and Attention Masks

When working with real data we'll be using minibatches that contain sequences of varying size. Thus, we'll have to tokenize all of our inputs, and padding to shorter sequences, and get attention masks to make sure BERT doesn't attend to padding tokens during the forward pass. Thankfully `batch_encode_plus` does this all for us. As an example, we'll encode a batch of two inputs. 

To illustrate the importance of padding/attention masking, we have the tonkenizer return Torch tensors instead of lists. We can't do this unless all the inputs in the batch are the same length, so we either have to pad the shorter ones or truncate the longer ones. Note that the inputs to our BERT model have to be Torch tensors. 

In [10]:
text1 = "This is the first sentence!"
text2 = "This is the second sentence! But I need it to be longer than the first."

batch_encode_dict = tokenizer.batch_encode_plus([text1, text2],
                                                max_length=None, 
                                                pad_to_max_length=True,
                                                return_tensors='pt',
                                                return_token_type_ids=True,
                                                return_attention_masks=True)

print(batch_encode_dict['input_ids'])
print(batch_encode_dict['attention_mask'])

Keyword arguments {'return_attention_masks': True} not recognized.
Keyword arguments {'return_attention_masks': True} not recognized.


tensor([[ 101, 2023, 2003, 1996, 2034, 6251,  999,  102,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0],
        [ 101, 2023, 2003, 1996, 2117, 6251,  999, 2021, 1045, 2342, 2009, 2000,
         2022, 2936, 2084, 1996, 2034, 1012,  102]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


### Configuring a Model

Now we're going to try configuring a pre-trained BERT model. When we run a forward pass minibatch of inputs, the model will return a tuple of at most size four:

`(last_hidden_state, pooler_output, hidden_states, attentions)`

Whether or not `hidden_states` and `attentions` are included in the output depends on if the `output_hidden_states` and `output_attentions` flags are `True` or not (respectively) when we are configuring the model. The following descriptions come directly from the source code of the model:

1. **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``. 
Sequence of hidden-states at the output of the last layer of the model.
2. **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``. Last layer hidden-state of the first token of the sequence ([CLS] token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during Bert pretraining. This output is usually *not* a good summary of the semantic content of the input, you're often better with averaging or pooling the sequence of hidden-states for the whole input sequence.
3. **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``). List of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings) of shape ``(batch_size, sequence_length, hidden_size)``. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
4. **attentions**: (`optional`, returned when ``config.output_attentions=True``) list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

With this in mind we can go ahead and configure our model. 

In [11]:
import torch

# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states=True,
                                  output_attentions=True)

# Put the model in evaluation mode
model.eval()

HBox(children=(IntProgress(value=0, description='Downloading', max=440473133, style=ProgressStyle(description_…




BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

Let's run a forward pass on our batch of two sentences. 

In [12]:
# Run a foward pass
# With one sentence inputs there's no real need to pass in token_type_ids, but we do it for the sake of example
with torch.no_grad():
    output = model(batch_encode_dict['input_ids'],
                   attention_mask=batch_encode_dict['attention_mask'],
                   token_type_ids=batch_encode_dict['token_type_ids'])


We confirm that the shapes our output are as expected.

In [13]:
print("The length each inputed sequence is: ", len(batch_encode_dict['input_ids'][0]))
print("last_hidden_state is a tensor of shape: ", output[0].shape)
print("pooler_output is a tensor of shape: ", output[1].shape)
print("hidden_states is a list of tensors of length: ", len(output[2]))
print("Each tensor in hidden_states is of shape: ", output[2][0].shape)
print("attentions is a list of tensors of length: ", len(output[3]))
print("Each tensor in attentions is of shape: ", output[3][0].shape)

The length each inputed sequence is:  19
last_hidden_state is a tensor of shape:  torch.Size([2, 19, 768])
pooler_output is a tensor of shape:  torch.Size([2, 768])
hidden_states is a list of tensors of length:  13
Each tensor in hidden_states is of shape:  torch.Size([2, 19, 768])
attentions is a list of tensors of length:  12
Each tensor in attentions is of shape:  torch.Size([2, 12, 19, 19])


### Exploring Contextual Word Vectors

To confirm that the value of these vectors are in fact contextually dependent, let's take a look at the output from the following sentence:

In [14]:
print (text)

After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank.


We run a forward pass on the sentence. We'll use the output sum of hidden state outputs of the last four transformer layers.

In [15]:
encoded_input = tokenizer.encode(text, return_tensors='pt')
with torch.no_grad():
    output = model(encoded_input)

contextual_embeddings = torch.sum(torch.squeeze(torch.stack(output[2][-4:])), 0)

By finding which token id correpsonds to bank we can isolate the contextual embeddings for each individual bank token. 

In [16]:
print(encoded_input)
print(tokenizer.convert_tokens_to_ids('bank'))

tensor([[  101,  2044, 11065,  2769,  2013,  1996,  2924, 11632,  1010,  1996,
          2924, 27307,  2001,  2464,  5645,  2006,  1996,  5900,  2314,  2924,
          1012,   102]])
2924


In [17]:
from sklearn.metrics.pairwise import cosine_similarity

# Compare "bank" as in "bank robber" to "bank" as in "river bank"
different_bank = cosine_similarity(contextual_embeddings[10].reshape(1,-1), contextual_embeddings[19].reshape(1,-1))[0][0]

# Compare "bank" as in "bank robber" to "bank" as in "bank vault" 
same_bank = cosine_similarity(contextual_embeddings[10].reshape(1,-1), contextual_embeddings[6].reshape(1,-1))[0][0]

In [18]:
print ("Similarity of 'bank' as in 'bank robber' to 'bank' as in 'bank vault':",  same_bank)

Similarity of 'bank' as in 'bank robber' to 'bank' as in 'bank vault': 0.9386392


In [19]:
print ("Similarity of 'bank' as in 'bank robber' to 'bank' as in 'river bank':",  different_bank)

Similarity of 'bank' as in 'bank robber' to 'bank' as in 'river bank': 0.6932361


#### <font color='red'>Try for yourself</font>

Try and find more examples where the similarity of word embeddings for the same word in a similar context (such as 'bank' in 'bank robber' and 'bank' in 'bank vault') is greater than the similarity of word embeddings for same words in a different context (such as 'bank' in 'bank robber' and 'river bank'). Or look for a word that may have the same meaning in two sentences, but different contextual embeddings in each sentence. 

### Exploring Multi-headed Self-Attention

A lot of work has been done discussing the interpretability of multi-headed self-attention. By visualizing how tokens are attending to one another we can get an idea of what different layers and attention heads are trying to learn. The paper [What Does BERT Look At? An Analysis of BERT's Attention](https://arxiv.org/pdf/1906.04341.pdf) studies exactly this phenomena. Here are some interesting figures from the paper (the darkness of the lines indicate the strength of the attention weight):

<img src="https://i.imgur.com/J4vdut5.png" align="center"/>
<img src="https://i.imgur.com/eSq4YWS.png" align="center"/>



#### <font color='red'>Try for yourself</font>

The following [repo](https://github.com/jessevig/bertviz#attention-head-view) allows us to easily visualize attention and poke around on our own. Look into some of the existing literature on interpreting attention and see if you can uncover any patterns yourself. 

In [20]:
from bertviz import head_view

def call_html():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              "d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.8/d3.min",
              jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
            },
          });
        </script>
        '''))

In [21]:
sentence_a = "The cat sat on the mat"
sentence_b = "The cat lay on the rug"
encoded_dict = tokenizer.encode_plus(sentence_a, 
                                      sentence_b, 
                                      return_tensors='pt', 
                                      add_special_tokens=True)
token_type_ids = encoded_dict['token_type_ids']
input_ids = encoded_dict['input_ids']
attention = model(input_ids, token_type_ids=token_type_ids)[-1]
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)
call_html()

head_view(attention, tokens)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>