# Masked Language Modeling

In this lab, we will overview the **masked language modeling** objective, and a popular model architecture called the **Transformer** used for large-scale masked language modeling.


In [None]:
%pylab inline
import os, sys, glob, json, math
import pandas as pd
from tqdm import tqdm
from pprint import pprint
from collections import defaultdict
import torch
import torch.nn as nn

%load_ext autoreload
%autoreload 2
pd.set_option('display.max_colwidth', -1)

Populating the interactive namespace from numpy and matplotlib


  if sys.path[0] == '':


## Background

Recently, Devlin et al. published [BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding](https://arxiv.org/pdf/1810.04805.pdf).


**B**idirectional

**E**ncoder

**R**epresentations from

**T**ransformers


#### Goal: 
1. **pre-train** a model that produces language representations. (*this week's lab*)
2. **fine-tune** the model on a task. (*next week's lab*)
    


## Masked Language Model Objective

Randomly mask some of the tokens from the input, predict original vocabulary id of each masked token.

- Given sequence $\mathbf{x} = x_1,\ldots,x_T$.

- Form **mask** $m_1,\ldots,m_T$ where $m_i\in \{0,1\}$.
    - E.g. $m_i=1$ with probability 0.15
    
- Form **masked sequence** $\tilde{x}_1,\ldots,\tilde{x}_T$.
    
    $\tilde{x}_i = $

    \begin{cases} x_i & m_i=0 \\ \texttt{[MASK]} & m_i=1 \end{cases}


$\mathcal{L}_{\text{MLM}}=-\sum_{\underbrace{i | m_i=1}_{\text{MASKED POSITIONS}}}\log p_{\theta}(\underbrace{x_i}_{\text{TRUE TOKEN}}|\underbrace{\tilde{x}_1,\ldots,\tilde{x}_T}_{\text{MASKED SEQUENCE}})$


<!-- Below, we will discuss the exact form of $\tilde{x}_i$ that the BERT authors used. -->


<!-- #### Diagram of BERT Implementation -->
<!-- ![](bert_overview.png) -->

## Transformers

So far we have modeled a sequence by factorizing the joint distribution into conditionals, and **parameterizing each conditional with a recurrent network**:


#### $$p_{\theta}(x_1,\ldots,x_T)=\prod_{t=1}^T p_{\theta}(x_t | x_{<t})$$
\begin{align}
h_t &= RNNCell(x_{t}, h_{t-1}) \quad \text{We need T calls to process a sequence!}\\
p_{\theta}(x_t | x_{<t}) &=\text{softmax}\left(Wh_t+b\right),
\end{align}

where $\theta$ are the model parameters (RNN parameters, $W, b$, embedding matrix).


#### Alternative

An alternative proposed in [[Vaswani et al 2017](https://arxiv.org/pdf/1706.03762.pdf)] is to parameterize each conditional with a **particular feed-forward architecture** called the **Transformer**. With this model, it is possible to compute all conditionals with a **single feed-forward pass**:
\begin{align}
(h_1,\ldots,h_T) &= Transformer(x) \quad \text{No need to call transformer t times!} \\
p_{\theta}(x_t | x_{<t}) &= \text{softmax}\left(Wh_t + b\right)
\end{align}

We will discuss briefly the key ideas, the overall Transformer architecture (encoder only), and how they are used in Pytorch.

### High-Level View

We can view the Transformer encoder as mapping a sequence to a sequence of vectors.

![Drawing](https://drive.google.com/uc?export=view&id=1ru2nIDNxZ-A4xEtUAJFs8h_IMiMydjjk)

Let's step through the key ideas of how this mapping is designed, and discuss some of its resulting properties.

### Key Idea 1: Position Embeddings

Unlike RNNs which can learn positional information via the hidden state over time, the Transformer has no notion of time (**similar to the bag-of-words model you implemented**).

Thus we encode inputs with **position** as well as **token** embeddings:

![Drawing](https://drive.google.com/uc?export=view&id=1MXvjtVmTqaNImOaQqDBGru-HqXSjV8tm)

Positional embeddings usually have the same dimensionality as token embeddings. Here we learn positional embeddings similarly to tokens embeddings starting from random initialization. However, there exist approaches which do not require training e.g. sinusoidal PE https://medium.com/nlp-trend-and-review-en/positional-embeddings-7b168da36605).



In [None]:
input_sequence = ['<s>', 'my', 'pet', '[M]', '<s>']

max_len = 10

vocab = {'<s>': 0, 'my': 1, 'pet': 2, 'dog': 3, 'cat': 4, 'lion': 5, '[M]': 6, '<pad>': 7}

pad_idx = 7
dim = 6

token_embed = nn.Embedding(len(vocab), embedding_dim=dim)
position_embed = nn.Embedding(max_len, embedding_dim=dim)

In [None]:
input_vector = torch.tensor([vocab[x] for x in input_sequence]).unsqueeze(1)

input_embeddings = token_embed(input_vector) + position_embed(torch.arange(len(input_vector)).unsqueeze(1))
input_embeddings.size()

torch.Size([5, 1, 6])

**Warning!!** The pytorch Transformer classes accept input shaped as `Length x Batch x Dim`

#### Key Idea 2: Modularity
The Transformer (encoder) is composed of a stack of **N identical layers**.

![Drawing](https://drive.google.com/uc?export=view&id=1yO6VcQzhhk0iMK1bUu_ke_orXvFw7y3l)

In [None]:
import torch.nn as nn
nn.TransformerEncoder?

#### The `forward` passes the input through the N layers, then normalizes it:

**Warning!!** The forward function accepts input as `Length x Batch x Dim`

In [None]:
nn.TransformerEncoder.forward??

In [None]:
encoder_layer = nn.TransformerEncoderLayer(d_model=dim, nhead=2, dim_feedforward=64, dropout=0.1)

encoder = nn.TransformerEncoder(encoder_layer, num_layers=4)

In [None]:
outputs = encoder(input_embeddings)

print("input size: \t%s" % str(tuple(input_embeddings.shape)))
print("output size:\t%s" % str(tuple(outputs.shape)))
outputs

input size: 	(5, 1, 6)
output size:	(5, 1, 6)


tensor([[[ 1.5190, -0.1116,  0.4945, -1.4319,  0.5633, -1.0334]],

        [[ 0.3832, -1.5494, -0.7756, -0.3796,  1.2404,  1.0809]],

        [[-0.7320,  1.2635,  0.4798, -1.7963,  0.6191,  0.1658]],

        [[-1.7166,  1.3516,  0.4973,  0.6291, -0.7639,  0.0026]],

        [[ 1.3709,  0.3752,  0.6588, -1.5671,  0.1889, -1.0268]]],
       grad_fn=<NativeLayerNormBackward>)

#### Each layer has two parts, **self-attention** and a feed-forward transformation:

![Drawing](https://drive.google.com/uc?export=view&id=1yO6VcQzhhk0iMK1bUu_ke_orXvFw7y3l)

In [None]:
nn.TransformerEncoderLayer??

In [None]:
nn.TransformerEncoderLayer.forward??

### Key Idea 3: Self-Attention

In the RNN, the hidden state contains information about previous tokens.
The Transformer instead performs **attention** over all inputs at a given layer. 'Attention' computes an output vector by taking a weighted sum of input vectors. The weights are 'attention weights'. The Transformer uses **scaled dot-product attention**:
#### $$\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$$

and 'Multi-head Attention' refers to applying several of these operations in parallel ($Q, K, V \in \mathbf{R}^{d_\text{model} \times d_k}$ refer to the query, key, and value matrices respectively). 
Attention scores from different heads are then concatenated and compressed into a single attention vector by applying an additional linear layer.

![Drawing](https://drive.google.com/uc?export=view&id=1U8X3jQko0ydDe2PVHe7cxf4XqhfJ-M6I)

Above image is from [Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/).

#### *Key Property*: Each output vector of a layer $n$ can use information from **all** inputs to the layer $n$.

Thus each **final output vector** can incorporate information from **all input words**.

(If we want to prevent information flow such as in left-to-right language modeling, we can use masking).

In [None]:
attn = nn.MultiheadAttention(embed_dim=dim, num_heads=2, dropout=0.0)

attn_outputs, attn_weights = attn.forward(query=outputs, key=outputs, value=outputs)

print("input shape: %s" % (str(tuple(outputs.size()))))
print("output shape: %s" % (str(tuple(attn_outputs.size()))))
print(outputs)

print("\nattn weights shape: %s" % (str(tuple(attn_weights.size()))))
print(attn_weights)

input shape: (5, 1, 6)
output shape: (5, 1, 6)
tensor([[[ 1.5190, -0.1116,  0.4945, -1.4319,  0.5633, -1.0334]],

        [[ 0.3832, -1.5494, -0.7756, -0.3796,  1.2404,  1.0809]],

        [[-0.7320,  1.2635,  0.4798, -1.7963,  0.6191,  0.1658]],

        [[-1.7166,  1.3516,  0.4973,  0.6291, -0.7639,  0.0026]],

        [[ 1.3709,  0.3752,  0.6588, -1.5671,  0.1889, -1.0268]]],
       grad_fn=<NativeLayerNormBackward>)

attn weights shape: (1, 5, 5)
tensor([[[0.2475, 0.1575, 0.1896, 0.1661, 0.2392],
         [0.1866, 0.2408, 0.1360, 0.2386, 0.1981],
         [0.2551, 0.1334, 0.2261, 0.1380, 0.2475],
         [0.1880, 0.1689, 0.2610, 0.1980, 0.1841],
         [0.2521, 0.1464, 0.2026, 0.1576, 0.2414]]], grad_fn=<DivBackward0>)


## Attention masks: 

Pytorch provides interface for 2 kinds of masking:

1. Masking padding tokens (argument `key_padding_mask`): model should never attend to padding tokens!
2. Masking arbitrary positions from keys (argument `attn_mask`): this allows to exclude some positions from attention. This is used for e.g. left-to-right causal masking.

Check this doc for more details: https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html?highlight=multiheadattention#torch.nn.MultiheadAttention


In [None]:
# adding causal mask
mask = (1-torch.tril(torch.ones(outputs.size(0), outputs.size(0)))).bool()
print("Attn mask:\n")
print(mask)
print()

attn_outputs, attn_weights = attn.forward(query=outputs, key=outputs, value=outputs, attn_mask=mask)

print("input shape: %s" % (str(tuple(outputs.size()))))
print("output shape: %s" % (str(tuple(attn_outputs.size()))))
print(outputs)

print("\nattn weights shape: %s" % (str(tuple(attn_weights.size()))))
print(attn_weights)

Attn mask:

tensor([[False,  True,  True,  True,  True],
        [False, False,  True,  True,  True],
        [False, False, False,  True,  True],
        [False, False, False, False,  True],
        [False, False, False, False, False]])

input shape: (5, 1, 6)
output shape: (5, 1, 6)
tensor([[[ 1.5190, -0.1116,  0.4945, -1.4319,  0.5633, -1.0334]],

        [[ 0.3832, -1.5494, -0.7756, -0.3796,  1.2404,  1.0809]],

        [[-0.7320,  1.2635,  0.4798, -1.7963,  0.6191,  0.1658]],

        [[-1.7166,  1.3516,  0.4973,  0.6291, -0.7639,  0.0026]],

        [[ 1.3709,  0.3752,  0.6588, -1.5671,  0.1889, -1.0268]]],
       grad_fn=<NativeLayerNormBackward>)

attn weights shape: (1, 5, 5)
tensor([[[1.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.4493, 0.5507, 0.0000, 0.0000, 0.0000],
         [0.4086, 0.2093, 0.3821, 0.0000, 0.0000],
         [0.2305, 0.2068, 0.3209, 0.2418, 0.0000],
         [0.2521, 0.1464, 0.2026, 0.1576, 0.2414]]], grad_fn=<DivBackward0>)


#### Summary

In [None]:
class Transformer(nn.Module):
    def __init__(self, vocab_size, max_len, pad_idx, dim=8, num_layers=4, nhead=2):
        super().__init__()
        self.token_embed = nn.Embedding(vocab_size, dim)
        self.position_embed = nn.Embedding(max_len, dim)
        encoder_layer = nn.TransformerEncoderLayer(d_model=dim, nhead=nhead, dim_feedforward=64, dropout=0.0)
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)

        # projection tansforms hidden representations into logits for each token in the vocabulary
        self.projection = nn.Linear(dim, vocab_size)
        self.pad_idx = pad_idx
    
    def features(self, token_indices):
        pos = torch.arange(len(token_indices), device=token_indices.device).unsqueeze(1) # shape: [seq_len x 1]
        x = self.token_embed(token_indices) + self.position_embed(pos) # shape: [seq_len x batch_size x hidden_size]

        # we use 'attention mask' to ignore <pad> tokens during processing
        # attention mask is a binary vector having 1 on positions correspoding to <pad> tokens, and 0 otherwise
        attn_mask = ~token_indices.ne(self.pad_idx).transpose(0, 1) # shape: [batch_size x seq_len]

        x = self.encoder(x, src_key_padding_mask=attn_mask) # shape: [seq_len x batch_size x hidden_size]
        return x
    
    def forward(self, token_indices):
        x = self.features(token_indices) # shape: [seq_len x batch_size x hidden_size]
        x = self.projection(x) # shape: [seq_len x batch_size x vocab_size]
        return x

In [None]:
model = Transformer(len(vocab), max_len=100, pad_idx=pad_idx)

model.features(input_vector)

tensor([[[-0.3334,  1.6998,  0.0668, -1.5477, -0.9966, -0.4759,  1.0390,
           0.5480]],

        [[ 0.0634,  1.8216, -0.2919, -1.7535, -0.8763, -0.1555,  0.5098,
           0.6824]],

        [[-1.0470,  1.7331,  0.5718, -0.8491, -0.7864, -0.5778,  1.3273,
          -0.3717]],

        [[-0.2668,  1.2357,  0.1098, -0.8498, -1.1265, -0.7275,  1.9415,
          -0.3163]],

        [[-0.1503,  1.7762, -0.3093, -0.9540, -1.0494, -0.9841,  1.2543,
           0.4166]]], grad_fn=<NativeLayerNormBackward>)

## Back to Masked Language Modeling

Recall the **key property** of Transformers: due to self-attention, each output vector can incorporate information from *all* input tokens.

![Drawing](https://drive.google.com/uc?export=view&id=1RMGwljcEnedShwAjg811rPH50ZBnuFSo)

This is useful for masked language modeling, where we want to use information from the entire context when predicting the masked token(s).

## *BERT*

**B**idirectional

**E**ncoder

**R**epresentations from

**T**ransformers

#### - Masked Language Modeling at scale

#### - Learned representations are useful for downstream tasks

#### Great implementation in [transformers](https://github.com/huggingface/transformers):

    pip install transformers

In [None]:
! pip install transformers

Collecting transformers
  Downloading transformers-4.11.3-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 8.4 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 51.5 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 50.2 MB/s 
Collecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 5.3 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 53.6 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempti

In [None]:
import torch
from transformers import (
    BertForMaskedLM,
    BertTokenizer
)

### Details -- Model Variants

- $\text{BERT}_{\text{BASE}}$: 12 layers, hidden dimension 768, 12 attention heads (**110 million parameters**)
- $\text{BERT}_{\text{LARGE}}$: 24 layers, hidden dimension 1024, 16 attention heads (**340 million parameters**)

In `transformers` framework each BERT model comes in two modes: `uncased` and `cased`: the former version the training data was lowercased before tokenization, while in the latter version it was kept intact. 

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False)
model = BertForMaskedLM.from_pretrained('bert-base-cased', output_attentions=True)

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Details -- Input Implementation


- `[CLS]` token: starts each sequence. Used as aggregate sequence representation.
- `[SEP]` token: separates two segments (e.g. two sentences).
- **Segment embedding**: learned embedding for every token indicating whether it belongs
to sentence A or sentence B.
- **Position embedding**: learned.

![Drawing](https://drive.google.com/uc?export=view&id=1UDReqHfugh4Zef3f-IlNOR2BQ4CrZkFx)
**Exercise:** Which downstream tasks would two sequences be useful for?

### Tokenization

#### BERT represents text using **subword** tokens with a 30k token vocabulary.  



(more info [here](https://github.com/google/sentencepiece) and in the papers mentioned there)

<!-- - **Token embedding**: WordPiece embeddings with 30k token vocabulary. -->

In [None]:
tokenizer.tokenize("Pretraining is cool.")

['Pre', '##tra', '##ining', 'is', 'cool', '.']

In [None]:
tokenizer.tokenize("BERT represents text using subwords.")

['B', '##ER', '##T', 'represents', 'text', 'using', 'sub', '##words', '.']

## Sampling

How do we **sample** from a masked language model? Remember sampling from left-to-right autoregressive model, can that work here? Nope...

This is an active area of research, but we consider a method proposed by [Wang & Cho 2019](https://arxiv.org/pdf/1902.04094.pdf).

#### Core Idea
Repeat for $t=1,\ldots,T$:
- Forward pass to obtain $$h_1,\ldots, h_L$$
- Choose an unsampled location: $$\ell_t\sim \mathcal{U}(1,\ldots,L)$$
- Sample a word from the location: $$w_t\sim \texttt{softmax(project(}h_{\ell_t}))$$
- Repeat


(based on the code from [Wang & Cho 2019](https://colab.research.google.com/drive/1MxKZGtQ9SSBjTK5ArsZ5LKhkztzg52RV#scrollTo=8BR0JVmlTvEQ&forceEdit=true&sandboxMode=true))

In [None]:
def tokenize_batch(batch):
    return [tokenizer.convert_tokens_to_ids(sent) for sent in batch]

def untokenize_batch(batch):
    return [tokenizer.convert_ids_to_tokens(sent) for sent in batch]

def detokenize(sent):
    """ Roughly detokenizes (mainly undoes wordpiece) """
    new_sent = []
    for i, tok in enumerate(sent):
        if tok.startswith("##"):
            new_sent[len(new_sent) - 1] = new_sent[len(new_sent) - 1] + tok[2:]
        else:
            new_sent.append(tok)
    return new_sent

In [None]:
def generate_step(out, gen_idx, temperature=None, top_k=0, sample=False, return_list=True):
    """ Sample a word from from out[gen_idx]"""
    # out is a BERT output of shape [batch_size x seq_len x vocab_size]
    # gen_idx is the position for which we want to sample a token

    logits = out[:, gen_idx] # shape: [batch_size x vocab_size]

    # temperature < 1 leads to the more peaky distribution after applying softmax
    # temperature > 1 makes the output distribution more uniform
    if temperature is not None:
        logits = logits / temperature

    dist = torch.distributions.categorical.Categorical(logits=logits)
    idx = dist.sample().squeeze(-1)
    return idx.tolist() if return_list else idx

def get_init_text(seed_text, max_len, batch_size, tokenizer):
    """ Get initial sentence by padding seed_text with masks to max_len """
    batch = [seed_text + [tokenizer.mask_token] * (max_len-len(seed_text)) + [tokenizer.sep_token] for _ in range(batch_size)]
    return tokenize_batch(batch)

def printer(tokens):
    sent = detokenize(tokens)
    return " ".join(sent)

In [None]:
import math
import time

def uniform_sequential_generation(model, tokenizer, seed_text, batch_size=10, max_len=15, temperature=1.0, max_iter=300,
                                  device='cpu', print_every=20, verbose=True, temperature_decay=0.95):
    """ Generate for one uniformly-sampled position at a timestep"""
    seed_len = len(seed_text)
    batch = get_init_text(seed_text, max_len, batch_size, tokenizer)
    
    for iter_num in range(max_iter):
        # New permutation
        if iter_num % (max_len - seed_len) == 0:
            # applying temperature decay: we want more peaky distributions when we are close to the convergence point
            if iter_num > 0:
                temperature = temperature * temperature_decay
            positions = np.random.permutation(max_len - seed_len)
        
        position = positions[iter_num % len(positions)] # e.g., for permutation (3, 1, 2):  3, 1, 2, 3, 1, 2, ....
        # mask out the token for current position
        for batch_idx in range(batch_size):
            batch[batch_idx][seed_len + position] = tokenizer.mask_token_id
        
        # predict the distribution for the masked token
        inp = torch.tensor(batch, device=device)
        out = model(inp)[0] # BERT returns a tuple (logits, attention_scores), we need only logits here

        # sample from the predicted distribution
        idxs = generate_step(out, gen_idx=seed_len + position, temperature=temperature)
        
        # update the selected position with the sampled token
        for batch_idx in range(batch_size):
            batch[batch_idx][seed_len + position] = idxs[batch_idx]
            
        if iter_num == 0 or (verbose and np.mod(iter_num + 1, print_every) == 0):
            for_print = tokenizer.convert_ids_to_tokens(batch[0])
            print("iter %d  \t(temp %.2f)\t%s" % (iter_num + 1, temperature, printer(for_print)))
            
    return untokenize_batch(batch)

In [None]:
def generate(model, tokenizer, n_samples, seed_text="[CLS]", batch_size=10, max_len=25, 
             temperature=1.0, max_iter=500,
             device='cpu', print_every=20, verbose=True):
    sentences = []
    n_batches = math.ceil(n_samples / batch_size)
    start_time = time.time()
    for batch_n in range(n_batches):
        batch = uniform_sequential_generation(model, tokenizer, seed_text, batch_size=batch_size, max_len=max_len,
                                              temperature=temperature, max_iter=max_iter, 
                                              device=device, verbose=verbose)
        if (batch_n + 1) % print_every == 0:
            print("Finished batch %d in %.3fs" % (batch_n + 1, time.time() - start_time))
            start_time = time.time()
        
        sentences += batch
    return sentences

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

model.to(device)

n_samples = 5
batch_size = 5
max_len = 20
temperature = 0.7
max_iter = 300

# Choose the prefix context
seed_text = "[CLS]".split()
seed_text = "[CLS] My favorite class is".split()
bert_sents = generate(model, tokenizer, n_samples, seed_text=seed_text, batch_size=batch_size, max_len=max_len,
                      temperature=temperature, max_iter=max_iter,
                      device=device)

iter 1  	(temp 0.70)	[CLS] My favorite class is [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] t [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [SEP]
iter 20  	(temp 0.66)	[CLS] My favorite class is in the field . I don don ' t need it out for them . [SEP]
iter 40  	(temp 0.63)	[CLS] My favorite class is in the morning but I just don ' t like it because of it . [SEP]
iter 60  	(temp 0.60)	[CLS] My favorite class is in the library and I really don ' t want any part of it . [SEP]
iter 80  	(temp 0.54)	[CLS] My favorite class is in the library . I really don ' t want any part of it . [SEP]
iter 100  	(temp 0.51)	[CLS] My favorite class is in the library but I really don ' t want any part of it . [SEP]


KeyboardInterrupt: ignored

In [None]:
for i, sent in enumerate(bert_sents):
    print("Sample %d: \t %s\n" % (i, printer(sent)))

Sample 0: 	 [CLS] My favorite class is Yoga , known locally as TAPA or Skully - Skully . [SEP]

Sample 1: 	 [CLS] My favorite class is : English , maths , the fine arts , and the performing arts . [SEP]

Sample 2: 	 [CLS] My favorite class is the most popular one I have . My classes were taught in high school . [SEP]

Sample 3: 	 [CLS] My favorite class is a crappy ice skating and rock climbing class for kids during the summer . [SEP]

Sample 4: 	 [CLS] My favorite class is World History . Would that be my favorite class ? No . Not really . [SEP]



### Examining Learned Conditionals (& Representations)

**Probing tasks** can be used to examine aspects of what the model has learned. 

Following [Petroni et al 2019](https://arxiv.org/pdf/1909.01066.pdf) we probe for '**knowledge**' that the model has learned by querying for masked out objects, e.g.:

![Drawing](https://drive.google.com/uc?export=view&id=1bKbTKxu8bp_mZRclpFuV8-ALhAV1Dhus)

The task also illustrates some aspects of the **conditional distributions** and **contextualized representations** that the model has learned.

(image from [Petroni et al 2019])


#### Probing Task

We use a dataset from [Petroni et al 2019](https://github.com/facebookresearch/LAMA).

In [None]:
! pip install jsonlines

Collecting jsonlines
  Downloading jsonlines-2.0.0-py3-none-any.whl (6.3 kB)
Installing collected packages: jsonlines
Successfully installed jsonlines-2.0.0


In [None]:
import os
import subprocess
import jsonlines

def load_lama_squad():
    filename = os.path.join('data', 'Squad', 'test.jsonl')
    if not os.path.exists(filename):
        url = "https://dl.fbaipublicfiles.com/LAMA/data.zip"
        args = ['wget', url]
        subprocess.call(args)
        args = ['unzip', 'data.zip']
        subprocess.call(args)

    data = [line for line in jsonlines.Reader(open(filename, 'r'))]
    return data

In [None]:
data = load_lama_squad()
data[0]

{'id': '56be4db0acb8001400a502f0_0',
 'masked_sentences': ['To emphasize the 50th anniversary of the Super Bowl the [MASK] color was used.'],
 'obj_label': 'gold',
 'sub_label': 'Squad'}

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
results = []

model.eval()
for example in tqdm(data, total=len(data)):
    sentence, label = example['masked_sentences'][0], example['obj_label']
    inp = torch.tensor([tokenizer.encode(sentence)],
                      device=device)
    
    mask = (inp == tokenizer.vocab[tokenizer.mask_token])

    dict_out = model(inp)
    out, attn = dict_out['logits'], dict_out['attentions']
    
    probs, ids = out[mask].softmax(1).topk(10)
    probs = probs[0].tolist()
    ids = ids[0].tolist()

    tokens = [tokenizer.ids_to_tokens[i] for i in ids]

    results.append({
        'inp': inp,
        'sentence': sentence,
        'label': label,
        'top_tokens': tokens,
        'top_probs': probs,
        'correct@1': tokens[0] == label,
        'attn': attn
    })

print("correct@1: %.3f" % (len([r for r in results if r['correct@1']])/len(results)))

100%|██████████| 305/305 [00:38<00:00,  7.97it/s]

correct@1: 0.128





In [None]:
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

correct = [r for r in results if r['correct@1']]
wrong = [r for r in results if not r['correct@1']]

def show(idx=0, attn_layer=0, is_correct=True):
    result = correct[idx] if is_correct else wrong[idx]

    top_str = '\n\t'.join([('\t%s\t(%.4f)' % (t, p)) for t, p in zip(result['top_tokens'], result['top_probs'])])
    print("""%s
    \tlabel:\t%s

    \ttop:\n%s
    """ % (result['sentence'], result['label'], top_str))

    print("Attention weights (12 heads) from layer %d:" % attn_layer)
    # --- visualize attention
    fig, axs = plt.subplots(3, 4, figsize=(18, 12))

    toks = ['[CLS]'] + tokenizer.tokenize(result['sentence']) + ['[SEP]']
    for i, ax in enumerate(axs.reshape(-1)):
        ax.matshow(result['attn'][attn_layer][0][i].data.cpu().numpy(), cmap='gray')

        ax.set_xticks(range(len(toks)))
        ax.set_xticklabels(toks, rotation=90, fontsize=13)
        ax.set_yticks(range(len(toks)))
        ax.set_yticklabels(toks, fontsize=13)
        
    plt.tight_layout()
    
interactive(show, idx=(0, min(len(correct), len(wrong))-1), attn_layer=range(12), is_correct=True)

interactive(children=(IntSlider(value=0, description='idx', max=38), Dropdown(description='attn_layer', option…