<a href="https://colab.research.google.com/github/sujitpal/nlp-deeplearning-ai-examples/blob/master/blog_tds_fd905cb22df7_bert_mlm_wsd.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Impl: Language Model as Knowledge Base?

An implementation of the [Language Model as Knowledge Bases?](https://arxiv.org/pdf/1909.01066.pdf) (Petroni, et al, 2019) using pre-trained models in the HuggingFace library.

The method used is to identify subject, predicate, and object in simple (cloze style) sentences, and mask out the predicate, and have the masked language model make a prediction. This gives synonyms of the predicate.

We haven't gone that far, we use one of the inputs provided (referenced from the github repository referenced by the paper) to infer predictions from a masked language model based on `bert-base-uncased`.

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/99/84/7bc03215279f603125d844bf81c3fb3f2d50fe8e511546eb4897e4be2067/transformers-4.0.0-py3-none-any.whl (1.4MB)
[K     |████████████████████████████████| 1.4MB 8.8MB/s 
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 25.4MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 42.3MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893257 sha256=00b3b6c96335fc38f0d

## Model and Tokenizer

Task is to predict words that are masked using BERT, so we will use [BERTMaskedLM](https://huggingface.co/transformers/model_doc/bert.html#bertformaskedlm) model and [BERTTokenizer](https://huggingface.co/transformers/model_doc/bert.html#berttokenizer) and use the pre-trained `bert-base-uncased` model.

In [2]:
import json
import pandas as pd
import torch

from transformers import BertTokenizer, BertForMaskedLM

In [3]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = BertForMaskedLM.from_pretrained('bert-base-cased', return_dict=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Functions

We are going to use the pre-trained BERT language model in inference mode only.

The tokenizer tokenizes the input sequence and pads it with the `[CLS]` and `[SEP]` tokens.

The output produced by the model has two components, `loss` and `logits`. The `logits` component has shape (1, `number_of_tokens`, `vocab_size`) where the leading 1 represents the single input sentence.

We will identify the logits corresponding to the position of our masked token, identify the top 5 vocabulary words predicted for that position, and return the softmax probabilities for each of the top 5 predicted words.


In [4]:
model = model.eval()

In [5]:
inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
outputs = model(**inputs)

In [6]:
tokenizer.convert_ids_to_tokens(inputs.input_ids[0])

['[CLS]', 'The', 'capital', 'of', 'France', 'is', '[MASK]', '.', '[SEP]']

In [7]:
outputs

MaskedLMOutput([('logits',
                 tensor([[[ -7.1545,  -6.9931,  -7.1826,  ...,  -5.9124,  -5.6733,  -5.9854],
                          [ -8.0190,  -8.1319,  -8.0509,  ...,  -6.5679,  -6.4058,  -6.8998],
                          [ -4.9772,  -6.1781,  -6.0669,  ...,  -5.6362,  -4.6603,  -5.1241],
                          ...,
                          [ -3.4420,  -3.2557,  -3.5733,  ...,  -2.4606,  -2.6495,  -3.1952],
                          [-10.5890, -10.4621, -11.7181,  ...,  -7.4646,  -9.9543,  -8.3927],
                          [-14.8900, -14.8873, -14.4569,  ..., -11.6588, -13.0151, -11.6073]]],
                        grad_fn=<AddBackward0>))])

In [8]:
def get_mask_index(input_ids, tokenizer):
  x = input_ids[0]
  is_masked = torch.where(x == tokenizer.mask_token_id, x, 0)
  mask_idx = torch.nonzero(is_masked)
  return mask_idx.item()


mask_idx = get_mask_index(inputs.input_ids, tokenizer)
mask_idx

6

In [9]:
def get_top_k_predictions(pred_logits, mask_idx, top_k):
  probs = torch.nn.functional.softmax(pred_logits[0, mask_idx, :], dim=-1)
  top_k_weights, top_k_indices = torch.topk(probs, top_k, sorted=True)
  top_k_pct_weights = [100 * x.item() for x in top_k_weights]
  top_k_tokens = tokenizer.convert_ids_to_tokens(top_k_indices)
  return list(zip(top_k_tokens, top_k_pct_weights))


get_top_k_predictions(outputs.logits, mask_idx, 5)

[('Paris', 44.46824491024017),
 ('Lyon', 9.396002441644669),
 ('Toulouse', 8.234526962041855),
 ('Lille', 7.515138387680054),
 ('Marseille', 5.692282319068909)]

## Test Sentences

We take our pair of sentences for disambiguating the word `bank` and mask them, and extract the top 20 predictions from the pre-trained BERT MLM model.

As expected, the first set of predictions predominantly point to some sort of financial institution, whereas the second set of predictions predominantly point to some geographical formation around bodies of water.

In [10]:
sentences = [
  "Go to the [MASK] and deposit your pay check.",
  "Jim and Janet went down to the river [MASK] to admire the swans."
]

In [11]:
def get_predictions(sentence, tokenizer, model):
  inputs = tokenizer(sentence, return_tensors="pt")
  outputs = model(**inputs)
  mask_idx = get_mask_index(inputs.input_ids, tokenizer)
  top_preds = get_top_k_predictions(outputs.logits, mask_idx, 20)
  return top_preds


get_predictions(sentences[0], tokenizer, model)

[('bank', 70.31388282775879),
 ('office', 10.280602425336838),
 ('register', 1.7452064901590347),
 ('store', 1.6284791752696037),
 ('bathroom', 0.9394806809723377),
 ('library', 0.8934859186410904),
 ('desk', 0.8724371902644634),
 ('counter', 0.7977360859513283),
 ('hotel', 0.5163747351616621),
 ('lobby', 0.4956984892487526),
 ('kitchen', 0.3637084737420082),
 ('garage', 0.34799331333488226),
 ('door', 0.3412738908082247),
 ('car', 0.33113823737949133),
 ('house', 0.2649057889357209),
 ('airport', 0.2547034528106451),
 ('elevator', 0.24911428336054087),
 ('back', 0.24807683657854795),
 ('computer', 0.24019612465053797),
 ('banks', 0.23491475731134415)]

In [12]:
get_predictions(sentences[1], tokenizer, model)

[('##bank', 32.602083683013916),
 ('below', 13.032011687755585),
 ('bank', 11.94089725613594),
 (',', 5.626501142978668),
 ('##boat', 3.163905441761017),
 ('##front', 2.7332114055752754),
 ('basin', 1.6210488975048065),
 ('##bed', 1.2178435921669006),
 ('together', 1.1841697618365288),
 ('bed', 0.9657190181314945),
 ('again', 0.8369853720068932),
 ('deck', 0.8356167934834957),
 ('valley', 0.7271417416632175),
 ('mouth', 0.7227516267448664),
 ('boat', 0.7151090539991856),
 ('pier', 0.6493301596492529),
 ('house', 0.6301595363765955),
 ('banks', 0.5700606852769852),
 ('pool', 0.5345713812857866),
 ('Thames', 0.49955612048506737)]