# Masked Language Model Test

We will test here whether a BERT model can predict masked Mongolian words. We will download and test the cased BERT-Base model. For other available Mongolian BERT models see [tugstugi/mongolian-bert](https://github.com/tugstugi/mongolian-bert).


Download the model, install needed dependencies and initialize the model:

In [1]:
import os
from os.path import exists, join, basename, splitext
import sys

def download_from_google_drive(file_id, file_name):
  # download a file from the Google Drive link
  !rm -f ./cookie
  !curl -c ./cookie -s -L "https://drive.google.com/uc?export=download&id={file_id}" > /dev/null
  confirm_text = !awk '/download/ {print $NF}' ./cookie
  confirm_text = confirm_text[0]
  !curl -Lb ./cookie "https://drive.google.com/uc?export=download&confirm={confirm_text}&id={file_id}" -o {file_name}
  
# download a pre-trained model
model_path = 'cased_bert_base_pytorch'
if not exists(model_path):
  download_from_google_drive('11Adpo6DorPgpE8z1lL6rvZAMHLEfnJwv', '%s.zip' % model_path)
  !unzip {model_path}.zip
  sys.path.append(model_path)
  
# we need only sentencepience and pytorch-pretrained-BERT, everything else is included in the downloaded model
!pip install -q pytorch-pretrained-BERT sentencepiece

# import needed modules
import torch
from tokenization_sentencepiece import FullTokenizer
import pytorch_pretrained_bert
from pytorch_pretrained_bert import BertModel, BertForMaskedLM, BertForNextSentencePrediction

# Load pre-trained model tokenizer
tokenizer = FullTokenizer(model_file=join(model_path, 'mn_cased.model'), vocab_file=join(model_path, 'mn_cased.vocab'), do_lower_case=False)
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained(model_path)
model = model.eval()

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   388    0   388    0     0   4674      0 --:--:-- --:--:-- --:--:--  4674
100  394M    0  394M    0     0  49.8M      0 --:--:--  0:00:07 --:--:-- 65.6M
Archive:  cased_bert_base_pytorch.zip
   creating: cased_bert_base_pytorch/
  inflating: cased_bert_base_pytorch/eval_results.txt  
  inflating: cased_bert_base_pytorch/mn_cased.model  
  inflating: cased_bert_base_pytorch/mn_cased.vocab  
  inflating: cased_bert_base_pytorch/tokenization_sentencepiece.py  
  inflating: cased_bert_base_pytorch/bert_config.json  
  inflating: cased_bert_base_pytorch/pytorch_model.bin  
[K    100% |████████████████████████████████| 122kB 4.6MB/s 
[K    100% |████████████████████████████████| 1.0MB 17.6MB/s 
[?25hBetter speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
Loaded a trained SentencePiece model.


We will mask the following Mongolian sentence and try to predict the masked word:

In [0]:
TEXT = 'Орчин үеийн стандартын усан спортын бассейныг ирэх онд ашиглалтад оруулна.'

Tokenize the above text:

In [3]:
tokenized_text = tokenizer.tokenize(TEXT)
" ".join(tokenized_text)

'▁Орчин ▁үеийн ▁стандартын ▁усан ▁спортын ▁бассейн ыг ▁ирэх ▁онд ▁ашиглалтад ▁оруулна .'

Mask the token `▁ашиглалтад`:

In [4]:
masked_index = tokenized_text.index('▁ашиглалтад')
tokenized_text[masked_index] = '[MASK]'
" ".join(tokenized_text)

'▁Орчин ▁үеийн ▁стандартын ▁усан ▁спортын ▁бассейн ыг ▁ирэх ▁онд [MASK] ▁оруулна .'

Now predict:

In [5]:
# index and segment ids
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [0]*len(indexed_tokens);
len(segments_ids) == len(tokenized_text)

# Predict all tokens
with torch.no_grad():
    predictions = model(torch.tensor([indexed_tokens]), torch.tensor([segments_ids]))

# confirm we were able to predict the masked word
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
predicted_token

'▁ашиглалтад'