# Data Generation with Masked Language Modeling with BERT

We use the MLM feature of BERT for the generation of the adjectival predication data. We insert an adjective to the sentences where there is a verbal predication targeting a certain semantic type. We expect the inserted adjectives to target the same semantic type. 

In [1]:
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
import spacy

In [7]:
model_name = "bert-large-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
nlp = spacy.load("en_core_web_trf")

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In order to insert the adjectives, first we insert a mask in place of the adjective.

In [8]:
def mask_sentence(sentence, word, mask):
  mask_start = sentence.index(word)  
  masked_sentence = sentence[:mask_start] + f"{mask} " + sentence[mask_start:]
  return masked_sentence

mask = "[MASK]"
sentence = "I ate the soup." #Food verbal predication
word = "soup"
masked_sentence = mask_sentence(sentence, word, mask)
masked_sentence

'I ate the [MASK] soup.'

We make the model predict a word in place of the mask.

In [9]:
sentence_alternatives = unmasker(masked_sentence)
sentence_alternatives

[{'score': 0.16507339477539062,
  'token': 17690,
  'token_str': 'vegetable',
  'sequence': 'I ate the vegetable soup.'},
 {'score': 0.1070723757147789,
  'token': 2504,
  'token_str': 'cold',
  'sequence': 'I ate the cold soup.'},
 {'score': 0.10284987837076187,
  'token': 9323,
  'token_str': 'chicken',
  'sequence': 'I ate the chicken soup.'},
 {'score': 0.05587597191333771,
  'token': 2633,
  'token_str': 'hot',
  'sequence': 'I ate the hot soup.'},
 {'score': 0.042337991297245026,
  'token': 26422,
  'token_str': 'tomato',
  'sequence': 'I ate the tomato soup.'}]

We insert the model-predicted words instead of the mask and parse the final sentences to check whether the predicted word is an adjective or not. It is also possible to control the quality of the predictions by checking the confidence score of the predictions.

In [17]:
for alternative in sentence_alternatives:
    doc = nlp(alternative["sequence"])
    for token in doc:
        if token.text == word:
            for child in token.children:
                if child.dep_ == "amod" and child.text == alternative["token_str"]:
                    print(alternative["sequence"])

I ate the cold soup.
I ate the hot soup.
