# Masked Language Modelling

**Masked language modeling** is a technique to train language models in a self-supervised way. It works by randomly masking some of the words in the input text and asking the model to predict the masked words based on the context provided by the surrounding words. For example:

> **Input:** "I have watched this [MASK] and it was awesome."

> **Output:** "I have watched this movie and it was awesome."

It allows the model to learn from both left (previous) and right (future) context, unlike traditional language models that only use left (previous) context. This can improve the model's ability to capture long-range dependencies and semantic relationships.


## Masked Language Modelling - Usecases

- **Content creation**: they are used to generate content for blogs, newsletters, social media posts, etc. based on user input. Save time and money on hiring writers and marketers. Increase online presence and engagement.

- **Education**: they are used to generate educational materials and resources for students and teachers based on user input. Improve quality and accessibility of education. Provide personalized and adaptive learning experiences.

- **Entertainment**: they are used to generate entertaining and creative content for stories, poems, songs, jokes, games, etc. based on user input. Provide fun and engaging entertainment options. Inspire and assist human creators.

In [1]:
!pip install transformers -qq

In [8]:
from transformers import pipeline

mlm = pipeline(
    task = "fill-mask",
    model = "distilroberta-base"
)
mlm("He is not feeling <mask>")

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.03095770627260208,
  'token': 3473,
  'token_str': ' comfortable',
  'sequence': 'He is not feeling comfortable'},
 {'score': 0.02819400653243065,
  'token': 1522,
  'token_str': ' safe',
  'sequence': 'He is not feeling safe'},
 {'score': 0.02772311493754387,
  'token': 157,
  'token_str': ' well',
  'sequence': 'He is not feeling well'},
 {'score': 0.023862358182668686,
  'token': 1372,
  'token_str': ' happy',
  'sequence': 'He is not feeling happy'},
 {'score': 0.02330631949007511,
  'token': 205,
  'token_str': ' good',
  'sequence': 'He is not feeling good'}]

## Masked Language Modelling in Python

In [10]:
!pip install transformers datasets -qq

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [15]:
from datasets import load_dataset

data = load_dataset("SetFit/bbc-news")

print(data)

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 1225
    })
    test: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 1000
    })
})


In [19]:
train_df = data["train"].to_pandas()
train_df

Unnamed: 0,text,label,label_text
0,wales want rugby league training wales could f...,2,sport
1,china aviation seeks rescue deal scandal-hit j...,1,business
2,rock band u2 break ticket record u2 have smash...,3,entertainment
3,markets signal brazilian recovery the brazilia...,1,business
4,tough rules for ringtone sellers firms that fl...,0,tech
...,...,...,...
1220,us economy shows solid gdp growth the us econo...,1,business
1221,microsoft releases bumper patches microsoft ha...,0,tech
1222,stuart joins norwich from addicks norwich have...,2,sport
1223,why few targets are better than many the econo...,1,business


In [20]:
# See al the unique label_text
set(train_df["label_text"])

{'business', 'entertainment', 'politics', 'sport', 'tech'}

In [24]:
# Pick a label
texts = train_df[train_df["label_text"] == "business"][["text"]]
texts.head()

Unnamed: 0,text
1,china aviation seeks rescue deal scandal-hit j...
3,markets signal brazilian recovery the brazilia...
9,bbc poll indicates economic gloom citizens in ...
18,ask jeeves tips online ad revival ask jeeves h...
21,ad firm wpp s profits surge 15% uk advertising...


In [28]:
article = texts.text.iloc[0]
article

'china aviation seeks rescue deal scandal-hit jet fuel supplier china aviation oil has offered to repay its creditors $220m (£117m) of the $550m it lost on trading in oil futures.  the firm said it hoped to pay $100m now and another $120m over eight years. with assets of $200m and liabilities totalling $648m  it needs creditors  backing for the offer to avoid going into bankruptcy. the trading scandal is the biggest to hit singapore since the $1.2bn collapse of barings bank in 1995. chen jiulin  chief executive of china aviation oil (cao)  was arrested by at changi airport by singapore police on 8 december. he was returning from china  where he had headed when cao announced its trading debacle in late-november. the firm had been betting heavily on a fall in the price of oil during october  but prices rose sharply instead.  among the creditors whose backing cao needs for its restructuring plan are banking giants such as barclay s capital and sumitomo mitsui  as well as south korean firm

In [30]:
# model
mlm = pipeline(
    task = "fill-mask",
    model = "distilroberta-base"
)

mlm("china aviation seeks <mask> deal scandal-hit jet fuel supplier china aviation oil has offered to repay.")

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.2776990234851837,
  'token': 11103,
  'token_str': ' bailout',
  'sequence': 'china aviation seeks bailout deal scandal-hit jet fuel supplier china aviation oil has offered to repay.'},
 {'score': 0.1183982715010643,
  'token': 8183,
  'token_str': ' restructuring',
  'sequence': 'china aviation seeks restructuring deal scandal-hit jet fuel supplier china aviation oil has offered to repay.'},
 {'score': 0.07595193386077881,
  'token': 12313,
  'token_str': ' swap',
  'sequence': 'china aviation seeks swap deal scandal-hit jet fuel supplier china aviation oil has offered to repay.'},
 {'score': 0.05177634209394455,
  'token': 4660,
  'token_str': ' compensation',
  'sequence': 'china aviation seeks compensation deal scandal-hit jet fuel supplier china aviation oil has offered to repay.'},
 {'score': 0.047498419880867004,
  'token': 2541,
  'token_str': ' loan',
  'sequence': 'china aviation seeks loan deal scandal-hit jet fuel supplier china aviation oil has offered to repa