The task is to explore masked language modeling in Transformers

In [15]:
import numpy as np
import pandas as pd

from pprint import pprint
from transformers import pipeline

In [16]:
# download BBC text classification dataset
# original dataset on Kaggle: https://www.kaggle.com/datasets/shivamkushwaha/bbc-full-text-document-classification)
!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

File ‘bbc_text_cls.csv’ already there; not retrieving.



In [17]:
# save the dataset in Pandas dataframe
df = pd.read_csv('bbc_text_cls.csv')

In [18]:
# check the dataset
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [19]:
# check labels
df['labels'].unique()

array(['business', 'entertainment', 'politics', 'sport', 'tech'],
      dtype=object)

In [20]:
# select business texts
business_texts = df[df['labels'] == 'business']['text']

In [21]:
# select a random text
np.random.seed(42)
i = np.random.choice(business_texts.shape[0])
doc = business_texts.iloc[i]

In [22]:
print(doc)

US insurer Marsh cuts 2,500 jobs

Up to 2,500 jobs are to go at US insurance broker Marsh & McLennan in a shake up following bigger-than-expected losses.

The insurer said the cuts were part of a cost-cutting drive, aimed at saving millions of dollars. Marsh posted a $676m (£352m) loss for the last three months of 2004, against a $375m (£195.3m) profit a year before. It blamed an $850m payout to settle a price-rigging lawsuit, brought by New York attorney general Elliot Spitzer. Under the settlement announced in January, Marsh took a pre-tax charge of $618m in the October-to-December quarter, on top of the $232m charge from the previous quarter. "Clearly 2004 was the most difficult year in MMC's financial history," Marsh chief executive Michael Cherkasky said.

An ongoing restructuring drive at the group also led to a $337m hit in the fourth quarter, the world's biggest insurer said.

Analysts expect its latest round of cuts to focus on its brokerage unit, which employs 40,000 staff. T

In [23]:
# use the pipeline
mlm = pipeline('fill-mask')

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [27]:
# try random sentences with a mask
text = 'The insurer said the cuts were part of a cost-cutting drive, aimed at saving millions of dollars.'
text_masked = text.replace('cuts', '<mask>')
pprint(mlm(text_masked))

[{'score': 0.18367719650268555,
  'sequence': 'The insurer said the layoffs were part of a cost-cutting drive, '
              'aimed at saving millions of dollars.',
  'token': 22788,
  'token_str': ' layoffs'},
 {'score': 0.13297876715660095,
  'sequence': 'The insurer said the cuts were part of a cost-cutting drive, '
              'aimed at saving millions of dollars.',
  'token': 2599,
  'token_str': ' cuts'},
 {'score': 0.10709443688392639,
  'sequence': 'The insurer said the reductions were part of a cost-cutting '
              'drive, aimed at saving millions of dollars.',
  'token': 14138,
  'token_str': ' reductions'},
 {'score': 0.08525798469781876,
  'sequence': 'The insurer said the changes were part of a cost-cutting drive, '
              'aimed at saving millions of dollars.',
  'token': 1022,
  'token_str': ' changes'},
 {'score': 0.04716590419411659,
  'sequence': 'The insurer said the moves were part of a cost-cutting drive, '
              'aimed at saving millions

In [26]:
text = 'In January, a former senior vice president also pleaded guilty to criminal charges related to the investigation.'
text_masked = text.replace('guilty', '<mask>')
pprint(mlm(text_masked))

[{'score': 0.9997223019599915,
  'sequence': 'In January, a former senior vice president also pleaded guilty '
              'to criminal charges related to the investigation.',
  'token': 2181,
  'token_str': ' guilty'},
 {'score': 6.685675907647237e-05,
  'sequence': 'In January, a former senior vice president also pleaded contest '
              'to criminal charges related to the investigation.',
  'token': 3096,
  'token_str': ' contest'},
 {'score': 5.803355816169642e-05,
  'sequence': 'In January, a former senior vice president also pleaded up to '
              'criminal charges related to the investigation.',
  'token': 62,
  'token_str': ' up'},
 {'score': 2.9516017093556002e-05,
  'sequence': 'In January, a former senior vice president also pleaded '
              'innocent to criminal charges related to the investigation.',
  'token': 7850,
  'token_str': ' innocent'},
 {'score': 2.9337308660615236e-05,
  'sequence': 'In January, a former senior vice president also pleaded 