# 🧠 Masked Language Model Training with BERT

This notebook demonstrates how to train a Masked Language Model (MLM) using BERT. The key steps include:
- 📊 Dataset preparation
- 🧾 Tokenization and masking
- ⚙️ Configuration of training parameters
- 🏋️ Training the model
- 🔍 Inference using masked token prediction


In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"

## 📊 Step 1: Dataset Preparation
We'll use a publicly available text dataset. You can replace this with your own custom data.


In [None]:
!wget -q https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
import pandas as pd

df = pd.read_csv("train.csv", header=None, names=["Class Index", "Title", "Description"])
df["text"] = df["Title"] + " " + df["Description"]
texts = df["text"].tolist()

texts= texts [0:5000] # taking only 2000 samples
# Display sample
print("Sample Text:\n", texts[0])


Sample Text:
 Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.


## 🧾 Step 2: Tokenization and Masking
We’ll use the BERT tokenizer and apply random masking (MLM-style) using Hugging Face's built-in `DataCollatorForLanguageModeling`.


In [None]:
from transformers import AutoTokenizer
from datasets import Dataset

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

dataset = Dataset.from_dict({"text": texts})
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=64)

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15  # 15% of tokens will be replaced with [MASK]
)

## 🖨️ Example: Masked Training Samples
Let’s visualize how the `DataCollatorForLanguageModeling` randomly masks tokens in the input sequences.


In [None]:
import torch
from transformers import DataCollatorForLanguageModeling

# Select a few tokenized samples
sample_batch = tokenized_dataset.select(range(3))

# Convert the Dataset slice to a list of dictionaries
# This extracts each sample as a dictionary
samples_list = [sample_batch[i] for i in range(len(sample_batch))]

# Apply masking by passing the list of dictionaries to the data_collator
masked = data_collator(samples_list)

# Get the token IDs for CLS and PAD
cls_token_id = tokenizer.cls_token_id
pad_token_id = tokenizer.pad_token_id

# Decode original and masked inputs
print("📄 Original vs. Masked Samples:\n")
for i in range(3):
    # Access input_ids directly from the masked batch output
    input_ids = masked["input_ids"][i]
    # Original ids can be accessed from the selected dataset
    original_ids = sample_batch["input_ids"][i]

    print(f"Example {i+1}:")
    # Decode original (which are lists of integers)
    print("Original:", tokenizer.decode(original_ids, skip_special_tokens=True))

    # Decode masked output without skipping special tokens first
    masked_decoded = tokenizer.decode(input_ids, skip_special_tokens=False)

    # Manually replace [CLS] and [PAD] with spaces or empty strings
    # Using replace might lead to issues if [CLS] or [PAD] is part of a word.
    # A more robust approach involves iterating through token IDs and decoding selectively,
    # but for a quick visualization, string replacement can work if you are careful.
    # Let's replace with spaces for better readability of remaining tokens.
    cleaned_masked_decoded = masked_decoded.replace(tokenizer.cls_token, "").replace(tokenizer.pad_token, "")

    print("Masked  :", cleaned_masked_decoded.strip()) # strip to remove leading/trailing spaces from replacement
    print("-" * 80)

📄 Original vs. Masked Samples:

Example 1:
Original: wall st. bears claw back into the black ( reuters ) reuters - short - sellers, wall street ' s dwindling \ band of ultra - cynics, are seeing green again.
Masked  : wall st. [MASK] claw back into the black ( reuters ) reuters - short [MASK] [MASK], wall street ' s dwindling \ band of ultra - cynics, are seeing green [MASK]. [SEP]
--------------------------------------------------------------------------------
Example 2:
Original: carlyle looks toward commercial aerospace ( reuters ) reuters - private investment firm carlyle group, \ which has a reputation for making well - timed and occasionally \ controversial plays in the defense industry, has quietly placed \ its bets on another part of the market.
Masked  : [MASK]le [MASK] toward commercial aerospace ( reuters ) [MASK] - private investment firm carlyle group, [MASK] which [MASK] a reputation [MASK] makingiol - timed [MASK] occasionally \ controversial plays in the defense industr

## ⚙️ Step 3: Model and Training Configuration
We'll fine-tune `bert-base-uncased` using the prepared dataset and collator.


In [None]:
from transformers import AutoModelForMaskedLM, TrainingArguments, Trainer

model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

training_args = TrainingArguments(
    output_dir="./bert-mlm",
    # evaluation_strategy="no", # Removed this argument as it caused a TypeError
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=100,
    save_steps=500,
)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


## 🏋️ Step 4: Train the Model
Let’s now train our BERT model with the dataset we’ve prepared.


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()


  trainer = Trainer(


Step,Training Loss
100,2.9671
200,2.625
300,2.5026
400,2.5508
500,2.4487
600,2.4627


TrainOutput(global_step=625, training_loss=2.5865390258789063, metrics={'train_runtime': 183.4579, 'train_samples_per_second': 27.254, 'train_steps_per_second': 3.407, 'total_flos': 164503008000000.0, 'train_loss': 2.5865390258789063, 'epoch': 1.0})

## 🔍 Step 5: Inference — Predicting Masked Tokens
Now let’s use the trained BERT model to fill in `[MASK]` in a sentence.


In [None]:
from transformers import pipeline

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

# Example: Predict the masked word
sentence = "The [MASK] sat on the mat."
results = fill_mask(sentence)

print("Predictions for [MASK]:")
for r in results:
    print(f"{r['token_str']:>10s} | score: {r['score']:.4f}")


Device set to use cuda:0


Predictions for [MASK]:
       man | score: 0.0851
      girl | score: 0.0399
       boy | score: 0.0367
       dog | score: 0.0341
     woman | score: 0.0209
