# Turning ModernBERT into an instruct-tuned Diffusion LLM

An experiment in adapting ModernBERT into a LLADA-style dLLM by fine-tuning it with a variable masking ratio on instruction data.

In [None]:
# !pip install -q transformers datasets accelerate bitsandbytes

In [None]:
# !pip install -U fsspec datasets

## Load the model + tokenizer

In [None]:
import os, random, itertools, math, torch
from torch.utils.data import DataLoader
from transformers import (
    AutoTokenizer, AutoModelForMaskedLM,
    get_cosine_schedule_with_warmup
)
from torch.optim import AdamW
from datasets import load_dataset
device = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
model_id = "answerdotai/ModernBERT-large"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,          # <-- bfloat16 = cheap & safe
    device_map="auto",                   # or set to "cuda:0"
    low_cpu_mem_usage=True,
)
mask_id  = tokenizer.mask_token_id
cls_id   = tokenizer.cls_token_id
sep_id   = tokenizer.sep_token_id

print(f"{tokenizer.mask_token=}  {mask_id}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer.mask_token='[MASK]'  50284


## Dataset

Including adding masks with rand prob and padding

In [None]:
raw_ds = load_dataset(
    "allenai/tulu-3-sft-mixture-0225",
    split="train",           # [:1%] % for demo.  drop the slice for real training
    cache_dir="./data"
)
print("raw len:", len(raw_ds))

raw len: 866137


In [None]:
max_len          = 512
mask_ratio_min   = 0.15          # Can tweak
mask_ratio_max   = 0.99

def join_dialogue(msgs):
    """
    [ {role, content}, ... ]  ->  one flat string with explicit [SEP] boundaries
    We expect first msg=user, second msg=assistant.
    """
    u = msgs[0]["content"].strip()
    a = msgs[1]["content"].strip()
    return f"User: {u} {tokenizer.sep_token} Assistant: {a}"

def apply_random_mask(example):
    text = join_dialogue(example["messages"])
    enc  = tokenizer(text,
                     truncation=True, max_length=max_len,
                     padding="max_length")
    ids  = enc["input_ids"]
    labels = [-100] * len(ids)     # -100 -> ignored by CE-loss

    # find assistant region (everything after first [SEP])
    if sep_id not in ids:
        return {**enc, "labels": labels}
    sep_pos = ids.index(sep_id)         # first [SEP]
    cand = [i for i in range(sep_pos+1, len(ids))
            if ids[i] not in (tokenizer.pad_token_id,
                              cls_id, sep_id)]
    if not cand:
        return {**enc, "labels": labels}

    # variable mask ratio
    m_ratio = random.uniform(mask_ratio_min, mask_ratio_max)
    n_mask  = max(1, int(len(cand) * m_ratio))
    chosen  = random.sample(cand, n_mask)

    for idx in chosen:
        labels[idx] = ids[idx]          # remember ground-truth
        dice = random.random()
        if dice < 0.8:                  # 80 %
            ids[idx] = mask_id
        elif dice < 0.9:                # 10 %
            ids[idx] = random.randint(0, tokenizer.vocab_size - 1)
        # else leave token unchanged (10 %)

    enc["input_ids"]    = ids
    enc["labels"]       = labels
    return enc

proc_ds = raw_ds.map(apply_random_mask, remove_columns=raw_ds.column_names, num_proc=32)
proc_ds.set_format(type="torch")
print(proc_ds[0]["input_ids"][:30], "\nlabels:", proc_ds[0]["labels"][:30])

In [None]:
# prompt: pick a random sample and decode it to visualize the data

sample = random.choice(proc_ds)
decoded_input = tokenizer.decode(sample["input_ids"], skip_special_tokens=False)
print("Decoded Input:")
print(decoded_input)

# Decode the labels to see the original tokens that were masked
labels_to_decode = [label for label in sample["labels"] if label != -100]
decoded_labels = tokenizer.decode(labels_to_decode, skip_special_tokens=False)
print("\nDecoded Labels (Original masked tokens):")
decoded_labels

Decoded Input:
[CLS]User: A successful businesswoman, Ms. Johnson, donates a portion of her annual earnings to a community fund. This year, she decided to allocate $100,000 of her earnings to create a scholarship program for underprivileged students. Ms. Johnson's earnings follow an exponential growth model due to her flourishing business.

1. Ms. Johnson's annual earnings can be modeled by the function \( E(t) = E_0 e^{kt} \), where \( E_0 \) is her initial earnings, \( k \) is the growth rate, and \( t \) is the number of years since she started her business. If Ms. Johnson started her business 5 years ago with initial earnings of $150,000, and her current earnings are $450,000, determine the growth rate \( k \).

2. Based on the growth rate \( k \) found in sub-problem 1, predict the amount Ms. Johnson will allocate to the scholarship program in 10 years if she continues to donate the same percentage of her earnings. [SEP] Assistant[MASK][MASK][MASK][MASK][MASK] problem[MASK][MASK][

': To solve the given, we need to perform the following steps:\n\n### Part 1: Determine the Growth Rate \\( k \\)\n\nThe earnings function is given:\n\n\\[ Et) E_0 e^{kt} \\]\n\nwhere:\n- \\( E_0 = 150, \\) (initial earnings)\n- \\( E(t) 450,000 \\) (current earnings after 5 years)\n- \\( t = 5 \\) years\nWe need to find the rate \\( k \\). Start by substituting the known values into the equation:\n\n 450,000 = 150,000 e^{5k} \\]\n\nDivide both sides by 150,000 to isolate the exponential\n\\[ e^{5k} \\frac{450,000}{150, \\]\n\\[ e^{5k} = 3 \\]\n\nTo for \\( k \\ take the natural logarithm of both sides:\n\n \\ln(e^{k}) = \\ln(3) \\]\n\nBy the properties logarithms, this simplifies to:\n\n\\[ 5k = \\ln3) \\]\n\nSolve for k \\) by both sides 5:\n\n\\[ = \\frac{\\ln3)}{5}]'

In [None]:
batch = proc_ds[0:4]
for k,v in batch.items(): print(k, v.shape)

input_ids torch.Size([4, 512])
attention_mask torch.Size([4, 512])
labels torch.Size([4, 512])


## Training

In [None]:
# # A minimal train loop
# loader = DataLoader(proc_ds, batch_size=32, shuffle=True)
# optim = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
# num_epochs      = 1
# num_steps       = len(loader)*num_epochs
# warmup_steps    = int(0.06 * num_steps)
# sched = get_cosine_schedule_with_warmup(optim, warmup_steps, num_steps)
# model.train()
# for epoch in range(num_epochs):
#     for step, batch in enumerate(loader, 1):
#         batch = {k:v.to(device) for k,v in batch.items()}
#         out   = model(**batch)
#         out.loss.backward()
#         optim.step(); optim.zero_grad(); sched.step()
#         if step % 100 == 0 or step==1:
#             print(f"epoch {epoch} ‖ step {step:4} ‖ loss {out.loss.item():.4f}")

In [None]:
# ── split 95 % / 5 % after masking ──────────────────────────────────────────────
train_ds = proc_ds.shuffle(seed=42).select(range(int(0.95*len(proc_ds))))
val_ds   = proc_ds.select(range(int(0.95*len(proc_ds)), len(proc_ds)))

batch_size = 32          # ↑ when you have more VRAM
train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
val_loader   = DataLoader(val_ds,   batch_size=batch_size, shuffle=True)

In [None]:
def accuracy_buckets(logits, labels, attn):
    """
    Returns:
        total_loss   (float, per-token, NOT per-sequence)
        global_acc   (float)
        bucket_acc   (list[4])  # ≤.25, .25-.5, .5-.75, >.75
    """
    # logits: [B,T,V]  labels: [B,T]
    with torch.no_grad():
        pred   = logits.argmax(-1)
        mask   = labels != -100                  # only masked positions count
        correct= (pred == labels) & mask

        # ---- global ----
        tot_masked = mask.sum().item()
        tot_corr   = correct.sum().item()
        global_acc = tot_corr / tot_masked if tot_masked else 0.0

        # ---- buckets by sample-level mask ratio ----
        bucket_corr  = [0,0,0,0]
        bucket_total = [0,0,0,0]
        edges = (0.25, 0.50, 0.75, 1.01)         # last edge slightly >1

        for b in range(labels.size(0)):
            n_mask = mask[b].sum().item()
            if n_mask == 0:                       # should be rare
                continue
            # denominator = real tokens (ignore pads)
            seq_len = attn[b].sum().item()
            ratio   = n_mask / seq_len
            # bucket index
            for i,edge in enumerate(edges):
                if ratio <= edge:
                    bucket_total[i]  += n_mask
                    bucket_corr[i]   += correct[b].sum().item()
                    break

        bucket_acc = [c/t if t else 0.0 for c,t in zip(bucket_corr, bucket_total)]
    return global_acc, bucket_acc


In [None]:
from itertools import islice

@torch.no_grad()
def evaluate(model, val_loader, batches=8):
    model.eval()
    tot_loss, tot_acc, bucket_hits = 0., 0., [0,0,0,0]
    for batch in islice(val_loader, batches):
        batch = {k:v.to(device) for k,v in batch.items()}
        out   = model(**batch)
        loss  = out.loss.item()
        tot_loss += loss
        acc, bucket_acc = accuracy_buckets(out.logits, batch["labels"], batch["attention_mask"])
        tot_acc += acc
        bucket_hits = [h+a for h,a in zip(bucket_hits, bucket_acc)]
    n = batches
    val_loss = tot_loss / n
    val_acc  = tot_acc  / n
    bucket_acc = [b/n for b in bucket_hits]
    return val_loss, val_acc, bucket_acc


In [None]:
from tqdm.auto import tqdm

optim = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
num_epochs           = 1
steps_per_epoch      = len(train_loader)
total_steps          = num_epochs * steps_per_epoch
warmup_steps         = int(0.06 * total_steps)
sched = get_cosine_schedule_with_warmup(optim, warmup_steps, total_steps)

log_every   = 200                 # validate every n updates
global_step = 0
losses, val_losses = [], []
accs, val_accs = [], []

model.train()
for epoch in range(num_epochs):
    pbar = tqdm(train_loader, total=steps_per_epoch,
                desc=f"Epoch {epoch}", leave=False, dynamic_ncols=True)

    running_loss = 0.0
    running_acc  = 0.0
    for step, batch in enumerate(pbar, 1):
        global_step += 1
        batch = {k:v.to(device) for k,v in batch.items()}
        out   = model(**batch)
        out.loss.backward()

        optim.step(); optim.zero_grad(); sched.step()

        # ── on-the-fly training accuracy ───────────────────────────────────────
        acc, _ = accuracy_buckets(out.logits.detach(), batch["labels"], batch["attention_mask"])
        running_loss += out.loss.item()
        running_acc  += acc

        losses.append(out.loss.item())
        accs.append(acc)

        # progress bar message
        if step % 20 == 0 or step == 1:
            pbar.set_postfix(loss = running_loss/step,
                             acc  = running_acc / step)

        # ── periodic validation ────────────────────────────────────────────────
        if global_step % log_every == 0:
            val_loss, val_acc, val_buckets = evaluate(model, val_loader)
            val_losses.append(val_loss)
            val_accs.append(val_acc)
            print(f"\n🧮 step {global_step:6d} | "
                  f"train_loss {running_loss/step:.4f}  "
                  f"train_acc {running_acc/step:.3f} | "
                  f"val_loss {val_loss:.4f}  val_acc {val_acc:.3f} | "
                  f"bucket_acc {['{:.3f}'.format(x) for x in val_buckets]}\n")
            model.train()               # back to train-mode


- Should I scale loss based on masking ratio?
- What's the length distribution in the dataset? Is 512 too big?
- Can I do larger batch size?
- Fixed batch or 10 for eval with fixed masking ratio for smooth val loss curve
- Full dataset run
- Mask ratio schedule so we warm up to full masking?

In [None]:
save_dir = "modernbert-diffusion-finetuned"
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

('modernbert-diffusion-finetuned/tokenizer_config.json',
 'modernbert-diffusion-finetuned/special_tokens_map.json',
 'modernbert-diffusion-finetuned/tokenizer.json')

In [None]:
# repo_id = "johnowhitaker/modernbert-diffusion"
# model.push_to_hub(
#     repo_id,
#     token="my_token",           # or pass your string literal
#     private=True,                          # drop this if the repo can be public
#     commit_message="Add diffusion-style fine-tuned weights",
# )
# tokenizer.push_to_hub(repo_id, token="my_token")

In [None]:
# prompt: plot the train and val losses and accuracies (two separate subplots)

import matplotlib.pyplot as plt

# Plotting the loss
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(losses, label='Train Loss')
# Since val_losses are recorded less frequently, we need to align the x-axis.
# We log every `log_every` steps.
val_x = [i * log_every for i in range(1, len(val_losses) + 1)]
plt.plot(val_x, val_losses, label='Validation Loss')
plt.xlabel('Training Steps')
plt.ylabel('Loss')
plt.title('Train and Validation Loss over Steps')
plt.legend()

# Plotting the accuracy
plt.subplot(1, 2, 2)
plt.plot(accs, label='Train Accuracy')
# Aligning val_accs with the corresponding steps
plt.plot(val_x, val_accs, label='Validation Accuracy')
plt.xlabel('Training Steps')
plt.ylabel('Accuracy')
plt.title('Train and Validation Accuracy over Steps')
plt.legend()

plt.tight_layout()
plt.show()