In [1]:

%%capture
! pip install transformers datasets accelerate evaluate
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

# Masked language modeling

In [14]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load DiffusionDB dataset

The kaggle dataset already exists in the HuggingFace hub and we will use that version to make our development faster.

In [2]:
from datasets import load_dataset
from datasets import Dataset, DatasetDict

dataset = load_dataset("poloclub/diffusiondb")
dataset["train"][100]

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x512>,
 'prompt': 'baroque dmt trip ',
 'seed': 2601779167,
 'step': 50,
 'cfg': 7.0,
 'sampler': 'k_lms',
 'width': 512,
 'height': 512,
 'user_name': '18cdc4cf828242fcf88f15095dc4553170847a2baea96f87a63e3281d5d98df6',
 'timestamp': datetime.datetime(2022, 8, 12, 3, 49, tzinfo=<UTC>),
 'image_nsfw': 0.078245148062706,
 'prompt_nsfw': 0.0008729292312636971}

In [3]:
# Perform the train-test split
train_test_split = dataset['train'].train_test_split(test_size=0.2)

# Create a new DatasetDict to include the new train and test splits
dataset = DatasetDict({
    'train': train_test_split['train'],
    'test': train_test_split['test']
})

# Access the new train and test splits
train_dataset = dataset['train']
test_dataset = dataset['test']


Then take a look at an example:

In [4]:
dataset["train"][0]

{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=768x640>,
 'prompt': 'a painting of king arthur standing beneath a tree by n. c. wyeth, highly detailed, oil on canvas ',
 'seed': 3339297161,
 'step': 100,
 'cfg': 7.0,
 'sampler': 'k_lms',
 'width': 768,
 'height': 640,
 'user_name': 'e91cb787bdf7fe7e53a2b4bd7eb1690c192b44d9ba8358345b78a5c5dc047bd3',
 'timestamp': datetime.datetime(2022, 8, 12, 4, 21, tzinfo=<UTC>),
 'image_nsfw': 0.12349627912044525,
 'prompt_nsfw': 0.0005137880216352642}

While this may look like a lot, you're only really interested in the `prompt` field. What's cool about language modeling tasks is you don't need labels (also known as an unsupervised task) because the next word *is* the label.

## Preprocess

For masked language modeling, the next step is to load a DistilRoBERTa tokenizer to process the `text` subfield:

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")

You'll notice from the example above, the `text` field is actually nested inside `answers`. This means you'll need to e
xtract the `text` subfield from its nested structure with the [`flatten`](https://huggingface.co/docs/datasets/process.html#flatten) method:

In [6]:
dataset = dataset.flatten()
dataset["train"][0]

{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=768x640>,
 'prompt': 'a painting of king arthur standing beneath a tree by n. c. wyeth, highly detailed, oil on canvas ',
 'seed': 3339297161,
 'step': 100,
 'cfg': 7.0,
 'sampler': 'k_lms',
 'width': 768,
 'height': 640,
 'user_name': 'e91cb787bdf7fe7e53a2b4bd7eb1690c192b44d9ba8358345b78a5c5dc047bd3',
 'timestamp': datetime.datetime(2022, 8, 12, 4, 21, tzinfo=<UTC>),
 'image_nsfw': 0.12349627912044525,
 'prompt_nsfw': 0.0005137880216352642}

In [7]:
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["prompt"]])

To apply this preprocessing function over the entire dataset, use the 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once, and increasing the number of processes with `num_proc`. Remove any columns you don't need:

In [8]:
tokenized_prompts= dataset.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=dataset["train"].column_names,
)

Map (num_proc=4):   0%|          | 0/800 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (726 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (663 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (568 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (644 > 512). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/200 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (653 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (679 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (571 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (576 > 512). Running this sequence through the model will result in indexing errors


This dataset contains the token sequences, but some of these are longer than the maximum input length for the model.

You can now use a second preprocessing function to
- concatenate all the sequences
- split the concatenated sequences into shorter chunks defined by `block_size`, which should be both shorter than the maximum input length and short enough for your GPU RAM.

In [9]:
block_size = 128


def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

Apply the `group_texts` function over the entire dataset:

In [10]:
lm_dataset = tokenized_prompts.map(group_texts, batched=True, num_proc=4)

Map (num_proc=4):   0%|          | 0/800 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/200 [00:00<?, ? examples/s]

Now create a batch of examples using [DataCollatorForLanguageModeling](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.


Use the end-of-sequence token as the padding token and specify `mlm_probability` to randomly mask tokens each time you iterate over the data:

In [11]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

Use the end-of-sequence token as the padding token and specify `mlm_probability` to randomly mask tokens each time you iterate over the data:

In [20]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15, return_tensors="pt")

## Train

In [21]:
from transformers import AutoModelForMaskedLM, TrainingArguments, Trainer

model = AutoModelForMaskedLM.from_pretrained("distilroberta-base")

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [22]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
     

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-5): 6 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): La

In [23]:
training_args = TrainingArguments(
    output_dir="diffusion_prompt",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()



Epoch,Training Loss,Validation Loss
1,No log,1.518225
2,No log,1.402045
3,1.641800,1.347369


TrainOutput(global_step=513, training_loss=1.6369536773503175, metrics={'train_runtime': 94.8624, 'train_samples_per_second': 43.041, 'train_steps_per_second': 5.408, 'total_flos': 135373709078784.0, 'train_loss': 1.6369536773503175, 'epoch': 3.0})

Once training is completed, use the [evaluate()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.evaluate) method to evaluate your model and get its perplexity:

# Evaluate the trained model on Perplexity metric

In [24]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 3.83


Then share your model to the Hub with the [push_to_hub()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.push_to_hub) method so everyone can use your model:

# Save models weights and tokenizer to hub

In [25]:
trainer.push_to_hub()

model.safetensors:   0%|          | 0.00/329M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Shamima/diffusion_prompt/commit/9daea172b4232f82d1a498f36012645c51ba1d0b', commit_message='End of training', commit_description='', oid='9daea172b4232f82d1a498f36012645c51ba1d0b', pr_url=None, pr_revision=None, pr_num=None)

In [30]:
tokenizer.push_to_hub('Shamima/diffusion_prompt')

README.md:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Shamima/diffusion_prompt/commit/8c1abdeb98d0989445a5a823c3b3ad4572816b8c', commit_message='Upload tokenizer', commit_description='', oid='8c1abdeb98d0989445a5a823c3b3ad4572816b8c', pr_url=None, pr_revision=None, pr_num=None)

Then you can load DistilRoBERTa with [TFAutoModelForMaskedLM](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.TFAutoModelForMaskedLM):

## Inference

Now that we've finetuned the model, you can use it for inference!. Let's come up with some text we'd like the model to fill in the blank with, and use the special `<mask>` token to indicate the blank:

In [33]:
text = "A portrait of <mask>"

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for fill-mask with your model, and pass your text to it. If you like, you can use the `top_k` parameter to specify how many predictions to return:

In [34]:
from transformers import pipeline

mask_filler = pipeline("fill-mask", "Shamima/diffusion_prompt")
mask_filler(text, top_k=3)

[{'score': 0.023939495906233788,
  'token': 10,
  'token_str': ' a',
  'sequence': 'A portrait of a'},
 {'score': 0.019102510064840317,
  'token': 2864,
  'token_str': ' herself',
  'sequence': 'A portrait of herself'},
 {'score': 0.01888210140168667,
  'token': 16423,
  'token_str': ' Hitler',
  'sequence': 'A portrait of Hitler'}]

Tokenize the text and return the `input_ids` as PyTorch tensors. You'll also need to specify the position of the `<mask>` token:

In [35]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Shamima/diffusion_prompt")
inputs = tokenizer(text, return_tensors="pt")
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

Pass your inputs to the model and return the `logits` of the masked token:

In [36]:
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("Shamima/diffusion_prompt")
logits = model(**inputs).logits
mask_token_logits = logits[0, mask_token_index, :]

Then return the three masked tokens with the highest probability and print them out:

In [37]:
top_3_tokens = torch.topk(mask_token_logits, 3, dim=1).indices[0].tolist()

for token in top_3_tokens:
    print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))

A portrait of  a
A portrait of  herself
A portrait of  Hitler
