## Fine-tuning a masked language model

In [1]:
from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [2]:
distilbert_num_parameters = model.num_parameters() / 1_000_000
print(f"DistilBERT has {distilbert_num_parameters:.2f} million parameters.")

DistilBERT has 66.99 million parameters.


In [3]:
text = "Well, another large [MASK] model."

In [4]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# Let's find the top 5 predictions for the masked token
import torch

inputs = tokenizer(text, return_tensors="pt")
inputs_logits = model(**inputs).logits
inputs_logits.shape # batch size, sequence length, vocab size

torch.Size([1, 9, 30522])

In [8]:
# Find the location of the masked token
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = inputs_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
for token in top_5_tokens:
    print(f"'{tokenizer.decode([token])}'")
    print(f">>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}")

'scale'
>>> Well, another large scale model.
'satellite'
>>> Well, another large satellite model.
'business'
>>> Well, another large business model.
'size'
>>> Well, another large size model.
'sized'
>>> Well, another large sized model.


In [9]:
# Load imdb movie reviews dataset
from datasets import load_dataset
imdb_dataset = load_dataset("imdb")
imdb_dataset

README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [10]:
samples = imdb_dataset["train"].shuffle(seed=42).select(range(3))

for sample in samples:
    print(f"Review: {sample['text']}")
    print(f"Label: {sample['label']}")

Review: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...
Label: 1
Review: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stub your toe on the m

In [12]:
# Unlabled data has label -1
samples = imdb_dataset["unsupervised"].shuffle(seed=42).select(range(3))

for sample in samples:
    print(f"Review: {sample['text']}")
    print(f"Label: {sample['label']}")

Review: If you've seen the classic Roger Corman version starring Vincent Price it's hard to put it out of your head, but you probably should do because this one is totally different. Subtlety has been abandoned in favour of gross-out horror - nudity, gore and all-round unpleasantness. OK it's ridiculous, trashy, sensationalised and historically dubious (did any members of the Inquisition really wear horn-rimmed glasses?), but despite all this it is strangely compelling. I literally couldn't tear myself away from the screen until the end of the movie. If there's a bigger compliment you can pay to a film I don't know what it is.
Label: -1
Review: For me, this was the most moving film of the decade. Samira Makhmalbaf shows pure bravery and vision in the making. She has an intelligence and gift for speaking to the people, regardless of their nationality or beliefs. I am inspired and touched by her humanity and can only hope that she has touched many people the same way. Her message in this

In [17]:
# Define a tokenize function and test it
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [
            result.word_ids(i) for i in range(len(result["input_ids"]))
        ]
    return result

examples = imdb_dataset["train"][:2]
tokenized_examples = tokenize_function(examples)
print(tokenized_examples["word_ids"])

[[None, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 143, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 21

In [18]:
len(tokenized_examples)

3

In [19]:
[len(ids) for ids in tokenized_examples["input_ids"]]

[363, 304]

In [20]:
# Tokenize the entire dataset
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

In [None]:
# Check the maximum length (context size) of the tokenizer
tokenizer.model_max_length

512

In [24]:
# Set our chunk size
chunk_size = 512

In [27]:
# Check out a few examples
tokenized_samples = tokenized_datasets["train"][:5]
for i, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"Sample {i} has {len(sample)} tokens.")

Sample 0 has 363 tokens.
Sample 1 has 304 tokens.
Sample 2 has 133 tokens.
Sample 3 has 185 tokens.
Sample 4 has 495 tokens.


In [29]:
# Concatenate all samples
# sum(..., []) flattens a list of lists
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"Concatenated length: {total_length} tokens")

Concatenated length: 1480 tokens


In [30]:
chunks = {
    k: [t[i: i+chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"Chunk of size {len(chunk)}: {chunk}")

Chunk of size 512: [101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107, 2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142, 2163, 1012, 1999, 2090, 4851, 8801, 1998, 6623, 7939, 4697, 3619, 1997, 8947, 2055, 2037, 10740, 2006, 4331, 1010, 2016, 2038, 3348, 2007, 2014, 3689

In [None]:
# Define a helper function to group texts into chunks
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {
        k: sum(examples[k], []) for k in examples.keys()
    }
    total_length = len(concatenated_examples["input_ids"])
    # We drop the small remainder
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i: i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Add a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
# Create the language modeling dataset
# This will take about 2 minutes
lm_datasets = tokenized_datasets.map(
    group_texts, batched=True, batch_size=1000
)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [33]:
lm_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 15313
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 14966
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 30721
    })
})

In [34]:
tokenizer.decode(lm_datasets["train"][0]["input_ids"])

'[CLS] i rented i am curious - yellow from my video store because of all the controversy that surrounded it when it was first released in 1967. i also heard that at first it was seized by u. s. customs if it ever tried to enter this country, therefore being a fan of films considered " controversial " i really had to see this for myself. < br / > < br / > the plot is centered around a young swedish drama student named lena who wants to learn everything she can about life. in particular she wants to focus her attentions to making some sort of documentary on what the average swede thought about certain political issues such as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes 

In [36]:
# Build a data collator for language modeling
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm_probability=0.15,
)

In [39]:
# Let's see how this masking works
samples = [lm_datasets["train"][i] for i in range(3)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"Chunk: {tokenizer.decode(chunk)}")

Chunk: [CLS] i rented i am curious - yellow from my video store because of all the controversy that surrounded it when it was first released in 1967. i also heard that at first [MASK] was seized by [MASK]. [MASK]. customs if it ever tried to [MASK] this country, therefore [MASK] a fan of [MASK] [MASK] " controversial " i really had to see this for myself. < br / > [MASK] br [MASK] > the plot is centered [MASK] a young [MASK] drama [MASK] named lena who wants to learn everything she can about life. in particular she wants [MASK] [MASK] her attentions to [MASK] some sort of documentary on what the average swede thought about certain political issues such as the vietnam war and race issues in [MASK] united states. in between [MASK] politicians and [MASK] denizens information stockholm about their opinions on politics, she has sex with her drama teacher, classmates, [MASK] married men. < br / > < br / > what kills mecape i am curious - yellow is that 40 years [MASK], this was considered po

In [40]:
# Build a custom data collator that masks only whole words
import collections
import numpy as np
from transformers import default_data_collator

wwm_probability = 0.20

def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")
        # Create a mapping of word to its token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_index in enumerate(word_ids):
            if word_index is not None:
                if word_index != current_word:
                    current_word = word_index
                    current_word_index += 1
                mapping[current_word_index].append(idx)
                
        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_index in np.where(mask)[0]:
            word_index = word_index.item()
            for token_index in mapping[word_index]:
                new_labels[token_index] = labels[token_index]
                input_ids[token_index] = tokenizer.mask_token_id
        feature["labels"] = new_labels
    return default_data_collator(features)

In [42]:
# Let's see how this masking works
samples = [lm_datasets["train"][i] for i in range(3)]
batch = whole_word_masking_data_collator(samples)
for chunk in batch["input_ids"]:
    print(f"Chunk: {tokenizer.decode(chunk)}")

Chunk: [CLS] i [MASK] i [MASK] [MASK] - yellow from [MASK] video [MASK] [MASK] of [MASK] the controversy that surrounded it when it was first released in 1967. i also heard that at first it was seized by u. s [MASK] customs if it ever [MASK] to enter this country, [MASK] being a fan of films considered " [MASK] " i really [MASK] to see this for myself. < br / > < br / > the plot is centered around [MASK] [MASK] [MASK] drama [MASK] named lena who wants to learn everything she [MASK] about life. in particular [MASK] wants to focus her attentions [MASK] making some sort of documentary on [MASK] [MASK] average swede thought about certain political [MASK] such as the vietnam war and [MASK] issues in [MASK] united states. in between asking politicians and ordinary denizens of stockholm [MASK] their opinions on politics, she [MASK] sex with [MASK] drama teacher, classmates, and married men. < br / > < br [MASK] > what kills me about [MASK] [MASK] curious [MASK] yellow is that 40 years ago, th

In [43]:
# Down sample our dataset for faster training
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

In [44]:
# Set up hf credentials
import os
from dotenv import load_dotenv
from huggingface_hub import HfApi, create_repo

load_dotenv()
token = os.getenv("HF_TOKEN_WRITE")

In [46]:
# Directory settings
import os
os.environ["HF_HOME"] = "../data/cache"
os.environ["WANDB_DISABLED"] = "true"  # if not using wandb

# Build the trainer
from transformers import Trainer, TrainingArguments

batch_size = 64
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"../data/{model_name}-finetuned-imdb",
    overwrite_output_dir=True,
    eval_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    hub_model_id=f"{model_name}-finetuned-imdb",
    hub_token=token,
    fp16=True,
    logging_steps=logging_steps,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


In [None]:
# Calculate perplexity (lower is better)
import math
eval_results = trainer.evaluate()
perplexity = math.exp(eval_results["eval_loss"])
print(f"Perplexity: {perplexity:.2f}")

Perplexity: 19.15


In [48]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss,Model Preparation Time
1,2.5266,2.326373,0.005
2,2.4244,2.275798,0.005
3,2.4069,2.26536,0.005


TrainOutput(global_step=471, training_loss=2.4524260205068407, metrics={'train_runtime': 166.3414, 'train_samples_per_second': 180.352, 'train_steps_per_second': 2.832, 'total_flos': 3976834682880000.0, 'train_loss': 2.4524260205068407, 'epoch': 3.0})

In [49]:
# Perplexity after training
eval_results = trainer.evaluate()
perplexity = math.exp(eval_results["eval_loss"])
print(f"Perplexity: {perplexity:.2f}")

Perplexity: 9.56


In [50]:
# Save our model to the hub
trainer.push_to_hub()

Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

CommitInfo(commit_url='https://huggingface.co/tensor-polinomics/distilbert-base-uncased-finetuned-imdb/commit/5a1d75969cf75ff716eb5657faaea214af7128a8', commit_message='End of training', commit_description='', oid='5a1d75969cf75ff716eb5657faaea214af7128a8', pr_url=None, repo_url=RepoUrl('https://huggingface.co/tensor-polinomics/distilbert-base-uncased-finetuned-imdb', endpoint='https://huggingface.co', repo_type='model', repo_id='tensor-polinomics/distilbert-base-uncased-finetuned-imdb'), pr_revision=None, pr_num=None)

## Fine-tuning DistilBERT with Accelerate

In [51]:
# To reduce randomness in results, let's apply the masking once on the whole dataset
# Build a helper function
def insert_random_mask(batch):
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    masked_inputs = data_collator(features)
    # Return a new "masked" column for each column in the dataset
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()}

In [None]:
downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"])
eval_dataset = downsampled_dataset["test"].map(
    insert_random_mask, batched=True, batch_size=1000,
    remove_columns=downsampled_dataset["test"].column_names,
)
eval_dataset = eval_dataset.rename_columns(
    {
        "masked_input_ids": "input_ids",
        "masked_attention_mask": "attention_mask",
        "masked_labels": "labels",
    }
)

In [55]:
# Set up dataloaders
from torch.utils.data import DataLoader
from transformers import default_data_collator

batch_size = 64
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
eval_dataloader = DataLoader(
    eval_dataset,
    batch_size=batch_size,
    collate_fn=default_data_collator,
)

In [59]:
# Reload the model
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

# Optimizer
from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=5e-5)

# Set up the Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

# Learning rate scheduler
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

# Training loop
from tqdm.auto import tqdm
import torch
import math

ProgressBar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        ProgressBar.update(1)
    
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)
        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))
    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")
    print(f"Epoch {epoch}: Perplexity: {perplexity:.2f}")

# Save and upload
import os
from dotenv import load_dotenv
from huggingface_hub import HfApi, create_repo

load_dotenv()
token = os.getenv("HF_TOKEN_WRITE")

model_name = "distilbert-finetuned-imdb-accelerate"

# Create repo
repo_id = create_repo(model_name, token=token, exist_ok=True).repo_id
print(f"Repo created: {repo_id}")

accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(
    "../data/models/distilbert-finetuned-imdb-accelerate",
    save_function=accelerator.save,
)
if accelerator.is_main_process:
    tokenizer.save_pretrained("../data/models/distilbert-finetuned-imdb-accelerate")
    unwrapped_model.push_to_hub(model_name, token=token)
    tokenizer.push_to_hub(model_name, token=token)

  0%|          | 0/471 [00:00<?, ?it/s]

Epoch 0: Perplexity: 9.57
Epoch 1: Perplexity: 9.10
Epoch 2: Perplexity: 8.94
Repo created: tensor-polinomics/distilbert-finetuned-imdb-accelerate


Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

README.md: 0.00B [00:00, ?B/s]

In [61]:
# Reload our model and test
from transformers import pipeline
mask_filler = pipeline(
    "fill-mask",
    model="tensor-polinomics/distilbert-finetuned-imdb-accelerate",
)

preds = mask_filler(text)
for pred in preds:
    print(f"Token: {pred['token_str']}, Score: {pred['score']:.4f}")

config.json:   0%|          | 0.00/500 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Device set to use cuda:0


Token: scale, Score: 0.5341
Token: business, Score: 0.0183
Token: size, Score: 0.0182
Token: budget, Score: 0.0119
Token: satellite, Score: 0.0079
