In [1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "0,1"

In [2]:
import math
import transformers
from datasets import load_dataset
from pathlib import Path
from transformers import AutoTokenizer,TrainingArguments,Trainer

# Fine-tuning a language model

## Preparing the dataset

For each of those tasks, we will use the [Wikitext 2]() dataset as an example. You can load it very easily with the 🤗 Datasets library.

In [3]:
files = list(map(str, Path("data/wiki-20220301-en").glob("*.parquet")))
datasets = load_dataset("parquet", data_files=files, split="train")

Resolving data files:   0%|          | 0/26 [00:00<?, ?it/s]

In [4]:
datasets[10]

{'id': '63665707',
 'url': 'https://en.wikipedia.org/wiki/Sardiha%20railway%20station',
 'title': 'Sardiha railway station',
 'text': 'Sardiha railway station is a railway station on Howrah–Nagpur–Mumbai line under Kharagpur railway division of South Eastern Railway zone. It is situated at Dhadkinala in Jhargram district in the Indian state of West Bengal. It is  from Kharagpur Junction.\n\nReferences\n\nRailway stations in Jhargram district\nKharagpur railway division\nJhargram district'}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [5]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [6]:
show_random_elements(datasets)

Unnamed: 0,id,url,title,text
0,19514703,https://en.wikipedia.org/wiki/Vorab%20Glacier,Vorab Glacier,"The Vorab Glacier (, rarely Vorabfiren, Romansh: Glatscher dil Vorab) is a 2 km long glacier (as of 2005) situated in the Glarus Alps in the cantons of Glarus and Graubünden. It lies on the east side of the Vorab, between 2,600 and 3,000 metres above sea level. In 1973 it had an area of 2.17 km2., and is receding today.\n\nThe glacier is part of the Flims-Laax ski resort complex, and may be skied in the winter.\n\nSee also\nList of glaciers in Switzerland\nSwiss Alps\n\nReferences\n\nExternal links\nSwiss glacier monitoring network\n\nGlaciers of Switzerland\nGlaciers of the Alps\nGlarus–Graubünden border"
1,37115725,https://en.wikipedia.org/wiki/Kasbah%20of%20Tifoultoute,Kasbah of Tifoultoute,"The Kasbah of Tifoultoute is a kasbah in Ouarzazate Province, Morocco located west of the city of Ouarzazate.\n\nHistory \nThis fortress belonged to the family of Thami El Glaoui, Pasha of Marrakech from 1912 to 1956.\n\nGallery\n\nReferences \n\nTifoultoute\nForts in Morocco\nBuildings and structures in Drâa-Tafilalet"
2,48777199,https://en.wikipedia.org/wiki/Manifold%20regularization,Manifold regularization,"In machine learning, Manifold regularization is a technique for using the shape of a dataset to constrain the functions that should be learned on that dataset. In many machine learning problems, the data to be learned do not cover the entire input space. For example, a facial recognition system may not need to classify any possible image, but only the subset of images that contain faces. The technique of manifold learning assumes that the relevant subset of data comes from a manifold, a mathematical structure with useful properties. The technique also assumes that the function to be learned is smooth: data with different labels are not likely to be close together, and so the labeling function should not change quickly in areas where there are likely to be many data points. Because of this assumption, a manifold regularization algorithm can use unlabeled data to inform where the learned function is allowed to change quickly and where it is not, using an extension of the technique of Tikhonov regularization. Manifold regularization algorithms can extend supervised learning algorithms in semi-supervised learning and transductive learning settings, where unlabeled data are available. The technique has been used for applications including medical imaging, geographical imaging, and object recognition.\n\nManifold regularizer\n\nMotivation \n\nManifold regularization is a type of regularization, a family of techniques that reduces overfitting and ensures that a problem is well-posed by penalizing complex solutions. In particular, manifold regularization extends the technique of Tikhonov regularization as applied to Reproducing kernel Hilbert spaces (RKHSs). Under standard Tikhonov regularization on RKHSs, a learning algorithm attempts to learn a function from among a hypothesis space of functions . The hypothesis space is an RKHS, meaning that it is associated with a kernel , and so every candidate function has a norm , which represents the complexity of the candidate function in the hypothesis space. When the algorithm considers a candidate function, it takes its norm into account in order to penalize complex functions.\n\nFormally, given a set of labeled training data with and a loss function , a learning algorithm using Tikhonov regularization will attempt to solve the expression\n\n \n\nwhere is a hyperparameter that controls how much the algorithm will prefer simpler functions to functions that fit the data better.\n\nManifold regularization adds a second regularization term, the intrinsic regularizer, to the ambient regularizer used in standard Tikhonov regularization. Under the manifold assumption in machine learning, the data in question do not come from the entire input space , but instead from a nonlinear manifold . The geometry of this manifold, the intrinsic space, is used to determine the regularization norm.\n\nLaplacian norm \n\nThere are many possible choices for . Many natural choices involve the gradient on the manifold , which can provide a measure of how smooth a target function is. A smooth function should change slowly where the input data are dense; that is, the gradient should be small where the marginal probability density , the probability density of a randomly drawn data point appearing at , is large. This gives one appropriate choice for the intrinsic regularizer:\n\n \n\nIn practice, this norm cannot be computed directly because the marginal distribution is unknown, but it can be estimated from the provided data.\n\nGraph-based approach of the Laplacian norm \n\nWhen the distances between input points are interpreted as a graph, then the Laplacian matrix of the graph can help to estimate the marginal distribution. Suppose that the input data include labeled examples (pairs of an input and a label ) and unlabeled examples (inputs without associated labels). Define to be a matrix of edge weights for a graph, where is a distance measure between the data points and . Define to be a diagonal matrix with and to be the Laplacian matrix . Then, as the number of data points increases, converges to the Laplace–Beltrami operator , which is the divergence of the gradient . Then, if is a vector of the values of at the data, , the intrinsic norm can be estimated:\n\n \n\nAs the number of data points increases, this empirical definition of converges to the definition when is known.\n\nSolving the regularization problem with graph-based approach \n\nUsing the weights and for the ambient and intrinsic regularizers, the final expression to be solved becomes:\n\n \n\nAs with other kernel methods, may be an infinite-dimensional space, so if the regularization expression cannot be solved explicitly, it is impossible to search the entire space for a solution. Instead, a representer theorem shows that under certain conditions on the choice of the norm , the optimal solution must be a linear combination of the kernel centered at each of the input points: for some weights ,\n\n \n\nUsing this result, it is possible to search for the optimal solution by searching the finite-dimensional space defined by the possible choices of .\n\nFunctional approach of the Laplacian norm \n\nThe idea beyond graph-Laplacian is to use neighbors to estimate Laplacian. \nThis method is akin local averaging methods, that are known to scale poorly in high-dimensional problem.\nIndeed, graph Laplacian is known to suffer from the curse of dimensionality.\nLuckily, it is possible to leverage expected smoothness of the function to estimate thanks to more advanced functional analysis.\nThis method consists in estimating the Laplacian operator thanks to derivatives of the kernel reading where denotes the partial derivatives according to the j-th coordinate of the first variable.\nThis second approach of the Laplacian norm is to put in relation with meshfree methods, that constrast with the finite difference method in PDE.\n\nApplications \n\nManifold regularization can extend a variety of algorithms that can be expressed using Tikhonov regularization, by choosing an appropriate loss function and hypothesis space . Two commonly used examples are the families of support vector machines and regularized least squares algorithms. (Regularized least squares includes the ridge regression algorithm; the related algorithms of LASSO and elastic net regularization can be expressed as support vector machines.) The extended versions of these algorithms are called Laplacian Regularized Least Squares (abbreviated LapRLS) and Laplacian Support Vector Machines (LapSVM), respectively.\n\nLaplacian Regularized Least Squares (LapRLS) \n\nRegularized least squares (RLS) is a family of regression algorithms: algorithms that predict a value for its inputs , with the goal that the predicted values should be close to the true labels for the data. In particular, RLS is designed to minimize the mean squared error between the predicted values and the true labels, subject to regularization. Ridge regression is one form of RLS; in general, RLS is the same as ridge regression combined with the kernel method. The problem statement for RLS results from choosing the loss function in Tikhonov regularization to be the mean squared error:\n\n \n\nThanks to the representer theorem, the solution can be written as a weighted sum of the kernel evaluated at the data points:\n\n \n\nand solving for gives:\n\n \n\nwhere is defined to be the kernel matrix, with , and is the vector of data labels.\n\nAdding a Laplacian term for manifold regularization gives the Laplacian RLS statement:\n\n \n\nThe representer theorem for manifold regularization again gives\n\n \n\nand this yields an expression for the vector . Letting be the kernel matrix as above, be the vector of data labels, and be the block matrix :\n\n \n\nwith a solution of\n\n \n\nLapRLS has been applied to problems including sensor networks,\nmedical imaging,\nobject detection,\nspectroscopy,\ndocument classification,\ndrug-protein interactions,\nand compressing images and videos.\n\nLaplacian Support Vector Machines (LapSVM) \n\nSupport vector machines (SVMs) are a family of algorithms often used for classifying data into two or more groups, or classes. Intuitively, an SVM draws a boundary between classes so that the closest labeled examples to the boundary are as far away as possible. This can be directly expressed as a linear program, but it is also equivalent to Tikhonov regularization with the hinge loss function, :\n\n \n\nAdding the intrinsic regularization term to this expression gives the LapSVM problem statement:\n\n \n\nAgain, the representer theorem allows the solution to be expressed in terms of the kernel evaluated at the data points:\n\n \n\n can be found by writing the problem as a linear program and solving the dual problem. Again letting be the kernel matrix and be the block matrix , the solution can be shown to be\n\n \n\nwhere is the solution to the dual problem\n\nand is defined by\n\n \n\nLapSVM has been applied to problems including geographical imaging,\nmedical imaging,\nface recognition,\nmachine maintenance,\nand brain–computer interfaces.\n\nLimitations \n\n Manifold regularization assumes that data with different labels are not likely to be close together. This assumption is what allows the technique to draw information from unlabeled data, but it only applies to some problem domains. Depending on the structure of the data, it may be necessary to use a different semi-supervised or transductive learning algorithm.\n In some datasets, the intrinsic norm of a function can be very close to the ambient norm : for example, if the data consist of two classes that lie on perpendicular lines, the intrinsic norm will be equal to the ambient norm. In this case, unlabeled data have no effect on the solution learned by manifold regularization, even if the data fit the algorithm's assumption that the separator should be smooth. Approaches related to co-training have been proposed to address this limitation.\n If there are a very large number of unlabeled examples, the kernel matrix becomes very large, and a manifold regularization algorithm may become prohibitively slow to compute. Online algorithms and sparse approximations of the manifold may help in this case.\n\nSoftware \n The ManifoldLearn library and the Primal LapSVM library implement LapRLS and LapSVM in MATLAB.\n The Dlib library for C++ includes a linear manifold regularization function.\n\nSee also \n Manifold learning\n Manifold hypothesis\n Semi-supervised learning\n Transduction (machine learning)\n Spectral graph theory\n Reproducing kernel Hilbert space\n Tikhonov regularization\n Differential geometry\n\nReferences \n\nMachine learning"
3,11916618,https://en.wikipedia.org/wiki/Information%20rights%20management,Information rights management,"Information rights management (IRM) is a subset of digital rights management (DRM), technologies that protect sensitive information from unauthorized access. It is sometimes referred to as E-DRM or Enterprise Digital Rights Management. This can cause confusion, because digital rights management (DRM) technologies are typically associated with business-to-consumer systems designed to protect rich media such as music and video. IRM is a technology which allows for information (mostly in the form of documents) to be ‘remote controlled’.\n\nThis means that information and its control can now be separately created, viewed, edited and distributed. A true IRM system is typically used to protect information in a business-to-business model, such as financial data, intellectual property and executive communications. IRM currently applies mainly to documents and emails.\n\nFeatures\nIRM technologies typically have a number of features that allow an owner to control, manage and secure information from unwanted access.\n\nInformation encryption\nInformation rights management solutions use encryption to prevent unauthorized access. A key or password can be used to control access to the encrypted data.\n\nPermissions management\nOnce a document is encrypted against unauthorized users, an IRM user can apply certain access permissions that permit or deny a user from taking certain actions on a piece of information. Some of these standard permissions are included below. \n Strong in use protection, such as controlling copy & paste, preventing screenshots, printing, editing.\n A rights model/policy which allows for easy mapping of business classifications to information.\n Offline use allowing for users to create/access IRM sealed documents without needing network access for certain periods of time.\n Full auditing of both access to documents as well as changes to the rights/policy by business users.\n\nIt also allows users to change or revoke access permissions without sharing the document again.\n\nExamples \nAn example of IRM in use would be to secure a sensitive engineering document being distributed in an environment where the document's recipients could not necessarily be trusted.\n\nAlternatively, an e-mail could be secured with IRM. If an email is accidentally forwarded to an untrusted party, only authorized users can gain access. A well designed IRM system will not limit the ability for information to be shared. Rules are enforced only when people attempt to gain access. This is important as often people share sensitive information with users who should legitimately have access but don't. Technology must facilitate control over sensitive information in such a situation.\n\nIRM is far more secure than shared secret passwords. Key management is used to protect the information whilst it is at rest on a hard disk, network drive or other storage device. IRM continues to protect and control access to the document when it is in use. Functionality such as preventing screen shots, disallowing the copying of data from the secure document to an insecure environment and guarding the information from programmatic attack, are key elements of an effective IRM solution.\n\nNaming conventions\nInformation rights management is also known by the following names:\n Enterprise Rights Management\n Enterprise DRM or Enterprise Digital Rights Management\n Document Rights Management\n Intelligent Rights Management\n\nSee also\n Digital rights management\n Always-on DRM\n Copyright infringement\n Encryption\n Advanced Encryption Standard\n Rpmsg\n\nReferences\n\nDigital rights management"
4,15264516,https://en.wikipedia.org/wiki/Oddzar,Oddzar,"Oddzar is an American rock band formed in 1999 in Columbia, Maryland. Their original line-up consisted of high school friends Russ Eckell (vocals), David Nenner (guitar), Travis Lockhart (bass), and Blake Silvea (drums).\n\nThe four were heavily influenced by funk metal bands such as Red Hot Chili Peppers and Rage Against the Machine, as well as Pearl Jam. Shortly after their formation, Nenner left the group to form Truth Be Told and was replaced by Greg Jung in 2000.\n\nThe quartet reworked their style, drawing from influences such as Tool and Muse. ""We felt a need to avoid musical trends such as pop-punk, emo, and rap-metal,"" said Eckell. In 2002, the band was signed to DCide Records of Nothingface fame. The group began working on their self-titled debut, produced and engineered by Drew Mazurek (Linkin Park). Oddzar was released in 2004, four months after Greg Jung’s departure from the band in May. He was replaced by Greg Loman.\n\n“Abandoned Road” and ""Spell,"" songs from the first album were featured on MTV’s Road Rules in 2005.\n\nIn 2005, Travis Lockhart left Oddzar to pursue other interests. The band used Ellis Tinsley as a stand-in bassist before settling on University of Maryland student Trevor Olexy as their fourth member.\n\nOn November 19, 2007, Oddzar released a demo from their then untitled second album. The song was entitled “Ready the Chariot” and explored more progressive terrain than their earlier work.\n\nIn 2008 ""Until it Does"" a song from Oddzar's first album was used in the soundtrack of the eighth episode of The Real World: Hollywood, ""Arrival and Departure.""\n\nOddzar recorded their second album, tentatively entitled 'Rise' in late April 2008 at Mad Oak Studios in Allston, Massachusetts. The record was produced by Evan Anderson and engineered and mixed by Benny Grotto. The record was mastered in late September 2008 at Peerless Mastering (also in Boston, MA). It has been scheduled to be released on January 29, 2009 under the name Ready the Chariot.\n\nOn October 1, 2008, Oddzar released a track from their forthcoming records called D.O.D. (Dogs of Demikhov) on their Myspace page.\n\nAllmusic has said the band is “well worth keeping an eye on.”\n\nReferences\n\nHeavy metal musical groups from Maryland"
5,50816341,https://en.wikipedia.org/wiki/Kainz,Kainz,"Kainz is an Austrian and German surname. Notable people with the surname include:\n\n Florian Kainz (born 1992), Austrian football midfielder\n Howard P. Kainz (born 1933), American professor emeritus\n Tobias Kainz (born 1992), Austrian footballer\n Adolf Kainz (1903–1948), Austrian canoeist\n Josef Kainz (1858–1910), Austrian actor\n Wolfgang Kainz (born 1967), Austrian scientist"
6,53725691,https://en.wikipedia.org/wiki/Dafna%20%28given%20name%29,Dafna (given name),"Dafna (), is a feminine given name. Notable people with the given name include:\n\nDafna Dekel, Israeli singer and actress\nDafna Kaffeman, Israeli artist\nDafna Lemish, Israeli-American media researcher\nDafna Linzer, journalist\nDafna Rechter, Israeli actress and singer\n\nHebrew feminine given names"
7,35413031,https://en.wikipedia.org/wiki/Kashanak%2C%20Tehran,"Kashanak, Tehran","Kashanak (, also Romanized as Kāshānak) is a village in Khalazir Rural District, Aftab District, Tehran County, Tehran Province, Iran. At the 2006 census, its population was 421, in 110 families.\n\nReferences \n\nPopulated places in Tehran County"
8,56465613,https://en.wikipedia.org/wiki/Mark%20Stein%20%28American%20football%29,Mark Stein (American football),"Mark Stein is an American football coach. He is the head football coach at Martin Luther College in New Ulm, Minnesota, a position he has held since 2015. Inheriting a depleted roster during his first seasons, Stein has led a rebuilding of the MLC program, highlighted by being named Upper Midwest Athletic Conference Coach of the Year in 2017.\n\nHead coaching record\n\nCollege\n\nReferences\n\nExternal links\n Martin Luther profile\n\nYear of birth missing (living people)\nLiving people\nMartin Luther Knights football coaches\nHigh school football coaches in Wisconsin"
9,804176,https://en.wikipedia.org/wiki/San%20Francisco%20Botanical%20Garden,San Francisco Botanical Garden,"The San Francisco Botanical Garden at Strybing Arboretum (formerly Strybing Arboretum) is located in San Francisco's Golden Gate Park. Its 55 acres (22.3 ha) represents nearly 9,000 different kinds of plants from around the world, with particular focus on Magnolia species, high elevation palms, conifers, and cloud forest species from Central America, South America and Southeast Asia.\n\nSan Francisco's County Fair Building is located near the main entrance to the Garden.\n\nHistory\n\nPlans for the garden were originally laid out in the 1880s by park supervisor John McLaren, but funding was insufficient to begin construction until Helene Strybing left a major bequest in 1927. Planting was begun in 1937 with WPA funds supplemented by local donations, and the Arboretum officially opened in May 1940. As a part of Golden Gate Park, it is officially managed by the city of San Francisco, but the San Francisco Botanical Garden Society plays an important role in providing educational programs, managing volunteers, curatorial staff, and more. Formed in 1955, the San Francisco Botanical Garden Society (formerly the Strybing Arboretum Society) operates the Helen Crocker Russell Library of Horticulture, Garden Bookstore, and monthly plant sales, and offers a wide range of community education programs for children and adults. The Society also raises money for new projects and Garden renovations.\n\nIn 2004, Strybing Arboretum changed its name to San Francisco Botanical Garden at Strybing Arboretum, and the Arboretum Society followed suit, becoming San Francisco Botanical Garden Society at Strybing Arboretum.\n\nPlant collections\n\nThe gardens are organized into several specialized collections:\n Mediterranean\nCalifornia Native\n John Muir Nature Trail\n Redwood Grove\n Chile\n South Africa\n Australia\n Mediterranean Basin Region \n Mild-temperate climate \n New Zealand\n Moon-viewing Garden - a Japanese design\n Temperate Asia Garden\n Montane tropic \n Mesoamerican Cloud Forest\n Southeast Asian Cloud Forest (in development)\n Andean Cloud Forest (in development)\n Specialty collections\n Ancient Plant Garden \n Succulent garden \n Dwarf Conifer garden\n Exhibition Garden\n Garden of Fragrance\n Zellerbach Garden of Perennials\n Dry Mexico\n Rhododendron Garden\n Magnolias & Camellias (found in many collections)\n\nThe mild Mediterranean climate is ideal for plants from surprisingly many parts of the world; the arboretum does not include greenhouses for species requiring other climate types.\n\nSee also \n\n California native plants\n List of botanical gardens in the United States\n North American Plant Collections Consortium\n 49-Mile Scenic Drive\n\nReferences\n\nExternal links \n San Francisco Botanical Garden homepage\n""Pianos take over SF botanical gardens for 'Flower Piano' event"" KTVU, July 2018 \n""SF Botanical Garden digs its volunteers who get hands dirty"" SF Gate, June 2018\n""The Bay Area’s Largest Plant Sale Returns to Golden Gate Park"" SF Station, April 2018\n""How the Wealthy Stole 55 Acres of Golden Gate Park"" Medium, July 19, 2013\n\nArboreta in California\nBotanical gardens in California\nGolden Gate Park\nLandmarks in San Francisco\nGardens in San Francisco"


As we can see, some of the texts are a full paragraph of a Wikipedia article while others are just titles or empty lines.

## Masked language modeling

For masked language modeling (MLM) we are going to use the same preprocessing as before for our dataset with one additional step: we will randomly mask some tokens (by replacing them by `[MASK]`) and the labels will be adjusted to only include the masked tokens (we don't have to predict the non-masked tokens).

We will use the [`distilroberta-base`](https://huggingface.co/distilroberta-base) model for this example. You can pick any of the checkpoints listed [here](https://huggingface.co/models?filter=masked-lm) instead:

In [7]:
model_checkpoint = "model/deberta-v3-large-hf-weights"

We can apply the same tokenization function as before, we just need to update our tokenizer to use the checkpoint we just picked:

In [8]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=12, remove_columns=["text", "url", "title"])

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map (num_proc=12):   0%|          | 0/6458670 [00:00<?, ? examples/s]

In [None]:
for k in next(iter(tokenized_datasets)).keys():
    print(k)

In [None]:
block_size = 128

In [None]:
def group_texts(examples):
    """将一组文本实例重新组合成一个适合训练的数据格式"""
    # Concatenate all texts.
    concatenated_examples = {k: []for k in examples.keys()}
    for k in examples.keys():
        for text in examples[k]:
            concatenated_examples[k].extend(text)
    # concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=12,
)

The rest is very similar to what we had, with two exceptions. First we use a model suitable for masked LM:

In [None]:
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

We redefine our `TrainingArguments`:

In [None]:
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-wiki-sci",
    evaluation_strategy = "epoch",
    warmup_ratio=0.1,
    learning_rate=50e-7,
    weight_decay=0.01,
    bf16=True,
    save_total_limit=1,
    save_strategy="epoch",
    auto_find_batch_size=True,
    num_train_epochs=20,
    load_best_model_at_end=True,
    per_device_train_batch_size=512,
    output_dir="./save_checkpoints"
)

Finally, we use a special `data_collator`. The `data_collator` is a function that is responsible of taking the samples and batching them in tensors. In the previous example, we had nothing special to do, so we just used the default for this argument. Here we want to do the random-masking. We could do it as a pre-processing step (like the tokenization) but then the tokens would always be masked the same way at each epoch. By doing this step inside the `data_collator`, we ensure this random masking is done in a new way each time we go over the data.

To do this masking for us, the library provides a `DataCollatorForLanguageModeling`. We can adjust the probability of the masking:

In [None]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

Then we just have to pass everything to `Trainer` and begin training:

In [None]:
train_testvalid = lm_datasets.train_test_split(test_size=0.05)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_testvalid["train"],
    eval_dataset=train_testvalid["test"],
    data_collator=data_collator,
)

In [None]:
trainer.train()

Like before, we can evaluate our model on the validation set. The perplexity is much lower than for the CLM objective because for the MLM objective, we only have to make predictions for the masked tokens (which represent 15% of the total here) while having access to the rest of the tokens. It's thus an easier task for the model.

In [None]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

You can now upload the result of the training to the Hub, just execute this instruction: