In [1]:
! pip install datasets transformers



In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.3).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


In [6]:
import transformers

print(transformers.__version__)

4.52.2


# Fine-tuning a language model

## Language Model Fine-tuning

This notebook details how to fine-tune a [🤗 Transformers](https://github.com/huggingface/transformers) model for language modeling.

### Task Types

*   **Causal Language Modeling (CLM):** Predicts the subsequent token. Uses an attention mask to prevent peeking.

    ![Widget inference representing the causal language modeling task](https://github.com/huggingface/notebooks/blob/master/examples/images/causal_language_modeling.png?raw=1)

*   **Masked Language Modeling (MLM):** Predicts masked tokens using surrounding context.

    ![Widget inference representing the masked language modeling task](https://github.com/huggingface/notebooks/blob/master/examples/images/masked_language_modeling.png?raw=1)

### Workflow

We'll demonstrate dataset loading, preprocessing, and using the `Trainer` API.

Find a runnable script for distributed/TPU use in our [examples folder](https://github.com/huggingface/transformers/tree/master/examples).

## Preparing the dataset

For each of those tasks, we will use the [Wikitext 2]() dataset as an example. You can load it very easily with the 🤗 Datasets library.

In [2]:
# !pip install --upgrade datasets

In [1]:
from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

You can replace the dataset above with any dataset hosted on [the hub](https://huggingface.co/datasets) or use your own files. Just uncomment the following cell and replace the paths with values that will lead to your files:

In [None]:
# datasets = load_dataset("text", data_files={"train": path_to_train.txt, "validation": path_to_validation.txt}

In [3]:
datasets["train"][10]

{'text': ' The game \'s battle system , the BliTZ system , is carried over directly from Valkyira Chronicles . During missions , players select each unit using a top @-@ down perspective of the battlefield map : once a character is selected , the player moves the character around the battlefield in third @-@ person . A character can only act once per @-@ turn , but characters can be granted multiple turns at the expense of other characters \' turns . Each character has a field and distance of movement limited by their Action Gauge . Up to nine characters can be assigned to a single mission . During gameplay , characters will call out if something happens to them , such as their health points ( HP ) getting low or being knocked out by enemy attacks . Each character has specific " Potentials " , skills unique to each character . They are divided into " Personal Potential " , which are innate skills that remain unaltered unless otherwise dictated by the story and can either help or impede

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [4]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [5]:
show_random_elements(datasets["train"])

Unnamed: 0,text
0,† Marchandia \n
1,""" Joyful , Joyful "" is a song with a length of four minutes and twenty @-@ eight seconds . According to the sheet music published by Musicnotes.com , "" Joyful , Joyful "" is a CCM and alternative CCM set in common time in the key of F major with a tempo of 120 beats per minute . Mark Hall 's vocal range in the song spans from the low note of B ♭ 3 to the high note of F5 . The song has regarded as a re @-@ invention of "" Joyful , Joyful We Adore Thee "" and Beethoven 's Symphony No. 9 , the song alters the format of the former , rearranging the song 's overall structure while adding a chorus . "" Joyful , Joyful "" is led by a "" driving "" and "" pulsing "" string section that has been compared to Coldplay 's "" Viva la Vida "" . Mark Hall felt that the band 's arrangement brought out the message of one of the song 's final verses ( "" God our Father / Christ our brother / all who live in love are thine / teach us how to love each other / and fill us to the joy divine "" ) ; Hall described the message by saying "" God 's our father and Christ 's our brother , we have this connection with God . But if we can 't love each other , the joy isn 't completed . Its not real joy yet until we know how to love the people that are around us "" . \n"
2,"Scheduling the concert meant a financial loss of £ 500 @,@ 000 for the band , despite sponsorship from Coca @-@ Cola and GSM . Ticket prices were set at just DM 8 ( £ 8 , US $ 18 ) , because of the 50 percent unemployment rate in the city . Bono offered for the group to perform a benefit concert or small show in Sarajevo , but the city requested they hold the full PopMart show . Bono said , "" We offered to do a charity gig here , just turn up and do a scratch gig , but they wanted the whole fucking thing . They wanted the lemon ! "" McGuinness added , "" we felt it was important that we treat this as another city on the tour , to pay them that respect . To come here and not do the whole show would have been rude . "" According to news releases following the concert , the total net income for the show was US $ 13 @,@ 500 ; however , tour promoter John Giddings noted that price did not include the costs of the production or transportation . \n"
3,
4,"The couple , depicted in the centre , are accompanied by a host of divinities and other celestial beings . The god Vishnu and his wife Lakshmi are often pictured as giving away the bride to Shiva . The god Brahma is shown as the officiating priest . \n"
5,"Believing the glee club members are becoming complacent ahead of the forthcoming sectionals , director Will Schuester ( Matthew Morrison ) divides the club into boys against girls for a mash @-@ up competition . Cheerleading coach Sue Sylvester ( Jane Lynch ) observes that head cheerleader Quinn Fabray 's ( Dianna Agron ) performance standards are slipping . When Quinn blames her tiredness on her glee club participation , Sue renews her resolve to destroy the club , planning to sabotage Will 's personal life . \n"
6,"Biographer Weissweiler does not dismiss the possibility that Busch 's increasing alcohol dependence hindered self @-@ criticism . He refused invitations to parties , and publisher Otto Basserman sent him to Wiedensahl to keep his alcohol problem undetected from those around him . Busch was also a heavy smoker , resulting in symptoms of severe nicotine poisoning in 1874 . He began to illustrate drunkards more often . \n"
7,"Not Quite Hollywood : The Wild , Untold Story of Ozploitation ! is a 2008 Australian documentary film about the Australian New Wave of 1970s and ' 80s low @-@ budget cinema . The film was written and directed by Mark Hartley , who interviewed over eighty Australian , American and British actors , directors , screenwriters and producers , including Quentin Tarantino , Brian Trenchard @-@ Smith , Jamie Lee Curtis , Dennis Hopper , George Lazenby , George Miller , Barry Humphries , Stacy Keach and John Seale . \n"
8,"Toniná had a particularly active Early Classic presence , although the Early Classic remains lie entirely buried under later construction . Due to this , early texts are scarce and only offer a glimpse of the early history of the site . An 8th @-@ century text refers to a king ruling in AD 217 , although it only mentions his title , not his name . \n"
9,= = = Six Nations = = = \n


As we can see, some of the texts are a full paragraph of a Wikipedia article while others are just titles or empty lines.

## Causal Language modeling

For causal language modeling (CLM) we are going to take all the texts in our dataset and concatenate them after they are tokenized. Then we will split them in examples of a certain sequence length. This way the model will receive chunks of contiguous text that may look like:
```
part of text 1
```
or
```
end of text 1 [BOS_TOKEN] beginning of text 2
```
depending on whether they span over several of the original texts in the dataset or not. The labels will be the same as the inputs, shifted to the left.

We will use the [`distilgpt2`](https://huggingface.co/distilgpt2) model for this example. You can pick any of the checkpoints listed [here](https://huggingface.co/models?filter=causal-lm) instead:

In [6]:
model_checkpoint = "distilgpt2"

To tokenize all our texts with the same vocabulary that was used when training the model, we have to download a pretrained tokenizer. This is all done by the `AutoTokenizer` class:

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Calling tokenizer on all the text in dataset using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library.

In [8]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

1. Splitting the data
2. Batching the data.
3. and distributing the entire thing into 4 processes by giving num_proc=4

In [9]:
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

If we now look at an element of our datasets, we will see the text have been replaced by the `input_ids` the model will need:

In [None]:
tokenized_datasets["train"][1]

* Concatenate tokenized texts.
* Split concatenated texts into chunks of block_size.
* Use the map method with batched=True for this process.
* Determine block_size, considering model pretraining length and GPU memory constraints (e.g., using 128).

In [10]:
# block_size = tokenizer.model_max_length
block_size = 128

grouping text

In [11]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

* Duplicate inputs for labels (for Causal Language Modeling). The model handles the right shift.
* The map method processes data in batches (default batch size is 1000).
* The code drops remainder tokens to ensure concatenated texts are a multiple of block_size per batch.
* Batch size can be adjusted (higher batch size can slow processing).
* Multiprocessing can speed up preprocessing.

In [12]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

Verify the datasets contain chunks of block_size contiguous tokens.

In [13]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

' game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . \n The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving for series newcomers . Character designer Raita Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Oz'

Now that the data has been cleaned, we're ready to instantiate our `Trainer`. We will a model:

In [14]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

And some `TrainingArguments`:

In [15]:
from transformers import Trainer, TrainingArguments

In [16]:
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-wikitext2",
    eval_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True,
)

* The push_to_hub=True argument sets up pushing the model to the Hugging Face Hub regularly during training.

* To save your model locally with a different name than the repository, or to push under an organization, use the hub_model_id argument.
* The hub_model_id needs to be the full name, including your namespace (e.g., "your-username/repo-name" or "organization-name/repo-name").

We pass along all of those to the `Trainer` class:

In [17]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)

And we can train our model:

In [None]:
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mshivamnegi[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss


Once the training is completed, we can evaluate our model and get its perplexity on the validation set like this:

In [None]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
trainer.push_to_hub()

Sharing model to hub :
```python
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("sgugger/my-awesome-model")
```

## Masked language modeling

For masked language modeling (MLM) we are going to use the same preprocessing as before for our dataset with one additional step: we will randomly mask some tokens (by replacing them by `[MASK]`) and the labels will be adjusted to only include the masked tokens (we don't have to predict the non-masked tokens).

We will use the [`distilroberta-base`](https://huggingface.co/distilroberta-base) model for this example. one can pick any of the checkpoints listed [here](https://huggingface.co/models?filter=masked-lm) instead:

In [None]:
model_checkpoint = "distilroberta-base"

We can apply the same tokenization function as before, we just need to update our tokenizer to use the checkpoint we just picked:

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

And like before, we group texts together and chunk them in samples of length `block_size`. You can skip that step if your dataset is composed of individual sentences.

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

The rest is very similar to what we had, with two exceptions. First we use a model suitable for masked LM:

In [None]:
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

We redefine our `TrainingArguments`:

In [None]:
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-wikitext2",
    eval_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True,
)

Like before, the last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/bert-finetuned-wikitext2"` or `"huggingface/bert-finetuned-wikitext2"`).

Finally, we use a special `data_collator`. The `data_collator` is a function that is responsible of taking the samples and batching them in tensors. In the previous example, we had nothing special to do, so we just used the default for this argument. Here we want to do the random-masking. We could do it as a pre-processing step (like the tokenization) but then the tokens would always be masked the same way at each epoch. By doing this step inside the `data_collator`, we ensure this random masking is done in a new way each time we go over the data.

To do this masking for us, the library provides a `DataCollatorForLanguageModeling`. We can adjust the probability of the masking:

In [None]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

Then we just have to pass everything to `Trainer` and begin training:

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
    data_collator=data_collator,
)

In [None]:
trainer.train()

Like before, we can evaluate our model on the validation set. The perplexity is much lower than for the CLM objective because for the MLM objective, we only have to make predictions for the masked tokens (which represent 15% of the total here) while having access to the rest of the tokens. It's thus an easier task for the model.

In [None]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
trainer.push_to_hub()

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("sgugger/my-awesome-model")
```