<a href="https://colab.research.google.com/github/yotamnahum/Mamram-Language-Modelling-Workshop/blob/main/Train_a_language_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [None]:
%%capture
! pip install datasets transformers accelerate -U

Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [None]:
import transformers
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(transformers.__version__)
print(device)

4.33.1
cuda


# Fine-tuning a language model

## Preparing the dataset

For each of those tasks, we will use the [Wikitext 2]() dataset as an example. You can load it very easily with the 🤗 Datasets library.

You can choose other dataset [here](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending), or upload your own data

In [None]:
from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

Downloading builder script:   0%|          | 0.00/8.48k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/6.84k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.72M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

To access an actual element, you need to select a split first, then give an index:

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [None]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(datasets["train"])

Unnamed: 0,text
0,"It takes its name from the statues of saintly royalty which form part of its decoration , and is the oldest work in churrigueresque style in Mexico , taking 19 years to complete . At the bottom , from left to right , are six female royal saints : Saint Margaret of Scotland , Helena of Constantinople , Elisabeth of Hungary , Isabel of Portugal , Empress Cunegunda and Edith of Wilton . In the middle of the altar are six canonized kings , four of whom are : Hermenegild a Visigoth martyr , Henry II , Holy Roman Emperor , Edward the Confessor and Casimir of Poland . Above these four are Saints Louis of France and Ferdinand III of Castile . In between these kings an oil painting of the Adoration of the Magi by Juan Rodriguez Juarez shows Jesus as the King of kings . The top portion features a painting of the Assumption of Mary as celestial queen flanked by oval bas reliefs , one of Saint Joseph carrying the infant Jesus and the other of Saint Teresa of Ávila with a quill in her hand and the Holy Spirit above her , inspiring her to write . Above this are figures of Jesus and Mary among sculptures of angels crowned with an image of God , the Father . \n"
1,"The U.S. eventually establishes safe zones west of the Rocky Mountains and spends much of the next decade eradicating zombies in that region . All aspects of civilian life are devoted to supporting the war effort against the pandemic . Much of it resembles total war strategies : rationing of fuel and food , cultivation of private gardens , and civilian neighborhood patrols . The U.S. government also initiates a "" Re @-@ education Act "" to train the civilian population for the war effort and restore order ; the people with skills such as carpentry and construction find themselves more valuable than people with managerial skills . \n"
2,= = = Gulf of Mexico = = = \n
3,Martin O 'Neil as Edgar Ellerbeck \n
4,
5,
6,"There had been an overwhelming Conservative @-@ Unionist majority in the Lords since the Liberal split in 1886 . With the Liberal Party attempting to push through significant welfare reforms with considerable popular support , this seemed certain to cause problems in the relationship between the Houses . Between 1906 and 1909 , several important measures were being considerably watered down or rejected outright : for example , Birrell introduced the Education Bill 1906 , which was intended to address nonconformist grievances arising from the Education Act 1902 , but which was amended by the Lords to such an extent that it was effectively a different bill , upon which the Commons dropped the bill . This led to the 26 June 1907 resolution in the House of Commons declaring that the Lords ' power should be curtailed , put forward by Liberal Prime Minister Henry Campbell @-@ Bannerman . In 1909 , hoping to force an election , the Lords rejected the financial bill based on the government budget ( the "" People 's Budget "" ) put forward by David Lloyd George , by 350 votes to 75 . This , according to the Commons , was "" a breach of the Constitution , and a usurpation of the rights of the Commons "" . The Lords suggested that the Commons justify its position as representing the will of the people : it did this through the January 1910 general election . The Liberal government lost heavily , but remained in majority with the help of a significant number of Irish Nationalist and Labour MPs . The Irish Nationalists saw the continued power of the Lords as detrimental to securing Irish Home Rule . Following the election , the Lords relented on the budget ( since reintroduced by the government ) , it passing the Lords on 28 April , a day after the Commons . \n"
7,2Xe \n
8,
9,"The Ace Attorney series launched in Japan with the Game Boy Advance game Phoenix Wright : Ace Attorney in 2001 , and has been published in the West since the release of a Nintendo DS port in 2005 . The series currently consists of six main series games and four spin @-@ offs . Additionally , two titles that collect the first three main series games have been released : Ace Attorney : Phoenix Wright Trilogy HD , which was released for iOS in 2012 in Japan and in 2013 in the West , and Phoenix Wright : Ace Attorney Trilogy , which was released for the Nintendo 3DS in 2014 . \n"


As we can see, some of the texts are a full paragraph of a Wikipedia article while others are just titles or empty lines.

## Causal Language modeling

Choose [here](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending) a model to start from

In [None]:
model_checkpoint = "distilgpt2"

To tokenize all our texts with the same vocabulary that was used when training the model, we have to download a pretrained tokenizer. This is all done by the `AutoTokenizer` class:

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

We can now call the tokenizer on all our texts. This is very simple, using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that call the tokenizer on our texts:

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

Then we apply it to all the splits in our `datasets` object, using `batched=True` and 4 processes to speed up the preprocessing. We won't need the `text` column afterward, so we discard it.

In [None]:
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

If we now look at an element of our datasets, we will see the text have been replaced by the `input_ids` the model will need:

In [None]:
tokenized_datasets["train"][1]

{'input_ids': [796, 569, 18354, 7496, 17740, 6711, 796, 220, 198],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain `block_size`. To do this, we will use the `map` method again, with the option `batched=True`. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.

First, we grab the maximum length our model was pretrained with. This might be a big too big to fit in your GPU RAM, so here we take a bit less at just 128.

In [None]:
# block_size = tokenizer.model_max_length
block_size = 128

Then we write the preprocessing function that will group our texts:

In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

First note that we duplicate the inputs for our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the `map` method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of `block_size` every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

And we can check our datasets have changed: now the samples contain chunks of `block_size` contiguous tokens, potentially spanning over several of our original texts.

In [None]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

' game and follows the " Nameless ", a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven ". \n The game began development in 2010, carrying over a large portion of the work done on Valkyria Chronicles II. While it retained the standard features of the series, it also underwent multiple adjustments, such as making the game more forgiving for series newcomers. Character designer Raita Honjou and composer Hitoshi Sakimoto both returned from previous entries, along with Valkyria Chronicles II director Takeshi Oz'

Now that the data has been cleaned, we're ready to instantiate our `Trainer`. We will a model:

In [None]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_checkpoint).to(device)

Downloading model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
def generate_text(text, **kwargs):
    input_ids = tokenizer(text, return_tensors='pt').input_ids.to(device)
    output = model.generate(input_ids, pad_token_id=tokenizer.eos_token_id, **kwargs)
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Usage example:
# Assuming `tokenizer`, `model`, and `device` are already defined
text = "AI is".strip()

generated_text = generate_text(text, do_sample=True, max_length=64, top_p=0.95, top_k=0)
print(generated_text)

AI is unlocked from M9, M10 and M15 at 8:00 UTC. If you like it, please share us with your friends to decide who will be playing to play to unlock it on/off.




We are excited to announce that Mirage Day at E3 kicks off on September


In [None]:
generated_text = generate_text(text, do_sample=True, max_length=65, penalty_alpha=0.6, top_k=30)
print(generated_text)

AI is one of the most powerful and influential people in the world. It can be traced to several important things: (i) the role of the state, and (ii) the ability of state agencies to be trusted.



The use of authority for managing state agency functions is important.
However, in


And some `TrainingArguments`:

In [None]:
from transformers import Trainer, TrainingArguments

In [None]:
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-wikitext2",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
    max_steps=100, # for testing - should be `None` for full training
	num_train_epochs=1,
    # logging & evaluation strategies
    logging_strategy="steps",
    logging_steps=5,
    report_to="tensorboard",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
)

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/gpt-finetuned-wikitext2"` or `"huggingface/gpt-finetuned-wikitext2"`).

We pass along all of those to the `Trainer` class:

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)

And we can train our model:

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss
0,3.9812,3.90365


TrainOutput(global_step=100, training_loss=3.9745584678649903, metrics={'train_runtime': 10.8405, 'train_samples_per_second': 73.797, 'train_steps_per_second': 9.225, 'total_flos': 26129675059200.0, 'train_loss': 3.9745584678649903, 'epoch': 0.04})

Once the training is completed, we can evaluate our model and get its perplexity on the validation set like this:

In [None]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 52.84


In [None]:
text = "AI is".strip()

generated_text = generate_text(text, do_sample=True, max_length=64, top_p=0.95, top_k=0)
print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
IDF is awesome if it helps. The thread It was written was in the World What Women Say in 2014 and the highest tribute to women @ @irloy @lisp # Twitter follow + Google+. I have been @lisp shorthandically @@ to @all crew that @ the @ @@ @@ @@@ and @@ @@@ is social programming @ @ @@ @@@@@@ @@@@@@@ @@@@@@@ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@
----------------------------------------------------------------------------------------------------


In [None]:
generated_text = generate_text(text, do_sample=True, max_length=65, penalty_alpha=0.6, top_k=30)
print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
IDF is a non-profit organization dedicated to the prevention of HIV/AIDS and HIV/AIDS in the United States and around the world. The organization is the only organization dedicated to HIV/AIDS prevention in the United States and around the world. The organization is the only organization dedicated to HIV/AIDS prevention in the United States and around the world. The organization is the only organization dedicated to HIV/AIDS prevention in the United States and around the world. The organization is the only organization dedicated to HIV/AIDS prevention in the United States and around the world. The organization is the only organization dedicated to HIV/AIDS prevention in the United
----------------------------------------------------------------------------------------------------
