<a href="https://colab.research.google.com/github/yotamnahum/Mamram-Language-Modelling-Workshop/blob/main/Train_a_language_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [1]:
%%capture
! pip install datasets transformers accelerate -U

Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [2]:
import transformers
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(transformers.__version__)
print(device)

4.33.1
cuda


# Fine-tuning a language model

## Preparing the dataset

For each of those tasks, we will use the [Wikitext 2]() dataset as an example. You can load it very easily with the 🤗 Datasets library.

You can choose other dataset [here](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending), or upload your own data

In [4]:
# from datasets import load_dataset
# datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

from datasets import load_dataset
datasets = load_dataset("Abirate/english_quotes", split='train')
datasets = datasets.train_test_split(test_size=0.1, shuffle=True, seed=253)
datasets

DatasetDict({
    train: Dataset({
        features: ['quote', 'author', 'tags'],
        num_rows: 2257
    })
    test: Dataset({
        features: ['quote', 'author', 'tags'],
        num_rows: 251
    })
})

To access an actual element, you need to select a split first, then give an index:

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [5]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(datasets["train"])

Unnamed: 0,quote,author,tags
0,“Fear cuts deeper than swords.”,"George R.R. Martin,","[bravery, fear]"
1,"“In heaven, all the interesting people are missing.”",Friedrich Nietzsche,"[heaven, religion]"
2,"“Atticus, he was real nice.""""Most people are, Scout, when you finally see them.”","Harper Lee,",[inspirational]
3,“What she had realized was that love was that moment when your heart was about to burst.”,"Stieg Larsson,","[adoration, infatuation, love]"
4,“How wonderful it is that nobody need wait a single moment before starting to improve the world.”,"Anne Frank,","[activism, life, optimism, philosophy, world]"
5,"“Maybe there were people who lived those lives. Maybe this girl was one of them. But what about the rest of us? What about the nobodies and the nothings, the invisible girls? We learn to hold our heads as if we wear crowns. We learn to wring magic from the ordinary. That was how you survived when you werenâ€™t chosen, when there was no royal blood in your veins. When the world owed you nothing, you demanded something of it anyway.”","Leigh Bardugo,","[crooked-kingdom, grisha, inej-ghafa, six-of-crows]"
6,"“Damn, Claire. Warn a guy before you do a face-plant on the floor next time. I could have looked all heroic and caught you or something -Shane”","Rachel Caine,","[funny, morganvillevampires]"
7,“Great minds are always feared by lesser minds.”,"Dan Brown,",[knowledge]
8,“God can't give us peace and happiness apart from Himself because there is no such thing.”,C.S. Lewis,[god-religion-happiness]
9,"“Lord, what fools these mortals be!”","William Shakespeare,","[comedy, elizabethan, robin-goodfellow]"


As we can see, some of the texts are a full paragraph of a Wikipedia article while others are just titles or empty lines.

## Causal Language modeling

Choose [here](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending) a model to start from

In [6]:
model_checkpoint = "distilgpt2"

To tokenize all our texts with the same vocabulary that was used when training the model, we have to download a pretrained tokenizer. This is all done by the `AutoTokenizer` class:

In [7]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

We can now call the tokenizer on all our texts. This is very simple, using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that call the tokenizer on our texts:

In [8]:
def tokenize_function(examples):
    return tokenizer(examples["quote"])

Then we apply it to all the splits in our `datasets` object, using `batched=True` and 4 processes to speed up the preprocessing. We won't need the `text` column afterward, so we discard it.

In [9]:
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["quote", "author", "tags"])

Map (num_proc=4):   0%|          | 0/2257 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/251 [00:00<?, ? examples/s]

If we now look at an element of our datasets, we will see the text have been replaced by the `input_ids` the model will need:

In [11]:
print(tokenized_datasets["train"][1])

{'input_ids': [447, 250, 1858, 743, 307, 1661, 618, 356, 389, 34209, 284, 2948, 21942, 11, 475, 612, 1276, 1239, 307, 257, 640, 618, 356, 2038, 284, 5402, 13, 447, 251], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain `block_size`. To do this, we will use the `map` method again, with the option `batched=True`. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.

First, we grab the maximum length our model was pretrained with. This might be a big too big to fit in your GPU RAM, so here we take a bit less at just 128.

In [12]:
# block_size = tokenizer.model_max_length
block_size = 128

Then we write the preprocessing function that will group our texts:

In [13]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

First note that we duplicate the inputs for our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the `map` method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of `block_size` every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

In [14]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/2257 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/251 [00:00<?, ? examples/s]

And we can check our datasets have changed: now the samples contain chunks of `block_size` contiguous tokens, potentially spanning over several of our original texts.

In [15]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

' groups, parties, nations and epochs, it is the rule.”“Is it so bad, then, to be misunderstood? Pythagoras was misunderstood, and Socrates, and Jesus, and Luther, and Copernicus, and Galileo, and Newton, and every pure and wise spirit that ever took flesh. To be great is to be misunderstood.”“Try not to become a man of success. Rather become a man of value.”“When he died, all things soft and beautiful and bright would be buried with him.”“Things need not have happened to be true.'

Now that the data has been cleaned, we're ready to instantiate our `Trainer`. We will a model:

In [16]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_checkpoint).to(device)

Downloading model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [17]:
def generate_text(text, **kwargs):
    input_ids = tokenizer(text, return_tensors='pt').input_ids.to(device)
    output = model.generate(input_ids, pad_token_id=tokenizer.eos_token_id, **kwargs)
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Usage example:
# Assuming `tokenizer`, `model`, and `device` are already defined
text = "AI is".strip()

generated_text = generate_text(text, do_sample=True, max_length=64, top_p=0.95, top_k=0)
print(generated_text)

AI is for the vehicle, not the data, and all are using your vehicle› or what was stolen. The satellite has to be installed in your vehicle in order to determine whether this vehicle has been stolen.



You can view this content at:
Watch How Riverfront Is Coming to Texas



In [18]:
generated_text = generate_text(text, do_sample=True, max_length=65, penalty_alpha=0.6, top_k=30)
print(generated_text)

AI is the most advanced and fastest operating operating, and one of the most advanced and fastest operating Windows 10 operating systems available. It is also a great way to develop and develop our own Windows desktop based OS on the existing design and functionality of the most advanced and fastest operating systems available. In addition to working through the latest and


And some `TrainingArguments`:

In [19]:
from transformers import Trainer, TrainingArguments

In [22]:
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
    max_steps=20, # for testing
	num_train_epochs=1, # for testing
    # logging & evaluation strategies
    evaluation_strategy="steps",
    eval_steps=5,
    logging_strategy="steps",
    logging_steps=1,
    report_to="tensorboard",
    save_total_limit=2,
    load_best_model_at_end=True,
)

We pass along all of those to the `Trainer` class:

In [23]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["test"],
)

And we can train our model:

In [24]:
trainer.train()

Step,Training Loss,Validation Loss
5,3.6086,3.563999
10,3.7327,3.533816
15,3.7201,3.519659
20,3.4113,3.515366


TrainOutput(global_step=20, training_loss=3.723767566680908, metrics={'train_runtime': 1.954, 'train_samples_per_second': 81.881, 'train_steps_per_second': 10.235, 'total_flos': 5225935011840.0, 'train_loss': 3.723767566680908, 'epoch': 0.21})

Once the training is completed, we can evaluate our model and get its perplexity on the validation set like this:

In [25]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 33.63


In [26]:
text = "AI is".strip()

generated_text = generate_text(text, do_sample=True, max_length=64, top_p=0.95, top_k=0)
print(generated_text)

AI is undervalued and we feel badly for the way the future of Indian affairs is far from perfect. Politics, in this case, ought to be specific. A corporation is a corporation, and you know what that means and how you live, but you can't really give the best job to a great person who lived


In [27]:
generated_text = generate_text(text, do_sample=True, max_length=65, penalty_alpha=0.6, top_k=30)
print(generated_text)

AI is a powerful game of chess, and it's a game made of the same principle that has led us into the depths of war. It's a game full of secrets and the art of chess, and it's a game that you can't tell your friends what this stuff is," says the man who spent two years
