<a href="https://colab.research.google.com/github/yotamnahum/Mamram-Language-Modelling-Workshop/blob/main/Train_a_language_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [1]:
%%capture
! pip install datasets transformers accelerate -U

Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [1]:
import transformers
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(transformers.__version__)
print(device)

4.33.1
cuda


# Fine-tuning a language model

## Preparing the dataset

For each of those tasks, we will use the `english_quotes` dataset as an example. You can load it very easily with the 🤗 Datasets library.

You can choose other dataset [here](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending), or upload your own data

In [2]:
from datasets import load_dataset
datasets = load_dataset("Abirate/english_quotes", split='train')
datasets = datasets.train_test_split(test_size=0.1, shuffle=True, seed=253)
datasets

DatasetDict({
    train: Dataset({
        features: ['quote', 'author', 'tags'],
        num_rows: 2257
    })
    test: Dataset({
        features: ['quote', 'author', 'tags'],
        num_rows: 251
    })
})

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [3]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(datasets["train"])

Unnamed: 0,quote,author,tags
0,"“I'm quite illiterate, but I read a lot. ”","J.D. Salinger,",[holden]
1,"“Kiss me, and you will see how important I am.”","Sylvia Plath,","[importance, kiss, kissing]"
2,“Make your own Bible. Select and collect all the words and sentences that in all your readings have been to you like the blast of a trumpet.”,Ralph Waldo Emerson,[spirituality]
3,“There is more than one way to burn a book. And the world is full of people running about with lit matches.”,Ray Bradbury,[censorship]
4,"“I am too intelligent, too demanding, and too resourceful for anyone to be able to take charge of me entirely. No one knows me or loves me completely. I have only myself”",Simone de Beauvoir,[feminism]
5,“Happiness in intelligent people is the rarest thing I know.”,"Ernest Hemingway,",[happiness]
6,"“People, generally, suck.”","Christopher Moore,",[humor]
7,"“And now Iâ€™m looking at you,â€� he said, â€œand youâ€™re asking me if I still want you, as if I could stop loving you. As if I would want to give up the thing that makes me stronger than anything else ever has. I never dared give much of myself to anyone before â€“ bits of myself to the Lightwoods, to Isabelle and Alec, but it took years to do it â€“ but, Clary, since the first time I saw you, I have belonged to you completely. I still do. If you want me.”","Cassandra Clare,","[cassandra-clare, city-of-glass, clary, jace, love, mortal-instruments]"
8,“The scar had not pained Harry for nineteen years. All was well.”,"J.K. Rowling,",[harry-potter]
9,"“Ah! There is nothing like staying at home, for real comfort.”",Jane Austen,[relaxation]


As we can see, some of the texts are a full paragraph of a Wikipedia article while others are just titles or empty lines.

## Causal Language modeling

Choose [here](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending) a model to start from

In [4]:
model_checkpoint = "distilgpt2"

To tokenize all our texts with the same vocabulary that was used when training the model, we have to download a pretrained tokenizer. This is all done by the `AutoTokenizer` class:

In [5]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

We can now call the tokenizer on all our texts. This is very simple, using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that call the tokenizer on our texts:

In [6]:
def tokenize_function(examples):
    return tokenizer(examples["quote"])

Then we apply it to all the splits in our `datasets` object, using `batched=True` and 4 processes to speed up the preprocessing. We won't need the `text` column afterward, so we discard it.

In [7]:
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["quote", "author", "tags"])

If we now look at an element of our datasets, we will see the text have been replaced by the `input_ids` the model will need:

In [8]:
print(tokenized_datasets["train"][1])

{'input_ids': [447, 250, 1858, 743, 307, 1661, 618, 356, 389, 34209, 284, 2948, 21942, 11, 475, 612, 1276, 1239, 307, 257, 640, 618, 356, 2038, 284, 5402, 13, 447, 251], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain `block_size`. To do this, we will use the `map` method again, with the option `batched=True`. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.

First, we grab the maximum length our model was pretrained with. This might be a big too big to fit in your GPU RAM, so here we take a bit less at just 128.

In [9]:
# block_size = tokenizer.model_max_length
block_size = 128

Then we write the preprocessing function that will group our texts:

In [10]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

First note that we duplicate the inputs for our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the `map` method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of `block_size` every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

In [11]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

And we can check our datasets have changed: now the samples contain chunks of `block_size` contiguous tokens, potentially spanning over several of our original texts.

In [13]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

' groups, parties, nations and epochs, it is the rule.”“Is it so bad, then, to be misunderstood? Pythagoras was misunderstood, and Socrates, and Jesus, and Luther, and Copernicus, and Galileo, and Newton, and every pure and wise spirit that ever took flesh. To be great is to be misunderstood.”“Try not to become a man of success. Rather become a man of value.”“When he died, all things soft and beautiful and bright would be buried with him.”“Things need not have happened to be true.'

Now that the data has been cleaned, we're ready to instantiate our `Trainer`. We will a model:

In [14]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_checkpoint).to(device)

In [15]:
def generate_text(text, **kwargs):
    input_ids = tokenizer(text, return_tensors='pt').input_ids.to(device)
    output = model.generate(input_ids, pad_token_id=tokenizer.eos_token_id, **kwargs)
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Usage example:
# Assuming `tokenizer`, `model`, and `device` are already defined
text = "AI is".strip()

generated_text = generate_text(text, do_sample=True, max_length=64, top_p=0.95, top_k=0)
print(generated_text)

AI is solid, smart and challenging new technology that really makes it easy to manufacture something, does a little bit different, and are definitely worth buying.


In [16]:
generated_text = generate_text(text, do_sample=True, max_length=65, penalty_alpha=0.6, top_k=30)
print(generated_text)

AI is the one that is being tested, and we've got an important problem. That's why we have built this software for the mobile version of the Android platform. We've got our product in development.




That's a shame for us, though. Even though we do the right thing when it


And some `TrainingArguments`:

In [40]:
from transformers import Trainer, TrainingArguments
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned",
    learning_rate=4e-5,
    # weight_decay=0.01,
    push_to_hub=False,
    #max_steps=20, # for testing
	num_train_epochs=1, # for testing
    # logging & evaluation strategies
    evaluation_strategy="steps",
    eval_steps=10,
    logging_strategy="steps",
    logging_steps=5,
    report_to="tensorboard",
    save_total_limit=2,
    load_best_model_at_end=True,
)

We pass along all of those to the `Trainer` class:

In [41]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["test"],
)

And we can train our model:

In [42]:
trainer.train()

Step,Training Loss,Validation Loss
10,2.7974,3.751385
20,2.8471,3.722462
30,2.8758,3.72787
40,2.9486,3.735564
50,3.0813,3.657136
60,3.1396,3.635551
70,3.4055,3.554279
80,3.3769,3.56923
90,3.5756,3.562459
100,2.9628,3.62785


TrainOutput(global_step=282, training_loss=2.760936219641503, metrics={'train_runtime': 19.864, 'train_samples_per_second': 112.968, 'train_steps_per_second': 14.197, 'total_flos': 73293738541056.0, 'train_loss': 2.760936219641503, 'epoch': 3.0})

Once the training is completed, we can evaluate our model and get its perplexity on the validation set like this:

In [43]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 42.47


In [44]:
text = "AI is".strip()

generated_text = generate_text(text, do_sample=True, num_beams=3, max_length=64, top_p=0.95, top_k=0)
print(generated_text)

AI is a writer, not a writer.”“It is better to have a friend who is not your friend, than to have a friend who is not your friend, than to have a friend who is not your friend, than to have a friend who is not your friend, than to have a friend


In [45]:
generated_text = generate_text(text, do_sample=True, max_length=65, penalty_alpha=0.6, top_k=0)
print(generated_text)

AI is a woman. I am a woman. I am a rich woman. I am a poor woman. I am a beautiful, delicate, resourceful woman.I pour out my income and put out my retirement plans and everything. So I don't have to waste much time on my recent fortune. What am I going
