# Training a Language Model

This time you will build an actual language model, although relatively small, using a dataset of your choosing
and we will show you a "large" language model. We will also look at various LLM architectures.

## Finding a Dataset

The first task in building any machine learning application is to find a dataset to work with. There 
are many online repositories of datasets that you can use for this.

The most popular website for NLP related datasets (and models) is [HuggingFace](https://huggingface.co).
For this task we recommend that you try to find a dataset in your language or a language you understand.

1. Go to https://huggingface.co/datasets
2. Under **Main** first filter by *Modalities*, select **Text**. 
   Under **Tasks**, select **Text Generation**. 
   Under **Libraries**, select **Datasets**.
3. Under **Languages** select your language. This should now give you a list of suitable datasets that we can use.
4. Choose any of the datasets you see which is not too big in size and check it's description
   for how to load the dataset. This may be different for different datasets. 

<div style="display: flex;">
    <img width="30%" src="images/dataset-type.png">
    <img width="25%" src="images/dataset-languages.png">
</div>

Of course you can choose a dataset according to what you want to build.
Other interesting ideas for datasets may be code datasets, math datasets, song lyrics datasets.
Finding and getting them ready for use may be more work but would be a good exercise.
A good dataset would have at least 100K samples of text, each at least around 100 to 250 words in length. Another important factor is where the data was collected from, for example text from forum websites where people use abbreviations or slang may not be ideal for training a langauge model, if you don't also have samples of good text like news articles or Wikipedia pages.

Remember that a machine learning model is only as good as the dataset it uses.

## Running this Notebook

This is the most computationally intensive notebook so far. You will need to provision either a long running Jupyter instance (at least 8-10 hours), 
or learn to [write a Slurm job](https://doc.zih.tu-dresden.de/jobs_and_resources/slurm/) to run on the HPC, although this may be more difficult for inexperienced users. 
You can use the [Slurm job file generator](https://doc.zih.tu-dresden.de/jobs_and_resources/slurm_generator/) to write your script once you have figured out how to use the HPC.

Here are the values we recommend you should use when creating a Jupyter instance.

| Name              | Value               |
| ----------------- | ------------------- |
| Cluster           | Alpha               |
| Generic Resources | gpu:1               |
| Runtime           | 12:00:00 (12 hours) |
| Project           | p_scads_llm_secrets |

In [None]:
!pip install transformers datasets tokenizers

## Prepare a Dataset

Once you find a suitable dataset, you should be able to read the dataset using the `datasets` library.
The example in this notebook will use the [Wikipedia dataset](https://huggingface.co/datasets/wikimedia/wikipedia) in Belarusian.
This will illustrate how you can use a non-standard character set for your model.

In [16]:
from datasets import load_dataset

LANG = "be"

dataset = load_dataset("wikimedia/wikipedia", f"20231101.{LANG}", cache_dir=".checkpoints/data")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 236165
    })
})

In [17]:
# Since we dont need any of the other columns, we just keep the text column
dataset = [ data["text"] for data in dataset["train"] ]

## Byte-Level Tokenization

We will build a Byte-Pair Encoding (BPE) tokenizer, but we will use the byte-level version of BPE tokenization.

**Byte Level BPE Tokenization** considers a byte to be the smaller unit of text, instead of characters as in simple BPE. 
This enables it to be applied to more complicated character sets where a character may span more than one byte.
It's used by a lot of Transformer models, including GPT, GPT-2, RoBERTa, BART, and DeBERTa.

In [None]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
VOCAB_SIZE = 50_000
tokenizer.train_from_iterator(
    dataset,
    vocab_size=VOCAB_SIZE,
    min_frequency=2, 
    special_tokens=[
        "<s>",
        "<pad>",
        "</s>",
        "<unk>",
        "<mask>",
    ])

tokenizer.enable_truncation(max_length=512)
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")))

Since the tokenization is done at a byte-level, when using UTF-8 encoded characters, the tokens will not remain human readable.
Therefore we need to use a byte-level decoder to combine bytes back into UTF-8 encoded characters so that they are human readable again.

In [24]:
from tokenizers.decoders import ByteLevel

sentence = "Аԥсуаа жәытә-натә аахыс Пицунда амшын аԥшаҳәа Амзара ҳәа иашьҭан"
decoder = ByteLevel()

sentence = tokenizer.encode(sentence)

print("[Tokens]", list(map(lambda x: x.encode("utf-8").hex(), sentence.tokens)))
print("[Decode]", decoder.decode(sentence.tokens))

[Tokens] ['3c733e', 'c390c4b2', 'c394', 'c2a5', 'c391c4a3c391c4a5c390c2b0', 'c390c2b0', 'c4a0c390c2b6', 'c393c4bb', 'c391c4adc391c4a4', 'c393c4bb', '2d', 'c390c2bdc390c2b0c391c4a4', 'c393c4bb', 'c4a0c390c2b0', 'c390c2b0c391c4a7', 'c391c4ad', 'c391c4a3', 'c4a0c390c581', 'c390c2b8c391c4a8', 'c391c4a5c390c2bdc390c2b4c390c2b0', 'c4a0c390c2b0c390c2bc', 'c391c4aac391c4adc390c2bd', 'c4a0c390c2b0', 'c394', 'c2a5', 'c391c4aac390c2b0', 'c392', 'c2b3', 'c393c4bb', 'c390c2b0', 'c4a0c390c4b2c390c2bc', 'c390c2b7c390c2b0c391c4a2c390c2b0', 'c4a0c392', 'c2b3', 'c393c4bb', 'c390c2b0', 'c4a0c390c2b8', 'c390c2b0c391c4aa', 'c391c4ae', 'c392', 'c583', 'c390c2b0c390c2bd', '3c2f733e']
[Decode] <s>Аԥсуаа жәытә-натә аахыс Пицунда амшын аԥшаҳәа Амзара ҳәа иашьҭан</s>


Notice that the tokens are groups of bytes (every 2 hexadecimal characters are one byte) and after decoding this sequence we get back the original sentence.

In [None]:
tokenizer.save_model(f".checkpoints/{LANG}-roberta/tokenizer")

['.checkpoints/be-tokenizer/vocab.json',
 '.checkpoints/be-tokenizer/merges.txt']

## Training our Language Model

Language Model is the colloquial term for what the experts call "causal language modeling" where the objective of the model is to predict the next word(s) given the last few words.
Thus this task is "causal" in nature in that each future word is assumed to follow from the past words.

The architecture we will use is [RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/roberta#roberta). 
This is a version of [BERT](https://huggingface.co/docs/transformers/v4.46.2/en/model_doc/bert#bert) which is a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction. RoBERTa builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.


In [None]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained(f".checkpoints/{LANG}-roberta/tokenizer", max_len=512)

In [None]:
from transformers import RobertaForCausalLM
from transformers import RobertaConfig

model = RobertaForCausalLM(
    RobertaConfig(
        vocab_size=VOCAB_SIZE,
        max_position_embeddings=514,
        num_attention_heads=12,
        num_hidden_layers=6,
        type_vocab_size=1,
        is_decoder=True
    ))

In [10]:
model.num_parameters()

81966416

Here we see that even such a relatively simple and small model has around 82 million parameters.
To put this in perspective, GPT-2 had 1.5 billion parameters and GPT-3 has 175 billion parameters.

In [11]:
train_dataset = tokenizer(dataset, add_special_tokens=True, truncation=True, max_length=512)["input_ids"]

In [None]:
from transformers.trainer import Trainer, TrainingArguments
from transformers.data.data_collator import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

training_args = TrainingArguments(
    learning_rate=1e-3,
    lr_scheduler_type="cosine",
    warmup_steps=1_000,
    weight_decay=0.1,
    gradient_accumulation_steps=8,
    per_device_train_batch_size=64,
    num_train_epochs=5,
    output_dir=f".checkpoints/{LANG}-roberta",
    overwrite_output_dir=True,
    save_steps=10_000,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

Start training! This cell will take a really long time.

You can try using different values for the learning rate and number of training epochs to see what works best.

In [None]:
trainer.train()

In [None]:
trainer.save_model(f".checkpoints/{LANG}-roberta")

## Evaluate

Run the model with the text generation pipeline. Pipelines automate the process of tokenization, running the model, and then decoding the output tokens.
Here we will take an input prompt in the language that the model was trained in and generate text from that.

In [None]:
from transformers import pipeline

prompt = input("Prompt: ")
generate = pipeline("text-generation", model=model, tokenizer=tokenizer, device="cuda")
output = generate(prompt, max_new_tokens = 250)[0]["generated_text"]
print(output)