## Understanding LLMs and Pre-training

In [None]:
%pip install datasets transformers[sentencepiece]

In [1]:
import torch
from datasets import load_dataset, DatasetDict

from transformers import (
    BertTokenizer,
    BertForMaskedLM,
    GPT2Tokenizer,
    GPT2LMHeadModel,
    DataCollatorForLanguageModeling,
    AutoConfig,
    AutoTokenizer,
    Trainer,
    TrainingArguments
)

2025-04-11 12:14:24.269648: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-04-11 12:14:24.269686: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-04-11 12:14:24.270991: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-04-11 12:14:24.278520: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


#### Understanding Masked LM's

In [2]:
## The first model we will look at is BERT, which is trained with masked tokens. As an example,
## the text below masks the word "box" from a well-known movie quote.

text = "Life is like a [MASK] of chocolates."

In [3]:
## We'll now see how BERT is able to predict the missing word. We can use HuggingFace to load
## a copy of the pretrained model and tokenizer.

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
%pip install hf_xet

In [4]:
## Next, we'll feed our example text into the tokenizer.

encoded_input = tokenizer(text, return_tensors='pt')
print('input_ids:', encoded_input['input_ids'])
print('attention_mask:', encoded_input['attention_mask'])
     

input_ids: tensor([[ 101, 2166, 2003, 2066, 1037,  103, 1997, 7967, 2015, 1012,  102]])
attention_mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


In [5]:
## input_ids represents the tokenized output. Each integer can be mapped back to the corresponding string.

print(tokenizer.decode([7967]))

chocolate


In [6]:
## The model will then receive the output of the tokenizer. We can look at the BERT model to see exactly how
## it was constructed and what the outputs will be like.

model

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwi

The model starts with an embedding of each of the 30,522 possible tokens into 768 dimensions, which at this point is simply a representation of each token without any additional information about their relationships to one another in the text. Then the encoder attention blocks are applied, updating the embeddings such that they now encode each token's contribution to the chunk of text and interactions with other tokens. Notably, this includes the masked tokens as well. The final stage is the language model head, which takes the embeddings from the masked positions back to 30,522 dimensions. Each index of this final vector corresponds to the probability that the token in that position would be the correct choice to fill the mask.

In [7]:
model_output = model(**encoded_input)
output = model_output["logits"]

print(output.shape)

torch.Size([1, 11, 30522])


In [8]:
tokens = encoded_input['input_ids'][0].tolist()
masked_index = tokens.index(tokenizer.mask_token_id)
logits = output[0, masked_index, :]

print(logits.shape)

torch.Size([30522])


In [9]:
probs = logits.softmax(dim=-1)
values, predictions = probs.topk(5)
sequence = tokenizer.decode(predictions)

print('Top 5 predictions:', sequence)
print(values)

Top 5 predictions: box bag bowl jar cup
tensor([0.1764, 0.1688, 0.0419, 0.0336, 0.0262], grad_fn=<TopkBackward0>)


Printing the top 5 predictions and their respective scores, we see that BERT accurately chooses "box" as the most likely replacement for the mask token.

#### Understanding Causal LM's

We now repeat a similar exercise with the causal LLM GPT-2. This model generates text following an input, instead of replacing a mask within the text.

In [10]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [11]:
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

The model begins by embedding each of the 50,257 tokens into a 768-dimensional space, initially representing only individual token identities.Next, 12 transformer blocks apply self-attention and MLP layers, allowing tokens to interact and refine their contextual representations. These enriched embeddings capture dependencies across the sequence, enabling the model to understand language structure.  
Finally, a linear head maps each embedding back to 50,257 logits, representing the probability distribution over the vocabulary for next-token prediction.

In [12]:
## We'll use a different text example, since this model works by producing tokens sequentially
## rather than filling a mask.

text = "Swimming at the beach is"
model_inputs = tokenizer(text, return_tensors='pt')
model_inputs

{'input_ids': tensor([[10462, 27428,   379,   262, 10481,   318]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

In [13]:
## After applying the model, the information needed to predict the next token is represented by
## the last token. So we can access that vector by the index -1.

output = model(**model_inputs)
next_token_logits = output.logits[:, -1, :]
next_token = torch.argmax(next_token_logits, dim=-1)
print(next_token)

tensor([257])


In [14]:
## Now add the new token to the end of the text, and feed all of it back to the model to continue
## predicting more tokens.

model_inputs['input_ids'] = torch.cat([model_inputs['input_ids'], next_token[:, None]], dim=-1)
model_inputs["attention_mask"] = torch.cat([model_inputs['attention_mask'], torch.tensor([[1]])], dim=-1)
print(model_inputs)

{'input_ids': tensor([[10462, 27428,   379,   262, 10481,   318,   257]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}


In [15]:
## Here's what we have so far. The model added the word 'a' to the input text.

print(tokenizer.decode(model_inputs['input_ids'][0]))

Swimming at the beach is a


In [16]:
## Repeating all the previous steps, we then add the word 'great'.

output = model(**model_inputs)
next_token_logits = output.logits[:, -1, :]
next_token = torch.argmax(next_token_logits, dim=-1)
model_inputs['input_ids'] = torch.cat([model_inputs['input_ids'], next_token[:, None]], dim=-1)
model_inputs["attention_mask"] = torch.cat([model_inputs['attention_mask'], torch.tensor([[1]])], dim=-1)
print(tokenizer.decode(model_inputs['input_ids'][0]))

Swimming at the beach is a great


In [17]:
## HuggingFace automates this iterative process. We'll use the quicker approach to finish our sentence.

output_generate = model.generate(**model_inputs, max_length=20, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(output_generate[0]))

Swimming at the beach is a great way to get a little extra energy.

The beach


### Pre-training a GPT-2 model from scratch

Next we'll train a GPT-2 model from scratch using English Wikipedia data. Note that we're only using a tiny subset of the data to demonstrate that the model is capable of learning. The exact same approach could be followed on the full dataset to train a more functional model, but that would require a lot of compute.

In [2]:
dataset = load_dataset("wikipedia", "20220301.en")
ds_shuffle = dataset['train'].shuffle()

raw_datasets = DatasetDict(
    {
        "train": ds_shuffle.select(range(50)),
        "valid": ds_shuffle.select(range(50, 100))
    }
)

raw_datasets

Loading dataset shards:   0%|          | 0/41 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 50
    })
    valid: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 50
    })
})

In [3]:
print(raw_datasets['train'][0]['text'][:200])

Gymnothorax angusticeps is a moray eel found in the southeast Pacific Ocean, around Peru. It was first named by Hildebrand and Barton in 1949. It is colloquially known as the wrinkled moray.

Referenc


In [4]:
## We'll tokenize the text, setting the context size to 128 and thus breaking each document into chunks of 128 tokens.

context_length = 128
tokenizer = AutoTokenizer.from_pretrained("gpt2")

outputs = tokenizer(
    raw_datasets["train"][:2]["text"],
    truncation=True,
    max_length=context_length,
    return_overflowing_tokens=True,
    return_length=True,
)

print(f"Input IDs length: {len(outputs['input_ids'])}")
print(f"Input chunk lengths: {(outputs['length'])}")
print(f"Chunk mapping: {outputs['overflow_to_sample_mapping']}")

Input IDs length: 2
Input chunk lengths: [64, 17]
Chunk mapping: [0, 1]


In [8]:
def tokenize(element):
    outputs = tokenizer(
        element["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}


tokenized_datasets = raw_datasets.map(
    tokenize, batched=True, remove_columns=raw_datasets["train"].column_names
)
tokenized_datasets

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 697
    })
    valid: Dataset({
        features: ['input_ids'],
        num_rows: 304
    })
})

Now we can set up the HuggingFace Trainer as follows. Since we're using such a small dataset, we'll need lots of epochs for the model to make progress because all of the parameters are randomly initialized at the outset. Typically, most LLM's are trained for only one epoch and more diverse examples.

In [5]:
config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

model = GPT2LMHeadModel(config)

In [6]:

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

In [11]:
args = TrainingArguments(
    output_dir="wiki-gpt2",
    eval_strategy="steps",
    num_train_epochs=100
)

trainer = Trainer(
    model=model,
    processing_class=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["valid"]
)

In [12]:
trainer.train()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss,Validation Loss
500,3.7277,8.048859
1000,2.3347,8.357629
1500,1.6549,8.582697
2000,1.0569,8.8822
2500,0.5793,9.064018
3000,0.2667,9.28649
3500,0.1266,9.376274
4000,0.0766,9.436334
4500,0.0547,9.625267
5000,0.043,9.589741


TrainOutput(global_step=8800, training_loss=0.5735442948341369, metrics={'train_runtime': 4021.6566, 'train_samples_per_second': 17.331, 'train_steps_per_second': 2.188, 'total_flos': 4553013657600000.0, 'train_loss': 0.5735442948341369, 'epoch': 100.0})

In [13]:
trainer.evaluate()

{'eval_loss': 9.871642112731934,
 'eval_runtime': 4.4402,
 'eval_samples_per_second': 68.466,
 'eval_steps_per_second': 8.558,
 'epoch': 100.0}

The training loss is low by the end, which means the model should perform very well on training examples it has seen. It does not generalize well to the validation set of course, since we deliberately overfit on a small train set.

We can confirm with a couple of examples that were seen in training.

In [14]:
text = tokenizer.decode(tokenized_datasets["train"][0]['input_ids'][:16])
print(text)

David McGowan (born 27 April 1981) is an Australian high-performance ro


In [15]:
model_inputs = tokenizer(text, return_tensors='pt')
print(model_inputs['input_ids'].shape)

torch.Size([1, 16])


In [16]:

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model_inputs['input_ids'] = model_inputs['input_ids'].to(device)
model_inputs['attention_mask'] = model_inputs['attention_mask'].to(device)

output_generate = model.generate(**model_inputs, max_new_tokens=16)
output_generate

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


tensor([[11006, 11130, 45197,   357,  6286,  2681,  3035, 14745,     8,   318,
           281,  6638,  1029,    12, 26585,   686,  5469,  3985,   290,  1966,
          8852,  5752,   263,    13,  1081,   257,  5752,   263,   339,   373,
           257, 13430]], device='cuda:0')

In [17]:
sequence = tokenizer.decode(output_generate[0])
print(sequence)

David McGowan (born 27 April 1981) is an Australian high-performance rowing coach and former representative rower. As a rower he was a junior


The model should do quite well at reciting text after seeing it so many times. We can be convinced that the tokenizer, model architecture, and training objective are well-suited to learning Wikipedia data. For comparison, we'll try this model on text from the validation set.

In [18]:
text = tokenizer.decode(tokenized_datasets["valid"][0]['input_ids'][:32])
print(text)

R (Coughlan) v North and East Devon Health Authority [1999] EWCA Civ 1871 is a UK enterprise law case, concerning health care in


In [19]:
model_inputs = tokenizer(text, return_tensors='pt')

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model_inputs['input_ids'] = model_inputs['input_ids'].to(device)
model_inputs['attention_mask'] = model_inputs['attention_mask'].to(device)

output_generate = model.generate(**model_inputs, max_new_tokens=16)
sequence = tokenizer.decode(output_generate[0])
print(sequence)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


R (Coughlan) v North and East Devon Health Authority [1999] EWCA Civ 1871 is a UK enterprise law case, concerning health care in the fifth Discworld. He has stated that the resource recovery industry. He is


In [20]:
raw_datasets['valid'][0]['text']

'R (Coughlan) v North and East Devon Health Authority [1999] EWCA Civ 1871 is a UK enterprise law case, concerning health care in the UK.\n\nFacts \nMiss Coughlan claimed she should be able to remain at Mardon House, Exeter, purpose built for her and seven others with severe disabilities. After a 1971 road traffic accident, she became tetraplegic, needing constant care. Devon HA decided it should be closed in 1996, although she had been assured before it was a ‘home for life’. The Health Authority argued Mardon House had become ‘a prohibitively expensive white elephant’ which ‘left fewer resources available for other services’.\n\nJudgment\nThe Court of Appeal held there was a legitimate expectation to fair treatment, with a substantive benefit of the ‘home for life’. Frustrating the expectation would be an abuse of power. However the duty to promote health, in statute, was not a duty to ensure the service was comprehensive.\n\nLord Woolf MR said that the failure to keep the home open 

As expected, our model is completely confused this time. We'd need to train for much longer, and on much more diverse data, before we would have a model that can sensibly complete prompts it has never seen before. This is precisely why pre-training is such an important and powerful technique. If we had to train on all of Wikipedia for every NLP application to achieve optimal performance, it would be prohibitively expensive. But there's no need to do that when we can share and reuse existing pre-trained models as we did in the first part of this tutorial.