<a target="_blank" href="https://colab.research.google.com/github/raghavbali/mastering_llms_workshop_dhs2025/blob/main/docs/module_02_llm_building_blocks/03_training_language_models.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Training Language Models

## The 2-Step Training Paradigm

Transformers are complex models built like LEGO blocks using multiple smart and specialized components.
- A vanilla transformer model consists of separate stacks of encoders and decoders.
- Each encoder block includes multi-head self-attention, enabling the model to capture relationships between tokens regardless of their positions.
- Residual connections help maintain gradient flow, preventing the vanishing gradient problem.
- Layer normalization ensures training stability, and feed-forward layers introduce non-linearity and learn complex token interactions.
- Decoder blocks contain the same components but also include an encoder-decoder attention mechanism to incorporate context from the encoder.
- The model uses embedding layers to convert tokens into a continuous latent space for contextual learning and positional encoding to preserve the order of tokens in the sequence

<img src="../assets/02_training_setup_01.png">

> Source: Backcock, Bali et. al.

The two-step **training paradigm** in transformer models is designed as follows:
- **Pretraining** on large raw datasets like open-webtext, allowing the model to learn broad language patterns and concepts. This forms a strong foundation for various NLP tasks.
- The second step, **fine-tuning**, uses task-specific datasets to tailor the model to particular tasks or domains.

In [1]:
import transformers
transformers.__version__

'4.44.0'

In [None]:
# !pip install -U datasets==2.20.0 huggingface_hub==0.23.4 accelerate==0.33.0

## Imports and Utils

In [2]:
from collections import defaultdict
from tqdm import tqdm
from datasets import Dataset
from datasets import load_dataset

In [1]:
def any_keyword_in_string(string, keywords):
    for keyword in keywords:
        if keyword in string:
            return True
    return False

In [2]:
def filter_streaming_dataset(dataset, filters):
    filtered_dict = defaultdict(list)
    total = 0
    for sample in tqdm(iter(dataset)):
        total += 1
        if any_keyword_in_string(sample["content"], filters):
            for k, v in sample.items():
                filtered_dict[k].append(v)
    print(f"{len(filtered_dict['content'])/total:.2%} of data after filtering.")
    return Dataset.from_dict(filtered_dict)

## Dataset Preparation

In [5]:
from datasets import load_dataset, DatasetDict

# Try loading the original dataset
# transformersbook/codeparrot
ds_train = load_dataset("theothertom/codeparrot-python-only", split="train")
ds_valid = load_dataset("theothertom/codeparrot-python-only", split="validation")


raw_datasets = DatasetDict(
    {
        "train": ds_train.shuffle().select(range(5000)),
        "valid": ds_valid.shuffle().select(range(500))
    }
)

raw_datasets

DatasetDict({
    train: Dataset({
        features: ['code', 'repo_name', 'path', 'language', 'license', 'size'],
        num_rows: 5000
    })
    valid: Dataset({
        features: ['code', 'repo_name', 'path', 'language', 'license', 'size'],
        num_rows: 500
    })
})

In [6]:
raw_datasets['train'][0].keys()

dict_keys(['code', 'repo_name', 'path', 'language', 'license', 'size'])

In [7]:
raw_datasets['train'][0]['size']

1321

In [8]:
for key in raw_datasets["train"][0]:
  if key != "size":
    print(f"{key.upper()}: {raw_datasets['train'][0][key][:200]}")

CODE: from __future__ import absolute_import

from sentry.testutils import AcceptanceTestCase


class AuthTest(AcceptanceTestCase):
    def enter_auth(self, username, password):
        # disable captcha as
REPO_NAME: mitsuhiko/sentry
PATH: tests/acceptance/test_auth.py
LANGUAGE: Python
LICENSE: bsd-3-clause


## Tokenize

In [9]:
from transformers import AutoTokenizer

context_length = 128
tokenizer = AutoTokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")

outputs = tokenizer(
    raw_datasets["train"][:2]["code"],
    truncation=True,
    max_length=context_length,
    return_overflowing_tokens=True,
    return_length=True,
)

print(f"Input IDs length: {len(outputs['input_ids'])}")
print(f"Input chunk lengths: {(outputs['length'])}")
print(f"Chunk mapping: {outputs['overflow_to_sample_mapping']}")

Input IDs length: 4
Input chunk lengths: [128, 128, 80, 82]
Chunk mapping: [0, 0, 0, 1]




In [10]:
def tokenize(element):
    outputs = tokenizer(
        element["code"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}

In [11]:
tokenized_datasets = raw_datasets.map(
    tokenize, batched=True, remove_columns=raw_datasets["train"].column_names
)
tokenized_datasets

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 79806
    })
    valid: Dataset({
        features: ['input_ids'],
        num_rows: 9225
    })
})

## Load Model

In [12]:
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig

config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

In [13]:
model = GPT2LMHeadModel(config)
model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

GPT-2 size: 124.2M parameters


In [14]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

In [15]:
out = data_collator([tokenized_datasets["train"][i] for i in range(5)])
for key in out:
    print(f"{key} shape: {out[key].shape}")

input_ids shape: torch.Size([5, 128])
attention_mask shape: torch.Size([5, 128])
labels shape: torch.Size([5, 128])


In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [17]:
import os
os.environ["WANDB_DISABLED"] = "true"

## Training Setup

In [18]:
from transformers import Trainer, TrainingArguments

In [19]:
args = TrainingArguments(
    output_dir="codeparrot-ds",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    # evaluation_strategy="steps",
    eval_steps=5_000,
    logging_steps=5_000,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    weight_decay=0.1,
    warmup_steps=1_000,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    save_steps=5_000,
    fp16=True,
    push_to_hub=False,
    report_to=None
    # use_mps_device=False,
    # use_cpu=True
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [20]:
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["valid"],
)

In [21]:
# upto 6mins on A40 GPU
trainer.train()

Step,Training Loss


TrainOutput(global_step=311, training_loss=6.315891278135048, metrics={'train_runtime': 338.9317, 'train_samples_per_second': 235.463, 'train_steps_per_second': 0.918, 'total_flos': 5200756604928000.0, 'train_loss': 6.315891278135048, 'epoch': 0.9975942261427426})

## Push to Hub

In [None]:
# trainer.push_to_hub()

## Let's Generate Some Code

In [22]:
import torch
from transformers import pipeline

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
pipe = pipeline(
    "text-generation", 
    model="raghavbali/codeparrot-ds", 
    tokenizer="huggingface-course/code-search-net-tokenizer",
    device=device,
    temperature=0.9
)

config.json:   0%|          | 0.00/871 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/497M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]



In [23]:
txt = """\
# return the sum of x & y
def sum(x,y):
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


# return the sum of x & y
def sum(x,y):
 return (x)

# Create a list to search
print('M') for the number of word

# Find the frequency
sorted_idx =


In [24]:
txt = """\
# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create scatter plot with x, y
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create scatter plot with x, y
y = np.add([1)))
y = tf.


---
## Recap
- **Training Paradigm**: The notebook introduces the two-step training paradigm for transformer models, involving pretraining on large datasets to learn general language patterns and fine-tuning on task-specific data.
- **Dataset Preparation and Tokenization**: It details how to prepare datasets for training, including loading datasets, tokenizing text using a tokenizer from Hugging Face, and mapping tokenized data for model input.
- **Model Training and Generation**: The notebook covers setting up the training environment, configuring the GPT-2 model, and running the training process. Additionally, it demonstrates code generation using the trained model with different temperature settings.