# 📝 NOTE

Training on a CPU for a GPT-2 355M model against 3 billion tokens is simply not a good use of time and resources. For this notebook, we'll leverage **GPU training**. It is _possible_ to do this training on a single GPU within a reasonable amount of time.

If you don't have access to a GPU, there are a few good options out there such as:

* Lambda Cloud
* Fly IO
* Linode

For this notebook, I used my personal NVIDIA RTX 6000 Ada GPU to train. 
* For a batch size of 1, it required 17 GB of VRAM to train. 👍
* For a batch size of 4, it required 47 GB of VRAM to train. 😓 


May 17, 2025 - 4:00 PM Step 0

May 17. 2025 - 8:30 PM Step 40950

May 17, 2025 - 9:05 PM Step 45350

May 18, 2025 - 4:25 AM Step 113650

----
Batch Size = 4, RTX 6000 Ada
May 18, 2025 - 6:55 PM Step 0

## Cost of Cloud Training

I'll cut to the chase. Maybe this discussion will help with deciding on whether training on the cloud may be worth it for you.

### Lambda Cloud

* I initially trained on Lambda Cloud with their H100 80GB PCIe offering. Batch size was 1 for $2.49 an hour.
* Next with batch size of 4 and seeing a H100 80GB SXM come available, I trained this for $3.49

### Local

* I also trained on my own personal RTX 6000 Ada GPU. Batch size 1
* Then I trained again on RTX 6000 Ada GPU with batch size 4

### Linode CPU

Here's why you should _never_ attempt CPU training. 

# Pretraining 2 - GPT-2 355M - GPU Training

I've included the training script `model_train.py` at the same level of this notebook. It is the script I had actually used to produce the model. 

## Loading the Input and Validation Tokens

We created the dataloaders in the previous notebook. Let's load them back up.

In [2]:
from scripts.preload_dataloaders import load_pickled_dataloader

train_loader = load_pickled_dataloader("data/fineweb-3b/train_loader.dl")
print("Loaded train_loader.")

val_loader = load_pickled_dataloader("data/fineweb-3b/val_loader.dl")
print("Loaded val_loader")

len(train_loader), len(val_loader)

Loaded train_loader.
Loaded val_loader


(617022, 108930)

Validate for the maximum token ID. This makes sure that we don't have any token IDs out of range. Being out of range means that our dataloader could be corrupted. I ended up having to do this because I had bad data in the previous notebook, causing re-do of the loader construction.

In [None]:
# To check token ID range in your dataset
max_token = float('-inf')
for i, (input_batch, _) in enumerate(train_loader):
    max_token = max(max_token, input_batch.max().item())

print(f"Maximum token ID: {max_token}")

Maximum token ID: 50256


In [None]:
from scripts.gpt2_model import GPTModel

# We'll use CUDA here
device = "cuda"

GPT_CONFIG_355M = {
  "vocab_size": 50257,   # Vocabulary size
  "context_length": 1024, # Context length
  "emb_dim": 1024,        # Embedding dimension (larger than 124M)
  "n_heads": 16,         # Number of attention heads (larger than 124M)
  "n_layers": 24,        # Number of layers (larger than 124M)
  "drop_rate": 0.0,      # Dropout rate
  "qkv_bias": False      # Query-key-value bias
}

model = GPTModel(GPT_CONFIG_355M)
model = model.to(device)
model

GPTModel(
  (tok_emb): Embedding(50257, 1024)
  (pos_emb): Embedding(1024, 1024)
  (drop_emb): Dropout(p=0.0, inplace=False)
  (trf_blocks): Sequential(
    (0): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=1024, out_features=1024, bias=False)
        (W_key): Linear(in_features=1024, out_features=1024, bias=False)
        (W_value): Linear(in_features=1024, out_features=1024, bias=False)
        (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        (dropout): Dropout(p=0.0, inplace=False)
      )
      (ff): FeedForward(
        (layer): Sequential(
          (0): Linear(in_features=1024, out_features=4096, bias=True)
          (1): GELU()
          (2): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_shortcut): Dropout(p=0.0, inplace=False)
    )
    (1): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear

## Existing Training and Validation Loss

If you're somehow reloading a model, it's useful to check out the current training and validaton loss. Warning, this will perform a forward pass on all your data, so it is still an expensive effort 

In [8]:
import torch
from scripts.train import calc_loss_loader

torch.manual_seed(123)

train_loss = calc_loss_loader(train_loader, model)
val_loss = calc_loss_loader(val_loader, model)

print("Training loss:", train_loss)
print("Validation loss:", val_loss)

NameError: name 'train_loader' is not defined

## Training

Now it is time to train our 355M model. Running this script n the notebook is actually not a good thing to do. It will take took long. I have included 2 scripts you can run within a `tmux` session so that you can disconnect from your session while still having your training continue. 

* `model_train.py`
* `model_inference.py`.

In `train_model_simple`, I have modified the code to take in a `device` argument now to accommodate a GPU. This cause a ripple effect of adding the arguments elswhere. 

Additionally, I included a new `max_iters` argument to only train the model up to a specific number of steps as an entire epoch would take too long if just needing to test.

In [None]:
from scripts.perf_timer import PerfTimer
from scripts.train import train_model_simple
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

torch.manual_seed(123)

model = GPTModel(GPT_CONFIG_355M)
model = model.to("cuda")
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)

# We have lots of data, so we can just train for a single epoch.
num_epochs = 1

timer = PerfTimer()

timer.start()
train_losses, val_losses = train_model_simple(
    model, train_loader, val_loader, optimizer,
    num_epochs=num_epochs, eval_freq=50, eval_iter=50, # eval less frequently
    start_context="Every effort moves you", tokenizer=tokenizer, device="cuda"
)
timer.stop()

print(f"Took this long to train: {timer.elapsed_ms()} ms")


Num batches: 50
Processing batch: 0
Num batches: 50
Processing batch: 0
Ep 1 (Step 000000): Train loss 9.876, Val loss 9.842
Num batches: 50
Processing batch: 0
Num batches: 50
Processing batch: 0
Ep 1 (Step 000050): Train loss 7.778, Val loss 7.751
Num batches: 50
Processing batch: 0
Num batches: 50
Processing batch: 0
Ep 1 (Step 000100): Train loss 7.550, Val loss 7.462


## Save the model 

Oh, we should save our precious efforts! Let's not make all that waiting all for nothing!

In [None]:
torch.save(model.state_dict(), "models/gpt2-355M-model.pth")

## Reload the model 

In [None]:
import torch
from scripts.gpt2_model import GPTModel

model = GPTModel(GPT_CONFIG_355M)
model.load_state_dict(
  torch.load("models/gpt2-355M-model.pth", weights_only=True)
)

## Testing by inferencing

In [None]:
from scripts.perf_timer import PerfTimer
from scripts.generate import generate_text_simple

perf_timer = PerfTimer()

perf_timer.start()
token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=50,
    context_size=GPT_CONFIG_355M["context_length"]
)
perf_timer.stop()

print("Generated tokens in", perf_timer.elapsed_ms(), "ms")
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))