# 📝 NOTE

Training on a CPU for a GPT-2 355M model against 3 billion tokens is simply not a good use of time and resources. For this notebook, we'll leverage **GPU training**. It is _possible_ to do this training on a single GPU within a reasonable amount of time.

If you don't have access to a GPU, there are a few good options out there such as:

* Lambda Cloud
* Fly IO
* Linode

For this notebook, I used my personal NVIDIA RTX 6000 Ada GPU to train. 
* For a batch size of 1, it required 17 GB of VRAM to train. 👍
* For a batch size of 4, it required 47 GB of VRAM to train. 😓 


May 17, 2025 - 4:00 PM Step 0

May 17. 2025 - 8:30 PM Step 40950

May 17, 2025 - 9:05 PM Step 45350

May 18, 2025 - 4:25 AM Step 113650

----
Batch Size = 4, RTX 6000 Ada
May 18, 2025 - 6:55 PM Step 0

## Cost of Cloud Training

I'll cut to the chase. Maybe this discussion will help with deciding on whether training on the cloud may be worth it for you.

### Lambda Cloud

* I initially trained on Lambda Cloud with their H100 80GB PCIe offering. Batch size was 1 for $2.49 an hour.
* Next with batch size of 4 and seeing a H100 80GB SXM come available, I trained this for $3.49

### Local

* I also trained on my own personal RTX 6000 Ada GPU. Batch size 1
* Then I trained again on RTX 6000 Ada GPU with batch size 4

### Linode CPU

Here's why you should _never_ attempt CPU training. 

# Pretraining 2 - GPT-2 355M - GPU Training

I've included the training script `model_train.py` at the same level of this notebook. It is the script I had actually used to produce the model. 

## Loading the Input and Validation Tokens

We created the dataloaders in the previous notebook. Let's load them back up.

In [1]:
from scripts.preload_dataloaders import load_pickled_dataloader

train_loader = load_pickled_dataloader("data/fineweb-3b/train_loader.dl")
print("Loaded train_loader.")

val_loader = load_pickled_dataloader("data/fineweb-3b/val_loader.dl")
print("Loaded val_loader")

len(train_loader), len(val_loader)

Loaded train_loader.
Loaded val_loader


(77127, 13616)

Validate for the maximum token ID. This makes sure that we don't have any token IDs out of range. Being out of range means that our dataloader could be corrupted. I ended up having to do this because I had bad data in the previous notebook, causing re-do of the loader construction.

In [2]:
# To check token ID range in your dataset
max_token = float('-inf')
for i, (input_batch, _) in enumerate(train_loader):
    max_token = max(max_token, input_batch.max().item())

print(f"Maximum token ID: {max_token}")

Maximum token ID: 50256


In [2]:
from scripts.gpt2_model import GPTModel

# We'll use CUDA here
device = "cuda"

GPT_CONFIG_355M = {
  "vocab_size": 50257,   # Vocabulary size
  "context_length": 1024, # Context length
  "emb_dim": 1024,        # Embedding dimension (larger than 124M)
  "n_heads": 16,         # Number of attention heads (larger than 124M)
  "n_layers": 24,        # Number of layers (larger than 124M)
  "drop_rate": 0.0,      # Dropout rate
  "qkv_bias": False      # Query-key-value bias
}

model = GPTModel(GPT_CONFIG_355M)
model = model.to(device)
model

GPTModel(
  (tok_emb): Embedding(50257, 1024)
  (pos_emb): Embedding(1024, 1024)
  (drop_emb): Dropout(p=0.0, inplace=False)
  (trf_blocks): Sequential(
    (0): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=1024, out_features=1024, bias=False)
        (W_key): Linear(in_features=1024, out_features=1024, bias=False)
        (W_value): Linear(in_features=1024, out_features=1024, bias=False)
        (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        (dropout): Dropout(p=0.0, inplace=False)
      )
      (ff): FeedForward(
        (layer): Sequential(
          (0): Linear(in_features=1024, out_features=4096, bias=True)
          (1): GELU()
          (2): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_shortcut): Dropout(p=0.0, inplace=False)
    )
    (1): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear

## Existing Training and Validation Loss

If you're somehow reloading a model, it's useful to check out the current training and validaton loss. Warning, this will perform a forward pass on all your data, so it is still an expensive effort 

In [3]:
import torch
from scripts.train import calc_loss_loader

torch.manual_seed(123)

with torch.no_grad():
  train_loss = calc_loss_loader(train_loader, model, device="cuda")
  val_loss = calc_loss_loader(val_loader, model, device="cuda")

print("Training loss:", train_loss)
print("Validation loss:", val_loss)

env: CUDA_LAUNCH_BLOCKING=1


KeyboardInterrupt: 

## Training

Now it is time to train our 355M model. Running this script n the notebook is actually not a good thing to do. It will take took long. I have included 2 scripts you can run within a `tmux` session so that you can disconnect from your session while still having your training continue. 

* `model_train.py`
* `model_inference.py`.

In `train_model_simple`, I have modified the code to take in a `device` argument now to accommodate a GPU. This cause a ripple effect of adding the arguments elswhere. 

Additionally, I included a new `max_iters` argument to only train the model up to a specific number of steps as an entire epoch would take too long if just needing to test.

In [3]:
import torch
from scripts.perf_timer import PerfTimer
from scripts.train import train_model_simple
import tiktoken

# Configure the device
capability = torch.cuda.get_device_capability()
if capability[0] >= 7:
  print("More modern NVIDIA GPU Found... using tensor cores")
  torch.set_float32_matmul_precision("high")
else:
  print("Tensor cores not supported on this GPU.")

tokenizer = tiktoken.get_encoding("gpt2")

torch.manual_seed(123)

model = GPTModel(GPT_CONFIG_355M)
model = torch.compile(model)
model = model.to("cuda").to(torch.bfloat16)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.1, fused=True)

# We have lots of data, so we can just train for a single epoch.
num_epochs = 1

timer = PerfTimer()

timer.start()
train_losses, val_losses = train_model_simple(
    model, train_loader, val_loader, optimizer,
    num_epochs=num_epochs, eval_freq=100, eval_iter=100, # eval less frequently
    start_context="Every effort moves you", tokenizer=tokenizer, device="cuda",
    max_iter=500
)
timer.stop()

print(f"Took this long to train: {timer.elapsed_ms()} ms")
print("Train losses\n")
print(train_losses)
print("Val losses\n")
print(val_losses)


More modern NVIDIA GPU Found... using tensor cores
Ep 1 (Step 000000 of 77127): Train loss 9.978, Val loss 9.985
Every effort moves you,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Ep 1 (Step 000100 of 77127): Train loss 6.963, Val loss 6.963
Every effort moves you can be a few. -to-to-to-to- - - - - - - - - - - - - - - - - - 
Ep 1 (Step 000200 of 77127): Train loss 6.708, Val loss 6.694
Every effort moves you can be a few you can be a lot of the best. " " " " - - - - - - - - - - - - - - 
Ep 1 (Step 000300 of 77127): Train loss 6.548, Val loss 6.542
Every effort moves you can be a great. The first time to the best to the best to the best to the best of the best to the most of the best of the most of the most of the most of the most of the most of the most of the
Ep 1 (Step 000400 of 77127): Train loss 6.443, Val loss 6.435
Every effort moves you can be a great way to be a good. " " " " " " " " " " " " " " " " " " " "
Ep 1 (Step 000500 of 77127): Train loss 6.385, Val loss 6.366
Ev

## Save the model 

Oh, we should save our precious efforts! Let's not make all that waiting all for nothing!

In [None]:
torch.save(model.state_dict(), "models/gpt2-355M-model.pth")

## Reload the model 

In [None]:
import torch
from scripts.gpt2_model import GPTModel

model = GPTModel(GPT_CONFIG_355M)
model.load_state_dict(
  torch.load("models/gpt2-355M-model.pth", weights_only=True)
)
model.eval()

## Testing by inferencing

In [None]:
from scripts.perf_timer import PerfTimer
from scripts.generate import generate_text_simple

perf_timer = PerfTimer()

perf_timer.start()
token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=50,
    context_size=GPT_CONFIG_355M["context_length"]
)
perf_timer.stop()

print("Generated tokens in", perf_timer.elapsed_ms(), "ms")
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

## Analyzing the Results

- Talk about starting training and validation loss
- Talk about the end
- Chart the data after grabbing the entire arrays (I printed them out)
- Mention that 3B dataset for the 355M model for 1 epoch at the learning rate I was going is probably too little... But good to test our hypothesis.
- I didn't save the optimizer, so now I am in a situation where a second epoch has to start with an optimizer without any momentum
- Still worth doing a second epoch in that state - I will just increase the learning rate by 3x to get to where i need to go faster -- plus, my model is somewhat stable at this point. (momentum means additional delta determined from past gradients off from the direction we're going)

## Next steps to get a quality model
- Epoch 2 - on 3B dataset, increased learning rate to account for lack of optimizer saving
- Build a 10B dataset and train a version of the 355M model. (Chinchilla scaling law -> 7B tokens is optimal in general)

Lastly! Save the optimizer!