# Pretraining 2: GPT-2 355M 

**WARNING**

The data set we will be using is very large. In spirit in assuring I show you everything step by step, you will likely run out of memory running each step of the notebook if you do not have at least 64 GB of RAM (my system). 

I configured my system to have 128GB of swap and traded off performance to get the project done. If you are concerned about your SSD health, then you should probably run this notebook on the cloud.


we are going to go big here 

https://huggingface.co/datasets/PatrickHaller/fineweb-3B

This dataset is over 8 GB. So hopefully your internet connection is fast enough.

In [None]:
%%bash
if [ ! -f data/fineweb-3b/README.md ]; then
    echo "Data set not yet downloaded. Downloading now..."
    git clone https://huggingface.co/datasets/PatrickHaller/fineweb-3B data/fineweb-3b
else
    echo "Data set is downloaded."
fi 

mkdir -p data/fineweb-3b/text

THe data set is in parqet format. so we will need to write a conversion script that will convert parquet to CSV text

In [None]:
!pip install pandas
!pip install pyarrow

In [None]:
import pandas as pd
import os

pd.set_option('display.max_columns', None)

base_path = "data/fineweb-3b/data"

all_files = os.listdir(base_path)

output_path = "data/fineweb-3b/text"

for i, filename in enumerate(all_files):
    print(filename)

In [None]:
for i, filename in enumerate(all_files):
    fullpath = f"{base_path}/{filename}"

    df = pd.read_parquet(fullpath)

    data = df["text"].to_csv(index=False)
    
    print(data)

    break


Writing all files to output. The order at which these are processed isn't really important. But expect it to be large.

In [None]:

for i, filename in enumerate(all_files):
    fullpath = f"{base_path}/{filename}"

    df = pd.read_parquet(fullpath)

    data = df["text"].to_csv(index=False)

    with open(f"{output_path}/data-{i}.txt", "w") as f:
        # Skip the first line
        if data.startswith("text\n"):
            data = data[5:]
        
        f.write(data)

    print(f"Processed: {filename}")
    

In [None]:
with open(f"{output_path}/data-2.txt") as f:
  print(f.readline()) # this is the header
  print(f.readline())

Now that we have all this data, we will need to create batches for the training set and validation data set. But this can be quite large. Just for this data set we do have enough memory to hold everything with 64 GB in this system. So we will be lazy and just load everything into memory and will do a split.

I think Jupyter will run out of memory, so we will have to do this differently. I wrote a utility in C to quickly concatenate the raw text. 

In [None]:
%%bash
../text-builder/text-builder data/fineweb-3b/text data/fineweb-3b/raw_data.txt

In [None]:
%%bash
../text-splitter/text-splitter data/fineweb-3b/raw_data.txt data/fineweb-3b/train_data.txt data/fineweb-3b/val_data.txt

Create tokenized data set

This one is going to take a while since the data set is so big. I also recommend you increase the size of your swap file as the amount of data is going to for sure, be over 64 GB -- which is the amount of RAM i have on my system.

I increased my swap file to 100 GB. You can do the same with the `increase_swap.sh` helper. 

In [None]:
from scripts.tokenize_data import tokenize

# Tokenize the train data
tokenize(
  "data/fineweb-3b/train_data.txt",
  "data/fineweb-3b/train_tokens.txt"
)

# tokenize the validation data
tokenize(
  "data/fineweb-3b/val_data.txt",
  "data/fineweb-3b/val_tokens.txt"
)

In [None]:
from scripts.load_token_data import load_token_data, save_tokens

# feed them to create lists
train_tokens_list = load_token_data("data/finweb-3b/train_tokens.txt")
save_tokens(train_tokens_list, "data/fineweb-3b/train_tokens.lst")

val_tokens_list = load_token_data("data/finweb-3b/val_tokens.txt")
save_tokens(val_tokens_list, "data/fineweb-3b/val_tokens.lst")

## GPT-2 355 Config


In [1]:
GPT_CONFIG_355M = {
  "vocab_size": 50257,   # Vocabulary size
  "context_length": 1024, # Context length
  "emb_dim": 1024,        # Embedding dimension (larger than 124M)
  "n_heads": 16,         # Number of attention heads (larger than 124M)
  "n_layers": 24,        # Number of layers (larger than 124M)
  "drop_rate": 0.0,      # Dropout rate
  "qkv_bias": False      # Query-key-value bias
}

## Loading the Input and Validation Tokens

In [None]:
from scripts.preload_dataloaders import load_train_dataloader, load_val_dataloader

train_loader = load_train_dataloader("data/fineweb-3b/train_loader.dl")
print("Loaded train_loader.")

val_loader = load_val_dataloader("data/fineweb-3b/val_loader.dl")
print("Loaded val_loader")

In [None]:
from scripts.gpt2_model import GPTModel

model = GPTModel(GPT_CONFIG_355M)

In [None]:
from scripts.train import calc_loss_loader

torch.manual_seed(123)

train_loss = calc_loss_loader(train_loader, model)
val_loss = calc_loss_loader(val_loader, model)

print("Training loss:", train_loss)
print("Validation loss:", val_loss)

Now it is time to train our 355M model. Here we go!

In [None]:
from scripts.perf_timer import PerfTimer

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_355M)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)

# We have lots of data, so we can just train for a single epoch.
num_epochs = 1

timer = PerfTimer()

timer.start()
train_losses, val_losses = train_model_simple(
    model, train_loader, val_loader, optimizer,
    num_epochs=num_epochs, eval_freq=50, eval_iter=50, # eval less frequently
    start_context="Every effort moves you", tokenizer=tokenizer
)
timer.stop()

print(f"Took this long to train: {timer.elapsed_ms()} ms")


## Save the model 

In [None]:
torch.save(model.state_dict(), "models/gpt2-355M-model.pth")

## Reload the model 

In [None]:
import torch
from scripts.gpt2_model import GPTModel

model = GPTModel(GPT_CONFIG_355M)
model.load_state_dict(
  torch.load("models/gpt2-355M-model.pth", weights_only=True)
)

## Testing by inferencing

In [None]:
from scripts.perf_timer import PerfTimer
from scripts.generate import generate_text_simple

perf_timer = PerfTimer()

perf_timer.start()
token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=50,
    context_size=GPT_CONFIG_355M["context_length"]
)
perf_timer.stop()

print("Generated tokens in", perf_timer.elapsed_ms(), "ms")
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

## TTNN

now let's load up model weights and perform the inference. this time we do all the same benchmarks as with notebook 11