# Pretraining 2: GPT-2 355M - Dataset Preparation 

This notebook is independent to the work that we have been doing so far. The goal of this notebook is to just walk through basic dataset preparation for training a bigger GPT-2 model. In this case, we're going to try and train a GPT-2 355M model from scratch. 

This will involve using a much, much larger dataset with about 3 billion tokens. If you're not interested in prepping the dataset, you can download the prepped dataset already as prepared data loaders here: [TODO]

### ⚠️⚠️ WARNING ️⚠️⚠️

The dataset we will be using is very large with about 3 billion tokens and totalling up to about 13 GB of raw uncompressed text.

When the data is tokenized and turned into tensors, you'll need about 50 GB of space to store these on disk.

In spirit in making sure that I show you everything step by step, you **will likely** run **out of memory** 💣 running each step of the notebook if you **do not have at least** **64 GB of RAM** (such as my system). 

To get around this memory limitation, I have personally configured my system to have **128 GB** of swap space. Increasing the swap space to utilize it is a trade-off for performance. But it is necessary to do to get this project done if you don't have enough memory. If you are concerned about your SSD health, then you should probably run this notebook on the cloud. I do personally think it is not a big deal though. 

Overall, you should expect to use about 50 GB for the data set, and 128 GB for swap space. This means that the expectation to get this task done will be _at least_ having 180 GB of SSD space.


## Acquiring the Dataset

Let's acquire the dataset. We'll be using `fineweb-3B` from HuggingFace. You can grab the dataset manually through a `git clone` from this location:

https://huggingface.co/datasets/PatrickHaller/fineweb-3B

If you're lazy, you can use this shell script I've written here to trigger the download. The files will be played in `data/fineweb-3b` within project directory.

In [None]:
%%bash
if [ ! -f data/fineweb-3b/README.md ]; then
    echo "Data set not yet downloaded. Downloading now..."
    git clone https://huggingface.co/datasets/PatrickHaller/fineweb-3B data/fineweb-3b
else
    echo "Data set is downloaded."
fi 

mkdir -p data/fineweb-3b/text

Data set not yet downloaded. Downloading now...


Cloning into 'data/fineweb-3b'...
Filtering content: 100% (28/28), 7.72 GiB | 106.56 MiB/s, done..50 MiB/s 689.91 MiB/s23.84 MiB/s


The data set is in `parquet` format. so we will need to write a conversion script that will convert `parquet` file to CSV text. This is also going to also need a lot of storage on the computer in addition to the original 8 GB dataset. Expect the converted files to total around 13 GB. 

To do that, `pandas` and `pyarrow` must be installed.

In [2]:
!pip install pandas
!pip install pyarrow


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Let's check out what files are in the `fineweb-3b` data. We'll Perform the conversion soon.

In [3]:
import pandas as pd
import os

pd.set_option('display.max_columns', None)

base_path = "data/fineweb-3b/data"

all_files = os.listdir(base_path)

for i, filename in enumerate(all_files):
    print(filename)

train-00000-of-00028.parquet
train-00016-of-00028.parquet
train-00027-of-00028.parquet
train-00021-of-00028.parquet
train-00008-of-00028.parquet
train-00009-of-00028.parquet
train-00018-of-00028.parquet
train-00001-of-00028.parquet
train-00026-of-00028.parquet
train-00019-of-00028.parquet
train-00002-of-00028.parquet
train-00023-of-00028.parquet
train-00006-of-00028.parquet
train-00004-of-00028.parquet
train-00012-of-00028.parquet
train-00003-of-00028.parquet
train-00015-of-00028.parquet
train-00025-of-00028.parquet
train-00024-of-00028.parquet
train-00005-of-00028.parquet
train-00017-of-00028.parquet
train-00013-of-00028.parquet
train-00010-of-00028.parquet
train-00022-of-00028.parquet
train-00020-of-00028.parquet
train-00007-of-00028.parquet
train-00011-of-00028.parquet
train-00014-of-00028.parquet


What does a single file look like? Let's take a look at the first 100 characters.

In [None]:
output_path = "data/fineweb-3b/text"

for i, filename in enumerate(all_files):
    fullpath = f"{base_path}/{filename}"

    df = pd.read_parquet(fullpath)

    data = df["text"].to_csv(index=False)
    
    print(data[:100])

    break


text
"However it is not simply those who are traditions away option plans to help you wedding just w


Writing all files to output. The order at which these are processed isn't really important. But expect it to be large. When converting the file to text, we can also scrub away the first line which is just simply `text\n`. I don't think it is a big deal to leave it in, but it's so easy to just handle it now, so we might as well...

In [None]:
for i, filename in enumerate(all_files):
    fullpath = f"{base_path}/{filename}"

    df = pd.read_parquet(fullpath)

    data = df["text"].to_csv(index=False)

    with open(f"{output_path}/data-{i}.txt", "w") as f:
        # Skip the first line
        if data.startswith("text\n"):
            data = data[5:]
        
        f.write(data)

    print(f"Processed: {filename}")
    

Processed: train-00000-of-00028.parquet
Processed: train-00016-of-00028.parquet
Processed: train-00027-of-00028.parquet
Processed: train-00021-of-00028.parquet
Processed: train-00008-of-00028.parquet
Processed: train-00009-of-00028.parquet
Processed: train-00018-of-00028.parquet
Processed: train-00001-of-00028.parquet
Processed: train-00026-of-00028.parquet
Processed: train-00019-of-00028.parquet
Processed: train-00002-of-00028.parquet
Processed: train-00023-of-00028.parquet
Processed: train-00006-of-00028.parquet
Processed: train-00004-of-00028.parquet
Processed: train-00012-of-00028.parquet
Processed: train-00003-of-00028.parquet
Processed: train-00015-of-00028.parquet
Processed: train-00025-of-00028.parquet
Processed: train-00024-of-00028.parquet
Processed: train-00005-of-00028.parquet
Processed: train-00017-of-00028.parquet
Processed: train-00013-of-00028.parquet
Processed: train-00010-of-00028.parquet
Processed: train-00022-of-00028.parquet
Processed: train-00020-of-00028.parquet


Test to see if the text has been converted and written correctly. Let's just output the first 2 lines. 

In [None]:
with open(f"{output_path}/data-2.txt") as f:
  print(f.readline())
  print(f.readline())

"3M™ Neutral film is designed for use on the interior surface of windows. It’s metalised technology reflects the sun’s rays while allowing optical clarity to be maintained and rejects excess light to reduce glare. Also, depending on lighting conditions, rooms are potected against prying eyes from looking in.

3M™ Sun Control Window Films are an elegant way to manage light and heat. 3M technology can significantly reduce heat gain and create a comfortable environment, especially in warmer months, as well as helping to reduce the workload of air conditioners and save energy costs. 3M Window Films also reduce glare and block almost the entire amount of UVA and UVB rays which are the main cause of fading and skin damage."



## Create the Training and Validation Text

Now that we have all this data, we will need to create batches for the training set and validation dataset. But this can be quite large. Just for this dataset alone, we do have enough memory to hold everything with only 64 GB of RAM in this system.

We can decide to write some fancy streaming/generator thing to accommodate, but I'm not really here for that right now. 

I am lazy, and for this time, loading everything into memory and doing a split can work. 

I wrote 2 C programs to do:
* `text-builder` - Concatenate all `txt` files to a single `raw_data.txt` file.
* `text-splitter` - To split the `raw_data.txt` and create separate `train_data.txt` and `val_data.txt` datasets.

The performance is really good. It is way faster than what I can do in Python. It was also worth the time to just get some more C skills.

In [7]:
%%bash
../text-builder/text-builder data/fineweb-3b/text data/fineweb-3b/raw_data.txt

Total files: 28
Total size of files: 13794733675 bytes
Total file size read into memory: 13794733703, Number of additional characters for new line: 28


Split the data into 2 text files. `train_data.txt` and `val_data.txt`.

In [8]:
%%bash
../text-splitter/text-splitter data/fineweb-3b/raw_data.txt data/fineweb-3b/train_data.txt data/fineweb-3b/val_data.txt

Number of characters total: 13794733703, Split index: 11725523968


Create tokenized data set. Basically just call `tiktoken`'s `encode` on the text. The tokens will go into an array, and we can `pickle` it for later use. In fact, I recommend doing so as by now you're also probably running out of memory/swap space to maintain all this data. 

We will write code to unpickle, and create `torch` tensors Dataloader.

In [None]:
from scripts.tokenize_data import tokenize

# Tokenize the train data
train_tokens = tokenize(
  "data/fineweb-3b/train_data.txt",
  "data/fineweb-3b/train_tokens.lst"
)

# Tokenize the validation data
val_tokens = tokenize(
  "data/fineweb-3b/val_data.txt",
  "data/fineweb-3b/val_tokens.lst"
)

Tokenizing data/fineweb-3b/train_data.txt
Lines Read: 10000
Lines Read: 20000
Lines Read: 30000
Lines Read: 40000
Lines Read: 50000
Lines Read: 60000
Lines Read: 70000
Lines Read: 80000
Lines Read: 90000
Lines Read: 100000
Lines Read: 110000
Lines Read: 120000
Lines Read: 130000
Lines Read: 140000
Lines Read: 150000
Lines Read: 160000
Lines Read: 170000
Lines Read: 180000
Lines Read: 190000
Lines Read: 200000
Lines Read: 210000
Lines Read: 220000
Lines Read: 230000
Lines Read: 240000
Lines Read: 250000
Lines Read: 260000
Lines Read: 270000
Lines Read: 280000
Lines Read: 290000
Lines Read: 300000
Lines Read: 310000
Lines Read: 320000
Lines Read: 330000
Lines Read: 340000
Lines Read: 350000
Lines Read: 360000
Lines Read: 370000
Lines Read: 380000
Lines Read: 390000
Lines Read: 400000
Lines Read: 410000
Lines Read: 420000
Lines Read: 430000
Lines Read: 440000
Lines Read: 450000
Lines Read: 460000
Lines Read: 470000
Lines Read: 480000
Lines Read: 490000
Lines Read: 500000
Lines Read: 51000

[273,
 1525,
 34919,
 15810,
 28834,
 1222,
 14699,
 10006,
 412,
 26483,
 12,
 10468,
 12,
 38604,
 7036,
 26,
 9362,
 364,
 838,
 32914,
 50,
 34764,
 56,
 13,
 198,
 49,
 451,
 15810,
 28834,
 290,
 14699,
 10006,
 12533,
 1525,
 34919,
 12,
 20,
 17464,
 1821,
 11414,
 777,
 5672,
 25,
 198,
 38,
 9655,
 3813,
 3201,
 2813,
 33834,
 2177,
 198,
 7376,
 85,
 33087,
 7889,
 259,
 1140,
 3050,
 33834,
 2177,
 198,
 34919,
 1525,
 34919,
 8504,
 12,
 2949,
 786,
 290,
 569,
 571,
 1358,
 15810,
 28834,
 290,
 14699,
 1081,
 4428,
 13508,
 389,
 257,
 3334,
 14156,
 27182,
 284,
 13745,
 22412,
 22349,
 13,
 198,
 4711,
 11801,
 1143,
 15810,
 14732,
 654,
 2345,
 391,
 20446,
 6932,
 393,
 40849,
 445,
 24945,
 28834,
 26632,
 1001,
 3021,
 351,
 3334,
 12,
 42492,
 40870,
 406,
 29812,
 341,
 13,
 198,
 34919,
 1525,
 34919,
 15810,
 28834,
 1222,
 16066,
 10006,
 389,
 17858,
 1522,
 284,
 21167,
 3406,
 23600,
 602,
 329,
 25048,
 11,
 5178,
 11,
 290,
 15553,
 11,
 9175,
 3406,
 21

## GPT-2 355M Config

Let's define the GPT-2 configuration. The context length, number of heads, and layers will be increased to increase the overall number of trainable weights.

In [10]:
GPT_CONFIG_355M = {
  "vocab_size": 50257,   # Vocabulary size
  "context_length": 1024, # Context length
  "emb_dim": 1024,        # Embedding dimension (larger than 124M)
  "n_heads": 16,         # Number of attention heads (larger than 124M)
  "n_layers": 24,        # Number of layers (larger than 124M)
  "drop_rate": 0.0,      # Dropout rate
  "qkv_bias": False      # Query-key-value bias
}

## Training and Validation Dataloaders

Using the config, let's now use the tokens list to create the dataloaders and pickle them too. We want to pickle so that we can reload it all later and won't have to go through the same pain in buildng this dataset as we had just now.

Note that internally `create_train_dataloader` and `create_val_dataloader` operate on a batch size of `1`. 

In [None]:
from scripts.preload_dataloaders import create_train_dataloader, create_val_dataloader

create_train_dataloader(
  GPT_CONFIG_355M,
  "data/fineweb-3b/train_tokens.lst",
  "data/fineweb-3b/train_loader.dl"
)
print("Created train_loader.")


Processing chunk: 0. Token: 0 of 2527323724
Processing chunk: 100000. Token: 102400000 of 2527323724
Processing chunk: 200000. Token: 204800000 of 2527323724
Processing chunk: 300000. Token: 307200000 of 2527323724
Processing chunk: 400000. Token: 409600000 of 2527323724
Processing chunk: 500000. Token: 512000000 of 2527323724
Processing chunk: 600000. Token: 614400000 of 2527323724
Processing chunk: 700000. Token: 716800000 of 2527323724
Processing chunk: 800000. Token: 819200000 of 2527323724
Processing chunk: 900000. Token: 921600000 of 2527323724
Processing chunk: 1000000. Token: 1024000000 of 2527323724
Processing chunk: 1100000. Token: 1126400000 of 2527323724
Processing chunk: 1200000. Token: 1228800000 of 2527323724
Processing chunk: 1300000. Token: 1331200000 of 2527323724
Processing chunk: 1400000. Token: 1433600000 of 2527323724
Processing chunk: 1500000. Token: 1536000000 of 2527323724
Processing chunk: 1600000. Token: 1638400000 of 2527323724
Processing chunk: 1700000. Tok

In [None]:
create_val_dataloader(
  GPT_CONFIG_355M,
  "data/fineweb-3b/val_tokens.lst",
  "data/fineweb-3b/val_loader.dl"
)
print("Created val_loader.")

Processing chunk: 0. Token: 0 of 446181160
Processing chunk: 100000. Token: 102400000 of 446181160
Processing chunk: 200000. Token: 204800000 of 446181160
Processing chunk: 300000. Token: 307200000 of 446181160
Processing chunk: 400000. Token: 409600000 of 446181160
Created val_loader.
