# Pretraining 2: GPT-2 355M - Dataset Preparation 

This notebook is independent to the work that we have been doing so far. The goal of this notebook is to just walk through basic dataset preparation for training a bigger GPT-2 model. In this case, we're going to try and train a GPT-2 355M model from scratch. 

There are 3 different types of datasets we can prepare for our model. They are 100M, 1B, 3B and 10B tokens. Use what is best for your system. GPT-2 was trained closer to 10B tokens, so that would get us the best results, but requires a lot of system RAM to hold the dataset.

If you are just interested in learning, 100M is good enough to see the difference and produce something somewhat coherent for a 355M parameter model. This requires much less system memory.

### ⚠️⚠️ WARNING - FOR THOSE WHO WANT TO GO BEYOND 100M TOKENS ️⚠️⚠️

If you choose to use a dataset that is very large (ex. 1B+ tokens), this can use _gigabytes_ of memory.

When the data is tokenized and turned into tensors, you'll need a lot of headroom on disk and memory to accomodate.

In spirit in making sure that I show you everything step by step, you **will likely** run **out of memory** 💣 running each step of the notebook if you **do not have at least** **64 GB of RAM** (such as my system). 

To get around this memory limitation, I have personally configured my system to have **128 GB** of swap space. Increasing the swap space to utilize it is a trade-off for performance. But it is necessary to do to get this project done if you don't have enough memory. If you are concerned about your SSD health, then you should probably run this notebook on the cloud. I do personally think it is not a big deal though. 

Reminder, there is a 100M token dataset as an option if you're concerned about all this. I think that would work for most systems.

## Acquiring the Dataset

Let's acquire the dataset. We'll be using `fineweb` family of datasets from HuggingFace. You can grab the dataset manually through a `git clone` from these location:

* 100M - https://huggingface.co/datasets/Butanium/fineweb-100m-sample-test-set
* 1B - https://huggingface.co/datasets/PatrickHaller/fineweb-1B
* 3B - https://huggingface.co/datasets/PatrickHaller/fineweb-3B
* 10B - https://huggingface.co/datasets/PatrickHaller/fineweb-10B


If you're lazy, you can use this shell script I've written here to trigger the download. The files will be placed in `data/fineweb-xx` within project directory. (`xx` is the number of tokens)

In [42]:
%%bash
# Possible values: fineweb-100m, fineweb-1b, fineweb-3b, fineweb-10b
DATASET_NAME="fineweb-100m" # change this to your dataset

if [ "$DATASET_NAME" == "fineweb-100m" ]; then
    DATASET_URL="https://huggingface.co/datasets/Butanium/fineweb-100m-sample-test-set"
elif [ "$DATASET_NAME" == "fineweb-1b" ]; then
    DATASET_URL="https://huggingface.co/datasets/PatrickHaller/fineweb-1B"
elif [ "$DATASET_NAME" == "fineweb-3b" ]; then
    DATASET_URL="https://huggingface.co/datasets/PatrickHaller/fineweb-3B"
elif [ "$DATASET_NAME" == "fineweb-10b" ]; then
    DATASET_URL="https://huggingface.co/datasets/PatrickHaller/fineweb-10B"
else
    DATASET_URL="https://huggingface.co/datasets/Butanium/fineweb-100m-sample-test-set"
fi


if [ ! -f "data/$DATASET_NAME/README.md" ]; then
    echo "Data set not yet downloaded. Downloading now..."
    git clone "$DATASET_URL" "data/$DATASET_NAME"
else
    echo "Data set is downloaded."
fi 

mkdir -p "data/$DATASET_NAME/text"

Data set is downloaded.


The data set is in `parquet` format. so we will need to write a conversion script that will convert `parquet` file to CSV text. 

To do that, `pandas` and `pyarrow` must be installed.

In [23]:
!pip install pandas
!pip install pyarrow



Let's check out what files are in your chosen `fineweb-xx` data. We'll Perform the conversion soon.

First, specify the dataset you will want to be using for pre-training.

Possible options are:
```
fineweb-100m
fineweb-1b
fineweb-3b
fineweb-10b
```

In [43]:
# change this here
dataset_name = 'fineweb-100m'

In [26]:
import pandas as pd
import os

pd.set_option('display.max_columns', None)

base_path = f"data/{dataset_name}/data"

all_files = os.listdir(base_path)

for i, filename in enumerate(all_files):
    print(filename)

train-00009-of-00010.parquet
train-00008-of-00010.parquet
validation-00000-of-00001.parquet
train-00001-of-00010.parquet
train-00004-of-00010.parquet
train-00006-of-00010.parquet
train-00002-of-00010.parquet
train-00003-of-00010.parquet
train-00005-of-00010.parquet
train-00007-of-00010.parquet
train-00000-of-00010.parquet


What does a single file look like? Let's take a look at the first 100 characters.

In [27]:
output_path = f"data/{dataset_name}/text"

for i, filename in enumerate(all_files):
    fullpath = f"{base_path}/{filename}"

    df = pd.read_parquet(fullpath)

    data = df["text"].to_csv(index=False)
    
    print(data[:100])

    break


text
"Dating is demanding, especially if youare looking to impress. There are numerous methods to en


Writing all files to output. The order at which these are processed isn't really important. But expect it to be large. When converting the file to text, we can also scrub away the first line which is just simply `text\n`. I don't think it is a big deal to leave it in, but it's so easy to just handle it now, so we might as well...

In [28]:
for i, filename in enumerate(all_files):
    fullpath = f"{base_path}/{filename}"

    df = pd.read_parquet(fullpath)

    data = df["text"].to_csv(index=False)

    with open(f"{output_path}/data-{i}.txt", "w") as f:
        # Skip the first line
        if data.startswith("text\n"):
            data = data[5:]
        
        f.write(data)

    print(f"Processed: {filename}")
    

Processed: train-00009-of-00010.parquet
Processed: train-00008-of-00010.parquet
Processed: validation-00000-of-00001.parquet
Processed: train-00001-of-00010.parquet
Processed: train-00004-of-00010.parquet
Processed: train-00006-of-00010.parquet
Processed: train-00002-of-00010.parquet
Processed: train-00003-of-00010.parquet
Processed: train-00005-of-00010.parquet
Processed: train-00007-of-00010.parquet
Processed: train-00000-of-00010.parquet


Test to see if the text has been converted and written correctly. Let's just output the first 2 lines. 

In [29]:
with open(f"{output_path}/data-0.txt") as f:
  print(f.readline())
  print(f.readline())

"Dating is demanding, especially if youare looking to impress. There are numerous methods to ensure you get started regarding the right foot…even befuck sluts for freee an initial big date occurs. It doesn’t just take much in order to make a lady feel truly special, especially if you reveal her you’re curious and you’re a person of word. Soon after are a handful of tactics to generate good feeling prior to the very first big date (as well as second or 3rd):

- Ask the woman away. Yes, you got that right. Cannot phone their to see if she really wants to “hang away” or “meet for a drink sometime”…call their and have her from a genuine time for a certain time and time. This one gesture goes a long way in revealing the girl that you’re curious and not simply finding a casual fling or friendship.



## Create the Training and Validation Text

Now that we have all this data, we will need to create batches for the training set and validation dataset.

We can decide to write some fancy streaming/generator thing to accommodate, but I'm not really here for that right now. 

I am lazy, and for this time, loading everything into memory and doing a split can work. 

I wrote 2 C programs to do:
* `text-builder` - Concatenate all `txt` files to a single `raw_data.txt` file.
* `text-splitter` - To split the `raw_data.txt` and create separate `train_data.txt` and `val_data.txt` datasets.

The performance is really good. It is way faster than what I can do in Python. It was also worth the time to just get some more C skills.

In [None]:
%%bash
DATASET_NAME="fineweb-100m" # Replace this here!!
../text-builder/text-builder "data/$DATASET_NAME/text" "data/$DATASET_NAME/raw_data.txt"

Total files: 11
Total size of files: 4653127311 bytes
Total file size read into memory: 4653127322, Number of additional characters for new line: 11


Split the data into 2 text files. `train_data.txt` and `val_data.txt`.

In [None]:
%%bash
DATASET_NAME="fineweb-100m" # Replace this here!!
../text-splitter/text-splitter "data/$DATASET_NAME/raw_data.txt" "data/$DATASET_NAME/train_data.txt" "data/$DATASET_NAME/val_data.txt"

Number of characters total: 4653127322, Split index: 3955158272


Create tokenized data set. Basically just call `tiktoken`'s `encode` on the text. The tokens will go into an array, and we can `pickle` it for later use. In fact, I recommend doing so as by now you're also probably running out of memory/swap space to maintain all this data. 

We will write code to unpickle, and create `torch` tensors Dataloader.

In [33]:
from scripts.tokenize_data import tokenize

# Tokenize the train data
train_tokens = tokenize(
  f"data/{dataset_name}/train_data.txt",
  f"data/{dataset_name}/train_tokens.lst"
)

# Tokenize the validation data
val_tokens = tokenize(
  f"data/{dataset_name}/val_data.txt",
  f"data/{dataset_name}/val_tokens.lst"
)

Tokenizing data/fineweb-1b/train_data.txt
Lines Read: 10000
Lines Read: 20000
Lines Read: 30000
Lines Read: 40000
Lines Read: 50000
Lines Read: 60000
Lines Read: 70000
Lines Read: 80000
Lines Read: 90000
Lines Read: 100000
Lines Read: 110000
Lines Read: 120000
Lines Read: 130000
Lines Read: 140000
Lines Read: 150000
Lines Read: 160000
Lines Read: 170000
Lines Read: 180000
Lines Read: 190000
Lines Read: 200000
Lines Read: 210000
Lines Read: 220000
Lines Read: 230000
Lines Read: 240000
Lines Read: 250000
Lines Read: 260000
Lines Read: 270000
Lines Read: 280000
Lines Read: 290000
Lines Read: 300000
Lines Read: 310000
Lines Read: 320000
Lines Read: 330000
Lines Read: 340000
Lines Read: 350000
Lines Read: 360000
Lines Read: 370000
Lines Read: 380000
Lines Read: 390000
Lines Read: 400000
Lines Read: 410000
Lines Read: 420000
Lines Read: 430000
Lines Read: 440000
Lines Read: 450000
Lines Read: 460000
Lines Read: 470000
Lines Read: 480000
Lines Read: 490000
Lines Read: 500000
Lines Read: 51000

## GPT-2 355M Config

Let's define the GPT-2 configuration. The context length, number of heads, and layers will be increased to increase the overall number of trainable weights.

In [38]:
GPT_CONFIG_355M = {
  "vocab_size": 50257,   # Vocabulary size
  "context_length": 1024, # Context length
  "emb_dim": 1024,        # Embedding dimension (larger than 124M)
  "n_heads": 16,         # Number of attention heads (larger than 124M)
  "n_layers": 24,        # Number of layers (larger than 124M)
  "drop_rate": 0.0,      # Dropout rate
  "qkv_bias": False      # Query-key-value bias
}

## Training and Validation Dataloaders

Using the config, let's now use the tokens list to create the dataloaders and pickle them too. We want to pickle so that we can reload it all later and won't have to go through the same pain in buildng this dataset as we had just now.

I'd also like to point out that internally `create_train_dataloader` and `create_val_dataloader` operate on a batch size of `4`. If you find that you need to use less memory usage during training, you will need to recreate these data loaders again with a smaller batch size.

In [39]:
# change this to your desired batch size.
# Note: bigger batch size means more VRAM necessary.
# Try: 4, 8, 16, 32
batch_size = 32

In [40]:
from scripts.preload_dataloaders import create_dataloader_to_pickle

create_dataloader_to_pickle(
  GPT_CONFIG_355M,
  f"data/{dataset_name}/train_tokens.lst",
  f"data/{dataset_name}/train_loader.dl",
  batch_size=batch_size
)
print("Created train_loader.")


Processing chunk: 0. Token: 0 of 852689166
Processing chunk: 100000. Token: 102400000 of 852689166
Processing chunk: 200000. Token: 204800000 of 852689166
Processing chunk: 300000. Token: 307200000 of 852689166
Processing chunk: 400000. Token: 409600000 of 852689166
Processing chunk: 500000. Token: 512000000 of 852689166
Processing chunk: 600000. Token: 614400000 of 852689166
Processing chunk: 700000. Token: 716800000 of 852689166
Processing chunk: 800000. Token: 819200000 of 852689166
Created train_loader.


Create the validation data loader. It's must smaller, and won't take as long. It is good practice to make the validation set unshuffled. 

In [41]:
create_dataloader_to_pickle(
  GPT_CONFIG_355M,
  f"data/{dataset_name}/val_tokens.lst",
  f"data/{dataset_name}/val_loader.dl",
  batch_size=batch_size,
  shuffle=False
)
print("Created val_loader.")

Processing chunk: 0. Token: 0 of 150485876
Processing chunk: 100000. Token: 102400000 of 150485876
Created val_loader.
