GPT-2

Recreating gpt-2 on my own, at first, and then pulling in optimizations from Andrej Karpathy's final YT video in the Zero to Hero Deep Learning series (from commit cc0a0c606d6c8de9a7cb4c0e7751d1d38c318563 onwards).

Setup

Dependencies

Set up a python environment and install requirements:

python3 -m venv .
source ./bin/activate
pip3 install -r requirements.txt

Datasets

Then download and prepare the training and val datasets:

python3 ./data/prepare.py

Training

You can train the model by calling:

python3 ./train.py

Or with DDP (if you have multiple GPUs - highly suggested):

# DDP on 4 gpus on 1 node (for example)
torchrun --standalone --nproc_per_node=4 train.py

Note that this, by default, loads the training checkpoint located in out/*.pt. If there is no training checkpoint, it starts training the model from scratch.

Sampling

Sample from the model by calling:

# with no prompt and default max tokens
python3 ./sample.py

# with a prompt
python3 ./sample.py -p "Hello, I'm a language model,"

# with a prompt and setting the maximum tokens to 500
python3 ./sample.py -p "Hello, I'm a language model," -m 500

Note that this, by default, loads the committed checkpoint located in checkpoint/*.pt. If there is no committed checkpoint, it will sample from an untrained model.

Build Details

Links

Data

GPT-2 was built off the WebText dataset. This dataset is internal to OpenAI, so I will be using the OpenWebText dataset. You can find all data in the data directory.

Notably, the WebText dataset was scraped with the following constraints:

All outbound links from Reddit posts with at least 3 karma
All posts up until December 2017
~8 million documents total
~40GB of text
Removal of all wikipedia documents and links since Wikipedia is "a common data source for other datasets and could complicate the analysis due to overlapping training data with test evaluation tasks".

OpenAI did not leverage CommonCrawl to reduce the data quality complexity they would have to surmount. Their main aim was to show that unsupervised learning on a large corpus could lead to meta learning on multiple tasks.

Tokenization

OpenAI leveraged BPE (byte pair encoding) on top of UTF-8 unicode points to represent the text data. They then tokenized on sub-word groupings with a vocab size of 50,527. They leveraged other token pre-processing steps to prevent things like BPE merging across character categories for any byte sequence.

Since the aim of this project is to just recreate the core of GPT-2, I will be leveraging tiktoken instead of implementing and training the tokenizer from scratch. This should also allow me to download the open source weights and know that my model can interop with whatever setup OpenAI used internally.

Model Architecture

GPT-2 largely follows the GPT-1 architecture, which consists of:

12-layer decoder-only transformer
Masked self-attention with 768 dim states and 12 attention heads
position-wise feed-forward networks with 3072 dim inner state
Adam optimizer with a learning rate of ~2.5e-4
Dropout with a rate of 0.1 regularization at the residual, embedding, and attention layers
A modified version of L2 regularization with w=0.01 on all non-bias or gain weights
GELU for the activation functions

With some modifications:

LayerNorm was moved to the input of each sub-block
An additional LayerNorm was added after the final self-attention block
Modified initialization that accounts for accumulations on the residual path with model depth
Scaled weights of the residual layers by a factor of 1/sqrt(N) where N is the number of residual layers
Context size of 1024
Batch size of 512

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
checkpoint		checkpoint
data		data
training-log		training-log
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
model.py		model.py
requirements.txt		requirements.txt
sample.py		sample.py
scratch.ipynb		scratch.ipynb
train.py		train.py
training-log.ipynb		training-log.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT-2

Setup

Dependencies

Datasets

Training

Sampling

Build Details

Links

Data

Tokenization

Model Architecture

About

Releases

Packages

Languages

vhmth/gpt-2

Folders and files

Latest commit

History

Repository files navigation

GPT-2

Setup

Dependencies

Datasets

Training

Sampling

Build Details

Links

Data

Tokenization

Model Architecture

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages