# Training custom GPT2 model

We will use nanoGPT by Andrej Karpathy.

For full source see https://github.com/karpathy/nanoGPT.git

In [6]:
# Download the nanoGPT from Andrej Karpathy's github
import urllib.request
base_url = "https://github.com/karpathy/nanoGPT/raw/master/"
urllib.request.urlretrieve(f"{base_url}/model.py", "model.py")
urllib.request.urlretrieve(f"{base_url}/train.py", "train.py")
urllib.request.urlretrieve(f"{base_url}/configurator.py", "configurator.py")
urllib.request.urlretrieve(f"{base_url}/sample.py", "sample.py")

('sample.py', <http.client.HTTPMessage at 0x7f3e4b698400>)

Model configuration is in configs/azure_docs_training.py, but it is mostly on defaults (GPT2 in its small 124M version).

Max iterations is set to 3000 as we do not want to spend more on GPU in this demo.

In [12]:
!/bin/python3 train.py configs/azure_docs_training.py

# 180 minutes on NVIDIA A100 GPU

Overriding config with configs/azure_docs_training.py:
out_dir = 'azure_docs_out'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10

always_save_checkpoint = False

wandb_log = False
wandb_project = 'azure_docs'
wandb_run_name = 'nano-gpt-training'

dataset = 'azure_docs'
batch_size = 12
block_size = 1024
gradient_accumulation_steps = 5 * 8

max_iters = 3000

tokens per iteration will be: 491,520
Initializing a new model from scratch
defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)
number of parameters: 123.59M
num decayed parameter tensors: 50, with 124,354,560 parameters
num non-decayed parameter tensors: 25, with 19,200 parameters
using fused AdamW: True
compiling the model... (takes a ~minute)
step 0: train loss 10.9024, val loss 10.9210
iter 0: loss 10.9571, time 29977.77ms, mfu -100.00%
iter 10: loss 8.9336, time 3398.18ms, mfu 39.63%
iter 20: loss 9.5884, time 3423.06ms, mfu 39.60%
iter 30: loss 9.0825, time 

## Finetuning from gpt2 weights

In [1]:
# GPT2 124M
!/bin/python3 train.py configs/azure_docs_finetuning.py

# 3 minutes on NVIDIA A100 GPU

Overriding config with configs/azure_docs_finetuning.py:
out_dir = 'azure_docs_finetuning_out'
eval_interval = 5
eval_iters = 40

wandb_log = False

dataset = 'azure_docs'
init_from = 'gpt2'     # This is starting point, 124M pretrained GPT2

always_save_checkpoint = False

batch_size = 1
gradient_accumulation_steps = 32
max_iters = 50

# finetune at constant LR
learning_rate = 3e-5
decay_lr = False
tokens per iteration will be: 32,768
Initializing from OpenAI GPT-2 weights: gpt2
loading weights from pretrained gpt: gpt2
forcing vocab_size=50257, block_size=1024, bias=True
overriding dropout rate to 0.0
number of parameters: 123.65M
num decayed parameter tensors: 50, with 124,318,464 parameters
num non-decayed parameter tensors: 98, with 121,344 parameters
using fused AdamW: True
compiling the model... (takes a ~minute)
step 0: train loss 2.1447, val loss 2.5056
iter 0: loss 2.4903, time 44285.92ms, mfu -100.00%
iter 1: loss 2.6530, time 915.68ms, mfu -100.00%
iter 2: loss 1.9127, time

In [2]:
# GPT2 XL
!/bin/python3 train.py configs/azure_docs_finetuning_xl.py

# 9 minutes on NVIDIA A100 GPU

Overriding config with configs/azure_docs_finetuning_xl.py:
out_dir = 'azure_docs_finetuning_xl_out'
eval_interval = 5
eval_iters = 40

wandb_log = False

dataset = 'azure_docs'
init_from = 'gpt2-xl'     # This is starting point, 124M pretrained GPT2

always_save_checkpoint = False

batch_size = 1
gradient_accumulation_steps = 32
max_iters = 20

# finetune at constant LR
learning_rate = 3e-5
decay_lr = False
tokens per iteration will be: 32,768
Initializing from OpenAI GPT-2 weights: gpt2-xl
loading weights from pretrained gpt: gpt2-xl
forcing vocab_size=50257, block_size=1024, bias=True
overriding dropout rate to 0.0
number of parameters: 1555.97M
Downloading (…)lve/main/config.json: 100%|█████| 689/689 [00:00<00:00, 6.00MB/s]
Downloading pytorch_model.bin: 100%|████████| 6.43G/6.43G [00:40<00:00, 157MB/s]
Downloading (…)neration_config.json: 100%|██████| 124/124 [00:00<00:00, 857kB/s]
num decayed parameter tensors: 194, with 1,556,609,600 parameters
num non-decayed parameter tensors: