# `nanoGPT`: GPT-2 Medium (350M Params)

## Install / Setup

### First Time Running

We need to install `ngpt` and setup the Shakespeare dataset

This will need to be ran the first time you are running this notebook.

Following the

```python
!python3 -m pip install nanoGPT
```

you will need to restart your runtime (Runtime -> Restart runtime)

After this, you should be able to

```python
>>> import ngpt
>>> ngpt.__file__
'/content/nanoGPT/src/ngpt/__init__.py'
```

In [1]:
%%bash

python3 -c 'import ngpt; print(ngpt.__file__)' 2> '/dev/null'

if [[ $? -eq 0 ]]; then
    echo "Has ngpt installed. Nothing to do."
else
    echo "Does not have ngpt installed. Installing..."
    git clone 'https://github.com/saforem2/nanoGPT'
    python3 nanoGPT/data/shakespeare_char/prepare.py
    python3 -m pip install -e nanoGPT -vvv
fi

/lus/grand/projects/datascience/foremans/locations/thetaGPU/projects/saforem2/nanoGPT/src/ngpt/__init__.py
Has ngpt installed. Nothing to do.


## Post Install

If installed correctly, you should be able to:

```python
>>> import ngpt
>>> ngpt.__file__
'/path/to/nanoGPT/src/ngpt/__init__.py'
```

In [2]:
%load_ext autoreload
%autoreload 2

import ngpt
from rich import print
print(ngpt.__file__)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Build Trainer

Explicitly, we:

1. `setup_torch(...)`
2. Build `cfg: DictConfig = get_config(...)`
3. Instnatiate `config: ExperimentConfig = instantiate(cfg)`
4. Build `trainer = Trainer(config)`

In [3]:
import os
import numpy as np
from ezpz import setup_torch
from hydra.utils import instantiate
from ngpt.configs import get_config, PROJECT_ROOT
from ngpt.trainer import Trainer
from enrich.console import get_console

console = get_console()
HF_DATASETS_CACHE = PROJECT_ROOT.joinpath('.cache', 'huggingface')
HF_DATASETS_CACHE.mkdir(exist_ok=True, parents=True)

os.environ['MASTER_PORT'] = '5127'
os.environ['HF_DATASETS_CACHE'] = HF_DATASETS_CACHE.as_posix()

SEED = np.random.randint(2**32)
console.log(f'SEED: {SEED}')

rank = setup_torch('DDP', seed=1234)
cfg = get_config(
    [
        'data=owt',
        'model=gpt2_medium',
        'optimizer=gpt2_medium',
        'train=gpt2_medium',
        'train.dtype=bfloat16',
        'train.max_iters=1000',
        'train.log_interval=100',
        'train.init_from=gpt2-medium',
    ]
)
config = instantiate(cfg)
trainer = Trainer(config)

--------------------------------------------------------------------------

  Local host:   thetagpu23
  Local device: mlx5_0
--------------------------------------------------------------------------
2023-11-15 09:48:20.191135: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


[38;2;131;131;131m[2023-11-15 09:48:26][0m[34m[INFO][0m[38;2;119;119;119m[configs.py:263[0m[38;2;119;119;119m][0m - Rescaling GAS -> GAS [32m/[0m[32m/[0m WORLD_SIZE = [35m1[0m [32m/[0m[32m/[0m [35m1[0m
[38;2;131;131;131m[2023-11-15 09:48:26][0m[34m[INFO][0m[38;2;119;119;119m[configs.py:398[0m[38;2;119;119;119m][0m - Tokens per iteration: [35m4[0m,[35m096[0m
[38;2;131;131;131m[2023-11-15 09:48:26][0m[34m[INFO][0m[38;2;119;119;119m[configs.py:430[0m[38;2;119;119;119m][0m - Using [1m<[0m[1;95mtorch.amp.autocast_mode.autocast[0m[39m object at [0m[35m0x7fcbf3a11930[0m[1m>[0m
[38;2;131;131;131m[2023-11-15 09:48:26][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:187[0m[38;2;119;119;119m][0m - Initializing from OpenAI GPT-[35m2[0m Weights: gpt2-medium


[38;2;131;131;131m[2023-11-15 09:48:49][0m[34m[INFO][0m[38;2;119;119;119m[model.py:225[0m[38;2;119;119;119m][0m - loading weights from pretrained gpt: gpt2-medium
[38;2;131;131;131m[2023-11-15 09:48:49][0m[34m[INFO][0m[38;2;119;119;119m[model.py:234[0m[38;2;119;119;119m][0m - forcing [3;94mvocab_size[0m=[35m50257[0m, [3;94mblock_size[0m=[35m1024[0m, [3;94mbias[0m=[3;92mTrue[0m
[38;2;131;131;131m[2023-11-15 09:48:49][0m[34m[INFO][0m[38;2;119;119;119m[model.py:240[0m[38;2;119;119;119m][0m - overriding dropout rate to [35m0.0[0m
[38;2;131;131;131m[2023-11-15 09:48:55][0m[34m[INFO][0m[38;2;119;119;119m[model.py:160[0m[38;2;119;119;119m][0m - number of parameters: [35m353.[0m77M


Downloading (…)lve/main/config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

[38;2;131;131;131m[2023-11-15 09:49:16][0m[34m[INFO][0m[38;2;119;119;119m[model.py:290[0m[38;2;119;119;119m][0m - num decayed parameter tensors: [35m98[0m, with [35m354[0m,[35m501[0m,[35m632[0m parameters
[38;2;131;131;131m[2023-11-15 09:49:16][0m[34m[INFO][0m[38;2;119;119;119m[model.py:291[0m[38;2;119;119;119m][0m - num non-decayed parameter tensors: [35m194[0m, with [35m321[0m,[35m536[0m parameters
[38;2;131;131;131m[2023-11-15 09:49:17][0m[34m[INFO][0m[38;2;119;119;119m[model.py:297[0m[38;2;119;119;119m][0m - using fused AdamW: [3;92mTrue[0m


## Prompt (**prior** to training)

In [4]:
query = "What is a supercomputer? Explain like I'm a child, and speak clearly. Double check your logic.."
outputs = trainer.evaluate(query, num_samples=1, display=False)
console.print(fr'\[prompt]: "{query}"')
console.print("\[response]:\n\n" + fr"{outputs['0']['raw']}")

## Train Model


|  **NAME**  |     **DESCRIPTION**          |
|:----------:|:----------------------------:|
|   `step`   | Current training step        |
|   `loss`   | Loss value                   |
|   `dt`     | Time per step (in **ms**)    |
|   `sps`    | Samples per second           |
|   `mtps`   | (million) Tokens per sec     |
|   `mfu`    | Model Flops Utilization*     |
^Logging Legend

*in units of A100 `bfloat16` peak FLOPS

In [5]:
trainer.model.module.train()
trainer.train()

  0%|          | 0/1000 [00:00<?, ?it/s]

[38;2;131;131;131m[2023-11-15 09:50:50][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:516[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m100[0m [3;94mloss[0m=[35m2[0m[35m.791[0m [3;94mdt[0m=[35m387[0m[35m.530[0m [3;94msps[0m=[35m2[0m[35m.580[0m [3;94mmtps[0m=[35m0[0m[35m.011[0m [3;94mmfu[0m=[35m24[0m[35m.642[0m [3;94mtrain_loss[0m=[35m2[0m[35m.837[0m [3;94mval_loss[0m=[35m2[0m[35m.826[0m
[38;2;131;131;131m[2023-11-15 09:51:28][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:516[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m200[0m [3;94mloss[0m=[35m2[0m[35m.716[0m [3;94mdt[0m=[35m375[0m[35m.216[0m [3;94msps[0m=[35m2[0m[35m.665[0m [3;94mmtps[0m=[35m0[0m[35m.011[0m [3;94mmfu[0m=[35m24[0m[35m.722[0m [3;94mtrain_loss[0m=[35m2[0m[35m.837[0m [3;94mval_loss[0m=[35m2[0m[35m.826[0m
[38;2;131;131;131m[2023-11-15 09:52:07][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:516[0m[38;2;119;119;119m

## Evaluate Model

In [6]:
query = "What is a supercomputer? Explain like I'm a child, and speak clearly. Double check your logic.."
outputs = trainer.evaluate(query, num_samples=1, display=False)
console.print(fr'\[prompt]: "{query}"')
console.print("\[response]:\n\n" + fr"{outputs['0']['raw']}")

## Train a bit more??

In [7]:
trainer.model.module.train()
for iter in range(10):
    console.rule(f'iter: {iter}')
    trainer.train(train_iters=100)
    query = "What is a supercomputer?"
    outputs = trainer.evaluate(query, num_samples=1, display=False)
    console.print(fr'\[prompt]: "{query}"')
    console.print("\[response]:\n\n" + fr"{outputs['0']['raw']}")
    console.rule()

  0%|          | 0/100 [00:00<?, ?it/s]

[38;2;131;131;131m[2023-11-15 09:58:35][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:516[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m1100[0m [3;94mloss[0m=[35m2[0m[35m.939[0m [3;94mdt[0m=[35m356[0m[35m.182[0m [3;94msps[0m=[35m2[0m[35m.808[0m [3;94mmtps[0m=[35m0[0m[35m.011[0m [3;94mmfu[0m=[35m26[0m[35m.810[0m [3;94mtrain_loss[0m=[35m0[0m[35m.000[0m [3;94mval_loss[0m=[35m0[0m[35m.000[0m


  0%|          | 0/100 [00:00<?, ?it/s]

[38;2;131;131;131m[2023-11-15 09:59:28][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:516[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m1200[0m [3;94mloss[0m=[35m2[0m[35m.907[0m [3;94mdt[0m=[35m400[0m[35m.617[0m [3;94msps[0m=[35m2[0m[35m.496[0m [3;94mmtps[0m=[35m0[0m[35m.010[0m [3;94mmfu[0m=[35m23[0m[35m.837[0m [3;94mtrain_loss[0m=[35m0[0m[35m.000[0m [3;94mval_loss[0m=[35m0[0m[35m.000[0m


  0%|          | 0/100 [00:00<?, ?it/s]

[38;2;131;131;131m[2023-11-15 10:00:22][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:516[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m1300[0m [3;94mloss[0m=[35m2[0m[35m.941[0m [3;94mdt[0m=[35m423[0m[35m.726[0m [3;94msps[0m=[35m2[0m[35m.360[0m [3;94mmtps[0m=[35m0[0m[35m.010[0m [3;94mmfu[0m=[35m22[0m[35m.537[0m [3;94mtrain_loss[0m=[35m0[0m[35m.000[0m [3;94mval_loss[0m=[35m0[0m[35m.000[0m


  0%|          | 0/100 [00:00<?, ?it/s]

[38;2;131;131;131m[2023-11-15 10:01:14][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:516[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m1400[0m [3;94mloss[0m=[35m2[0m[35m.948[0m [3;94mdt[0m=[35m366[0m[35m.042[0m [3;94msps[0m=[35m2[0m[35m.732[0m [3;94mmtps[0m=[35m0[0m[35m.011[0m [3;94mmfu[0m=[35m26[0m[35m.088[0m [3;94mtrain_loss[0m=[35m0[0m[35m.000[0m [3;94mval_loss[0m=[35m0[0m[35m.000[0m


  0%|          | 0/100 [00:00<?, ?it/s]

[38;2;131;131;131m[2023-11-15 10:02:07][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:516[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m1500[0m [3;94mloss[0m=[35m2[0m[35m.871[0m [3;94mdt[0m=[35m341[0m[35m.233[0m [3;94msps[0m=[35m2[0m[35m.931[0m [3;94mmtps[0m=[35m0[0m[35m.012[0m [3;94mmfu[0m=[35m27[0m[35m.985[0m [3;94mtrain_loss[0m=[35m0[0m[35m.000[0m [3;94mval_loss[0m=[35m0[0m[35m.000[0m


  0%|          | 0/100 [00:00<?, ?it/s]

[38;2;131;131;131m[2023-11-15 10:03:00][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:516[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m1600[0m [3;94mloss[0m=[35m2[0m[35m.997[0m [3;94mdt[0m=[35m416[0m[35m.252[0m [3;94msps[0m=[35m2[0m[35m.402[0m [3;94mmtps[0m=[35m0[0m[35m.010[0m [3;94mmfu[0m=[35m22[0m[35m.941[0m [3;94mtrain_loss[0m=[35m0[0m[35m.000[0m [3;94mval_loss[0m=[35m0[0m[35m.000[0m


  0%|          | 0/100 [00:00<?, ?it/s]

[38;2;131;131;131m[2023-11-15 10:03:53][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:516[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m1700[0m [3;94mloss[0m=[35m3[0m[35m.223[0m [3;94mdt[0m=[35m392[0m[35m.194[0m [3;94msps[0m=[35m2[0m[35m.550[0m [3;94mmtps[0m=[35m0[0m[35m.010[0m [3;94mmfu[0m=[35m24[0m[35m.349[0m [3;94mtrain_loss[0m=[35m0[0m[35m.000[0m [3;94mval_loss[0m=[35m0[0m[35m.000[0m


  0%|          | 0/100 [00:00<?, ?it/s]

[38;2;131;131;131m[2023-11-15 10:04:45][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:516[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m1800[0m [3;94mloss[0m=[35m3[0m[35m.046[0m [3;94mdt[0m=[35m353[0m[35m.265[0m [3;94msps[0m=[35m2[0m[35m.831[0m [3;94mmtps[0m=[35m0[0m[35m.012[0m [3;94mmfu[0m=[35m27[0m[35m.032[0m [3;94mtrain_loss[0m=[35m0[0m[35m.000[0m [3;94mval_loss[0m=[35m0[0m[35m.000[0m


  0%|          | 0/100 [00:00<?, ?it/s]

[38;2;131;131;131m[2023-11-15 10:05:37][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:516[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m1900[0m [3;94mloss[0m=[35m3[0m[35m.108[0m [3;94mdt[0m=[35m413[0m[35m.097[0m [3;94msps[0m=[35m2[0m[35m.421[0m [3;94mmtps[0m=[35m0[0m[35m.010[0m [3;94mmfu[0m=[35m23[0m[35m.116[0m [3;94mtrain_loss[0m=[35m0[0m[35m.000[0m [3;94mval_loss[0m=[35m0[0m[35m.000[0m


  0%|          | 0/100 [00:00<?, ?it/s]

[38;2;131;131;131m[2023-11-15 10:06:30][0m[34m[INFO][0m[38;2;119;119;119m[trainer.py:516[0m[38;2;119;119;119m][0m - [3;94mstep[0m=[35m2000[0m [3;94mloss[0m=[35m3[0m[35m.133[0m [3;94mdt[0m=[35m378[0m[35m.236[0m [3;94msps[0m=[35m2[0m[35m.644[0m [3;94mmtps[0m=[35m0[0m[35m.011[0m [3;94mmfu[0m=[35m25[0m[35m.247[0m [3;94mtrain_loss[0m=[35m0[0m[35m.000[0m [3;94mval_loss[0m=[35m0[0m[35m.000[0m
