## Train a character-level GPT on some text data

The inputs here are simple text files, which we chop up to individual characters and then train GPT on. So you could say this is a char-transformer instead of a char-rnn. Doesn't quite roll off the tongue as well. In this example we will feed it some Shakespeare, which we'll get it to predict character-level.

In [1]:
# set up logging
import logging
logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO,
)

In [2]:
# make deterministic
from mingpt.utils import set_seed
set_seed(42)

In [3]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F

In [4]:
import math
from torch.utils.data import Dataset

class CharDataset(Dataset):

    def __init__(self, data, block_size):
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data
    
    def __len__(self):
        return math.ceil(len(self.data) / (self.block_size + 1))

    def __getitem__(self, idx):
        # we're actually going to "cheat" and pick a spot in the dataset at random
        i = np.random.randint(0, len(self.data) - (self.block_size + 1))
        chunk = self.data[i:i+self.block_size+1]
        dix = [self.stoi[s] for s in chunk]
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y


In [5]:
block_size = 128 # spatial extent of the model for its context

In [6]:
# you can download this file at https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt
text = open('input.txt', 'r').read() # don't worry we won't run out of file handles
train_dataset = CharDataset(text, block_size) # one line of poem is roughly 50 characters

data has 1115394 characters, 65 unique.


In [7]:
from mingpt.model import GPT, GPTConfig
mconf = GPTConfig(train_dataset.vocab_size, train_dataset.block_size,
                  n_layer=8, n_head=8, n_embd=512)
model = GPT(mconf)

08/17/2020 00:11:58 - INFO - mingpt.model -   number of parameters: 2.535219e+07


In [8]:
from mingpt.trainer import Trainer, TrainerConfig

# initialize a trainer instance and kick off training
tconf = TrainerConfig(max_epochs=200, batch_size=512, learning_rate=6e-4,
                      lr_decay=True, warmup_tokens=512*20, final_tokens=200*len(train_dataset)*block_size,
                      num_workers=4)
trainer = Trainer(model, train_dataset, None, tconf)
trainer.train()

epoch 1 iter 16: train loss 3.31022. lr 5.999637e-04: 100%|██████████| 17/17 [00:36<00:00,  2.18s/it]
epoch 2 iter 16: train loss 2.89320. lr 5.998533e-04: 100%|██████████| 17/17 [00:04<00:00,  3.78it/s]
epoch 3 iter 16: train loss 2.63845. lr 5.996690e-04: 100%|██████████| 17/17 [00:04<00:00,  3.74it/s]
epoch 4 iter 16: train loss 2.54588. lr 5.994107e-04: 100%|██████████| 17/17 [00:04<00:00,  3.87it/s]
epoch 5 iter 16: train loss 2.49512. lr 5.990785e-04: 100%|██████████| 17/17 [00:04<00:00,  3.98it/s]
epoch 6 iter 16: train loss 2.46732. lr 5.986726e-04: 100%|██████████| 17/17 [00:04<00:00,  3.96it/s]
epoch 7 iter 16: train loss 2.44716. lr 5.981929e-04: 100%|██████████| 17/17 [00:04<00:00,  3.95it/s]
epoch 8 iter 16: train loss 2.37363. lr 5.976397e-04: 100%|██████████| 17/17 [00:04<00:00,  3.93it/s]
epoch 9 iter 16: train loss 2.34669. lr 5.970130e-04: 100%|██████████| 17/17 [00:04<00:00,  3.96it/s]
epoch 10 iter 16: train loss 2.28792. lr 5.963130e-04: 100%|██████████| 17/17 [00:

epoch 156 iter 16: train loss 0.36846. lr 6.885214e-05: 100%|██████████| 17/17 [00:04<00:00,  3.97it/s]
epoch 157 iter 16: train loss 0.35783. lr 6.587674e-05: 100%|██████████| 17/17 [00:04<00:00,  3.96it/s]
epoch 158 iter 16: train loss 0.36345. lr 6.295911e-05: 100%|██████████| 17/17 [00:04<00:00,  4.01it/s]
epoch 159 iter 16: train loss 0.35740. lr 6.009997e-05: 100%|██████████| 17/17 [00:04<00:00,  3.98it/s]
epoch 160 iter 16: train loss 0.36017. lr 6.000000e-05: 100%|██████████| 17/17 [00:04<00:00,  4.00it/s]
epoch 161 iter 16: train loss 0.35203. lr 6.000000e-05: 100%|██████████| 17/17 [00:04<00:00,  3.98it/s]
epoch 162 iter 16: train loss 0.34658. lr 6.000000e-05: 100%|██████████| 17/17 [00:04<00:00,  3.98it/s]
epoch 163 iter 16: train loss 0.35008. lr 6.000000e-05: 100%|██████████| 17/17 [00:04<00:00,  3.93it/s]
epoch 164 iter 16: train loss 0.34701. lr 6.000000e-05: 100%|██████████| 17/17 [00:04<00:00,  4.01it/s]
epoch 165 iter 16: train loss 0.34820. lr 6.000000e-05: 100%|███

In [12]:
# alright, let's sample some character-level Shakespeare
from mingpt.utils import sample

context = "O God, O God!"
x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
y = sample(model, x, 2000, temperature=0.9, sample=True, top_k=5)[0]
completion = ''.join([train_dataset.itos[int(i)] for i in y])
print(completion)

O God, O God! which is the business so harm!
Well, lords, and save yourselves; and no oath to be angry
That in their embraces: and, to brave the life
We have forgot and bandy as that time
Have told me and he bids me for this excellent,
Now I would say he looks on the banks
And give more strength than a wild and provide
A salt that with some friendly vow,
That from the reaches of the gain and stop the sleeves
Do scope that which He should hide for his guard
As miser made thee first way from his holy exercise.

BUCKINGHAM:
Go, rating to London, with all these woful chances
Misthink the king and not be satisfied!

Son:
Was ever son so rued a father's death?

Father:
The warn's idle buy and blows: and then to make a
fire, sir, I will keep my capss with stars out
And safely point of good content.
Signior Lucentio, let us hence; good gods rest ourselves:
We shall we show her own heaven and the king
In me resolved: I have seen a lady's nose
That has been blue, but not her eyebrows.

First Lad

In [None]:
# well that was fun