## Train a character-level GPT on some text data

The inputs here are simple text files, which we chop up to individual characters and then train GPT on. So you could say this is a char-transformer instead of a char-rnn. Doesn't quite roll off the tongue as well. In this example we will feed it some Shakespeare, which we'll get it to predict character-level.

In [2]:
# set up logging
import logging
logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO,
)

In [3]:
# make deterministic
from mingpt.utils import set_seed
set_seed(42)

In [4]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F

In [5]:
import math
from torch.utils.data import Dataset

class CharDataset(Dataset):

    def __init__(self, data, block_size):
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data
    
    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        chunk = self.data[idx:idx + self.block_size + 1]
        # encode every character to an integer
        dix = [self.stoi[s] for s in chunk]
        """
        arrange data and targets so that the first i elements of x
        will be asked to predict the i-th element of y. Notice that
        the eventual language model will actually make block_size
        individual predictions at the same time based on this data,
        so we are being clever and amortizing the cost of the forward
        pass of the network. So for example if block_size is 4, then
        we could e.g. sample a chunk of text "hello", the integers in
        x will correspond to "hell" and in y will be "ello". This will
        then actually "multitask" 4 separate examples at the same time
        in the language model:
        - given just "h", please predict "e" as next
        - given "he" please predict "l" next
        - given "hel" predict "l" next
        - given "hell" predict "o" next
        
        In addition, because the DataLoader will create batches of examples,
        every forward/backward pass during traning will simultaneously train
        a LOT of predictions, amortizing a lot of computation. In particular,
        for a batched input of integers X (B, T) where B is batch size and
        T is block_size and Y (B, T), the network will during training be
        simultaneously training to make B*T predictions, all at once! Of course,
        at test time we can paralellize across batch B, but unlike during training
        we cannot parallelize across the time dimension T - we have to run
        a forward pass of the network to recover the next single character of the 
        sequence along each batch dimension, and repeatedly always feed in a next
        character to get the next one.
        
        So yes there is a big asymmetry between train/test time of autoregressive
        models. During training we can go B*T at a time with every forward pass,
        but during test time we can only go B at a time, T times, with T forward 
        passes.
        """
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y


In [6]:
block_size = 128 # spatial extent of the model for its context

In [7]:
import chardet
with open('input.txt', 'rb') as f:
    raw_data = f.read()

encoding_info = chardet.detect(raw_data)
detected_encoding = encoding_info['encoding']

text = open('input.txt', 'r', encoding='gb18030').read()
# you can download this file at https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt
#text = open('input.txt', 'r').read() # don't worry we won't run out of file handles
train_dataset = CharDataset(text, block_size) # one line of poem is roughly 50 characters

data has 773198 characters, 4396 unique.


In [11]:
from mingpt.model import GPT, GPTConfig
mconf = GPTConfig(train_dataset.vocab_size, train_dataset.block_size,
                  n_layer=12, n_head=8, n_embd=512)
model = GPT(mconf)

03/08/2025 15:21:06 - INFO - mingpt.model -   number of parameters: 4.239667e+07


In [12]:
from mingpt.trainer import Trainer, TrainerConfig

# initialize a trainer instance and kick off training
tconf = TrainerConfig(max_epochs=2, batch_size=512, learning_rate=6e-4,
                      lr_decay=True, warmup_tokens=512*20, final_tokens=2*len(train_dataset)*block_size,
                      num_workers=4)
trainer = Trainer(model, train_dataset, None, tconf)
trainer.train()

epoch 1 iter 1509: train loss 0.36099. lr 3.000244e-04: 100%|██████████| 1510/1510 [20:12<00:00,  1.25it/s]
epoch 2 iter 1509: train loss 0.19898. lr 6.000000e-05: 100%|██████████| 1510/1510 [20:16<00:00,  1.24it/s]


In [21]:
# alright, let's sample some character-level Shakespeare
from mingpt.utils import sample

#context = "My God !, O God! you can't do this thing!"
context = '我们都是憋老仔，脖子上喜欢挂玉佩。来财， 来， 来'
x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
y = sample(model, x, 3000, temperature=1.0, sample=True, top_k=10)[0]
completion = ''.join([train_dataset.itos[int(i)] for i in y])
print(completion)

我们都是憋老仔，脖子上喜欢挂玉佩。来财， 来， 来旺儿挑担儿受私贿

    
　　诗曰：
　　簟展湘纹浪欲生，幽怀自感梦难成。
　　倚床剩觉添风味，开户羞将待月明。
　　拟倩蜂媒传密意，难将萤火照离情。
　　遥怜织女佳期近，时看银河几曲横。
　　话说一日，陈敬济听见薛嫂儿说知孙雪娥之事。这陈敬济乘着这个根由，就如此这般，使薛嫂儿往西门庆家对月娘说。薛嫂只得见月娘，说：“陈姑夫在外声言发话，说不要大姐，要写状子，巡抚、巡按处告示，说老爹在日，收着他父亲寄放的许多金银箱笼细软之物。”这月娘一来因孙雪娥被来旺儿盗财拐去，二者又是来安儿小厮走了，三者家人来兴媳妇惠秀又死了，刚打发出去，家中正七事八事，听见薛嫂儿来说此话，唬的慌了手脚，连忙雇轿子，打发大姐家去。但是大姐床奁箱厨陪嫁之物，交玳安雇人，都抬送到陈敬济家。敬济说：“这是他随身嫁我的床帐妆奁，还有我家寄放的细软金银箱笼，须索还我。”薛嫂道：“你大丈母说来，当初丈人在时，止收下这个床奁嫁妆，并没见你别的箱笼。”敬济又要使女元宵儿。薛嫂儿和玳安儿来对月娘说。月娘不肯把元宵与他，说：“这丫头是李娇儿房中使的，如今留着晚早看哥儿哩。”把中秋儿打发将来，说：“原是买了伏侍大姐的。”这敬济又不要中秋儿，两头来回只教薛嫂儿走。他娘张氏向玳安说：“哥哥，你到家拜上你大娘，你家姐儿们多，也不稀罕这个使女看守哥儿。既是与了大姐房里好一向，你姐夫已是收用过了他，你大娘只顾留怎的？”玳安一面到家，把此话对月娘说了。月娘无言可对，只得把元宵儿打发将来。敬济收下，满心欢喜，说道：“可怎的也打我这条道儿来？”正是：
　　饶你奸似鬼，吃我洗脚水。
　　按下一头。单说李知县儿子李衙内，自从清明郊外看见吴月娘、孟玉楼两人一般打扮，生的俱有姿色，知是西门庆妻小。衙内有心，爱孟玉楼生的长挑身材，瓜子面皮，模样儿风流俏丽。原来衙内丧偶，鳏居已久，一向着媒妇各处求亲，都不遂意。及见玉楼，便觉动心，但无门可入，未知嫁与不嫁，从违如何。不期雪娥缘事在官，已知是西门庆家出来的，周旋委曲，在伊父案前，将各犯用刑研审，追出赃物数目，望其来领。月娘害怕，又不使人见官。衙内失望，因此才将赃物入官，雪娥官卖。至是衙内谋之于廊吏何不韦，径使官媒婆陶妈妈来西门庆家访求亲事，许说成此门亲事，免县中打卯，还赏银五两。
　　这陶妈妈听了，喜欢的疾走如飞，一日到于西门庆门首。来昭正在门

In [None]:
# well that was fun