## Train a character-level GPT on some text data

The inputs here are simple text files, which we chop up to individual characters and then train GPT on. So you could say this is a char-transformer instead of a char-rnn. Doesn't quite roll off the tongue as well. In this example we will feed it some Shakespeare, which we'll get it to predict character-level.

<span style="color: blue;">

**Overview:** This demo trains a character-level GPT (Generative Pre-trained Transformer) on Shakespeare text. Unlike word-level models that operate on tokens, a char-level model predicts the next *character* at each step. This makes the vocabulary small (~65 chars for English) but sequences much longer. The minGPT architecture is the same as GPT-2: stacked Transformer decoder blocks with causal (masked) self-attention.

</span>

In [1]:
# set up logging
import logging
logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO,
)

<span style="color: blue;">

**Logging setup:** Configures Python's logging module to print timestamps, log levels, and messages. Useful for tracking training progress and debugging. The format shows when each log line was produced.

</span>

In [2]:
# make deterministic
from mingpt.utils import set_seed
set_seed(42)

<span style="color: blue;">

**Reproducibility:** `set_seed(42)` fixes the random seed for NumPy, PyTorch, and Python's random module. This ensures that training and sampling produce the same results across runs, which is essential for reproducible experiments.

</span>

In [3]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F

<span style="color: blue;">

**Imports:** Standard libraries for the demo: NumPy for arrays, PyTorch for tensors and neural networks, and `F` (functional) for operations like softmax used in attention.

</span>

In [4]:
%%writefile dataset.py

import math
import torch
from torch.utils.data import Dataset

class CharDataset(Dataset):

    def __init__(self, data, block_size):
        chars = sorted(list(set(data))) # get all unique characters
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        
        self.stoi = { ch:i for i,ch in enumerate(chars) } # {'a': 0, 'b': 1, 'c': 2, ...}
        self.itos = { i:ch for i,ch in enumerate(chars) } # {0: 'a', 1: 'b', 2: 'c', ...}
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data
    
    def __len__(self):
        """
        return len(data) the training sample size

        each sample is a chunk of (block_size + 1) characters
        so, the start point of the samples could be 0, 1, 2, ..., len(data) - (block_size + 1)
        Totally len(data) - (block_size + 1) + 1  = len(data) - block_size samples
        """
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        chunk = self.data[idx:idx + self.block_size + 1] # idx to idx + block_size + 1
        # encode every character to an integer
        dix = [self.stoi[s] for s in chunk] # [0, 1, 2, ..., block_size]
        """
        arrange data and targets so that the first i elements of x
        will be asked to predict the i-th element of y. Notice that
        the eventual language model will actually make block_size
        individual predictions at the same time based on this data,
        so we are being clever and amortizing the cost of the forward
        pass of the network. So for example if block_size is 4, then
        we could e.g. sample a chunk of text "hello", the integers in
        x will correspond to "hell" and in y will be "ello". This will
        then actually "multitask" 4 separate examples at the same time
        in the language model:
        - given just "h", please predict "e" as next
        - given "he" please predict "l" next
        - given "hel" predict "l" next
        - given "hell" predict "o" next
        
        In addition, because the DataLoader will create batches of examples,
        every forward/backward pass during traning will simultaneously train
        a LOT of predictions, amortizing a lot of computation. In particular,
        for a batched input of integers X (B, T) where B is batch size and
        T is block_size and Y (B, T), the network will during training be
        simultaneously training to make B*T predictions, all at once! Of course,
        at test time we can paralellize across batch B, but unlike during training
        we cannot parallelize across the time dimension T - we have to run
        a forward pass of the network to recover the next single character of the 
        sequence along each batch dimension, and repeatedly always feed in a next
        character to get the next one.
        
        So yes there is a big asymmetry between train/test time of autoregressive
        models. During training we can go B*T at a time with every forward pass,
        but during test time we can only go B at a time, T times, with T forward 
        passes.
        """
        x = torch.tensor(dix[:-1], dtype=torch.long) # input sequence [0, 1, 2, ..., block_size-1]
        y = torch.tensor(dix[1:], dtype=torch.long) # target sequence (predict next character) [1, 2, 3, ..., block_size]
        return x, y


Writing dataset.py


<span style="color: blue;">

**CharDataset:** A PyTorch `Dataset` that:
- Builds a character vocabulary (`stoi` = char→index, `itos` = index→char)
- Slices the text into overlapping chunks of length `block_size + 1`
- Returns `(x, y)` where `x` = first `block_size` chars, `y` = next `block_size` chars (shifted by 1). This implements *teacher forcing*: given chars 1..T, predict char 2..T+1
- Enables efficient batched training: one forward pass trains the model on many positions at once (B×T predictions per batch). At inference, we generate one token at a time autoregressively.

---

**A concrete “drawn” alignment (what one sample trains):**

Assume:
- `block_size = T = 4`
- A single text chunk is `"hello"` (length \(T+1=5\))
- So `x = "hell"` and `y = "ello"`

```
chunk (T+1):   h    e    l    l    o
index t:      0    1    2    3    4
              |---- x ----|           (x has length T)
                   |---- y ----|      (y has length T)

x (input):     h    e    l    l
y (target):    e    l    l    o
pos in x/y:    0    1    2    3
```

At each position `pos` (0..T-1), the model does a next-character classification:

```
pos=0: input prefix = "h"        -> predict y[0] = "e"
pos=1: input prefix = "he"       -> predict y[1] = "l"
pos=2: input prefix = "hel"      -> predict y[2] = "l"
pos=3: input prefix = "hell"     -> predict y[3] = "o"
```

Even though we feed the whole `x="hell"` in one shot, *causal masking* ensures that the computation at `pos=k` can only see `x[:k+1]` (no peeking at future characters).

---

**What changes with batching:**

Let batch size be `B = 2`, `T = 4`. Suppose we sampled two chunks:
- Sample 0 chunk: `"hello"` → `x0="hell"`, `y0="ello"`
- Sample 1 chunk: `"orld!"` → `x1="orld"`, `y1="rld!"`

Then the batch tensors look like:

```
X (B,T) =
  [ h  e  l  l ]   # sample 0
  [ o  r  l  d ]   # sample 1

Y (B,T) =
  [ e  l  l  o ]   # sample 0
  [ r  l  d  ! ]   # sample 1
```

Training computes `B*T` next-char predictions at once. You can think of it as this table of tasks:

```
                time position (pos)
              0            1             2              3
sample 0   "h" -> e    "he" -> l     "hel" -> l     "hell" -> o
sample 1   "o" -> r    "or" -> l     "orl" -> d     "orld" -> !
```

So one forward/backward pass trains on \(B\times T\) supervised targets simultaneously.

---

**Why test-time / generation is different:**

At generation time, you only have a prompt and the future characters do not exist yet, so you must generate them step-by-step:

```
start prompt: "he"

step 1:
  input  = "he"
  model  -> distribution for next char
  pick   -> "l"
  new sequence = "hel"

step 2:
  input  = "hel"
  model  -> distribution for next char
  pick   -> "l"
  new sequence = "hell"

step 3:
  input  = "hell"
  model  -> distribution for next char
  pick   -> "o"
  new sequence = "hello"
```

You cannot parallelize across these newly generated time steps because each step depends on the previous generated output (while training can parallelize across the *existing* positions of `x` using teacher forcing + causal masking).


</span>

In [5]:
block_size = 128 # spatial extent of the model for its context

<span style="color: blue;">

**block_size (context length):** The maximum number of characters the model can "see" at once. With `block_size=128`, each input is 128 chars; the model uses causal masking so position `i` only attends to positions 1..i. Larger block_size = more context but more memory and compute.

</span>

In [6]:
!curl https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -o input.txt
# -o = output

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1089k  100 1089k    0     0  2566k      0 --:--:-- --:--:-- --:--:-- 2562k


<span style="color: blue;">

**Data download:** Fetches the Tiny Shakespeare dataset (~1M characters) from Karpathy's char-rnn repo. This is a subset of Shakespeare's works used for quick language modeling experiments.

</span>

In [7]:
# you can download this file at https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt
from dataset import CharDataset
text = open('input.txt', 'r').read() # don't worry we won't run out of file handles
train_dataset = CharDataset(text, block_size) # one line of poem is roughly 50 characters

data has 1115394 characters, 65 unique.


In [None]:
# my try
x0,y0 = train_dataset.__getitem__(idx = 0)
print(x0)
print(y0)

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1])
tensor([47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44, 53,
        56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,  1,
        44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1, 57,
        54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,  6,
         1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47, 58,
        47, 64, 43, 52,

<span style="color: blue;">

**Loading the dataset:** Reads the raw text file and instantiates `CharDataset`. The output shows 1,115,394 characters and 65 unique characters (letters, punctuation, newlines, etc.). The dataset will produce `len(text) - block_size` training examples via overlapping windows.

</span>

In [21]:
from mingpt.model import GPT, GPTConfig
mconf = GPTConfig(train_dataset.vocab_size, train_dataset.block_size,
                  n_layer=8, n_head=8, n_embd=512)
model = GPT(mconf)

02/22/2026 17:06:44 - INFO - mingpt.model -   number of parameters: 2.535219e+07


<span style="color: blue;">

**Model architecture:** Creates a GPT model with `GPTConfig`:
- `vocab_size`, `block_size`: from the dataset
- `n_layer=8`, `n_head=8`, `n_embd=512`: 8 Transformer blocks, 8 attention heads, 512-dimensional embeddings. This yields ~25M parameters. The model uses learned positional embeddings and causal self-attention in each block.

</span>

In [22]:
from mingpt.trainer import Trainer, TrainerConfig

# initialize a trainer instance and kick off training
tconf = TrainerConfig(max_epochs=2, batch_size=512, learning_rate=6e-4,
                      lr_decay=True, warmup_tokens=512*20, final_tokens=2*len(train_dataset)*block_size,
                      num_workers=4)
trainer = Trainer(model, train_dataset, None, tconf)
trainer.train()

epoch 1 iter 0: train loss 4.32059. lr 5.999999e-04:   0%|          | 1/2179 [04:23<159:14:34, 263.21s/it]


KeyboardInterrupt: 

<span style="color: blue;">

**Training:** The `Trainer` handles the training loop. Key config:
- `max_epochs=2`: 2 full passes over the data
- `batch_size=512`: 512 sequences per batch
- `learning_rate=6e-4` with decay: common for Transformers
- `warmup_tokens`, `final_tokens`: learning rate schedule (warmup then decay)
- `num_workers=4`: data loading workers (set to 0 if multiprocessing fails locally)

</span>

In [10]:
# alright, let's sample some character-level Shakespeare
from mingpt.utils import sample

context = "O God, O God!"
x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
y = sample(model, x, 2000, temperature=1.0, sample=True, top_k=10)[0]
completion = ''.join([train_dataset.itos[int(i)] for i in y])
print(completion)

O God, O God! that e'er this tongue of mine,
That laid the sentence of dread banishment
On yon proud man, should take it off again
With words of sooth! O that I were as great
As is my grief, or lesser than my name!
Or that I could forget what I have been,
Or not remember what I must be now!
Swell'st thou, proud heart? I'll give thee scope to beat,
Since foes have scope to beat both thee and me.

DUKE OF AUMERLE:
Northumberland comes back from Bolingbroke.

KING RICHARD II:
What must the king do now? must he submit?
The king shall do it: must he be deposed?
The king shall be contented: must he lose
The name of king? o' God's name, let it go:
I'll give my jewels for a set of beads,
My gorgeous palace for a hermitage,
My gay apparel for an almsman's gown,
My figured goblets for a dish of wood,
My sceptre for a palmer's walking staff,
My subjects for a pair of carved saints
And my large kingdom for a little grave,
A little little grave, an obscure grave;
Or I'll be buried in the king's hig

<span style="color: blue;">

**Sampling:** Autoregressive generation:
1. Encode the prompt "O God, O God!" to token indices
2. `sample()` repeatedly feeds the sequence to the model, gets the next-character distribution, samples from it (with `temperature=1.0`, `top_k=10` for diversity), appends to the sequence
3. Decode indices back to characters. `temperature` controls randomness (higher = more random); `top_k` restricts sampling to the top-k most likely tokens.

</span>

In [11]:
# well that was fun