## Train a character-level GPT on some text data

The inputs here are simple text files, which we chop up to individual characters and then train GPT on. So you could say this is a char-transformer instead of a char-rnn. Doesn't quite roll off the tongue as well. In this example we will feed it some Shakespeare, which we'll get it to predict character-level.

输入文件是简单的文本文件，我们将其分割为单个字符，然后在GPT上进行训练。所以你可以说这是一个字符transformer，而不是字符RNN。说起来不是很顺口。在这个例子中，我们会给它一些莎士比亚的文本，让它进行字符级别的预测。

In [2]:
# set up logging
# 这是一个标准库模块，用于记录日志信息
# format：指定日志消息的格式。这里的格式字符串包含了几个占位符
# datefmt：指定时间的显示格式，这里设置为 "月/日/年 时:分:秒"。
# level：设置日志级别，这里设置为 logging.INFO，表示只记录 INFO 级别及以上的日志消息。
import logging
logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO,
)

In [3]:
# make deterministic
# 使程序具有确定性，确保每次运行时的结果是相同的
# 设置随机种子
from mingpt.utils import set_seed
set_seed(42)

In [4]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F

In [5]:
import math
from torch.utils.data import Dataset

class CharDataset(Dataset):

    def __init__(self, data, block_size):
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        
        # 创建两个字典
        # self.stoi（string to index）：将每个字符映射到一个唯一的索引。
        # self.itos（index to string）：将每个索引映射回相应的字符。
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.block_size = block_size

        # vocab_size是chars的表的大小
        self.vocab_size = vocab_size
        self.data = data
    
    def __len__(self):
        return len(self.data) - self.block_size

    # 用于获取指定索引处的数据样本
    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        # 取得一个block_size大小的窗口data
        chunk = self.data[idx:idx + self.block_size + 1]
        # encode every character to an integer
        # 每个字符转换为整数索引，结果存储在 dix 列表中
        dix = [self.stoi[s] for s in chunk]
        """
        arrange data and targets so that the first i elements of x
        will be asked to predict the i-th element of y. Notice that
        the eventual language model will actually make block_size
        individual predictions at the same time based on this data,
        so we are being clever and amortizing the cost of the forward
        pass of the network. So for example if block_size is 4, then
        we could e.g. sample a chunk of text "hello", the integers in
        x will correspond to "hell" and in y will be "ello". This will
        then actually "multitask" 4 separate examples at the same time
        in the language model:
        - given just "h", please predict "e" as next
        - given "he" please predict "l" next
        - given "hel" predict "l" next
        - given "hell" predict "o" next
        
        In addition, because the DataLoader will create batches of examples,
        every forward/backward pass during traning will simultaneously train
        a LOT of predictions, amortizing a lot of computation. In particular,
        for a batched input of integers X (B, T) where B is batch size and
        T is block_size and Y (B, T), the network will during training be
        simultaneously training to make B*T predictions, all at once! Of course,
        at test time we can paralellize across batch B, but unlike during training
        we cannot parallelize across the time dimension T - we have to run
        a forward pass of the network to recover the next single character of the 
        sequence along each batch dimension, and repeatedly always feed in a next
        character to get the next one.
        
        So yes there is a big asymmetry between train/test time of autoregressive
        models. During training we can go B*T at a time with every forward pass,
        but during test time we can only go B at a time, T times, with T forward 
        passes.

        将数据和目标排列，使得 x 的前 i 个元素将用于预测 y 的第 i 个元素。
        注意，最终的语言模型实际上会基于这些数据同时进行 block_size 个单独的预测，
        因此我们在前向传播过程中非常巧妙地分摊了网络的计算成本。
        例如，如果 block_size 为 4, 那么我们可以采样一段文本 "hello",x 中的整数对应于 "hell",
        y 中的整数对应于 "ello"。这实际上会在语言模型中同时 "多任务" 处理 4 个独立的例子：
        - 仅给出 "h"，请预测下一个字符 "e"
        - 给出 "he"，请预测下一个字符 "l"
        - 给出 "hel"，请预测下一个字符 "l"
        - 给出 "hell"，请预测下一个字符 "o"

        此外，由于 DataLoader 会创建一批样本，因此在每次训练的前向/后向传播过程中，将同时训练大量的预测，
        从而分摊了大量的计算成本。特别是，对于一个批量输入的整数 X (B, T)（其中 B 是批量大小,T 是 block_size)
        和 Y (B, T)，网络在训练过程中将同时训练 B*T 个预测，一次性完成！当然，在测试时我们可以在批量 B 之间并行化，
        但与训练不同，我们不能在时间维度 T 上并行化——我们必须运行一次前向传播以恢复每个批次维度的序列中的下一个字符，
        并重复地总是输入下一个字符以获取下一个字符。

        因此，是的，自回归模型在训练/测试时间上存在很大的不对称性。在训练期间，我们可以每次前向传播处理 B*T,
        但在测试期间，我们只能每次前向传播处理 B, 而T次,需要 T 次前向传播。
        """
        # dix[:-1] 表示从 dix 列表中取除了最后一个字符外的所有字符，并将其转换为 PyTorch 的长整型张量
        x = torch.tensor(dix[:-1], dtype=torch.long)
        # dix[1:] 表示从 dix 列表中取除了第一个字符外的所有字符
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y


In [6]:
block_size = 128 # spatial extent of the model for its context

In [7]:
# you can download this file at https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt
text = open('input.txt', 'r').read() # don't worry we won't run out of file handles
train_dataset = CharDataset(text, block_size) # one line of poem is roughly 50 characters

data has 1115394 characters, 65 unique.


In [8]:
from mingpt.model import GPT, GPTConfig
# GPTConfig 类用于配置 GPT 模型的超参数
mconf = GPTConfig(train_dataset.vocab_size, train_dataset.block_size,
                  n_layer=8, n_head=8, n_embd=512)
model = GPT(mconf)

07/15/2024 22:48:26 - INFO - mingpt.model -   number of parameters: 2.535219e+07


In [9]:
from mingpt.trainer import Trainer, TrainerConfig

# initialize a trainer instance and kick off training
# TrainerConfig 用于设置训练的超参数
# max_epochs=2：训练的最大轮数（epoch），即整个训练数据集将被完整训练的次数。
# batch_size=512：每个训练批次的样本数量。
# learning_rate=6e-4：学习率，控制模型更新权重的步长。
# lr_decay=True：启用学习率衰减，随着训练的进行逐渐减小学习率。
# warmup_tokens=512*20：用于设置学习率预热阶段的总标记数，通常在训练初期逐渐增加学习率。
# final_tokens=2*len(train_dataset)*block_size：设置训练结束时的总标记数。
# num_workers=4：用于加载数据时的并行工作线程数，帮助加快数据读取速度。
tconf = TrainerConfig(max_epochs=2, batch_size=512, learning_rate=6e-4,
                      lr_decay=True, warmup_tokens=512*20, final_tokens=2*len(train_dataset)*block_size,
                      num_workers=4)
trainer = Trainer(model, train_dataset, None, tconf)
trainer.train()

In [10]:
# alright, let's sample some character-level Shakespeare
from mingpt.utils import sample

# 生成文本的起始输入
context = "O God, O God!"

# torch.tensor(...) 创建一个一维张量，包含上下文字符串 context 中每个字符的索引
# [None, ...] 是 Python 中的切片语法，用于在张量的第0个维度上插入一个新的维度。
# 这里的 None 表示在该位置插入一个新的维度，而 ... 表示保持后面的维度不变
x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)

# 调用 sample 函数进行文本生成。
# model 是训练好的 GPT 模型。
# x 是输入的上下文张量。
# 2000 指定生成的字符数。
# temperature=1.0 控制采样的随机性，值越高则生成的文本越随机，值越低则越保守。
# sample=True 表示进行随机采样。
# top_k=10 限制每次预测时考虑的候选字符数量，选择概率最高的 10 个字符进行采样。
# [0] 获取返回值中的第一个元素，这里 y 是生成的字符索引
y = sample(model, x, 2000, temperature=1.0, sample=True, top_k=10)[0]
completion = ''.join([train_dataset.itos[int(i)] for i in y])
print(completion)

O God, O God! that e'er this tongue of mine,
That laid the sentence of dread banishment
On yon proud man, should take it off again
With words of sooth! O that I were as great
As is my grief, or lesser than my name!
Or that I could forget
With Richmond, I'll tell you what I am,
The Lord Aumerle, .

CLAUDIO:
The prenzie Angelo!

ISABELLA:
O, 'tis the cunning livery of hell,
The damned'st body to invest and cover
In prenzie guards! Dost thou think, Claudio?
If I would yield him my virginity,
Thou mightst be freed.

CLAUDIO:
O heavens! it cannot be.

ISABELLA:
Yes, he would give't thee, from this rank offence,
So to offend him still. This night's the time
That I should do what I abhor to name,
Or else thou diest to-morrow.

CLAUDIO:
Thou shalt not do't.

ISABELLA:
O, were it but my life,
I'ld throw it down for your deliverance
As frankly as a pin.

CLAUDIO:
Thanks, dear Isabel.

ISABELLA:
Be ready, Claudio, for your death tomorrow.

CLAUDIO:
Yes. Has he affections
That profit us.

DUKE VIN

In [None]:
# well that was fun