<a href="https://colab.research.google.com/github/sunyeul/ToyProjectLab/blob/feature%2Fnanogpt_tutorial/nanoGPT/nanoGPT_tutorial_char_enc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# nanoGPTのコードレビュー

- https://github.com/karpathy/nanoGPT

![](https://github.com/karpathy/nanoGPT/raw/master/assets/nanogpt.jpg)

In [None]:
!nvidia-smi

Mon Apr 24 17:40:48 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0    42W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## 事前設定

In [None]:
import math
import torch
import torch.nn as nn
from torch.nn import functional as F

from tqdm.auto import tqdm
from dataclasses import dataclass

torch.manual_seed(3655)


@dataclass
class Config:
    vocab_size: int = 65 # 50_304
    batch_size: int = 64
    block_size: int = 256 # what is the maximum context length for predictions?

    train_size: float = 0.8  # valid_sizeは自動で0.2に決まる
    device = 'cuda' if torch.cuda.is_available() else 'cpu'

    n_layer: int = 6
    n_head: int = 6
    n_embd: int= 384
    dropout: float = 0.2
    bias: bool = True  # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster

## データのダウンロード

In [None]:
# データセットをダウンロードしましょう。
!wget -N https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2023-04-24 17:40:58--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


Last-modified header missing -- time-stamps turned off.
2023-04-24 17:40:59 (59.1 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [None]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

print(len(text))
print(text[:1_000])

1115394
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for re

In [None]:
# テキストから重複を除いた文字列を取得し、アルファベット順にソートする
chars = sorted(list(set(text)))
print("".join(chars))

# ボキャブラリーのサイズを取得する
vocab_size = len(chars)
Config.vocab_size = vocab_size
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


In [None]:
# 文字列をインデックスに変換するための辞書を作成する
# s2iは文字列をインデックスに変換するための辞書
s2i = {ch:i for i, ch in enumerate(chars)}

# インデックスを文字列に変換するための辞書を作成する
# i2sはインデックスを文字列に変換するための辞書
i2s = {i:ch for i, ch in enumerate(chars)}

# 文字列を数値のリストに変換する関数を定義する
encode = lambda s: [s2i[c] for c in s]

# 数値のリストを文字列に変換する関数を定義する
decode = lambda l: ''.join([i2s[i] for i in l])

In [None]:
# テキストを数値のリストに変換する
data = torch.tensor(encode(text), dtype=torch.long)

# データの形状とデータ型を表示する
print(data.shape, data.dtype)

# 先頭の1000文字を表示する
print(data[:1000]) # GPTにとっては、ここで表示される1000文字は以下のようになる

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

## train/validスプリット

In [None]:
# 学習用データと検証用データに分割する
n = int(Config.train_size * len(data))

train_data = data[:n]
val_data = data[n:]

In [None]:
# 学習用データを最初のConfig.block_size文字だけに限定する
x = train_data[:Config.block_size]

# 学習用データを2文字目から最初のConfig.block_size+1文字に限定する
y = train_data[1:Config.block_size+1]

# 10回繰り返す
for t in range(10):

    # 入力となる文字列を1文字からt+1文字までに限定する
    context = x[:t+1]

    # 正解の文字を取得する
    target = y[t]

    # 入力がcontextのときに、正解がtargetであることを表示する
    print(f"入力が{context}のときに、正解は{target}です")

入力がtensor([18])のときに、正解は47です
入力がtensor([18, 47])のときに、正解は56です
入力がtensor([18, 47, 56])のときに、正解は57です
入力がtensor([18, 47, 56, 57])のときに、正解は58です
入力がtensor([18, 47, 56, 57, 58])のときに、正解は1です
入力がtensor([18, 47, 56, 57, 58,  1])のときに、正解は15です
入力がtensor([18, 47, 56, 57, 58,  1, 15])のときに、正解は47です
入力がtensor([18, 47, 56, 57, 58,  1, 15, 47])のときに、正解は58です
入力がtensor([18, 47, 56, 57, 58,  1, 15, 47, 58])のときに、正解は47です
入力がtensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47])のときに、正解は64です


In [None]:
def get_batch(split):
    # 入力と正解の小さなバッチを生成する
    # splitが'train'の場合は学習用データから、'val'の場合は検証用データからデータを取得する
    data = train_data if split == 'train' else val_data
    
    # バッチを開始するためのランダムなインデックスを生成する
    idx = torch.randint(high=len(data) - Config.block_size, size=(Config.batch_size,))
    
    # xは、block_sizeの長さのシーケンスのバッチである
    x = torch.stack([data[i:i+Config.block_size] for i in idx])  # [batch_size, block_size]
    
    # yは、xと同じものであるが、1つずつずれている
    y = torch.stack([data[i+1:i+Config.block_size+1] for i in idx])  # [batch_size, block_size]
    
    x, y = x.to(Config.device), y.to(Config.device)
    return x, y

In [None]:
xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

inputs:
torch.Size([64, 256])
tensor([[43,  1, 58,  ..., 50, 63,  6],
        [50,  1, 58,  ..., 43,  1, 41],
        [57,  1, 42,  ..., 47, 41, 46],
        ...,
        [46, 43, 57,  ...,  1, 57, 53],
        [56,  1, 46,  ...,  1, 58, 46],
        [ 0, 27, 56,  ..., 32, 10,  0]], device='cuda:0')
targets:
torch.Size([64, 256])
tensor([[ 1, 58, 46,  ..., 63,  6,  1],
        [ 1, 58, 46,  ...,  1, 41, 39],
        [ 1, 42, 47,  ..., 41, 46, 39],
        ...,
        [43, 57, 58,  ..., 57, 53,  1],
        [ 1, 46, 43,  ..., 58, 46, 43],
        [27, 56,  1,  ..., 10,  0, 20]], device='cuda:0')


## nanoGPTモデルの実装

In [None]:
def new_gelu(x):
    """
    Google BERT repoに現在実装されているGELU活性化関数の実装（OpenAI GPTと同じ）。
    参考文献：Gaussian Error Linear Units（GELU）論文：https://arxiv.org/abs/1606.08415
    """
    return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))


class LayerNorm(nn.Module):
    """ オプションのバイアスを持つLayerNorm。PyTorchではLayerNormでbias=Falseを簡単にサポートしていない。 """
    
    def __init__(self, ndim, bias):
        super().__init__()
        # 正規化の重みパラメータ
        self.weight = nn.Parameter(torch.ones(ndim))
        # オプションのバイアス
        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None
        
    def forward(self, input):
        # Layer Normalizationを計算する
        return F.layer_norm(input, self.weight.shape, self.weight, self.bias, eps=1e-5)


class CausalSelfAttention(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # 全てのヘッドに対するkey、query、valueのプロジェクションをバッチ内で実行
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        # 出力のプロジェクション
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        # 正則化
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.dropout = config.dropout
        # Flash AttentionはGPUを加速するが、PyTorch >= 2.0でのみサポートされる
        self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
        if not self.flash:
            print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")
            # 入力シーケンスの左側にのみ注意が適用されることを保証する因果マスク
            self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                        .view(1, 1, config.block_size, config.block_size))
    
    def forward(self, x):
        B, T, C = x.size()  # バッチサイズ、シーケンス長、埋め込みサイズ（n_embd）

        # バッチ全体の全てのヘッドのquery、key、valuesを計算し、ヘッドをバッチの先頭に移動
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # [B, n_head, T, head_size]
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # [B, n_head, T, head_size]
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # [B, n_head, T, head_size]
        
        # 因果的自己注意機構; self-attend: [B, n_head, T, head_size] x [B, n_head, head_size, T] -> [B, n_head, T, T]
        if self.flash:
            # Flash Attention CUDAカーネルを使用した効率的な注意
            y = nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.dropout, is_causal=True)
        else:
            # 注意の手動実装
            att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
            att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float("-inf"))
            att = F.softmax(att, dim=-1)
            att = self.attn_dropout(att)
            y = att @ v  # [B, n_head, T, T] x [B, n_head, T, head_size] -> [B, n_head, T, head_size]
        y = y.transpose(1, 2).contiguous().view(B, T, C)  # すべてのヘッドの出力を横並びに再構築
        
        # 出力のプロジェクション
        y = self.resid_dropout(self.c_proj(y))
        return y


class MLP(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        # 入力を4倍に拡張する線形層
        self.c_fc = nn.Linear(config.n_embd, config.n_embd * 4, bias=config.bias)
        # 入力サイズに戻す線形層
        self.c_proj = nn.Linear(config.n_embd * 4, config.n_embd, bias=config.bias)
        # GELU非線形関数
        self.act = new_gelu
        # ドロップアウト
        self.dropout = nn.Dropout(config.dropout)
    
    def forward(self, x):
        # GELU非線形関数を適用する
        h = self.act(self.c_fc(x))
        # 元のサイズに戻す
        h2 = self.c_proj(h)
        # 出力をドロップアウト
        return self.dropout(h2)


class Block(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        # Causal Self-Attention
        self.attn = CausalSelfAttention(config)
        # Layer Normalization
        self.ln_1 = LayerNorm(config.n_embd, config.bias)
        # MLP
        self.mlp = MLP(config)
        # Layer Normalization
        self.ln_2 = LayerNorm(config.n_embd, config.bias)
    
    def forward(self, x):
        # Multi-Head Attentionを適用する
        x = x + self.attn(self.ln_1(x))  # 元のコードはこちら
        # x = x + self.ln_1(self.attn(x))
        # MLPを適用する
        x = x + self.mlp(self.ln_2(x))  # 元のコードはこちら
        # x = x + self.ln_2(self.mlp(x))
        return x


class GPT(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        assert config.vocab_size is not None
        assert config.block_size is not None
        self.config = config
    
        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            drop = nn.Dropout(config.dropout),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = LayerNorm(config.n_embd, config.bias)
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)    
        self.transformer.wte.weight = self.lm_head.weight

        # init all weights
        self.apply(self._init_weights)
        # apply special scaled init to the residual projections, per GPT-2 paper
        for pn, p in self.named_parameters():
            if pn.endswith('c_proj.weight'):
                nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, idx, targets=None):
        device = idx.device
        b, t = idx.size()
        assert t <= self.config.block_size, f"Cannot forward sequence of lengh {t}, block size is {self.config.block_size}"
        
        pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0)  # shape [1, t]

        # forward the GPT model itself
        tok_emb = self.transformer.wte(idx)  # token embeddings of shape [b, t, n_embd]
        pos_emb = self.transformer.wpe(pos)  # positional embeddings of shape [1, t, n_embd]
        x = self.transformer.drop(tok_emb + pos_emb)  # [b, t, n_embd]
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)  # [b, t, n_embd]

        if targets is not None:
            # if we are given some desired targets also calculate the loss
            logits = self.lm_head(x) # [b, t, vocab_size]
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
        else:
            # inference-time mini-optimization: only forward the lm_head on the very last position
            logits = self.lm_head(x[:, [-1], :])  # [b, 1, vocab_size]  note: using list [-1] to preserve the time dim
            loss = None
        
        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=0.7, top_k=None):
        """
        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
        the sequence max_new_tokens times, feeding the predictions back into the model each time.
        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
        """
        for _ in range(max_new_tokens):
            # if the sequencecontext is growing too long, we must crop it at block_size
            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
            # forward the model to get the logits for the index in the sequence
            logits, _ = self(idx_cond)  # [b, min(idx_cond.size(1), block_size), vocab_size]
            # pluck the logits at the final step and scale by desired terperature
            logits = logits[:, -1, :] / temperature
            # optionally crop the logits to only the top k options
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            # apply softmax to convert logits to (normalized) probabilities
            probs = F.softmax(logits, dim=-1)
            # sample from the distribution to get the next index
            idx_next = torch.multinomial(probs, num_samples=1)
            # append sampled index to the running sequence and continue
            idx = torch.cat([idx, idx_next], dim=1)
        
        return idx

## 学習

In [None]:
# モデルをインスタンス化する
model = GPT(config=Config)
model.to(device=Config.device)

GPT(
  (transformer): ModuleDict(
    (wte): Embedding(65, 384)
    (wpe): Embedding(256, 384)
    (drop): Dropout(p=0.2, inplace=False)
    (h): ModuleList(
      (0-5): 6 x Block(
        (attn): CausalSelfAttention(
          (c_attn): Linear(in_features=384, out_features=1152, bias=True)
          (c_proj): Linear(in_features=384, out_features=384, bias=True)
          (attn_dropout): Dropout(p=0.2, inplace=False)
          (resid_dropout): Dropout(p=0.2, inplace=False)
        )
        (ln_1): LayerNorm()
        (mlp): MLP(
          (c_fc): Linear(in_features=384, out_features=1536, bias=True)
          (c_proj): Linear(in_features=1536, out_features=384, bias=True)
          (dropout): Dropout(p=0.2, inplace=False)
        )
        (ln_2): LayerNorm()
      )
    )
    (ln_f): LayerNorm()
  )
  (lm_head): Linear(in_features=384, out_features=65, bias=False)
)

In [None]:
# 学習可能なパラメーター数を計算
num_params = 0

for parameter in model.parameters():
    num_params += len(parameter)

print(num_params)

51777


In [None]:
# 順伝播を実行し、ログオッズと損失を取得する
logits, loss = model(xb, yb)

# logitsの形状と損失を表示する
print(logits.shape)
print(loss)

# モデルによって生成されたテキストを表示する
generated_text = model.generate(idx=torch.zeros((1, 1,), dtype=torch.long, device=Config.device), max_new_tokens=100)[0].tolist()
decoded_text = decode(generated_text)
print(decoded_text)

torch.Size([64, 256, 65])
tensor(4.2784, device='cuda:0', grad_fn=<NllLossBackward0>)

X?zFZSSeksx'3 SS
NCF'sy?'Bn$LLWrSINosW!hZnF''w&LCjZ!X$HNXVVLlpn!vhKTh;3.'BbMjn3NYM:jRTl;jO-GXh!yxZbc


In [None]:
# PyTorchのオプティマイザを作成する
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

# 10,000回のステップを実行する
EPOCH = 10_000
for steps in tqdm(range(EPOCH)):
    
    # データのバッチをサンプリングする
    xb, yb = get_batch('train')

    # 勾配をゼロに設定
    optimizer.zero_grad(set_to_none=True)

    # 損失を評価する
    logits, loss = model(xb, yb)

    # 逆伝播を計算する
    loss.backward()

    # パラメータを更新する
    optimizer.step()

# 損失を表示する
print(loss.item())

  0%|          | 0/10000 [00:00<?, ?it/s]

0.6047207117080688


In [None]:
# Additional information
PATH = "model.pth"

torch.save({
            'epoch': EPOCH,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': loss.item(),
            }, PATH)

In [None]:
# checkpoint = torch.load(PATH)
# model.load_state_dict(checkpoint['model_state_dict'])
# optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

## 文章生成

In [None]:
# モデルによって生成されたテキストを表示する
generated_text = model.generate(idx=torch.zeros((1, 1), dtype=torch.long, device=Config.device), max_new_tokens=500)[0].tolist()

decoded_text = decode(generated_text)
print(decoded_text)


PAULINA:
There is a passing meless, methinks, I am
Enforced to confess it. But, methinks,
For, I mean to yourself--

LEONTES:
Stay me lady you,
Put sound to the base is dishonour'd of my banishment.
I would deep were all king, but, that, if
You be envied, and I would not leave
The vaultice of making me: indeed, wereful beguns,
Of what I was your quarrel doing, I would not king;
Thy father had been high committed in you
Even too as egg recompense: 'tis his precise
By each one to be alight and wee


In [None]:
# モデルによって生成されたテキストを表示する

prompt = "Hello, "
generated_text = model.generate(idx=torch.tensor([encode(prompt)], device=Config.device), max_new_tokens=500)[0].tolist()

decoded_text = decode(generated_text)
print(decoded_text)

Hello, jest thee this,
And be the wish'd and twenty before night.

STANLEY:
So, my lord, at out--

KING RICHARD II:
The true war from him: here's four out was now,
Which, like a bonny spurpose, not love.

HENRY BOLINGBROKE:
Hath his verges, and his words
Requite ith the this cause of your suit?

STrike Richard:
What noble prevail befell'd his fear?--and for it be prothen;
Nor never be stong by any cander regreeth:
'Tis shame can no leave but till I aim'd thee,
But the nothing done, not art a Montague.

