# GPT-2

## GPT 系列模型介绍

在 Transformer 模型中，使用了 Encoder-Decoder 架构来进行语言建模：本质上，是一种 有 Seq2Seq 有监督学习。 通过监督数据所训练的模型其难以在其他场景泛化。

有监督学习的缺陷在于：数据成本昂贵，语言数据标签通常需要人工编写或标注。这是所有深度学习模型都会面临的问题。

无监督学习正是从根本性上解决大规模模型对训练数据的需求。比如世界知识里，纯文本（如书本）是远多于（问答数据）的。从纯文本中进行预训练（Pre-trained），语言模型能学习到通用文本自身的表征（而不是从监督数据获取），语言模型所习得的文本处理能力能在**通用**下游任务泛化。

- GPT-1 提出了 Decoder-only 的 “预训练” 学习目标。并在其他阶段使用有监督学习（supervised learning, SFT).
- GPT-2 验证了通用文本预训练的模型，可以在不做SFT基础下，通过 Zero-Shot learning 解决 NLP 应用任务
- GPT-3 训练 175B 预训练模型，提出 “上下文学习（In-Context Learning, ICL）” 范式，在多个任务上取得良好表现。 GPT-3 是预训练模型发展最关键的转折，也是打败 Encoder-only 和 Encoder-Decoder Transformer 架构的关键工作。

GPT 系列模型的关键在于两种关键技术的配合：

1. Decoder-Transformer： 其模型架构中的 self-masked attention 天然可以**并行**做“自回归学习”
2. Casual Language Modeling： 因果语言建模是构造序列数据学习的方法，通过历史信息预测下一个时刻信息。
  
CLM 本质是 **构造一个简单的（无须人工label）通用的学习任务，将外在知识转化为内在参数**, 同时 Decoder-Transformer 参数能够 Scaling-up。需要注意到，多数模型都可以做 CLM 如：RNN、LSTM 和 Transformer。 特别是 Transformer 其本质是 **带了编码器表征** 的自回归模型。

在学习 GPT 系列模型的重要意义在于

1. GPT 验证了大规模预训练模型的可行性。（Ilya 尝试字符RNN预训练）
2. GPT 具有极强的泛化性，是其能够扩展众多下游任务的一个关键特性。
3. GPT 模型能够 Scaling-Up（参数规模提升，性能提升）


本 Notebook 实现以下关键技术：

1. Casual Language Modeling(CLM)
3. GPT2-Config
4. GPTModel
5. Dummy Block Dataset
6. Train
7. Inference

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
torch.manual_seed(42)

<torch._C.Generator at 0x10dc48db0>

## Casusal Language Modeling

- GPT 与 Transformer 中去除 encode-decode 的 Decoder 结构相同。
- Decoder 本身就是在做 因果语言建模。

In [2]:
vocab_size = 100
batch_size = 1
seq_len = 10 

x = torch.randint(vocab_size, (batch_size, seq_len))
print(x) # 给定序列数据

tensor([[42, 67, 76, 14, 26, 35, 20, 24, 50, 13]])


In [3]:
# 定义一个参数模型学习：

class CLMModel(nn.Module):
    def __init__(self, vocab_size, dim):
        super().__init__()
        self.embd = nn.Embedding(vocab_size, dim)
        self.w = nn.Linear(dim, dim)
        self.LM_head = nn.Linear(dim, vocab_size)
        self.softmax = nn.Softmax(dim = -1)

    def forward(self, x):
        '''
            - x dim[batch,seq]
        '''
        h = self.embd(x)
        h = self.w(h)
        logits = self.LM_head(h)
        prob = self.softmax(logits)
        return logits, prob

model = CLMModel(vocab_size = vocab_size,
         dim = 64)

y_logits, y_prob = model(x)

print(x.shape) # # batch_size, seq_len
print(y_logits.shape) # batch_size, seq_len, vocab_size
print(y_prob.shape) # batch_size, seq_len, vocab_size

torch.Size([1, 10])
torch.Size([1, 10, 100])
torch.Size([1, 10, 100])


In [4]:
# 基于 输入序列 创建 next-token 标签。

print(x)
print(x[:,1:].tolist(), x[:,0].tolist()) # 默认循环左移

label = torch.zeros_like(x)
label[:, :-1] = x[:, 1:]
label[:, -1] = x[:, 0]
print(label)

tensor([[42, 67, 76, 14, 26, 35, 20, 24, 50, 13]])
[[67, 76, 14, 26, 35, 20, 24, 50, 13]] [42]
tensor([[67, 76, 14, 26, 35, 20, 24, 50, 13, 42]])


In [5]:
# 因果建模
for i in range(batch_size):
    for j in range(seq_len - 1):
        print('input seq \t\t', x[i, :j+1].tolist())
        print('next token label\t', x[0, j+1].tolist())

input seq 		 [42]
next token label	 67
input seq 		 [42, 67]
next token label	 76
input seq 		 [42, 67, 76]
next token label	 14
input seq 		 [42, 67, 76, 14]
next token label	 26
input seq 		 [42, 67, 76, 14, 26]
next token label	 35
input seq 		 [42, 67, 76, 14, 26, 35]
next token label	 20
input seq 		 [42, 67, 76, 14, 26, 35, 20]
next token label	 24
input seq 		 [42, 67, 76, 14, 26, 35, 20, 24]
next token label	 50
input seq 		 [42, 67, 76, 14, 26, 35, 20, 24, 50]
next token label	 13


In [6]:
# loss

y_logits, y_prob = model(x)
b, s, v = y_logits.shape
loss_fn = nn.CrossEntropyLoss()

# 有 bs * seq_len 个 next-token-prediction 预测任务
loss = loss_fn(y_logits.reshape( b*s, v), 
               label.reshape(b*s)) 
print(loss)
loss.backward()

tensor(4.5059, grad_fn=<NllLossBackward0>)


以上示例准确来说是：自回归下一个词元学习。那么这里明明有 label， 为什么仍然说 GPT-2 是无监督学习呢？

- 无监督是不需要标签， 自监督是通过本身的数据即是输入，也是标签。那么 “自监督” 是更准确的，自监督是无监督的一种特例
- 自回归是通过历史数据预测下一个时刻数据。自回归是自监督的一种具体学习模式。

讨论：

上一章节的机器翻译的Transformer 是否是无监督学习？

## GPT2-Config

In [7]:
from dataclasses import dataclass

@dataclass
class GPT2Config:
    learning_rate: float = 0.001
    # src_vocab_size: int = 100
    # trg_vocab_size: int = 200
    vocab_size: int = 64 # 各个语种都在一个词表里
    max_len: int = 512
    dim: int = 512
    heads: int = 8
    num_layers: int = 6
    position_encoding_base: float = 10000.0
    pad_token_id: int = 0
    # src_pad_token_id: int = 0
    # trg_pad_token_id: int = 0
    attention_bias: bool = False
    initializer_range: float = 0.02
    embd_pdrop: float = 0.1

In [8]:
config = GPT2Config(
    vocab_size = vocab_size
)
print(config)

GPT2Config(learning_rate=0.001, vocab_size=100, max_len=512, dim=512, heads=8, num_layers=6, position_encoding_base=10000.0, pad_token_id=0, attention_bias=False, initializer_range=0.02)


## GPTModel

1. input layer
2. Attention
3. decoder block
4. output layer
5. GPT-2 model

GPT-2 去除 model 中 `nn.Linear` 层的 `bias` 项, 并将 LayerNorm 前置

### Input Layer

In [9]:
class GPT2InputLayer(nn.Module):
    """
    词向量 + 位置编码
    """

    def __init__(self, vocab_size=100, dim=512, max_len=1024, base=10000.0, embd_pdrop=0.1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, dim)
        self.max_len = max_len

        theta_ids = torch.arange(0, dim, 2)  # 0, 2, 4, ..., 512
        theta = 1 / (base ** (theta_ids / dim))
        pe = torch.zeros(dim)  # 512, sin( theta_0 ),cos( theta_0), ...
        pe[theta_ids] = theta
        pe[theta_ids+1] = theta

        position_ids = torch.arange(0, max_len)  # 0, 1, 2, ..., 1024
        self.PE = torch.outer(position_ids, pe)  # 1024 x 512

        self.PE[:, theta_ids] = torch.sin(self.PE[:, theta_ids])
        self.PE[:, theta_ids+1] = torch.sin(self.PE[:, theta_ids+1])

        # self.embd_pdrop = embd_pdrop
        # self.drop = nn.Dropout(embd_pdrop)

    def forward(self, input_ids):
        """
        嵌入向量 + 绝对位置编码(标准实现)
        """
        bs, seq_len = input_ids.shape
        X = self.embedding(input_ids)
        PE = self.PE[:seq_len, :]
        X_ = X + PE
        # X_ = self.drop(X_)
        return X_

## Decoder

```
    |
    |------->|
layernorm_1  |
    |        |
attention    |
    |<-------|
    |------->|
layernorm_2  |
    |        |
feedforward  |
    |<-------|
    |
    V
```

## Attention

### Decoder Attention mask

In [10]:
## 设计 masked-self attention 的 mask
data = [
    [1, 2, 4, 5, ],
    [5, 2, 1, 8, 7, 10, 32],
]
mask = torch.zeros(2, 7) 
for i in range(2):
    mask[i, :len(data[i])] = 1
print(mask)

tensor([[1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1.]])


In [11]:
tril_mask = torch.tril(torch.ones(7, 7)) 
print(tril_mask)

tensor([[1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1.]])


In [12]:
def get_add_mask(mask, tril_mask, neg_inf = -100000.0):
    bs, seq_len = mask.shape
    batch_mask = torch.zeros(bs, seq_len, seq_len)
    for i in range(bs):
        batch_mask[i, ] = torch.outer( mask[i,:], mask[i,:])
        batch_mask[i, ] *= tril_mask
    add_mask = (1 - batch_mask ) * neg_inf
    return add_mask

add_mask = get_add_mask(mask, tril_mask)
print(add_mask)

tensor([[[     -0., -100000., -100000., -100000., -100000., -100000., -100000.],
         [     -0.,      -0., -100000., -100000., -100000., -100000., -100000.],
         [     -0.,      -0.,      -0., -100000., -100000., -100000., -100000.],
         [     -0.,      -0.,      -0.,      -0., -100000., -100000., -100000.],
         [-100000., -100000., -100000., -100000., -100000., -100000., -100000.],
         [-100000., -100000., -100000., -100000., -100000., -100000., -100000.],
         [-100000., -100000., -100000., -100000., -100000., -100000., -100000.]],

        [[     -0., -100000., -100000., -100000., -100000., -100000., -100000.],
         [     -0.,      -0., -100000., -100000., -100000., -100000., -100000.],
         [     -0.,      -0.,      -0., -100000., -100000., -100000., -100000.],
         [     -0.,      -0.,      -0.,      -0., -100000., -100000., -100000.],
         [     -0.,      -0.,      -0.,      -0.,      -0., -100000., -100000.],
         [     -0.,      -

In [13]:
# 去除循环实现
def get_add_mask(mask, tril_mask, neg_inf = -100000.0):
    batch_mask = mask.unsqueeze(2) * mask.unsqueeze(1) * tril_mask
    add_mask = (1 - batch_mask ) * neg_inf
    return add_mask

add_mask = get_add_mask(mask, tril_mask)
print(add_mask)

tensor([[[     -0., -100000., -100000., -100000., -100000., -100000., -100000.],
         [     -0.,      -0., -100000., -100000., -100000., -100000., -100000.],
         [     -0.,      -0.,      -0., -100000., -100000., -100000., -100000.],
         [     -0.,      -0.,      -0.,      -0., -100000., -100000., -100000.],
         [-100000., -100000., -100000., -100000., -100000., -100000., -100000.],
         [-100000., -100000., -100000., -100000., -100000., -100000., -100000.],
         [-100000., -100000., -100000., -100000., -100000., -100000., -100000.]],

        [[     -0., -100000., -100000., -100000., -100000., -100000., -100000.],
         [     -0.,      -0., -100000., -100000., -100000., -100000., -100000.],
         [     -0.,      -0.,      -0., -100000., -100000., -100000., -100000.],
         [     -0.,      -0.,      -0.,      -0., -100000., -100000., -100000.],
         [     -0.,      -0.,      -0.,      -0.,      -0., -100000., -100000.],
         [     -0.,      -

In [14]:
# multi-head attention score
S = torch.randn(2, 8, 7, 7) # head 8
S += add_mask[:,None,:,:]
P = F.softmax(S, dim = -1)
print(P[0, 0, :, :])

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.9884, 0.0116, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2721, 0.4983, 0.2296, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.4810, 0.3868, 0.0911, 0.0411, 0.0000, 0.0000, 0.0000],
        [0.0611, 0.3105, 0.1586, 0.0232, 0.1378, 0.1346, 0.1742],
        [0.0464, 0.4581, 0.1123, 0.0241, 0.0945, 0.0676, 0.1970],
        [0.4624, 0.3804, 0.0484, 0.0103, 0.0126, 0.0307, 0.0552]])


## Multi-head attention

In [15]:
class MultiHeadScaleDotProductAttention(nn.Module):
    def __init__(self, dim_in, dim_out, heads=8):
        super().__init__()
        # self.WQ = nn.Linear(dim_in, dim_out)
        # self.WK = nn.Linear(dim_in, dim_out)
        # self.WV = nn.Linear(dim_in, dim_out)
        self.dim_out = dim_out
        self.WQKV = nn.Linear(dim_in, dim_out*3) # 存在 bias, 与 wq/wk/wv 独立变换不等价
        self.WO = nn.Linear(dim_in, dim_out)
        self.heads = heads
        self.head_dim = dim_out // self.heads

    def forward(self, X_Q, X_K, X_V, mask=None):
        bs, seq_len, dim = X_Q.shape
        # Q = self.WQ(X_Q)
        # K = self.WK(X_K)
        # V = self.WV(X_V)
        QKV = self.WQKV(X_Q)
        Q, K, V = QKV.split(dim = 2, split_size = self.dim_out)

        # 拆分维度
        Q_h = Q.reshape(bs, seq_len, self.heads, self.head_dim).transpose(1, 2)
        K_h = K.reshape(bs, seq_len, self.heads, self.head_dim).transpose(1, 2) 
        V_h = V.reshape(bs, seq_len, self.heads, self.head_dim).transpose(1, 2)

        # Scaled-dot product attention
        S = Q_h @ K_h.transpose(2, 3) / math.sqrt(self.head_dim)
        if mask is not None:
            S = S + mask[:, None, :, :]
        P = torch.softmax(S, dim=-1)  # 行 softmax
        Z = P @ V_h
        Z = Z.transpose(1, 2).reshape(bs, seq_len, dim)
        output = self.WO(Z)

        return output

## LayerNorm

In [16]:
class LayerNorm(nn.Module):
    def __init__(self, dim, ):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(dim))
        self.beta = nn.Parameter(torch.zeros(dim))
        self.epsilon = 1e-8

    def forward(self, X, ):
        mu = X.mean(dim=-1, keepdim=True)
        var = X.var(dim=-1, keepdim=True)
        X_hat = (X - mu) / torch.sqrt(var + self.epsilon)
        Y = X_hat * self.gamma + self.beta
        return Y

## FFN

### GELU

In [34]:
def GELU(x):
    # read lecture/lc3_gpt/GELU.ipynb 
    cdf = 0.5 * (1.0 + torch.tanh( math.sqrt(2.0 / torch.pi) 
                                  * (x + 0.044715 * torch.pow(x, 3))))
    return x * cdf

In [18]:
class FeedForwardNetwork(nn.Module):
    def __init__(self, dim, ):
        super().__init__()
        self.dim = dim
        self.W_up = nn.Linear(self.dim, 4 * self.dim)
        # self.ReLU = nn.ReLU()
        self.W_down = nn.Linear(4 * self.dim, self.dim)

    def forward(self, X):
        X_ = GELU(self.W_up(X))
        Y = self.W_down(X_)
        return Y

## Decoder blocks

In [19]:
class GPT2DecoderBlock(nn.Module):
    def __init__(self, dim=512, heads=8):
        super().__init__()
        self.attn = MultiHeadScaleDotProductAttention(dim, dim, heads)
        self.ln1 = LayerNorm(dim)
        self.ffn = FeedForwardNetwork(dim)
        self.ln2 = LayerNorm(dim)

    def forward(self, X, mask=None):
        '''
        Pre-Normaliztion
        '''
        X_ln = self.ln1(X)
        X_attn = self.attn(X_ln, X_ln, X_ln, mask = mask)
        X = X + X_attn

        X_ln = self.ln2(X)
        X_ffn = self.ffn(X_ln)
        X = X + X_ffn
        return X
        
    def forward_postnorm(self, X, mask=None):
        '''
        Post-Normaliztion
        '''
        X_attn = self.attn(X, X, X, mask = mask)
        X_ln = self.ln1(X_attn)
        X = X + X_ln

        X_ffn = self.ffn(X)
        X_ln = self.ln2(X_ffn)
        X = X + X_ln
        return X

## Model

In [22]:
class GPT2Model(nn.Module):
    def __init__(self, config:GPT2Config=None):
        super().__init__()
        self.config = config
        self.embd = GPT2InputLayer(vocab_size=self.config.vocab_size, 
                                   dim=self.config.dim, 
                                   max_len=self.config.max_len)
        self.decoder = nn.ModuleList(
            [GPT2DecoderBlock(self.config.dim, 
                              self.config.heads) for _ in range(self.config.num_layers)]
        )
        self.ln = LayerNorm(self.config.dim)
        self.lm_head = nn.Linear(self.config.dim, 
                                 self.config.vocab_size,
                                 bias = False) # 不学习预训练数据分布偏置

        self.cache_mask = torch.tril(torch.ones(self.config.max_len, 
                                                self.config.max_len)) 

        self._init_weights(self) # 增加初始化

    def forward(self, x):
        bs, seq_len = x.shape
        add_mask = get_add_mask(x, self.cache_mask[:seq_len, :seq_len])
        
        X = self.embd(x)
        for block in self.decoder:
            X = block(X, mask = add_mask)
        X = self.ln(X)
        logits = self.lm_head(X)
        
        return logits
    
    def _init_weights(self, module):
        """
        ref: src/transformers/models/gpt2/modeling_gpt2.py
        """
        if isinstance(module, (nn.Linear)):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()

        for name, p in module.named_parameters():
            if name == "WO.weight" :
                # Special Scaled Initialization --> There are 2 Layer Norms per Transformer Block
                p.data.normal_(mean=0.0, std=(self.config.initializer_range / math.sqrt(2 * self.config.num_layers)))

In [23]:
model = GPT2Model(config)
model(x).shape

torch.Size([1, 10, 100])

## Pretrained Dataset

In [24]:
import random
random.randrange(120)

3

In [25]:
## raw data

batch_size = 2
seq_len = 128
vocab_size = config.vocab_size

input_ids = torch.randint(low=1, high=config.vocab_size, size=(batch_size, seq_len))
print(-random.randrange(seq_len))

for i in range(batch_size):
    input_ids[i,-random.randrange(seq_len):] = config.pad_token_id
print(input_ids)

label = torch.ones_like(input_ids) * config.pad_token_id
label[:,:seq_len-1] = input_ids[:,1:]
print(label)

-20
tensor([[13, 65, 69, 42, 73, 68, 61, 31, 50, 61, 34,  5, 59, 60, 66, 41, 50, 99,
         39, 85, 30,  4, 18, 36, 86, 67,  7, 56, 38,  5, 24, 60, 89, 75, 32, 41,
         68, 10, 92, 60, 73, 38, 95, 72, 48,  2, 42, 32, 13, 65,  4, 77, 74, 27,
         70, 43, 95, 62, 35,  1, 82, 19,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0],
        [ 3, 10, 59, 36, 39,  6,  8, 32, 48, 47, 13, 66, 85, 25, 42, 21, 40, 60,
         38, 34, 79,  1, 49, 95, 45, 58, 71, 16, 17, 31, 85, 33, 27, 85, 65, 16,
         84,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
      

## Train

In [26]:
import torch.optim as optim

config = GPT2Config(
    vocab_size = vocab_size
)

print(config.vocab_size)
model.embd

model = GPT2Model(config)
optimizer = optim.Adam(model.parameters(), lr = config.learning_rate)
loss_fn = nn.CrossEntropyLoss(ignore_index = config.pad_token_id)

logits = model.embd(input_ids)

optimizer.zero_grad()
logits = model(input_ids)
loss = loss_fn(logits.reshape(batch_size * seq_len, vocab_size), 
               label.reshape(batch_size * seq_len))
loss.backward()
optimizer.step()

100


## Inference

1. 最基本的推理是“greedy search”， 那么 inference 的阶段，为什么要用“search”？
2. LLM 的本质是 概率生成模型，哪里可以体现生成过程的随机性？随机性是否合理？

## greedy search

In [27]:
def generation(
    model = None,
    input_ids: torch.tensor = None,
    max_new_token: int = 100,
):
    for i in range(max_new_token):
        model.eval()
        with torch.no_grad():
            logits = model(input_ids)
        logits = logits[:, -1, :] # zhe
        probs = F.softmax(logits, dim = -1)
        next_token_idx = torch.argmax(probs, dim=-1, keepdim=True)
        input_ids = torch.cat( [input_ids, next_token_idx], dim = -1 )
    return input_ids

# logits = model(input_ids)
result = generation(model, input_ids, max_new_token = 5)
print(result)

tensor([[13, 65, 69, 42, 73, 68, 61, 31, 50, 61, 34,  5, 59, 60, 66, 41, 50, 99,
         39, 85, 30,  4, 18, 36, 86, 67,  7, 56, 38,  5, 24, 60, 89, 75, 32, 41,
         68, 10, 92, 60, 73, 38, 95, 72, 48,  2, 42, 32, 13, 65,  4, 77, 74, 27,
         70, 43, 95, 62, 35,  1, 82, 19,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0, 85, 60, 38, 95, 60],
        [ 3, 10, 59, 36, 39,  6,  8, 32, 48, 47, 13, 66, 85, 25, 42, 21, 40, 60,
         38, 34, 79,  1, 49, 95, 45, 58, 71, 16, 17, 31, 85, 33, 27, 85, 65, 16,
         84,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0

## left padding

批量数据推理，用 left padding， 其生成时， 位置序列是连续的，符合预训练的建模。

批量数据：

- 训练时：左右 padding 都可以
- 推理时：左padding

非批量数据，不用考虑padding 

In [28]:
## raw data
batch_size = 2
seq_len = 24
vocab_size = config.vocab_size
input_ids = torch.randint(low=1, high=config.vocab_size, size=(batch_size, seq_len))

for i in range(batch_size):
    input_ids[i,: seq_len - random.randrange(seq_len)] = config.pad_token_id
print(input_ids)

tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 30,
         87,  7, 21, 96, 39, 79],
        [ 0,  0,  0,  0,  0,  0,  0,  0, 88, 56, 56,  4,  4, 82, 98, 19, 34, 55,
         87, 46, 72, 46, 84,  1]])


In [29]:
result = generation(model, input_ids, max_new_token = 5)
print(result)

tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 30,
         87,  7, 21, 96, 39, 79, 60, 38, 95, 60, 38],
        [ 0,  0,  0,  0,  0,  0,  0,  0, 88, 56, 56,  4,  4, 82, 98, 19, 34, 55,
         87, 46, 72, 46, 84,  1, 49, 95, 45, 58, 71]])


## temperature

为什么要采样？

1. 语言模型本身是概率模型， 在预测过程，实际上是：**预测下一个词元概率**， 我们只是采用不同的采样方法根据**概率**选择一个**合理**的离散词元
2. 为什么要有温度，仅改变分布是否更加 smooth 或 sharpe
3. 如何理解输出分布，当温度$t\neq1$，输出分布发生了改变，只是改变的程度不同。

举例：

给定 10 个球，3个蓝球，2个黑球，5个红球。那么这三个球的概率为:

1. 蓝:0.3
2. 黑:0.2
3. 红:0.5

类比预测 next-token prediction，则从一个箱子取一个球，对应的颜色，即是 next-token。 那么词表大小可以类比球色。在实现上可以用：`torch.multinomial`

采样生成有什么优势？ 

1. greedy search 为什么是 “search”， 可以认为模型能够输出任何文本（如无限猴子理论），search 只是一种规则
2. search 规则可以是确定性的（greedy search）也可以是随机性的(do sample)
3. 我们如果以"联合概率和"最大为最优搜索， 我们难以搜索到全局最优生成文本

In [30]:
def generation(
    model = None,
    input_ids: torch.tensor = None,
    max_new_token: int = 100,
    temperature: float = 1.0, 
):
    for i in range(max_new_token):
        
        model.eval()
        with torch.no_grad():
            logits = model(input_ids)
            
        logits = logits[:, -1, :] / temperature 
        probs = F.softmax(logits, dim = -1)
        # next_token_idx = torch.argmax(probs, dim=-1, keepdim=True)
        next_token_idx = torch.multinomial(probs, num_samples=1)
        input_ids = torch.cat( [input_ids, next_token_idx], dim = -1 )
    return input_ids

# logits = model(input_ids)
for i in range(3):
    result = generation(model, input_ids, max_new_token = 5)
    print(f'loop({i}):', result)

loop(0): tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 30,
         87,  7, 21, 96, 39, 79, 79, 60, 13, 65,  4],
        [ 0,  0,  0,  0,  0,  0,  0,  0, 88, 56, 56,  4,  4, 82, 98, 19, 34, 55,
         87, 46, 72, 46, 84,  1, 33, 65,  5, 59, 79]])
loop(1): tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 30,
         87,  7, 21, 96, 39, 79, 47, 95, 38, 95, 22],
        [ 0,  0,  0,  0,  0,  0,  0,  0, 88, 56, 56,  4,  4, 82, 98, 19, 34, 55,
         87, 46, 72, 46, 84,  1, 42, 10, 21, 60, 10]])
loop(2): tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 30,
         87,  7, 21, 96, 39, 79, 19, 26, 27, 95, 56],
        [ 0,  0,  0,  0,  0,  0,  0,  0, 88, 56, 56,  4,  4, 82, 98, 19, 34, 55,
         87, 46, 72, 46, 84,  1, 37, 85, 30, 35, 84]])


## top-k

In [31]:
def generation(
    model = None,
    input_ids: torch.tensor = None,
    max_new_token: int = 100,
    temperature: float = 1.0,
    top_k: int = 100, 
):
    top_k = min(top_k, model.config.vocab_size)
    
    for i in range(max_new_token):
        
        model.eval()
        with torch.no_grad():
            logits = model(input_ids)
            
        logits = logits[:, -1, :] / temperature 
        probs = F.softmax(logits, dim = -1)
        value, idx = torch.topk(probs, k = top_k, dim = -1)

        new_logits = torch.ones_like(probs) * -100000.0 
        new_logits[:,idx] = logits[:,idx] 
        
        probs = F.softmax(new_logits, dim = -1)
        next_token_idx = torch.multinomial(probs, num_samples=1)
        input_ids = torch.cat( [input_ids, next_token_idx], dim = -1 )
    return input_ids

# logits = model(input_ids)
for i in range(3):
    result = generation(model, input_ids, max_new_token = 5, top_k = 10)
    print(f'loop({i}):', result)

loop(0): tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 30,
         87,  7, 21, 96, 39, 79, 95, 73, 38, 34, 32],
        [ 0,  0,  0,  0,  0,  0,  0,  0, 88, 56, 56,  4,  4, 82, 98, 19, 34, 55,
         87, 46, 72, 46, 84,  1, 85,  5, 59, 73, 68]])
loop(1): tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 30,
         87,  7, 21, 96, 39, 79,  1, 49, 60, 38, 13],
        [ 0,  0,  0,  0,  0,  0,  0,  0, 88, 56, 56,  4,  4, 82, 98, 19, 34, 55,
         87, 46, 72, 46, 84,  1, 82,  1, 31, 95, 45]])
loop(2): tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 30,
         87,  7, 21, 96, 39, 79, 60, 38,  5, 59, 74],
        [ 0,  0,  0,  0,  0,  0,  0,  0, 88, 56, 56,  4,  4, 82, 98, 19, 34, 55,
         87, 46, 72, 46, 84,  1, 85, 85, 60, 66, 31]])


## top-p

top-k 存在截断问题，且生成过程中每个时刻的分布差异大，故K难以确定。

如果用概率和就能避免上述问题。

关键实现在于：

1. 对 vocab prob 降序排序
2. 用 `cumsum` 累计概率和, 并且找到第一个超过阈值 p 的位置 m
3. 取 m~l 位置概率设置为 0， 并**映射**回原词表位置
4. 对概率归一化
5. 用 multinominal 采样

In [32]:
# debug
logits = torch.randn(2, 8)
probs = F.softmax(logits, dim = -1)

sorted_probs, sorted_indices = torch.sort(probs, descending=True)
topp_probs = sorted_probs.clone()

cumsum_probs = torch.cumsum(sorted_probs, dim=-1)
for i in range(2):
    for j in range(vocab_size):
        if cumsum_probs[i,j] > 0.95:
            sorted_probs[i, j:] = 0
            topp_probs[i, sorted_indices[i,:].tolist()] = sorted_probs[i, :]
            break
print(topp_probs)

tensor([[0.0841, 0.0000, 0.1266, 0.0845, 0.3021, 0.2462, 0.0783, 0.0000],
        [0.1198, 0.1051, 0.3100, 0.1333, 0.1548, 0.0000, 0.0904, 0.0000]])


In [33]:
def generation(
    model = None,
    input_ids: torch.tensor = None,
    max_new_token: int = 100,
    top_p: float = 0.95,
    temperature: float = 1.1,
):
    top_p = min(top_p, 1.0)
    bs, seq_len = input_ids.shape
    
    for i in range(max_new_token):
        
        model.eval()
        with torch.no_grad():
            logits = model(input_ids)
            
        logits = logits[:, -1, :] / temperature 
        probs = F.softmax(logits, dim = -1)

        sorted_probs, sorted_indices = torch.sort(probs, descending=True)
        cumsum_probs = torch.cumsum(sorted_probs, dim=-1)

        topp_probs = sorted_probs.clone()
        for i in range(bs):
            # for j in range(model.config.vocab_size):
            idx = torch.where(cumsum_probs[i,:] > top_p)
            sorted_probs[i, idx[0][0]:] = 0 # 首个累计和大于 p 的位置
            topp_probs[i, sorted_indices[i,:].tolist()] = sorted_probs[i, :]
            
        probs = topp_probs / topp_probs.sum(dim = -1).unsqueeze(dim = 1)
        next_token_idx = torch.multinomial(probs, num_samples=1)
        input_ids = torch.cat( [input_ids, next_token_idx], dim = -1 )
    return input_ids

result = generation(model, input_ids, max_new_token = 5, top_p = 0.95)
print(result)

tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 30,
         87,  7, 21, 96, 39, 79,  9, 97, 31, 36,  7],
        [ 0,  0,  0,  0,  0,  0,  0,  0, 88, 56, 56,  4,  4, 82, 98, 19, 34, 55,
         87, 46, 72, 46, 84,  1, 60, 25, 21, 43, 18]])


当 Top-K 和 Top-P 同时存在时，实现顺序是怎么样的？

1. 先 top-k 粗筛， 再 top-p 精筛。 top-p 需要排序和cumsum，top-k 能减少 vocab 候选规模。
2. softmax 发生在 top-k 之前。 

## 其他

1. batch 数量对推理效率的影响
2. inference 分什么阶段
3. inference 有哪些存在重复计算
4. 为什么说语言模型是生成模型
5. 语言模型为什么可以多种输出
6. llm 为什么会产生重复生成现象，有什么方法可以避免吗？
7. 什么是 KV-cache，如何实现，kv-cache 加速的本质是什么？

## Reference

[GPT2 paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

[GPT2 code](https://github.com/openai/gpt-2)

[GPT2 by Transformers](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py)