# 作业三：预训练语言模型计算PPL
姓名：薛翔元
学号：521030910387

In [1]:
import torch
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

In [2]:
device = "cuda" if torch.cuda.is_available() else "cpu"

## 加载模型和Tokenizer

In [3]:
model_path = "./gpt2"
model = GPT2LMHeadModel.from_pretrained(model_path).to(device)
model.eval()
tokenizer = GPT2TokenizerFast.from_pretrained(model_path)

### Tokenizer

下面是一个例子，展示Tokenizer和模型的使用。理解下面的例子可能对你的大作业有帮助。

Tokenizer会将句子分割成一个个token，然后将每个token转化为一个数字，这个数字就是这个token在词表中的id

In [4]:
inputs = tokenizer("""GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion.""", return_tensors="pt")
inputs

{'input_ids': tensor([[   38, 11571,    12,    17,   318,   257,  6121,   364,  2746,  2181,
         13363,   319,   257,   845,  1588, 35789,   286,  3594,  1366,   287,
           257,  2116,    12, 16668, 16149,  6977,    13]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1]])}

可以将token id映射到对应的分词token

In [5]:
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print(tokens)

['G', 'PT', '-', '2', 'Ġis', 'Ġa', 'Ġtransform', 'ers', 'Ġmodel', 'Ġpret', 'rained', 'Ġon', 'Ġa', 'Ġvery', 'Ġlarge', 'Ġcorpus', 'Ġof', 'ĠEnglish', 'Ġdata', 'Ġin', 'Ġa', 'Ġself', '-', 'super', 'vised', 'Ġfashion', '.']


可以使用`decode`方法将token id转化回原来的句子

In [6]:
decoded_string = tokenizer.decode([38, 11571, 12, 17, 318, 257, 6121, 364, 2746, 2181, 13363, 319, 257, 845, 1588, 35789, 286, 3594, 1366, 287, 257, 2116, 12, 16668, 16149, 6977, 13])
print(decoded_string)

GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion.


### GPT2

GPT2是自回归式语言模型，可以根据前面的token预测下一个token。

将上面的token id输入到GPT2模型中，就可以得到每个token的概率分布

GPT2的输出的logits是一个三维张量，第一维是batch size，第二维是token的数量，第三维是词表的大小

> 注意：GPT2输出的是logits，需要经过softmax才能得到真正的概率分布

In [7]:
input_ids = inputs.input_ids.to(device)
with torch.no_grad():
    logits = model(input_ids).logits
print(logits.shape) # batch大小，序列长度，词表大小
print(logits[0, 0, :]) # 对于第一个词的预测logits，通过softmax后可以得到概率分布

torch.Size([1, 27, 50257])
tensor([-31.8240, -31.4345, -33.4860,  ..., -39.5280, -38.9087, -31.8361],
       device='cuda:0')


## 计算Perplexity (PPL)

PPL是语言模型的一个重要评价指标，表示模型对于给定的句子的概率分布的拟合程度。

计算公式为：
$$
PPL = \sqrt[N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_1,w_2,...,w_{i-1})}}
$$
通常可以转化为对数形式：
$$
PPL = \exp\left(\frac{1}{N}\sum_{i=1}^{N}-\log P(w_i|w_1,w_2,...,w_{i-1})\right)
$$

本节将实现GPT2模型的PPL计算

In [8]:
from torch.nn import Softmax, CrossEntropyLoss


def calculate_ppl(model, text):
    ## TODO: 首先将文本转换为输入token (7分)
    inputs = tokenizer(text, return_tensors="pt")
    input_ids = inputs.input_ids.to(device)
    # 获取模型的输出
    model.eval()
    with torch.no_grad():
        outputs = model(input_ids)
        logits = outputs.logits
        labels = input_ids.to(logits.device)
        # GPT2每个位置都是预测下一个token的概率，所以需要将labels向左移动一位
        shift_logits = logits[..., :-1, :]
        shift_labels = labels[..., 1:]
        ## TODO: 根据logits和labels计算model在text上的ppl（8分）
        ## Hint: 可以直接通过Softmax获取概率值按照上面公式计算
        ## Hint2: 也可以尝试利用CrossEntropyLoss进行等价计算
        loss_fn = CrossEntropyLoss(reduction='mean')
        loss = loss_fn(shift_logits.squeeze(0), shift_labels.squeeze(0))
        ppl = torch.exp(loss).item()
        # A plain but equivalent method here
        # ppl = 0
        # shift_logits = Softmax(dim=-1)(shift_logits)
        # for i in range(len(shift_logits[0])):
        #     ppl += torch.log(shift_logits[0, i, shift_labels[0, i]])
        # ppl = torch.exp(-ppl / len(shift_logits[0])).item()
    return ppl

## 测试

In [9]:
text1 = "GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks."
text2 = "Until the rocket ship nearly imploded. On Nov. 17, OpenAI's nonprofit board of directors fired Altman, without warning or even much in the way of explanation. The surreal maneuvering that followed made the corporate dramas of Succession seem staid. Employees revolted. So did OpenAI's powerful investors; one even baselessly speculated that one of the directors who defenestrated Altman was a Chinese spy. The company's visionary chief scientist voted to oust his fellow co-founder, only to backtrack. Two interim CEOs came and went. The players postured via selfie, open letter, and heart emojis on social media. Meanwhile, the company's employees and its board of directors faced off in “a gigantic game of chicken,” says a person familiar with the discussions. At one point, OpenAI's whole staff threatened to quit if the board didn't resign and reinstall Altman within a few hours, three people involved in the standoff tell TIME. Then Altman looked set to decamp to Microsoft—with potentially hundreds of colleagues in tow. It seemed as if the company that catalyzed the AI boom might collapse overnight."

print(calculate_ppl(model, text1))
print(calculate_ppl(model, text2))

68.2030258178711
46.45439910888672


（TODO：实验总结）

> 根据 PPL 计算公式，我们进行如下变换
>
> $$\begin{aligned} \text{PPL} &= \exp \left( - \frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_1, w_2, ..., w_{i-1}) \right) \\ &= \exp \left( - \frac{1}{N} \sum_{i=1}^{N} \log \frac{\exp \left( \text{logits}(w_i) \right)}{\sum_{j=1}^M \exp \left( \text{logits}(w_j) \right)} \right) \\ &= \exp \left( \frac{1}{N} \sum_{i=1}^{N} - \log \text{Softmax} \left( \text{logits} (w_i) \right) \right) \\ &= \exp \left( \frac{1}{N} \sum_{i=1}^{N} \text{CrossEntropy} \left( \text{logits} (w_i) | \text{onehot}(w_i) \right) \right) \end{aligned}$$
>
> 因此，内层可以直接使用 `CrossEntropyLoss` 进行计算，最后再取指数即可
>
> 此外，注释中展示了基于循环的朴素实现，两种方法得到的结果一致，验证了实现的正确性
>
> 测试结果表明，GPT2 模型的 PPL 相对较小，具有较强的语言生成能力