# Benchmark

LLM 具有解决通用任务的能力，其性能测评涵盖广泛如 NLP、知识、数学、代码、写作、长上下文能力和推理等任务。

需要区分任务和能力：

1. 任务：具体需要解决的场景，如文本分类
2. 能力：更加抽象的模型内在特性。如上下文理解、推理能力

能力的提升，会带来特定任务性能的表现。 对于 NLP 模型到 LLM 的能力测评差异，语言模型从具体任务测评 发展到 **通用** 能力测评。

能力是通过任务测评来给予一种客观的度量，如测试模型的数学推理能力， 则会使用一些奥数竞赛题作为测评。

这种客观度量模型能力的方式称为 “基准（benchmark）测试”。

在 GPT-3 预训练模型中，通过 few-shot prompting 在多任务上测评，发现语言模型的涌现能力。对于分类任务：

- {example问题1}{example答案1}{问题输入}，回答：

这种方式让模型 generation 出回答，再对回答进行匹配。翻译任务需要判别预测序列与标签结果，这种类型的测评方案其答案主观，benchmark 一般会把主观问题转化成客观检验形式。

## MMLU 知识测评

MMLU 是多个学科的选择题数据集，选择题形式便于检验答案，测评结果体现模型的“知识”能力。

[例如](https://huggingface.co/datasets/cais/mmlu/viewer/abstract_algebra/test?row=0) ：



| question     | Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q. |
| ------------ | ------------------------------------------------------------ |
| choices      | [ "0", "4", "2", "6" ]                                       |
| answer       | B                                                            |
| **question** | **Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of `<p>` in S_5.** |
| choices      | ['8', '2', '24', '120']                                      |
| answer       | C                                                            |



对于 abstract algebra 学科，可以结构化来组织完整测评数据集。

LLM 如何测评？

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
torch.manual_seed(42)

<torch._C.Generator at 0x1217bcd70>

## Format

增加例子, 将目标问题序列化成开口问题

In [2]:
def template_mmlu( example, question, choices):
    task_example = 'Predict choice <example>' + example + '<\example>' 
    input_question = '<question>' + question + '<\question>'
    input_choices = '<choices> (A)' + choices[0] + ' (B)' + choices[1] + ' (C)' + choices[2]+ ' (D)' + choices[3] + '<\choices>'
    return task_example + input_question + input_choices + 'answer:' 

example = 'Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q. <choices> (A)0 (B)4 (C)2 (D)6 answer:B'

question = 'Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.'
choices = ['8', '2', '24', '120']

prompt = template_mmlu(example, question, choices)
print(prompt)

Predict choice <example>Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q. <choices> (A)0 (B)4 (C)2 (D)6 answer:B<\example><question>Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.<\question><choices> (A)8 (B)2 (C)24 (D)120<\choices>answer:


## dummy tokenizer and model

In [3]:
# tokenizer
class SimplestTokenizer:
    def __init__(self, text):
        tokens = list(text)
        self.vocab = {}
        self.vocab_reverse = {}
        idx = 0
        for i in tokens:
            if i not in self.vocab:
                self.vocab[i] = idx
                self.vocab_reverse[idx] = i
                idx += 1
    def encode(self, text, return_pt = False):
        tokens = list(text)
        token_ids = [ self.vocab[token] for token in tokens]
        if return_pt:
            token_ids = torch.tensor(token_ids, dtype = torch.long).unsqueeze(0)
        return token_ids
    def decode(self, ids):
        token_list = [self.vocab_reverse[idx] for idx in ids]
        text = ''.join(token_list)
        return text

tokenizer = SimplestTokenizer(prompt)
input_ids = tokenizer.encode(prompt, return_pt=True)

In [4]:
class MyLLM(nn.Module):
    def __init__(self, vocab_size = 100):
        super().__init__()
        self.dim = 32
        self.embd = nn.Embedding(vocab_size, self.dim)
        self.w0 = nn.Linear(self.dim, self.dim)
        self.w1 = nn.Linear(self.dim, self.dim)
        self.lm_head = nn.Linear(self.dim, vocab_size)
    def forward(self, x):
        e = self.embd(x)
        e = torch.sin(self.w0(e).mean(dim = 1, keepdim=True)) + self.w1(e)
        logits = self.lm_head(e)
        return logits
        
vocab_size = len(tokenizer.vocab)
model = MyLLM(vocab_size = vocab_size)
model(input_ids)

tensor([[[ 0.1993, -0.4260,  0.3250,  ..., -0.6383, -0.2980,  0.2777],
         [-0.2143,  0.1726,  0.0864,  ..., -0.4307,  0.1291,  0.6233],
         [ 0.2139, -0.4257,  0.8965,  ...,  0.3269, -0.4121,  0.3931],
         ...,
         [ 0.2139, -0.4257,  0.8965,  ...,  0.3269, -0.4121,  0.3931],
         [-0.2143,  0.1726,  0.0864,  ..., -0.4307,  0.1291,  0.6233],
         [ 0.1097,  0.5253, -0.1027,  ..., -0.0012,  0.0870,  0.5525]]],
       grad_fn=<ViewBackward0>)

## 生成式测评

生成一个token作为选项，生成时可能产生非选项 ABCD 的答案

1. 直接判别生成 token id 与 label id 是否一致， 若不一致则为预测错误
2. 基于预测分布的 ABCD token 概率

### 直接判别

In [5]:
def generate(model, x, max_new_tokens = 20):
    for i in range(max_new_tokens):
        logits = model(x)[:, -1, :]
        new_token = torch.argmax(logits, dim = -1)
        x = torch.cat((x, new_token.unsqueeze(1)), dim = 1)
    return x, new_token, logits

_, new_token, logits = generate(model, input_ids, max_new_tokens=1) #只预测一个 token
print('predict argmax next-token id:',new_token)
print('predict answer:',tokenizer.decode(new_token.tolist()))
print('next-token prediction logits shape', logits.shape)
print('tokenizer vocab_size:', len(tokenizer.vocab))

label_id = tokenizer.encode(text = 'C')
print('label token id:', label_id)
print('prediction correctness:',new_token.tolist()[0] == label_id[0])

predict argmax next-token id: tensor([3])
predict answer: d
next-token prediction logits shape torch.Size([1, 49])
tokenizer vocab_size: 49
label token id: [37]
prediction correctness: False


### 概率判别

In [6]:
token_id_a = tokenizer.encode(text = 'A')[0]
token_id_b = tokenizer.encode(text = 'B')[0]
token_id_c = tokenizer.encode(text = 'C')[0]
token_id_d = tokenizer.encode(text = 'D')[0]

print(token_id_a, token_id_b, token_id_c, token_id_d)

p = F.softmax(logits[0, :], dim = 0)
print(p[token_id_a].item())
print(p[token_id_b].item())
print(p[token_id_c].item())
print(p[token_id_d].item())

# 答案为概率最大的 D

33 35 37 38
0.02757927030324936
0.010749058797955513
0.020917844027280807
0.03135909140110016


## PPL 判别

In [7]:
prompt_a = prompt + 'A'
prompt_b = prompt + 'B'
prompt_c = prompt + 'C'
prompt_d = prompt + 'D'

print(prompt_c)

Predict choice <example>Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q. <choices> (A)0 (B)4 (C)2 (D)6 answer:B<\example><question>Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.<\question><choices> (A)8 (B)2 (C)24 (D)120<\choices>answer:C


In [19]:
a = torch.randn(2,3)
b = torch.tensor([[2],[1]], dtype = torch.long)
print(a)
a.gather(index = b, dim = 1)

tensor([[-1.0448, -0.0402, -0.3794],
        [ 0.9052,  0.3824,  1.4473]])


tensor([[-0.3794],
        [ 0.3824]])

In [8]:
input_ids = tokenizer.encode(prompt_c, return_pt=True)
# print(input_ids)
_, seq_len = input_ids.shape
      
logits = model(input_ids)
print(input_ids.shape)
print(logits.shape)

p = F.softmax(logits, dim = -1)
p_next_token = p[0,:-1,:].gather(index = input_ids[0, 1:, None], dim = 0)
print(p_next_token.shape)

PPL = -p_next_token.log().mean()
print(PPL)

torch.Size([1, 292])
torch.Size([1, 292, 49])
torch.Size([291, 1])
tensor(3.7690, grad_fn=<NegBackward0>)


In [9]:
def mmlu_ppl(prompt, tokenizer, model):
    input_ids = tokenizer.encode(prompt, return_pt=True)
    logits = model(input_ids)
    p = F.softmax(logits, dim = -1)
    p_next_token = p[0,:-1,:].gather(index = input_ids[0, 1:, None], dim = 0)
    PPL = -p_next_token.log().mean()
    return PPL.item()

print(mmlu_ppl(prompt_a, tokenizer, model))
print(mmlu_ppl(prompt_b, tokenizer, model))
print(mmlu_ppl(prompt_c, tokenizer, model))
print(mmlu_ppl(prompt_d, tokenizer, model))

# 最小PPL为答案

3.768345832824707
3.767855405807495
3.769028663635254
3.767888069152832


## PPL 答案

对与选项 PPL，本身model有预测偏置。 以选项内容作为答案，再计算 PPL， 能够减少预测偏置

In [10]:
# example 的答案 (B)-> 4
example = 'Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q. <choices> (A)0 (B)4 (C)2 (D)6 answer:4'

question = 'Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.'
choices = ['8', '2', '24', '120']

prompt = template_mmlu(example, question, choices)
# print(prompt)

prompt_list = []
for choice in choices:
    prompt_list.append( prompt + choice )
print(prompt_list)  

['Predict choice <example>Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q. <choices> (A)0 (B)4 (C)2 (D)6 answer:4<\\example><question>Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.<\\question><choices> (A)8 (B)2 (C)24 (D)120<\\choices>answer:8', 'Predict choice <example>Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q. <choices> (A)0 (B)4 (C)2 (D)6 answer:4<\\example><question>Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.<\\question><choices> (A)8 (B)2 (C)24 (D)120<\\choices>answer:2', 'Predict choice <example>Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q. <choices> (A)0 (B)4 (C)2 (D)6 answer:4<\\example><question>Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.<\\question><choices> (A)8 (B)2 (C)24 (D)120<\\choices>answer:24', 'Predict choice <example>Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) 

In [11]:
result = [ mmlu_ppl(prompt, tokenizer, model) for prompt in prompt_list ]
print(result)
idx = torch.argmin(torch.tensor(result)).item()
print(idx) # 0:A, 1:B, 2:C, 3:D

[3.770315647125244, 3.7718098163604736, 3.774035692214966, 3.7709386348724365]
0


## 指标描述

- pass@N 指回答采样（N>1时，需要sample，非greedy search生成）数量 N 次，至少有一次答对了。比如让模型写代码，写了100次，至少有一次通过了，也算成功。
- cot@N 给定的示例有 N 个， 即 few-shot learning 中 exmaple的数量
- Magority@N 投票，采样的 4 个答案中（A,B,A,C），有2个为 A即是最高出现频次，则答案为 A。投票结果也可结合概率计算（A:0.4,B:0.55,A:0.6,C:0.3）,此时 A平均为0.5，小于B。

最严格的是 zero-shot greedy-search 结果给出 pass@1, cot@1。 但语言模型是随机模型，因此采样 / few-shot leanring，仍然能体现出其"预测分布"的准确性

## 总结

1. benchmark 的数据集设置中，其答案是容易被 check 的。 对于主观类问题，如果能转化为 客观类 问题形式，则容易进行测评
2. 本文给出 gen选项、PPL选项、PPL答案三种方式，更难的方式是 gen 答案。
3. 对比模型性能时需要看清测评的条件：采样数量、cot数量、...
4. 另外常用的典型测评方式为：LLM-As-a-judge、数学答案规则判别、人工判别
5. pretrained模型其生成内容不稳定，但通常来说 pretrained 模型有较完整的知识分布，性能保留最好。 而微调操作，则“强化”一部分能力，必然产生“遗忘现象”（遗忘程度不一）导致其他能力下降（跷跷板现象）
6. pretrained内部能力全面但不显化（如你掌握动态规划编程技巧，但是面试时无法回答hard 难度代码），而各种微调、ICL技术，则能显化/引出模型的推导结果（ICL如面试官给你提示如迭代方程，你根据提示能编写完整代码）
7. 微调后的模型一般采用生成式的判别