# Estimating performance

In this notebook, we show how to obtain estimates for LLM performance by combining anchor points and IRT.

## Preparing data

Loading packages

In [1]:
import numpy as np
import pickle
from tqdm import tqdm
from irt import *
from utils import *

random_state = 42



The leaderboard dataset we will use is composed by six scenarios (sub-datasets):
1. TruthfulQA
1. GSM8K
1. Winogrande
1. ARC
1. HellaSwag
1. MMLU

MMLU is further divided into sub-scenarios (e.g., abstract algebra, anatomy, etc). Let's check scenarios and sub-scenarios:

In [2]:
scenarios

{'harness_truthfulqa_mc_0': ['harness_truthfulqa_mc_0'],
 'gsm8k': ['harness_gsm8k_5'],
 'winogrande': ['harness_winogrande_5'],
 'arc': ['harness_arc_challenge_25'],
 'hellaswag': ['harness_hellaswag_10'],
 'mmlu': ['harness_hendrycksTest_abstract_algebra_5',
  'harness_hendrycksTest_anatomy_5',
  'harness_hendrycksTest_astronomy_5',
  'harness_hendrycksTest_business_ethics_5',
  'harness_hendrycksTest_clinical_knowledge_5',
  'harness_hendrycksTest_college_biology_5',
  'harness_hendrycksTest_college_chemistry_5',
  'harness_hendrycksTest_college_computer_science_5',
  'harness_hendrycksTest_college_mathematics_5',
  'harness_hendrycksTest_college_medicine_5',
  'harness_hendrycksTest_college_physics_5',
  'harness_hendrycksTest_computer_security_5',
  'harness_hendrycksTest_conceptual_physics_5',
  'harness_hendrycksTest_econometrics_5',
  'harness_hendrycksTest_electrical_engineering_5',
  'harness_hendrycksTest_elementary_mathematics_5',
  'harness_hendrycksTest_formal_logic_5',
 

Loading leaderboard data:

In [4]:
with open('data/lb.pickle', 'rb') as handle:
    data = pickle.load(handle)

In [27]:
print(len(data['models']))
c_data = data['data']
categories = list(c_data.keys())
for c in categories:
    print(c_data[c]['correctness'].shape)
print(c_data['harness_hendrycksTest_abstract_algebra_5']['correctness'].shape)    

395
(100, 395)
(135, 395)
(152, 395)
(100, 395)
(265, 395)
(144, 395)
(100, 395)
(100, 395)
(100, 395)
(173, 395)
(102, 395)
(100, 395)
(235, 395)
(114, 395)
(145, 395)
(378, 395)
(126, 395)
(100, 395)
(310, 395)
(203, 395)
(100, 395)
(165, 395)
(198, 395)
(193, 395)
(390, 395)
(270, 395)
(238, 395)
(151, 395)
(545, 395)
(216, 395)
(204, 395)
(237, 395)
(223, 395)
(131, 395)
(121, 395)
(108, 395)
(163, 395)
(112, 395)
(103, 395)
(234, 395)
(100, 395)
(783, 395)
(346, 395)
(895, 395)
(306, 395)
(311, 395)
(324, 395)
(282, 395)
(1534, 395)
(272, 395)
(612, 395)
(110, 395)
(245, 395)
(201, 395)
(100, 395)
(166, 395)
(171, 395)
(1172, 395)
(10042, 395)
(817, 395)
(1267, 395)
(1319, 395)
(100, 395)


In this dataset, we have data from 395 models. Let's see the names of some of them below

In [4]:
len(data['models']),data['models'][:10]

(395,
 ['open-llm-leaderboard/details_zhengr__MixTAO-7Bx2-MoE-DPO',
  'open-llm-leaderboard/details_alignment-handbook__zephyr-7b-sft-full',
  'open-llm-leaderboard/details_rombodawg__Leaderboard-killer-MoE_4x7b',
  'open-llm-leaderboard/details_FelixChao__ExtremeDolphin-MoE',
  'open-llm-leaderboard/details_LoSboccacc__orthogonal-2x7B-base',
  'open-llm-leaderboard/details_moreh__MoMo-70B-lora-1.8.6-DPO',
  'open-llm-leaderboard/details_deepseek-ai__deepseek-moe-16b-base',
  'open-llm-leaderboard/details_Swisslex__Mixtral-Orca-v0.1',
  'open-llm-leaderboard/details_wang7776__Mistral-7B-Instruct-v0.2-sparsity-20',
  'open-llm-leaderboard/details_nfaheem__Marcoroni-7b-DPO-Merge'])

Below, we will process the data so all correctness scores (for all scenarios) are stored in $Y$. The dictionaries `scenarios_position` and `subscenarios_position` give the position of scenarios/subscenarios correctness scores in $Y$.

In [5]:
scenarios_position, subscenarios_position = prepare_data(scenarios, data)
Y = create_responses(scenarios, data)
Y.shape

(395, 28659)

For example, below you can see the scores for MMLU:

In [6]:
Y[:,scenarios_position['mmlu']], Y[:,scenarios_position['mmlu']].shape

(array([[0., 0., 1., ..., 1., 1., 0.],
        [0., 0., 1., ..., 1., 1., 0.],
        [0., 0., 1., ..., 1., 1., 0.],
        ...,
        [0., 0., 1., ..., 1., 1., 0.],
        [0., 0., 1., ..., 1., 1., 0.],
        [1., 0., 1., ..., 1., 1., 0.]]),
 (395, 14042))

For scenarios that have multiple subscenarios, it is usually the case that we want to give equal importance to individual subscenarios when computing the aggregated performance in that scenario. This is equivalent to using a weighted average when computing the aggregated performance. We will create balance_weights, a vector of weights to help us compute those weighted averages. These weights will be different than one only for MMLU, which is the only scenario with multiple subscenarios.

We will use this when choosing the IRT dimension.

In [7]:
balance_weights = np.ones(Y.shape[1])

N = len(scenarios_position['mmlu'])
n_sub = len(scenarios['mmlu'])
for sub in scenarios['mmlu']:
    n_i = len(subscenarios_position['mmlu'][sub])
    balance_weights[subscenarios_position['mmlu'][sub]] = N/(n_sub*n_i)  

We can see below that first averaging within subscenarios and then computing a simple average is equivalent to using a weighted average from the beginning:

In [8]:
accs1 = np.mean([Y[:,subscenarios_position['mmlu'][sub]].mean(axis=1) for sub in scenarios['mmlu']], axis=0)
accs2 = (balance_weights*Y)[:,scenarios_position['mmlu']].mean(axis=1)

np.abs(accs1 - accs2).mean()

2.322333605307685e-14

## Obtaining estimates

Let's split the data in train and test (recent models are placed in the test set). We will not used the training par in this notebook, since they were already used in `anchor_points.ipynb` and `training_irt.ipynb` to obtain anchor points and train the IRT model. We will not discretize $Y$ in the evaluation time, but that can be done if the user thinks it's needed.

In [9]:
Y_test = Y[:100]
Y_train = Y[100:]

### Using anchor points to estimate performance in the test set and reporting the average prediction error

Loading

In [10]:
with open('data/anchor.pickle', 'rb') as handle:
    anchor = pickle.load(handle)

anchor_points = anchor['anchor_points']
anchor_weights = anchor['anchor_weights']

In practice, `Y_test` would be filled with NaNs except in the indices given `seen_items` below:

`seen_items = np.hstack([np.array(scenarios_position[scenario])[anchor_points[scenario]] for scenario in scenarios.keys()]).tolist()`

Computing estimates

In [11]:
preds = {}
for scenario in scenarios.keys():
    Y_anchor = Y_test[:,scenarios_position[scenario]][:,anchor_points[scenario]]
    preds[scenario] = (Y_anchor*anchor_weights[scenario]).sum(axis=1) # Predictions
    true = (balance_weights*Y_test)[:,scenarios_position[scenario]].mean(axis=1) # True performance

    print(f"scenario: {scenario}, avg. error: {np.abs(preds[scenario]-true).mean():.3f}")

scenario: harness_truthfulqa_mc_0, avg. error: 0.016
scenario: gsm8k, avg. error: 0.019
scenario: winogrande, avg. error: 0.024
scenario: arc, avg. error: 0.023
scenario: hellaswag, avg. error: 0.020
scenario: mmlu, avg. error: 0.028


### Combining anchor points with IRT to estimate performance in the test set and reporting the average prediction error

Loading IRT parameter estimates and recording all seen examples indices

In [12]:
A, B, _ = load_irt_parameters('data/irt_model/')
seen_items = np.hstack([np.array(scenarios_position[scenario])[anchor_points[scenario]] for scenario in scenarios.keys()]).tolist()
unseen_items = [i for i in range(Y_train.shape[1]) if i not in seen_items]

Estimating ability parameters for test LLMs

In [13]:
thetas = [estimate_ability_parameters(Y_test[j][seen_items], A[:, :, seen_items], B[:, :, seen_items]) for j in tqdm(range(Y_test.shape[0]))]

100%|███████████████████████████████████████████████| 95/95 [00:03<00:00, 25.97it/s]


#### p-IRT

In [14]:
pirt_preds = {}
for scenario in scenarios.keys():

    ind_seen = [u for u in seen_items if u in scenarios_position[scenario]]
    ind_unseen = [u for u in unseen_items if u in scenarios_position[scenario]]
    pirt_lambd = Y_anchor.shape[1]/len(scenarios_position[scenario])

    pirt_pred = []
    
    for j in range(Y_test.shape[0]):
        data_part = (balance_weights*Y_test)[j,ind_seen].mean()
        irt_part = (balance_weights*item_curve(thetas[j], A, B))[0,ind_unseen].mean()
        pirt_pred.append(pirt_lambd*data_part + (1-pirt_lambd)*irt_part) 
        
    pirt_preds[scenario] = np.array(pirt_pred) # Predictions
    true = (balance_weights*Y_test)[:,scenarios_position[scenario]].mean(axis=1) # True performance
    
    print(f"scenario: {scenario}, avg. error: {np.abs(pirt_preds[scenario]-true).mean():.3f}")

scenario: harness_truthfulqa_mc_0, avg. error: 0.018
scenario: gsm8k, avg. error: 0.023
scenario: winogrande, avg. error: 0.015
scenario: arc, avg. error: 0.012
scenario: hellaswag, avg. error: 0.016
scenario: mmlu, avg. error: 0.025


#### gp-IRT

Loading lambdas

In [15]:
with open('data/lambds.pickle', 'rb') as handle:
    lambds = pickle.load(handle)

Computing estimates and their average errors

In [16]:
gpirt_preds = {}
for scenario in scenarios.keys():
    gpirt_preds[scenario] = lambds[scenario]*preds[scenario]  + (1-lambds[scenario])*pirt_preds[scenario]
    true = (balance_weights*Y_test)[:,scenarios_position[scenario]].mean(axis=1) # True performance
    
    print(f"scenario: {scenario}, avg. error: {np.abs(gpirt_preds[scenario]-true).mean():.3f}")

scenario: harness_truthfulqa_mc_0, avg. error: 0.013
scenario: gsm8k, avg. error: 0.018
scenario: winogrande, avg. error: 0.013
scenario: arc, avg. error: 0.012
scenario: hellaswag, avg. error: 0.014
scenario: mmlu, avg. error: 0.023
