adopted from 2025 project with Su Bao at NTU: https://softly-undefined.github.io/files/cognitive_alignment.pdf

# Cleaning

This file contains the cleaning process for the English and Chinese versions of the mmlu dataset located:
- https://openaipublic.blob.core.windows.net/simple-evals/mmlu.csv
- https://openaipublic.blob.core.windows.net/simple-evals/mmlu_ZH-CN.csv

## Removing mismatched rows

Some rows had mismatched answers in the English and Chinese version. For simplicity we remove these rows.

In [1]:
import pandas as pd

en_df = pd.read_csv("mmlu_EN-US.csv")
zh_df = pd.read_csv("mmlu_ZH-CN.csv")

print("EN shape:", en_df.shape)
print("ZH shape:", zh_df.shape)

display(en_df.head())
display(zh_df.head())

print("EN Columns:", en_df.columns.tolist())
print("ZH Columns:", zh_df.columns.tolist())

row_alignment = (en_df.shape == zh_df.shape) and (en_df.index.equals(zh_df.index))
print("Row alignment:", row_alignment)

en_subjects = en_df['Subject'].value_counts()
zh_subjects = zh_df['Subject'].value_counts()
subject_diff = pd.concat([en_subjects, zh_subjects], axis=1, keys=['EN', 'ZH']).fillna(0)
display(subject_diff)

alignment_check = pd.DataFrame({
    'en_question': en_df['Question'].astype(str).str[:20],
    'zh_question': zh_df['Question'].astype(str).str[:20]
})
display(alignment_check.sample(10))

# check if choices and answers are structurally aligned
choices_aligned = all(
    en_df[f'{letter}'].notnull().equals(zh_df[f'{letter}'].notnull())
    for letter in ['A', 'B', 'C', 'D']
)
answers_aligned = en_df['Answer'].equals(zh_df['Answer'])

print("Choices structurally aligned:", choices_aligned)
print("Answer keys aligned:", answers_aligned)

EN shape: (14042, 8)
ZH shape: (14042, 8)


Unnamed: 0.1,Unnamed: 0,Question,A,B,C,D,Answer,Subject
0,0,Find the degree for the given field extension ...,0,4,2,6,B,abstract_algebra
1,1,"Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...",8,2,24,120,C,abstract_algebra
2,2,Find all zeros in the indicated finite field o...,0,1,01,04,D,abstract_algebra
3,3,Statement 1 | A factor group of a non-Abelian ...,"True, True","False, False","True, False","False, True",B,abstract_algebra
4,4,Find the product of the given polynomials in t...,2x^2 + 5,6x^2 + 4x + 6,0,x^2 + 1,B,abstract_algebra


Unnamed: 0.1,Unnamed: 0,Question,A,B,C,D,Answer,Subject
0,0,"求 Q 上给定域扩张 Q(sqrt(2), sqrt(3), sqrt(18)) 的次数。",0,4,2,6,B,abstract_algebra
1,1,"设 S_5 中 p = (1, 2, 5, 4)(2, 3)。求 S_5 中 <p> 的幂。",8,2,24,120,C,abstract_algebra
2,2,求系数在该域中的给定多项式的所示有限域的所有零。Z_5 中：x^5 + 3x^3 + x^2...,0,1,01,04,D,abstract_algebra
3,3,陈述 1 | 非阿贝尔群的因子群是非阿贝尔的。陈述 2 | 若 K 是 H 的正规子群，H ...,真，真,假，假,真，假,假，真,B,abstract_algebra
4,4,求给定多项式环中给定多项式的乘积。Z_8[x] 中：f(x) = 4x - 5，g(x) =...,2x^2 + 5,6x^2 + 4x + 6,0,x^2 + 1,B,abstract_algebra


EN Columns: ['Unnamed: 0', 'Question', 'A', 'B', 'C', 'D', 'Answer', 'Subject']
ZH Columns: ['Unnamed: 0', 'Question', 'A', 'B', 'C', 'D', 'Answer', 'Subject']
Row alignment: True


Unnamed: 0_level_0,EN,ZH
Subject,Unnamed: 1_level_1,Unnamed: 2_level_1
professional_law,1534,1534
moral_scenarios,895,895
miscellaneous,783,783
professional_psychology,612,612
high_school_psychology,545,545
high_school_macroeconomics,390,390
elementary_mathematics,378,378
moral_disputes,346,346
prehistory,324,324
philosophy,311,311


Unnamed: 0,en_question,zh_question
5271,The superintendent o,一个大型学区的负责人请学校的心理医生预测
323,If you lived on Venu,如果您生活在金星上，您会看到地球的哪些相
2800,Which describes an A,以下哪项描述了一种非洲蝴蝶物种存在两种截
2009,Deflection method di,偏转法是目前应用最广泛的直接测量方法，因
11868,A woman from State A,一名来自 A 州的妇女在 B 州的州法院
842,Hybrids between some,某些亲缘植物之间的杂交种是不育的，因为亲
8200,Thomson discusses a,汤姆森讨论了一个不同版本的小提琴手的情况
12470,The correction for a,衰减校正公式用于衡量：
1386,One end of a horizon,水平无质量弹簧的一端固定在墙上。弹簧的另
5318,After dealing kindly,在和善地接待了几位对她很粗鲁的顾客之后，


Choices structurally aligned: False
Answer keys aligned: False


In [2]:
# Look for mismatched answers
mismatches = en_df[en_df["Answer"] != zh_df["Answer"]]
print(f"Number of mismatches: {len(mismatches)}")
display(mismatches[['Question', 'Answer']].head())
display(zh_df.loc[mismatches.index][['Question', 'Answer']].head())


Number of mismatches: 176


Unnamed: 0,Question,Answer
5853,This question refers to the following informat...,B
5855,This question refers to the following informat...,C
5856,This question refers to the following informat...,D
5857,This question refers to the following informat...,C
5858,This question refers to the following informat...,A


Unnamed: 0,Question,Answer
5853,本问题涉及以下信息。\n你就是美国，\n未来的侵略者 \n要侵略印第安血统的天真的阿美利加\...,C
5855,本问题涉及以下信息。\n你就是美国，\n未来的侵略者 \n要侵略印第安血统的天真的阿美利加\...,B
5856,本问题涉及以下信息。\n火药武器：欧洲与中国\n在 12 世纪到 15 世纪的西欧，早期的大...,B
5857,本问题涉及以下信息。\n加纳市由两个城镇组成。其中一城镇由穆斯林居住，它有 12 座清真寺，...,D
5858,本问题涉及以下信息。\n没有比维护和平更紧迫的任务了。没有和平，我们的独立就没有什么意义。我...,C


In [3]:
aligned_mask = en_df['Answer'] == zh_df['Answer']

aligned_en_df = en_df[aligned_mask].reset_index(drop=True)
aligned_zh_df = zh_df[aligned_mask].reset_index(drop=True)

print("Cleaned EN shape:", aligned_en_df.shape)
print("Cleaned ZH shape:", aligned_zh_df.shape)

aligned_en_df.to_csv("mmlu_EN-US_clean.csv", index=False)
aligned_zh_df.to_csv("mmlu_ZH-CN_clean.csv", index=False)

Cleaned EN shape: (13866, 8)
Cleaned ZH shape: (13866, 8)


# Subcategorization and separating 100 from each

Next we will subcategorize through the subcategories located here https://github.com/hendrycks/test/blob/master/categories.py

And sample 100 questions from each subcategory for our testing

In [4]:
import pandas as pd

# === Load Datasets ===
english_df = pd.read_csv("mmlu_EN-US_clean.csv")
chinese_df = pd.read_csv("mmlu_ZH-CN_clean.csv")

# Subject Mappings (taken from: https://github.com/hendrycks/test/blob/master/categories.py)
subject_to_subcategory = {
    "abstract_algebra": "math",
    "anatomy": "health",
    "astronomy": "physics",
    "business_ethics": "business",
    "clinical_knowledge": "health",
    "college_biology": "biology",
    "college_chemistry": "chemistry",
    "college_computer_science": "computer science",
    "college_mathematics": "math",
    "college_medicine": "health",
    "college_physics": "physics",
    "computer_security": "computer science",
    "conceptual_physics": "physics",
    "econometrics": "economics",
    "electrical_engineering": "engineering",
    "elementary_mathematics": "math",
    "formal_logic": "philosophy",
    "global_facts": "other",
    "high_school_biology": "biology",
    "high_school_chemistry": "chemistry",
    "high_school_computer_science": "computer science",
    "high_school_european_history": "history",
    "high_school_geography": "geography",
    "high_school_government_and_politics": "politics",
    "high_school_macroeconomics": "economics",
    "high_school_mathematics": "math",
    "high_school_microeconomics": "economics",
    "high_school_physics": "physics",
    "high_school_psychology": "psychology",
    "high_school_statistics": "math",
    "high_school_us_history": "history",
    "high_school_world_history": "history",
    "human_aging": "health",
    "human_sexuality": "culture",
    "international_law": "law",
    "jurisprudence": "law",
    "logical_fallacies": "philosophy",
    "machine_learning": "computer science",
    "management": "business",
    "marketing": "business",
    "medical_genetics": "health",
    "miscellaneous": "other",
    "moral_disputes": "philosophy",
    "moral_scenarios": "philosophy",
    "nutrition": "health",
    "philosophy": "philosophy",
    "prehistory": "history",
    "professional_accounting": "other",
    "professional_law": "law",
    "professional_medicine": "health",
    "professional_psychology": "psychology",
    "public_relations": "politics",
    "security_studies": "politics",
    "sociology": "culture",
    "us_foreign_policy": "politics",
    "virology": "health",
    "world_religions": "philosophy",
}

# === Assign Subcategories ===
english_df["Subcategory"] = english_df["Subject"].map(subject_to_subcategory)
chinese_df["Subcategory"] = chinese_df["Subject"].map(subject_to_subcategory)

# === Sample X examples per valid subcategory ===
TARGET = 50
valid_subs = english_df["Subcategory"].value_counts()[lambda x: x >= TARGET].index.tolist()

# Sample from aligned data using the same indices
sampled_indices = (
    english_df[english_df["Subcategory"].isin(valid_subs)]
    .groupby("Subcategory", group_keys=False)
    .apply(lambda group: group.sample(n=TARGET, random_state=42))
    .index
)

# Remove anything with Subcategory values: politics, other, law, geography, engineering, culture, chemistry
filtered_indices = sampled_indices[~english_df.loc[sampled_indices]["Subcategory"].isin(["politics", "other", "law", "geography", "engineering", "culture", "chemistry"])]

# Use the exact same indices in both datasets
sampled_en_df = english_df.loc[filtered_indices].reset_index(drop=True)
sampled_zh_df = chinese_df.loc[filtered_indices].reset_index(drop=True)

# === Save Clean Balanced Files ===
sampled_en_df.to_csv("mmlu_EN-US_balanced.csv", index=False)
sampled_zh_df.to_csv("mmlu_ZH-CN_balanced.csv", index=False)

# === Summary ===
print("Final subcategory counts (should all be 100):")
print(sampled_en_df["Subcategory"].value_counts())
print(f"\nTotal examples in each file: {len(sampled_en_df)}")


Final subcategory counts (should all be 100):
Subcategory
biology             50
business            50
computer science    50
economics           50
health              50
history             50
math                50
philosophy          50
physics             50
psychology          50
Name: count, dtype: int64

Total examples in each file: 500


  .apply(lambda group: group.sample(n=TARGET, random_state=42))


In [5]:
import pandas as pd

en = pd.read_csv("mmlu_EN-US_balanced.csv")
zh = pd.read_csv("mmlu_ZH-CN_balanced.csv")

en_answer_counts = en['Answer'].value_counts().sort_index()
zh_answer_counts = zh['Answer'].value_counts().sort_index()

answer_comparison = pd.concat([en_answer_counts, zh_answer_counts], axis=1, keys=['EN', 'ZH']).fillna(0).astype(int)

answer_comparison['Difference'] = answer_comparison['EN'] - answer_comparison['ZH']
answer_comparison['% Difference'] = 100 * answer_comparison['Difference'] / answer_comparison['EN']

print("Answer distribution comparison:")
display(answer_comparison)


Answer distribution comparison:


Unnamed: 0_level_0,EN,ZH,Difference,% Difference
Answer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,108,108,0,0.0
B,126,126,0,0.0
C,129,129,0,0.0
D,137,137,0,0.0


In [6]:
import pandas as pd

en = pd.read_csv("mmlu_EN-US_balanced.csv")
zh = pd.read_csv("mmlu_ZH-CN_balanced.csv")

en_subcounts = en["Subcategory"].value_counts().sort_index()
zh_subcounts = zh["Subcategory"].value_counts().sort_index()

subcat_comparison = pd.concat([en_subcounts, zh_subcounts], axis=1, keys=["EN", "ZH"]).fillna(0).astype(int)
subcat_comparison["Difference"] = subcat_comparison["EN"] - subcat_comparison["ZH"]

print("Subcategory comparison between EN and ZH:")
display(subcat_comparison)

if (subcat_comparison["Difference"] == 0).all():
    print("\nAll subcategories match in count.")
else:
    print("\nSubcategory count mismatch found.")


Subcategory comparison between EN and ZH:


Unnamed: 0_level_0,EN,ZH,Difference
Subcategory,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
biology,50,50,0
business,50,50,0
computer science,50,50,0
economics,50,50,0
health,50,50,0
history,50,50,0
math,50,50,0
philosophy,50,50,0
physics,50,50,0
psychology,50,50,0



All subcategories match in count.


# We've Done It!

The final datasets are in the mmlu_EN-US_balanced.csv and mmlu_ZH-CN_balanced.csv files, and will next be used in the 2_order_bias directory