Adding PINYIN to Chinese MMLU dataset!

Example of how the pinyin conversion works：

1. need to check exactly how its done in pypinyin

2. somewhat context dependent!! Some characters in Chinese have different PINYIN based on the context. An example below shows 行 meaning both "xing" and "hang"

In [13]:
from pypinyin import lazy_pinyin
text = "这办法行。他是银行行长"
print(" ".join(lazy_pinyin(text)))


zhe ban fa xing 。 ta shi yin hang hang zhang


Now We Proceed to Add Pinyin to our dataset!

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [30]:
# First, merge the English and Chinese datasets on keys that uniquely identify each row
# For each column add suffixes "_EN-US" and "_ZH-CN" to distinguish between the two datasets
chinese_df = pd.read_csv('../1_data_cleaning/mmlu_ZH-CN_balanced.csv')
english_df = pd.read_csv('../1_data_cleaning/mmlu_EN-US_balanced.csv')

id_col = 'Unnamed: 0'
subject_col = 'Subject'
merge_keys = [id_col, subject_col]

chinese_df_suffixed = chinese_df.rename(
    columns={c: f'{c}_ZH-CN' for c in chinese_df.columns if c not in merge_keys}
)
english_df_suffixed = english_df.rename(
    columns={c: f'{c}_EN-US' for c in english_df.columns if c not in merge_keys}
)

merged_df = chinese_df_suffixed.merge(
    english_df_suffixed,
    on=merge_keys,
    how='inner',
    validate='one_to_one'
)

print(f'Merged rows: {len(merged_df)}')
merged_df.head(2)

Merged rows: 850


Unnamed: 0.1,Unnamed: 0,Question_ZH-CN,A_ZH-CN,B_ZH-CN,C_ZH-CN,D_ZH-CN,Answer_ZH-CN,Subject,Subcategory_ZH-CN,Question_EN-US,A_EN-US,B_EN-US,C_EN-US,D_EN-US,Answer_EN-US,Subcategory_EN-US
0,157,人血红蛋白分子的氨基酸序列与黑猩猩血红蛋白的相似性比与狗血红蛋白的相似性更高。这种相似性表明，,人与狗的亲缘关系比人与黑猩猩的亲缘关系更近,人与大猩猩的亲缘关系比人与狗的亲缘关系更近,人与黑猩猩有亲缘关系，但与犬无亲缘关系,人和黑猩猩非常相似,B,high_school_biology,biology,The sequence of amino acids in hemoglobin mole...,humans and dogs are more closely related than ...,humans and chimpanzees are more closely relate...,humans are related to chimpanzees but not to dogs,humans and chimpanzees are closely analogous,B,biology
1,39,研究从受精卵阶段到胎儿阶段的脊椎动物发育的胚胎学家有理由得出以下哪项结论？,个体发育重演系统发育。,早期胚胎展示了其纲、目和种的相同特征。,早期人类胚胎与早期鱼类和鸟类胚胎有共同特征。,人类胚胎在发育过程中显示出成年鱼类和鸟类的特征。,A,college_biology,biology,An embryologist studying the development of a ...,Ontogeny recapitulates phylogeny.,Early embryos display identical features of th...,An early human embryo has features in common w...,A human embryo displays features of adult fish...,A,biology


In [31]:
merged_df['Subcategory_ZH-CN'].value_counts()

Subcategory_ZH-CN
biology             50
history             50
politics            50
physics             50
philosophy          50
other               50
math                50
law                 50
health              50
business            50
geography           50
engineering         50
economics           50
culture             50
computer science    50
chemistry           50
psychology          50
Name: count, dtype: int64

In [32]:
# Next use pypinyin lazy_pinyin to convert the Chinese questions to pinyin
# Should contain all columns that the other two languages have, with a suffix "_ZH-PY" to distinguish
from pypinyin import lazy_pinyin

def to_pinyin(text):
    if pd.isna(text):
        return text
    return ' '.join(lazy_pinyin(str(text)))

pinyin_df = chinese_df[[id_col, subject_col]].copy()
for col in pinyin_df.columns:
    if col == subject_col:
        pinyin_df[col] = pinyin_df[col].map(to_pinyin)

for col in chinese_df.columns:
    if col != id_col:
        pinyin_df[f'{col}_ZH-PY'] = chinese_df[col].map(to_pinyin)

final_df = merged_df.merge(pinyin_df, on=merge_keys, how='inner', validate='one_to_one')

# Save completed table with EN-US, ZH-CN, and ZH-PY versions
final_df.to_csv('mmlu_EN-US_ZH-CN_ZH-PY.csv', index=False)

# Save another version with only ZH-PY data for simplicity
final_df_zh_py = final_df[[col for col in final_df.columns if col.endswith('_ZH-PY')]]
final_df_zh_py.to_csv('mmlu_ZH-PY.csv', index=False)

print(f'Final rows: {len(final_df)}')
print(f'Final columns: {len(final_df.columns)}')

Final rows: 850
Final columns: 24


In [33]:
# Confirm that this worked!
preview_cols = [
    id_col,
    'Question_ZH-CN',
    'Question_ZH-PY',
    'Question_EN-US',
    'Answer_ZH-CN',
    'Answer_ZH-PY',
    'Answer_EN-US'
]

final_df[preview_cols].head(5)

Unnamed: 0.1,Unnamed: 0,Question_ZH-CN,Question_ZH-PY,Question_EN-US,Answer_ZH-CN,Answer_ZH-PY,Answer_EN-US
0,157,人血红蛋白分子的氨基酸序列与黑猩猩血红蛋白的相似性比与狗血红蛋白的相似性更高。这种相似性表明，,ren xue hong dan bai fen zi de an ji suan xu l...,The sequence of amino acids in hemoglobin mole...,B,B,B
1,39,研究从受精卵阶段到胎儿阶段的脊椎动物发育的胚胎学家有理由得出以下哪项结论？,yan jiu cong shou jing luan jie duan dao tai e...,An embryologist studying the development of a ...,A,A,A
2,194,以下哪项不是防止种间繁殖的示例,yi xia na xiang bu shi fang zhi zhong jian fan...,All of the following are examples of events th...,D,D,D
3,266,细胞凋亡是程序性细胞死亡，是生物体的一个必要过程。关于细胞凋亡，以下哪项不正确？,xi bao diao wang shi cheng xu xing xi bao si w...,"Apoptosis, which is programmed cell death, is ...",A,A,A
4,11,导致野生猫和驯化猫不同斑纹的基因也会导致这些猫出现交叉眼（斜视），交叉眼略微适应不良。假设与...,dao zhi ye sheng mao he xun hua mao bu tong ba...,The same gene that causes various coat pattern...,B,B,B


In [36]:
# check examples when Subject_ZH-PY is history
df_double_check = pd.read_csv('mmlu_ZH-PY.csv')
# df_double_check[df_double_check['Subject_ZH-PY'] == 'history'].head(5)
# check what Subject_ZH-PY values we have
print(df_double_check['Subcategory_ZH-PY'].value_counts())
print(len(df_double_check))

Subcategory_ZH-PY
biology             50
history             50
politics            50
physics             50
philosophy          50
other               50
math                50
law                 50
health              50
business            50
geography           50
engineering         50
economics           50
culture             50
computer science    50
chemistry           50
psychology          50
Name: count, dtype: int64
850


In [37]:
df_double_check.head(5)

Unnamed: 0,Question_ZH-PY,A_ZH-PY,B_ZH-PY,C_ZH-PY,D_ZH-PY,Answer_ZH-PY,Subject_ZH-PY,Subcategory_ZH-PY
0,ren xue hong dan bai fen zi de an ji suan xu l...,ren yu gou de qin yuan guan xi bi ren yu hei x...,ren yu da xing xing de qin yuan guan xi bi ren...,ren yu hei xing xing you qin yuan guan xi ， da...,ren he hei xing xing fei chang xiang si,B,high_school_biology,biology
1,yan jiu cong shou jing luan jie duan dao tai e...,ge ti fa yu chong yan xi tong fa yu 。,zao qi pei tai zhan shi le qi gang 、 mu he zho...,zao qi ren lei pei tai yu zao qi yu lei he nia...,ren lei pei tai zai fa yu guo cheng zhong xian...,A,college_biology,biology
2,yi xia na xiang bu shi fang zhi zhong jian fan...,qian zai pei ou chu xian di li ge li,qian zai pei ou chu xian xing wei ge li,qian zai pei ou qiu ai yi shi bu tong,qian zai pei ou fan zhi ji jie xiang si,D,high_school_biology,biology
3,xi bao diao wang shi cheng xu xing xi bao si w...,sui ji fa sheng 。,zai sheng wu ti bu zai xu yao shi ， te ding xi...,ji lei le tai duo tu bian shi ， xi bao hui dia...,zhi wu xi bao bei ji sheng chong gan ran shi ，...,A,high_school_biology,biology
4,dao zhi ye sheng mao he xun hua mao bu tong ba...,jin hua shi jian jin de ， qing xiang yu jin hu...,biao xing tong chang shi zhe zhong de jie guo 。,zi ran xuan ze hui sui shi jian jian shao zhon...,duo ji yin yi chuan yi ban shi ying bu liang ，...,B,high_school_biology,biology
