Minerva MMLU-STEM Replication #7

haileyschoelkopf · 2023-07-11T18:34:08Z

Minerva evals on two variants/formats of MMLU-STEM:

The same as the original MMLU code: This should use the version of hendrycksTest-* from upstream post-PR 497: MMLU task fix EleutherAI/lm-evaluation-harness#497
MMLU with Chain-of-Thought, with + without Maj@k=16.

The STEM subjects are:

Some subjects are prompted differently than others:

We use a multiple choice version of the MATH prompt (see Listing 5) for the subtopics which use equations:
abstract_algebra, college_mathematics, college_physics, elementary_mathematics,
high_school_mathematics, high_school_physics, high_school_statistics. We wrote a custom chainof-thought for each of the remaining original prompts. Those prompts can be found in the supplementary
materials.
https://arxiv.org/pdf/2206.14858.pdf Appendix G

This PR adds this multiple choice MATH prompt version of MMLU to those appropriate subjects. The other subjects have "prompts in supplementary materials", but these are not included as far as I could tell in the supplementary material I downloaded from the linked zip.

Tested with Pythia-6.9b which achieves

0.14 acc on abstract_algebra
0.1074 acc on high_school_mathematics

zhangir-azerbayev · 2023-07-11T18:56:25Z

Am I correct that the Pythia results are significantly worse than random?

haileyschoelkopf · 2023-07-11T21:22:22Z

Yep, likely because the model doesn't always conform to the expected output format. I'll check what the score is when normalizing for that.

haileyschoelkopf · 2023-07-11T22:40:24Z

Runtime is long here, but initial small sample size tests with pythia-1.4b, abstract_algebra Maj@16 show better-than-random performance. Disallowing [invalid answer] from being selected as the majority vote which I think seems principled.

zhangir-azerbayev · 2023-07-12T01:58:03Z

Yep I'm fine with disallowing invalid answer from the majority vote.

haileyschoelkopf added 3 commits July 11, 2023 18:23

add missing deps

3ff8a8f

add hendrycksTest-cot, greedy, equation-based

05563fe

add citation

dcfaaa8

Add majority voting

aab923f

add config, fix bugs in maj vote

6416483

zhangir-azerbayev approved these changes Jul 12, 2023

View reviewed changes

zhangir-azerbayev merged commit fef9d47 into master Jul 12, 2023
0 of 2 checks passed

zhangir-azerbayev deleted the mmlu-minerva branch July 12, 2023 01:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minerva MMLU-STEM Replication #7

Minerva MMLU-STEM Replication #7

haileyschoelkopf commented Jul 11, 2023

zhangir-azerbayev commented Jul 11, 2023

haileyschoelkopf commented Jul 11, 2023

haileyschoelkopf commented Jul 11, 2023 •

edited

zhangir-azerbayev commented Jul 12, 2023

Minerva MMLU-STEM Replication #7

Minerva MMLU-STEM Replication #7

Conversation

haileyschoelkopf commented Jul 11, 2023

zhangir-azerbayev commented Jul 11, 2023

haileyschoelkopf commented Jul 11, 2023

haileyschoelkopf commented Jul 11, 2023 • edited

zhangir-azerbayev commented Jul 12, 2023

haileyschoelkopf commented Jul 11, 2023 •

edited