Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minerva MMLU-STEM Replication #7

Merged
merged 5 commits into from Jul 12, 2023
Merged

Minerva MMLU-STEM Replication #7

merged 5 commits into from Jul 12, 2023

Conversation

haileyschoelkopf
Copy link
Collaborator

Minerva evals on two variants/formats of MMLU-STEM:

  1. The same as the original MMLU code: This should use the version of hendrycksTest-* from upstream post-PR 497: MMLU task fix EleutherAI/lm-evaluation-harness#497
  2. MMLU with Chain-of-Thought, with + without Maj@k=16.

The STEM subjects are:

Some subjects are prompted differently than others:

We use a multiple choice version of the MATH prompt (see Listing 5) for the subtopics which use equations:
abstract_algebra, college_mathematics, college_physics, elementary_mathematics,
high_school_mathematics, high_school_physics, high_school_statistics. We wrote a custom chainof-thought for each of the remaining original prompts. Those prompts can be found in the supplementary
materials.
https://arxiv.org/pdf/2206.14858.pdf Appendix G

This PR adds this multiple choice MATH prompt version of MMLU to those appropriate subjects. The other subjects have "prompts in supplementary materials", but these are not included as far as I could tell in the supplementary material I downloaded from the linked zip.

Tested with Pythia-6.9b which achieves

  • 0.14 acc on abstract_algebra
  • 0.1074 acc on high_school_mathematics

@zhangir-azerbayev
Copy link
Collaborator

Am I correct that the Pythia results are significantly worse than random?

@haileyschoelkopf
Copy link
Collaborator Author

Yep, likely because the model doesn't always conform to the expected output format. I'll check what the score is when normalizing for that.

@haileyschoelkopf
Copy link
Collaborator Author

haileyschoelkopf commented Jul 11, 2023

Runtime is long here, but initial small sample size tests with pythia-1.4b, abstract_algebra Maj@16 show better-than-random performance. Disallowing [invalid answer] from being selected as the majority vote which I think seems principled.

@zhangir-azerbayev
Copy link
Collaborator

Yep I'm fine with disallowing invalid answer from the majority vote.

@zhangir-azerbayev zhangir-azerbayev merged commit fef9d47 into master Jul 12, 2023
0 of 2 checks passed
@zhangir-azerbayev zhangir-azerbayev deleted the mmlu-minerva branch July 12, 2023 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants