Awesome LLM4Math

Curation of resources for LLM mathematical reasoning.

| 🐱 GitHub | 🐦 Twitter |

📢 If you have any suggestions, please don't hesitate to let us know. You can

directly E-mail Yuxuan Tong,
comment under the Twitter thread,
or post an issue in the GitHub repository.

The following resources are listed in (roughly) chronological order of publication.

Continual Pre-Training: Methods / Models / Corpora

Llemma & Proof-Pile-2: Open-sourced re-implementation of Minerva.
- Open-sourced corpus Proof-Pile-2 comprising 51.9B tokens (by DeepSeek tokenizer).
- Continually pre-trained based on CodeLLaMAs.
OpenWebMath:
- 13.6B tokens (by DeepSeek tokenizer).
- Used by Rho-1 to achieve performance comparable with DeepSeekMath.
MathPile:
- 8.9B tokens (by DeepSeek tokenizer).
- Mainly comprising arXiv papers.
- Shown not effective (on 7B models) by DeepSeekMath.
DeepSeekMath: Open-sourced SotA (as of 2024-04-18).
- Continually pre-trained based on DeepSeek-LLMs and DeepSeekCoder-7B
Rho-1: Selecting tokens based on loss/perplexity, achieving performance comparable with DeepSeekMath but only based on 15B OpenWebMath corpus.

SFT: Methods / Models / Datasets

Natural language (only)

RFT: SFT on rejection-sampled model outputs is effective.
MetaMath: Constructing problems of ground truth answer (but no necessarily feasible) by self-verification.
- Augmenting with GPT-3.5-Turbo.
AugGSM8k : Common data augmentation on GSM8k helps little in generalization to MATH.
MathScale: Scaling synthetic data to ~2M samples using GPT-3.5-Turbo with knowledge graph.
KPMath: Scaling synthetic data to 1.576M samples using GPT-4-Turbo with knowledge graph.
XWin-Math: Simple scaling synthetic data to 480k MATH + 960k GSM8k samples using GPT-4-Turbo with knowledge graph.

Code integration

MAmmoTH: SFT on CoT&PoT-mixing data is effective.
ToRA & MARIO: The fisrt open-sourced model works to verify the effectiveness of SFT for tool-integrated reasoning.
OpenMathInstruct-1: Scaling synthetic data to 1.8M using Mixtral-8x7B

RL: Methods / Models / Datasets

Math-Shepherd: Consturcting step-correctness labels based on an MCTS-like method.

Evaluation: Benchmarks

Here we focus on several the most important benchmarks.

OpenAI `simple-evals` - Math

MMLU(-Math): Measuring Massive Multitask Language Understanding, reference: https://arxiv.org/abs/2009.03300, https://github.com/hendrycks/test, MIT License

MATH: Measuring Mathematical Problem Solving With the MATH Dataset, reference: https://arxiv.org/abs/2103.03874, https://github.com/hendrycks/math, MIT License

MGSM: Multilingual Grade School Math Benchmark (MGSM), Language Models are Multilingual Chain-of-Thought Reasoners, reference: https://arxiv.org/abs/2210.03057, https://github.com/google-research/url-nlp, Creative Commons Attribution 4.0 International Public License (CC-BY)

Other benchmarks

miniF2F: “a formal mathematics benchmark (translated across multiple formal systems) consisting of exercise statements from olympiads (AMC, AIME, IMO) as well as high-school and undergraduate maths classes”.
OlympiadBench: “an Olympiad-level bilingual multimodal scientific benchmark”.
- GPT-4V attains an average score of 17.23% on OlympiadBench, with a mere 11.28% in physics.

Curations, collections and surveys

GitHub - lupantech/dl4math: Resources of deep learning for mathematical reasoning (DL4MATH).

Events

AIMO: “a new $10mn prize fund to spur the open development of AI models capable of performing as well as top human participants in the International Mathematical Olympiad (IMO)”.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome LLM4Math

Continual Pre-Training: Methods / Models / Corpora

SFT: Methods / Models / Datasets

Natural language (only)

Code integration

RL: Methods / Models / Datasets

Evaluation: Benchmarks

OpenAI `simple-evals` - Math

Other benchmarks

Curations, collections and surveys

Events

About

Releases

Packages

License

tongyx361/Awesome-LLM4Math

Folders and files

Latest commit

History

Repository files navigation

Awesome LLM4Math

Continual Pre-Training: Methods / Models / Corpora

SFT: Methods / Models / Datasets

Natural language (only)

Code integration

RL: Methods / Models / Datasets

Evaluation: Benchmarks

OpenAI simple-evals - Math

Other benchmarks

Curations, collections and surveys

Events

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

OpenAI `simple-evals` - Math

Packages