Skip to content

tongyx361/Awesome-LLM4Math

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Awesome LLM4Math

Curation of resources for LLM mathematical reasoning.

Awesome License: Apache

🐱 GitHub | 🐦 Twitter |

📢 If you have any suggestions, please don't hesitate to let us know. You can

The following resources are listed in (roughly) chronological order of publication.

Continual Pre-Training: Methods / Models / Corpora

  • Llemma & Proof-Pile-2: Open-sourced re-implementation of Minerva.
    • Open-sourced corpus Proof-Pile-2 comprising 51.9B tokens (by DeepSeek tokenizer).
    • Continually pre-trained based on CodeLLaMAs.
  • OpenWebMath:
    • 13.6B tokens (by DeepSeek tokenizer).
    • Used by Rho-1 to achieve performance comparable with DeepSeekMath.
  • MathPile:
    • 8.9B tokens (by DeepSeek tokenizer).
    • Mainly comprising arXiv papers.
    • Shown not effective (on 7B models) by DeepSeekMath.
  • DeepSeekMath: Open-sourced SotA (as of 2024-04-18).
    • Continually pre-trained based on DeepSeek-LLMs and DeepSeekCoder-7B
  • Rho-1: Selecting tokens based on loss/perplexity, achieving performance comparable with DeepSeekMath but only based on 15B OpenWebMath corpus.

SFT: Methods / Models / Datasets

Natural language (only)

  • RFT: SFT on rejection-sampled model outputs is effective.
  • MetaMath: Constructing problems of ground truth answer (but no necessarily feasible) by self-verification.
    • Augmenting with GPT-3.5-Turbo.
  • AugGSM8k : Common data augmentation on GSM8k helps little in generalization to MATH.
  • MathScale: Scaling synthetic data to ~2M samples using GPT-3.5-Turbo with knowledge graph.
  • KPMath: Scaling synthetic data to 1.576M samples using GPT-4-Turbo with knowledge graph.
  • XWin-Math: Simple scaling synthetic data to 480k MATH + 960k GSM8k samples using GPT-4-Turbo with knowledge graph.

Code integration

  • MAmmoTH: SFT on CoT&PoT-mixing data is effective.
  • ToRA & MARIO: The fisrt open-sourced model works to verify the effectiveness of SFT for tool-integrated reasoning.
  • OpenMathInstruct-1: Scaling synthetic data to 1.8M using Mixtral-8x7B

RL: Methods / Models / Datasets

  • Math-Shepherd: Consturcting step-correctness labels based on an MCTS-like method.

Evaluation: Benchmarks

Here we focus on several the most important benchmarks.

Other benchmarks

  • miniF2F: “a formal mathematics benchmark (translated across multiple formal systems) consisting of exercise statements from olympiads (AMC, AIME, IMO) as well as high-school and undergraduate maths classes”.
  • OlympiadBench: “an Olympiad-level bilingual multimodal scientific benchmark”.
    • GPT-4V attains an average score of 17.23% on OlympiadBench, with a mere 11.28% in physics.

Curations, collections and surveys

Events

  • AIMO: “a new $10mn prize fund to spur the open development of AI models capable of performing as well as top human participants in the International Mathematical Olympiad (IMO)”.

About

Curation of resources for LLM mathematical reasoning.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published