GSM-Plus: Evaluating the Robustness of LLMs as Mathematical Problem Solvers

Dataset and Code for the Paper "A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers".

🌐 Project Page | 📚 Data | 📃 Paper

Introduction

Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. However, there are increasing debates regarding whether these models truly understand and apply mathematical knowledge or merely rely on shortcuts for mathematical reasoning. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.

In response to these issues, we present GSM-Plus, an adversarial evaluation benchmark that is crafted for systematically evaluating the mathematical reasoning capability of LLMs. we identify 5 perspectives to guide the development of GSM-Plus:

Numerical Variation refers to altering the numerical data or its types, including 3 subcategories: numerical substitution, digit expansion, and integer-decimal-fraction conversion.
Arithmetic Variation refers to reversing or introducing additional operations (e.g., addition, subtraction, multiplication, and division) to math problems, including 2 subcategories: adding operation and reversing operation.
Problem Understanding refers to rephrasing the text description of the math problems.
Distractor Insertion refers to inserting topic-related but useless sentences to the problems.
Critical Thinking focuses on question or doubt ability when the question lacks necessary statements.

Based on the 1,319 test questions from GSM8K, we create eight variations for each question, the yielding GSM-Plus comprises 10,552 question variations.

What's New

[2024.02.25] 📣 Dataset is released.

TODO

Release a compact test set for quick evaluation

Dataset Example

Examples of eight adversarial questions based on a seed question:

Dataset Usage

Data Downloading

You can download GSM-Plus test set by the following command (make sure that you have installed Huggingface Datasets):

from datasets import load_dataset

dataset = load_dataset("qintongli/GSM-Plus")
# print the first example
print(dataset["test"][0])

The dataset is provided in .jsonl format and contains the following attributes:

{
    "question": the adversarial question,
    "solution": the solution chain for the adversarial question ,
    "answer": the gold answer of the adversarial question,
    "perturbation_type": the perturbation type,
    "seed_question": the seed question used to craft the adversarial question,
    "seed_solution": the solution chain for the seed question,
    "seed_answer": the gold answer of the seed question,
}

Evaluations on GSM-Plus

Usage Examples

We offer a few examples for evaluating LLMs on GSM-Plus, as follows:

GPT-4, GPT-Turbo-3.5

python scripts/openai_model_inference.py --model_name [MODEL_NAME] --output_file [OUTPUT_PATH]/[MODEL_NAME]_prediction.json --prompt_type [PROMPT]
# [MODEL_NAME] can be: gpt-4-0613, gpt-3.5-turbo-0613, etc.
# [PROMPT] can be: cot, pot, ltm, complex, contrastive, cot_sc

Mistral, LLaMA, CodeLlama

python scripts/general_model_inference.py \
--model_name [MODEL_NAME] \
--output_file [OUTPUT_PATH]/[MODEL_NAME]_prediction.json \
--model_dir [MODEL_PATH] \  # path to checkpoint files
--nshots 8 \
--prompt_type [PROMPT] \
--specify_your_gpus [GPUT_IDs]

# [PROMPT] can be: cot-nshot, pot-nshot, ltm-nshot, ltm-1shot, complex, contrastive

Compositional Prompting

Extract key premise from a math word problem:

python scripts/extract_key_premise.py --outut_file comp_test.json --start_idx 0 --end_idx 10

Iteratively generate each reasoning thought based on the problem and its premise:

python scripts/compositional_prompt.py --input_file comp_test.json --output_file comp_prediction.json --goal_mode greedy --cal_mode greedy

️Citation

If you find GSM-Plus useful, please consider giving star and citing our paper:

@misc{li2024gsmplus,
      title={GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers}, 
      author={Qintong Li and Leyang Cui and Xueliang Zhao and Lingpeng Kong and Wei Bi},
      year={2024},
      eprint={2402.19255},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
dataset		dataset
results		results
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

dataset

dataset

results

results

scripts

scripts

.DS_Store

.DS_Store

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

GSM-Plus: Evaluating the Robustness of LLMs as Mathematical Problem Solvers

Introduction

What's New

TODO

Dataset Example

Dataset Usage

Data Downloading

Evaluations on GSM-Plus

Usage Examples

Compositional Prompting

️Citation

About

Releases

Packages

Languages

qtli/GSM-Plus

Folders and files

Latest commit

History

Repository files navigation

GSM-Plus: Evaluating the Robustness of LLMs as Mathematical Problem Solvers

Introduction

What's New

TODO

Dataset Example

Dataset Usage

Data Downloading

Evaluations on GSM-Plus

Usage Examples

Compositional Prompting

️Citation

About

Resources

Stars

Watchers

Forks

Languages