The data and code for the paper KnowledgeMath: Knowledge-Intensive Math Word Problem Solving in Finance Domains. This research investigates the table-to-text capabilities of different LLMs using four datasets within two real-world information seeking scenarios. It demonstrates that high-performing LLMs, such as GPT-4, can effectively serve as table-to-text generators, evaluators, and feedback generators. KnowledgeMath is a knowledge-intensive dataset focused on mathematical reasoning within the domain of finance. It requires the model to comprehend specialized financial terminology and to interpret tabular data presented in the questions. KnowledgeMath includes 1200 QA examples across 7 key areas in finance. These examples were collected from financial experts and feature detailed solution annotations in Python format.
All the data examples were divided into two subsets: validation and test.
- validation: 200 examples used for model development, validation, or for those with limited computing resources.
- test: 1000 examples for standard evaluation. We will not publicly release the annotated solution and answer for the test set.
You can download this dataset by the following command:
from datasets import load_dataset
dataset = load_dataset("yale-nlp/KnowledgeMath")
# print the first example on the validation set
print(dataset["validation"][0])
# print the first example on the test set
print(dataset["test"][0])
The dataset is provided in json format and contains the following attributes:
{
"question_id": [string] The question id,
"question": [string] The question text,
"tables": [list] List of Markdown-format tables associated with the question,
"python_solution": [string] Python-format and executable solution by financial experts. The code is written in a clear and executable format, with well-named variables and a detailed explanation,
"ground_truth": [float] Executed result of `python solution`, rounded to three decimal places,
"topic": [string] The related financial area of the question,
"knowledge_terms": [list] List of knowledge terms in our constructed knowledge bank that is necessary to answer the given question. We will release this feature upon paper publication
}
The code is tested on the following environment:
- python 3.11.5
- CUDA 12.1, PyTorch 2.1.1
- run
pip install -r requirements.txt
to install all the required packages
We provide inference scripts for running various LLMs on KnowledgeMath:
scripts/run_openai.sh
for running GPT-* modelsscripts/run_gemini.sh
for running Gemini Proscripts/run_vllm.sh
for running all other open-sourced LLMs (e.g., Llama-2, Mistral, Falcon) that are reported in the paper and supported by the vLLM framework
The Chain-of-Thought (CoT) and Program-of-Thought (PoT) output from various LLMs on both the validation and test sets of KnowledgeMath can be found at the outputs
directory.
We develop a heuristic-based method to automatically evaluate the accuracy of CoT and PoT outputs:
scripts/evaluate_cot_output.sh
for evaluating CoT outputsscripts/evaluate_pot_output.sh
for evaluating PoT outputs
To get the results on the test set, please send your result json file to this email (see the leaderboard section below for more details).
The leaderboard is continuously being updated. To submit your results to leaderboard, please send your result json file on the test set to this email. Please follow the format of the sample submission file.
For any issues or questions, kindly email us at: Yilun Zhao (yilun.zhao@yale.edu).
If you use the KnowledgeMath dataset in your work, please kindly cite the paper:
@misc{zhao2023knowledgemath,
title={KnowledgeMath: Knowledge-Intensive Math Word Problem Solving in Finance Domains},
author={Yilun Zhao and Hongjun Liu and Yitao Long and Rui Zhang and Chen Zhao and Arman Cohan},
year={2023},
eprint={2311.09797},
archivePrefix={arXiv},
primaryClass={cs.CL}
}