This dataset contains mathematical problems paired with both tool-integrated reasoning (TIR) traces and corresponding chain-of-thought (CoT) traces, distilled via agent-based pipelines. It is designed for fine-tuning large language models on step-by-step mathematical reasoning and tool-augmented problem solving.
We generate SFT data based on multiple data sources to ensure diverse and challenging coverage across mathematical domains. The initial training set consolidates examples from several established benchmarks and are collected by ToRL (Li et al. (2025)), including:
- NuminaMATH (Jia et al., 2024)
- MATH (Hendrycks et al., 2021)
- DeepScaleR (Luo et al., 2025)
To mitigate data leakage, we further clean the corpus by filtering out any training examples whose question text shares a repeated 10-gram subsequence with any question in our test sets. This deduplication step ensures a fair and reliable assessment of generalization performance. We ended up collecting 25,000 math problems in total. After the TIR trace filtering process with the Solver Agent, we obtain 11.6k TIR traces, with an overall accuracy around 46%.
The dataset is split into two files: train_part1.jsonl
and train_part2.jsonl
. Each entry in these files is a JSON object with at least the following fields:
problem
: The text of the math problem statement.TIR trace
: The tool-integrated reasoning (TIR) trace generated by the Solver Agent. This field contains the step-by-step, interleaved plan, tool calls, and intermediate reasoning as executed by the agent using external tools.CoT trace
: The solution trace, rephrased from the corresponding TIR trace by the Rephrase Agent. This field provides clear, step-by-step reasoning and is suitable for use as the target in supervised fine-tuning.
- Supervised Fine-Tuning: The
problem
field can be used as the model input andCoT trace
as the model output for training models to solve math problems with detailed reasoning. - Evaluation: The dataset can also be used to benchmark reasoning and solution generation capabilities of language models.
{
"problem": "For each positive number x, let f(x) = ((x + 1/x)^6 - (x^6 + 1/x^6) - 2) / ((x + 1/x)^3 + (x^3 + 1/x^3)). Find the minimum value of f(x).",
"TIR trace": "<message>\nStep-by-step plan:\n\n1. Simplify the expression: Use algebraic manipulation to rewrite the numerator and denominator in terms of a simpler variable, such as t = x + 1/x.\n2. Express powers in terms of t: Use the binomial theorem or known identities to express (x + 1/x)^n and x^n + 1/x^n in terms of t.\n3. Rewrite f(x) as a function of t: After substitution, f(x) becomes a rational function f(t).\n4. Find the minimum value of f(t): Use calculus (derivative) or algebraic methods to find critical points of f(t) in the domain t ≥ 2.\n\n...\n\nFinal answer: The minimum value is 6.\n</message>",
"CoT trace": "Step 1: Introduce the substitution t = x + 1/x. Step 2: Express powers in terms of t using identities. Step 3: Rewrite f(x) in terms of t and simplify. Step 4: Analyze f(t) and find the minimum. Final answer: 6."
}