📃 Paper | 🤗 QFFT-7B | 🤗 QFFT-32B | 📚 QFFT Datasets
The complete code is coming soon!
Welcome to the official repository for QFFT, Question-Free Fine-Tuning for Adaptive Reasoning!
QFFT introduces a novel and efficient fine-tuning method designed to empower large language models with adaptive reasoning ability. Instead of training models on (Question, Reasoning) pairs like traditional Supervised Fine-Tuning (SFT), QFFT discards the question input and learns solely from the reasoning response—especially Long CoT outputs.
QFFT enables models to:
- Preserve Short CoT for simple tasks (efficiency)
- Trigger Long CoT only when needed (effectiveness)
- Reduce overthinking by minimizing unnecessary reasoning
- Improve robustness in noisy, low-resource, and out-of-domain scenarios
We open-sourced our models, data, and code here.
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
pip install vllm bitsandbytes flashinfer-python==0.2.2.post1
pip install latex2sympy2 word2number
Model Name | Base LLM | Link |
---|---|---|
QFFT-S1-7B | Qwen2.5-7B-Instruct | HF Link |
QFFT-S1-32B | Qwen2.5-32B-Instruct | HF Link |
QFFT-LIMO-7B | Qwen2.5-7B-Instruct | HF Link |
QFFT-LIMO-32B | Qwen2.5-32B-Instruct | HF Link |
QFFT uses distilled responses from strong Long CoT models (e.g., DeepSeek-R1). During QFFT, the input questions are removed entirely.
Dataset | Size | Link |
---|---|---|
S1.1 | 1k | HF Link |
LIMO | 871 | HF Link |
To train a model using QFFT, you can use llamafactory-cli
and the provided YAML configs:
llamafactory-cli train examples/train_qfft/train_s1_qfft.yaml
llamafactory-cli train examples/train_qfft/train_limo_qfft.yaml
This codebase is based on LLaMA-Factory.
Our key modification lies in the template system. We implement a new QFFT template in:
/src/llamafactory/data/template.py
For details, please refer line 1569.
You can evaluate QFFT models on benchmarks (e.g., GSM8K, MATH, AIME) with tools like vllm
or Sglang
.
We also propose a novel metric RAK (Reasoning Adaptability Kappa) to evaluate the reasoning adaptability.
The evaluation code is coming soon!
Here are the main results comparing SFT and QFFT on 3 mathematical reasoning benchmarks:
Dataset | Method | GSM8K Acc | GSM8K Tokens | MATH Acc | MATH Tokens | AIME25 Acc | AIME25 Tokens | Avg Acc | Avg Tokens |
---|---|---|---|---|---|---|---|---|---|
S1.1 | SFT | 90.6 | 1.7K | 80.8 | 5.3K | 18.2 | 17.7K | 63.2 | 8.2K |
QFFT | 91.0 | 0.4K | 80.2 | 2.8K | 17.2 | 12.8K | 62.8 | 5.3K | |
Δ | +0.4 | -76.5% | -0.6 | -47.2% | -1.0 | -27.7% | -0.4 | -50.5% |
Dataset | Method | GSM8K Acc | GSM8K Tokens | MATH Acc | MATH Tokens | AIME25 Acc | AIME25 Tokens | Avg Acc | Avg Tokens |
---|---|---|---|---|---|---|---|---|---|
LIMO | SFT | 88.2 | 1.8K | 80.4 | 5.8K | 16.8 | 17.1K | 61.8 | 8.2K |
QFFT | 88.0 | 0.7K | 80.6 | 4.1K | 17.2 | 15.6K | 61.9 | 6.8K | |
Δ | -0.2 | -61.1% | +0.2 | -29.3% | +0.4 | -8.8% | +0.1 | -33.1% |
Dataset | Method | GSM8K Acc | GSM8K Tokens | MATH Acc | MATH Tokens | AIME25 Acc | AIME25 Tokens | Avg Acc | Avg Tokens |
---|---|---|---|---|---|---|---|---|---|
S1.1 | SFT | 92.8 | 2.1K | 93.1 | 4.1K | 48.6 | 16.2K | 78.2 | 7.5K |
QFFT | 93.6 | 0.6K | 92.2 | 2.4K | 46.8 | 12.9K | 77.5 | 5.3K | |
Δ | +0.8 | -71.4% | -0.9 | -41.5% | -1.8 | -20.4% | -0.6 | -44.4% |
Dataset | Method | GSM8K Acc | GSM8K Tokens | MATH Acc | MATH Tokens | AIME25 Acc | AIME25 Tokens | Avg Acc | Avg Tokens |
---|---|---|---|---|---|---|---|---|---|
LIMO | SFT | 91.2 | 1.9K | 93.0 | 3.9K | 45.8 | 13.2K | 76.6 | 6.3K |
QFFT | 92.6 | 0.8K | 92.6 | 2.9K | 45.0 | 12.5K | 76.7 | 5.4K | |
Δ | +1.4 | -57.9% | -0.4 | -25.6% | -0.8 | -5.3% | +0.1 | -29.6% |
@misc{liu2025qfft,
title={QFFT, Question-Free Fine-Tuning for Adaptive Reasoning},
author={Wanlong Liu and Junxiao Xu and Fei Yu and Yukang Lin and Ke Ji and Wenyu Chen and Yan Xu and Yasheng Wang and Lifeng Shang and Benyou Wang},
year={2025},
eprint={2506.12860},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.12860},
}