This repository contains the public reproduction package for Hallucination Detection from Structural Reasoning Model.
Top-level layout:
pipeline/ Core SRM pipeline: data I/O, DAG parsing, FLOW scoring, metrics, PR plots, prompts.
ablations/ Reproducible ablation workflows. These scripts compute tables from records/scores rather than printing static CSVs.
data/ Clean public logic records, public score tables, and generated outputs.
run.py Main command-line entry point.
api.jsonl API/local-model environment-variable template. No private keys or private paths are stored.
requirements.txt
Repository code is released under the MIT License. The released logic records
and score tables are derived from GSM8K, MATH, HumanEval, and HotpotQA; see
DATA_NOTICE.md for upstream dataset licenses, attribution, and citation
requirements. In particular, HotpotQA-derived records should be treated as
CC BY-SA 4.0 derived material.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtThe Qwen local-verifier ablation requires a CUDA-capable PyTorch environment and local/HuggingFace model weights. Set the path through environment variables instead of editing code:
export QWEN_JUDGE_MODEL_NAME=Qwen3.5-9B
export QWEN_JUDGE_MODEL_PATH=/path/to/Qwen3.5-9B
export QWEN_JUDGE_DEVICE_MAP=auto
export QWEN_JUDGE_TORCH_DTYPE=autoCompute metrics from the released score table:
python run.py --stage metrics --dataset math --model gptIf data/score_tables/ has been removed, regenerate all public score tables
from the released clean logic records:
python pipeline/score_tables.pyflow_plus is populated when a matching *__golden_dag.json record is released
for the same instance; otherwise that cell is left blank.
Regenerate a PR curve:
python run.py --stage plot --dataset gsm8k --model gptCompute FLOW from clean logic records:
python run.py --stage score --dataset math --model gpt --variant baseThe released data/logic/**.json files already contain clean structured samples and step_labels.gpt, so the main reproduction commands above do not call any model API. If you want to resample structured reasoning traces or re-run step judging, first configure the OpenAI-compatible endpoint in api.jsonl.
Recommended setup keeps secrets in environment variables:
export OPENAI_BASE_URL=https://your-openai-compatible-endpoint/v1
export OPENAI_API_KEY=your_api_key
export OPENAI_MODEL=your-model-nameapi.jsonl also contains optional local fields (base_url, api_key, model) for private local runs. Do not commit real API keys.
Resample structured records from existing question records:
python pipeline/sample.py \
--input data/logic/gsm8k/gpt-4o-mini__base.json \
--dataset gsm8k \
--output-model my-sampled-model \
--limit 20By default this writes data/logic/<dataset>/<output-model>__<variant>.json. The output keeps the same clean schema as the released logic files, with conditions, cots, sources, golden_label, and an empty step_labels object.
Run API-based step judging on sampled records:
python pipeline/judge.py \
--input data/logic/gsm8k/my-sampled-model__base.json \
--label-name gpt \
--overwrite \
--save-run-metadatapipeline/judge.py reads pipeline/prompts/judging/local_step_verifier.txt. Because the OpenAI-compatible API path does not compute token logits, it uses temperature 0.0 by default and parses the sampled final label from {0, 0.5, 1} as the step label.
Use --save-run-metadata for publishable reruns so raw judge outputs and parsed
labels are auditable.
After judging, the generated file can be used by the normal FLOW scoring path:
python run.py --stage score --dataset gsm8k --model my-sampled-model --variant baseAggregation table for Section 6.2:
python ablations/aggregation_comparison.py --datasets gsm8k math hotpotqa human_evalDAG perturbation / graph-quality ablation:
python ablations/dag_perturbation.py --datasets gsm8k mathLength-bucket sensitivity:
python ablations/length_sensitivity.py --datasets gsm8k math hotpotqa human_evalNode-sensitivity analysis:
python ablations/node_sensitivity.py --datasets gsm8k math hotpotqa human_eval --max-flips-per-bin 500Cost diagnostics from released logic records:
python ablations/cost_summary.py --datasets gsm8k mathStructured DAG/Golden-DAG diagnostics from released records:
python ablations/structured_baselines.py --datasets gsm8k math hotpotqa human_evalThe paper also reports external/private baseline runs such as RACE, PARC, and GoV. Their per-instance artifacts are intentionally not included in this public package. The released code and data reproduce SRM/FLOW, inverse-flow, aggregation, graph-perturbation, length, node-sensitivity, cost diagnostics, and the released subset of FLOW+ from the public records.
Generated outputs are written under data/tmp_results/, data/results/, or
outputs/ depending on the script. These directories are intentionally excluded
from the release archive and from recursive external-baseline data scans.
Run local Qwen step verification over a clean logic file:
python ablations/qwen_local_judge.py \
--input data/logic/math/gpt-4o-mini__base.json \
--output data/qwen_judge/math/gpt-4o-mini__base.json \
--label-name qwen \
--mode citedThen compute the Qwen verifier appendix table:
python ablations/qwen_verifier_table.py --label-name qwenBy default, qwen_local_judge.py writes a clean JSON array and only adds step_labels.qwen. Use --save-run-metadata only when you need raw model outputs/probabilities for audit logs.
The Qwen script reads the same default prompt file as the API judge: pipeline/prompts/judging/local_step_verifier.txt.
Released data/logic/**.json files are pretty-printed JSON arrays. Each record keeps the public clean fields:
{
"id": "...",
"dataset": "math",
"model": "gpt-4o-mini",
"variant": "base",
"question": "...",
"conditions": ["..."],
"cots": ["..."],
"sources": ["Condition1", "Condition1 & Step1"],
"golden_label": 0,
"step_labels": {"gpt": [0.0, 1.0]}
}golden_label = 1 is the positive hallucination/error class. Step labels are error-oriented: 0.0 correct, 0.5 uncertain, 1.0 incorrect. The private baseline columns used in the paper's comparison table are not stored in this release.