Skip to content

soncheinbok/FlowScore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Structural Reasoning Model Hallucination Detection

This repository contains the public reproduction package for Hallucination Detection from Structural Reasoning Model.

Top-level layout:

pipeline/      Core SRM pipeline: data I/O, DAG parsing, FLOW scoring, metrics, PR plots, prompts.
ablations/     Reproducible ablation workflows. These scripts compute tables from records/scores rather than printing static CSVs.
data/          Clean public logic records, public score tables, and generated outputs.
run.py         Main command-line entry point.
api.jsonl      API/local-model environment-variable template. No private keys or private paths are stored.
requirements.txt

License and data notice

Repository code is released under the MIT License. The released logic records and score tables are derived from GSM8K, MATH, HumanEval, and HotpotQA; see DATA_NOTICE.md for upstream dataset licenses, attribution, and citation requirements. In particular, HotpotQA-derived records should be treated as CC BY-SA 4.0 derived material.

Install

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

The Qwen local-verifier ablation requires a CUDA-capable PyTorch environment and local/HuggingFace model weights. Set the path through environment variables instead of editing code:

export QWEN_JUDGE_MODEL_NAME=Qwen3.5-9B
export QWEN_JUDGE_MODEL_PATH=/path/to/Qwen3.5-9B
export QWEN_JUDGE_DEVICE_MAP=auto
export QWEN_JUDGE_TORCH_DTYPE=auto

Main reproduction commands

Compute metrics from the released score table:

python run.py --stage metrics --dataset math --model gpt

If data/score_tables/ has been removed, regenerate all public score tables from the released clean logic records:

python pipeline/score_tables.py

flow_plus is populated when a matching *__golden_dag.json record is released for the same instance; otherwise that cell is left blank.

Regenerate a PR curve:

python run.py --stage plot --dataset gsm8k --model gpt

Compute FLOW from clean logic records:

python run.py --stage score --dataset math --model gpt --variant base

Regenerate samples and judge labels

The released data/logic/**.json files already contain clean structured samples and step_labels.gpt, so the main reproduction commands above do not call any model API. If you want to resample structured reasoning traces or re-run step judging, first configure the OpenAI-compatible endpoint in api.jsonl.

Recommended setup keeps secrets in environment variables:

export OPENAI_BASE_URL=https://your-openai-compatible-endpoint/v1
export OPENAI_API_KEY=your_api_key
export OPENAI_MODEL=your-model-name

api.jsonl also contains optional local fields (base_url, api_key, model) for private local runs. Do not commit real API keys.

Resample structured records from existing question records:

python pipeline/sample.py \
  --input data/logic/gsm8k/gpt-4o-mini__base.json \
  --dataset gsm8k \
  --output-model my-sampled-model \
  --limit 20

By default this writes data/logic/<dataset>/<output-model>__<variant>.json. The output keeps the same clean schema as the released logic files, with conditions, cots, sources, golden_label, and an empty step_labels object.

Run API-based step judging on sampled records:

python pipeline/judge.py \
  --input data/logic/gsm8k/my-sampled-model__base.json \
  --label-name gpt \
  --overwrite \
  --save-run-metadata

pipeline/judge.py reads pipeline/prompts/judging/local_step_verifier.txt. Because the OpenAI-compatible API path does not compute token logits, it uses temperature 0.0 by default and parses the sampled final label from {0, 0.5, 1} as the step label. Use --save-run-metadata for publishable reruns so raw judge outputs and parsed labels are auditable.

After judging, the generated file can be used by the normal FLOW scoring path:

python run.py --stage score --dataset gsm8k --model my-sampled-model --variant base

Ablation workflows

Aggregation table for Section 6.2:

python ablations/aggregation_comparison.py --datasets gsm8k math hotpotqa human_eval

DAG perturbation / graph-quality ablation:

python ablations/dag_perturbation.py --datasets gsm8k math

Length-bucket sensitivity:

python ablations/length_sensitivity.py --datasets gsm8k math hotpotqa human_eval

Node-sensitivity analysis:

python ablations/node_sensitivity.py --datasets gsm8k math hotpotqa human_eval --max-flips-per-bin 500

Cost diagnostics from released logic records:

python ablations/cost_summary.py --datasets gsm8k math

Structured DAG/Golden-DAG diagnostics from released records:

python ablations/structured_baselines.py --datasets gsm8k math hotpotqa human_eval

The paper also reports external/private baseline runs such as RACE, PARC, and GoV. Their per-instance artifacts are intentionally not included in this public package. The released code and data reproduce SRM/FLOW, inverse-flow, aggregation, graph-perturbation, length, node-sensitivity, cost diagnostics, and the released subset of FLOW+ from the public records.

Generated outputs are written under data/tmp_results/, data/results/, or outputs/ depending on the script. These directories are intentionally excluded from the release archive and from recursive external-baseline data scans.

Qwen-3.5-9B as local judge

Run local Qwen step verification over a clean logic file:

python ablations/qwen_local_judge.py \
  --input data/logic/math/gpt-4o-mini__base.json \
  --output data/qwen_judge/math/gpt-4o-mini__base.json \
  --label-name qwen \
  --mode cited

Then compute the Qwen verifier appendix table:

python ablations/qwen_verifier_table.py --label-name qwen

By default, qwen_local_judge.py writes a clean JSON array and only adds step_labels.qwen. Use --save-run-metadata only when you need raw model outputs/probabilities for audit logs. The Qwen script reads the same default prompt file as the API judge: pipeline/prompts/judging/local_step_verifier.txt.

Data format

Released data/logic/**.json files are pretty-printed JSON arrays. Each record keeps the public clean fields:

{
  "id": "...",
  "dataset": "math",
  "model": "gpt-4o-mini",
  "variant": "base",
  "question": "...",
  "conditions": ["..."],
  "cots": ["..."],
  "sources": ["Condition1", "Condition1 & Step1"],
  "golden_label": 0,
  "step_labels": {"gpt": [0.0, 1.0]}
}

golden_label = 1 is the positive hallucination/error class. Step labels are error-oriented: 0.0 correct, 0.5 uncertain, 1.0 incorrect. The private baseline columns used in the paper's comparison table are not stored in this release.

About

Official code and data for Hallucination Detection from Structural Reasoning Model (ICML 2026).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages