bash environment/setup_omanic_sft_env.sh
conda activate omanic_sftbash environment/setup_omanic_rl_env.sh
conda activate omanic_rlDownload the raw dataset files OmanicSynth.jsonl and OmanicBench.jsonl.
python data/download_omanic.pyFor SFT, convert the raw dataset files into OmanicSynth_sft.json and OmanicBench_sft.json.
python data/covert_to_sft.pyFor RL, convert the raw dataset files into OmanicSynth_rl.json and OmanicBench_rl.json.
The converted RL data uses data_source="omanic", which routes reward computation to verl/utils/reward_score/omanic.py.
python data/convert_to_rl.pyconda activate omanic_sft
cd LlamaFactoryFor Llama-3.3-70B
nohup bash -c '
module load cuda/12.2.2
export CUDA_HOME=$CUDA_PATH
source ~/miniconda3/etc/profile.d/conda.sh
conda activate omanic_sft
export FORCE_TORCHRUN=1 NNODES=1 NPROC_PER_NODE=4
llamafactory-cli train examples/train_lora/llama70B_omanic.yaml
' > train_llama70B_omanic.log 2> train_llama70B_omanic.err < /dev/null &For Qwen3-8B
nohup bash -c '
module load cuda/12.2.2
export CUDA_HOME=$CUDA_PATH
source ~/miniconda3/etc/profile.d/conda.sh
conda activate omanic_sft
export FORCE_TORCHRUN=1 NNODES=1 NPROC_PER_NODE=4
llamafactory-cli train examples/train_full/qwen3_8B_oamnic.yaml
' > train_qwen3_8B_oamnic.log 2> train_qwen3_8B_oamnic.err < /dev/null &conda activate omanic_rl
cd verl
module load cuda/12.2.2
export CUDA_HOME=$CUDA_PATH
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATHnohup bash examples/grpo_trainer/run_qwen3-8b_omanic.sh > grpo_train.log 2>&1 < /dev/null &LoRA merge is handled in memory during evaluation.
python eval/local_eval.py \
--base-model meta-llama/Llama-3.3-70B-Instruct \
--mode direct \
--lora-path LlamaFactory/saves/llama3.3-70b/lora \
--input data/OmanicBench.jsonl \
--batch-size 256Use --mode cot if you want chain-of-thought style evaluation.
This evaluates the full fine-tuned model directly without LoRA merge.
python eval/local_eval.py \
--model-path LlamaFactory/saves/qwen3-8b/full \
--mode direct \
--input data/OmanicBench.jsonl \
--batch-size 256Set your OpenRouter API key in the shell before running the script:
export OPENROUTER_API_KEY="your_openrouter_api_key"The default input file is data/OmanicBench.jsonl, and results will be written to eval/results.
Specify a single model with --model. You can use either the full OpenRouter model ID or a supported alias.
Examples:
python eval/open_eval.py \
--model openai/gpt-5.4 \
--mode direct
python eval/open_eval.py \
--model anthropic/claude-sonnet-4.6 \
--mode cotUse --model all to evaluate every model listed in eval/open_eval.py:
python eval/open_eval.py \
--model all \
--mode directYou can also override the default input path when needed:
python eval/open_eval.py \
--model GPT-4o \
--mode direct \
--input data/OmanicBench.jsonlFor any inquiries, please reach out at peettherapynoys@gmail.com
If you find Omanic useful for your research and applications, please cite:
@article{gu2026omanic,
title={Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models},
author={Gu, Xiaojie and Tong, Sherry T and Feng, Aosong and Han, Sophia Simeng and Lu, Jinghui and Chen, Yingjian and Iwasawa, Yusuke and Matsuo, Yutaka and Park, Chanjun and Ying, Rex and Li, Irene},
journal={arXiv preprint arXiv:2603.16654},
year={2026}
}