Mihir Prabhudesai*,†, Aryan Satpathy*,†, Yangmin Li†, Zheyang Qin†,
Nikash Bhardwaj, Amir Zadehλ, Chuan Liλ, Katerina Fragkiadaki, Deepak Pathak
Carnegie Mellon University · λLambda
*Project co-leads & Equal contribution · †Core contributors
teaser.mp4
We present SIM2REASON: a method for turning physics simulators into scalable generators of question–answer pairs to improve LLM reasoning, removing the need of human annotation in the data-generation pipeline. The core idea is to structure the randomization with a domain-specific language (DSL) and use it to procedurally generate reasoning problems, as illustrated in the examples above. LLMs finetuned on this synthetic data get zero-shot improvement on real world benchmarks such as International Physics Olympiad.
We have witnessed remarkable advances in LLM reasoning capabilities with the advent of DeepSeek-R1. However, much of this progress has been fueled by the abundance of internet question–answer (QA) pairs—a major bottleneck going forward, since such data is limited in scale and concentrated mainly in domains like mathematics. In contrast, other sciences such as physics lack large-scale QA datasets to effectively train reasoning-capable models. In this work, we show that physics simulators can serve as a powerful alternative source of supervision for training LLMs for physical reasoning. We generate random scenes in physics engines, create synthetic question–answer pairs from simulated interactions, and train LLMs using reinforcement learning on this synthetic data. Our models exhibit zero-shot sim-to-real transfer to real-world physics benchmarks: for example, training solely on synthetic simulated data improves performance on IPhO (International Physics Olympiad) problems by 5-10 percentage points across model sizes. These results demonstrate that physics simulators can act as scalable data generators, enabling LLMs to acquire deep physical reasoning skills beyond the limitations of internet-scale QA data. Code available at: https://sim2reason.github.io/.
Create an environment for data generation with the following commands:
conda create -n pho_data python=3.11
conda activate pho_data
pip install bpy
pip install mujoco ImageIO
pip install ipdb scipy tabulate pandas matplotlib
pip install hydra-core omegaconf
pip install tqdm wandb
pip install sympy math_verify
pip install transformers pyarrow
Create an environment for training with the following commands:
conda create -n pho_training python=3.12
conda activate pho_training
pip install vllm==0.8.5.post1
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
pip install math-verify==0.7.0
pip install latex2sympy2_extended==1.0.9
pip install antlr4-python3-runtime==4.11.0
pip install polars ipdb
cd verl_v4
pip install -e .
pip install transformers==4.52.3
pip install setuptools==69.5.1
Set path to store data and checkpoints:
export PHO_DATA=<path to folder>
export PHO_CHECKPOINT_DIR=<path to folder>
export DATA_VERSION=v1Generate synthetic scenes by running the following command:
python -m sim.scene_generator scene_generation.num_scenes=1000 scene_generation=all data_version=$DATA_VERSIONGenerate QA pairs and filter shortcut questions by running the following command:
python -m sim.qa_gen_rule data_version=$DATA_VERSION gpt_nlq.num_generations_per_problem=30 numerical=True symbolic=True reverse=False
python -m sim.create_child_scenes data_version=$DATA_VERSION
python -m sim.qa_gen_rule data_version=$DATA_VERSION numerical=True symbolic=True reverse=False gpt_nlq.build_child_scenes=TruePreprocess generated QA pairs into format suitable for verl by running the following command:
python -m sim.write_json data_version=$DATA_VERSION numerical=True symbolic=False reverse=False
python -m sim.write_json data_version=$DATA_VERSION numerical=False symbolic=True reverse=False
python -m llm.preprocess_json_to_parquet --json_names numerical_problems_all_train_without_shortcut.json numerical_problems_all_test_without_shortcut.json symbolic_problems_all_train_without_shortcut.json symbolic_problems_all_test_without_shortcut.json --data_version "$DATA_VERSION" --no_extra_instructionUpon successfully running the above, <PHO_DATA>/<DATA_VERSION> should be populated with two files: train_<DATA_VERSION>_rl.parquet and test_<DATA_VERSION>_rl.parquet.
python -m verl_v4.recipe.dapo.main_dapo exps="[dapo_32b,syn_data,q2.5_14b,gspo,use_kl]" trainer.n_gpus_per_node=8 trainer.val_before_train=False sp_size=8 gen_tp=8 trainer.total_epochs=10This command finetunes Qwen2.5 14B Instruct on the generated synthetic data using DAPO algorithm. Training info is logged in wandb by default in the project named verl, and checkpoints are saved at PHO_CHECKPOINT_DIR/<run_name>. By default, only latest global step checkpoint is saved, deleting the old checkpoints to save storage.
Pretrained Checkpints: We provide our pretrained checkpoints for various Qwen models in HuggingFace.
Evaluation requires first converting the checkpoint to HuggingFace format by running the following command:
export ACTOR_DIR=$PHO_CHECKPOINT_DIR/<run_name>/global_step_<step>/actor
python verl_v4/scripts/model_merger.py merge \
--backend fsdp \
--local_dir "$ACTOR_DIR" \
--target_dir "$TARGET_DIR"To evaluate the model's performance on IPhO, download the val set from HF link to PHO_DATA/ipho/ipho_numeric_validation_no_instruct.parquet and run:
python -m verl_v4.recipe.dapo.simple_eval \
exps='[dapo_32b,log_all_reward,pass_at_n,simple_eval]' \
model=qwen2.5-3b-instruct data.val_batch_size=null actor_rollout_ref.rollout.val_kwargs.n=8 \
max_response_length=30000 trainer.n_gpus_per_node=8 model_name=${MODEL_PATH} \
actor_rollout_ref.rollout.gpu_memory_utilization=0.85 max_model_len=30000 \
data.val_files="[${PHO_DATA}/ipho/ipho_numeric_validation_no_instruct.parquet]"To evaluate the model's performance on JEEBench, first download JEEBench by running setup_jeebench.sh, then run:
python -m verl_v4.recipe.dapo.simple_eval \
exps='[dapo_32b,log_all_reward,pass_at_n,use_JEEBench]' \
data.val_files="[${PHO_DATA}/JEEBench/dataset.json]" \
model=qwen2.5-3b-instruct data.val_batch_size=1 actor_rollout_ref.rollout.val_kwargs.n=8 \
max_response_length=32000 trainer.n_gpus_per_node=8 model_name=${MODEL_PATH} \
actor_rollout_ref.rollout.gpu_memory_utilization=0.65 max_model_len=32000To evaluate the model's performance on OlympiadBench, run:
python -m verl_v4.recipe.dapo.simple_eval \
exps='[dapo_32b,log_all_reward,pass_at_n,use_olympiad_bench]' \
data.val_files="[${PHO_DATA}/olympiad_bench/OlympiadBench_Dataset/data/OE_TO_physics_en_COMP.json,
${PHO_DATA}/olympiad_bench/OlympiadBench_Dataset/data/OE_TO_physics_zh_CEE.json]" \
model=qwen2.5-3b-instruct data.val_batch_size=1 actor_rollout_ref.rollout.val_kwargs.n=1 \
max_response_length=32000 trainer.n_gpus_per_node=8 model_name=${MODEL_PATH} \
actor_rollout_ref.rollout.gpu_memory_utilization=0.65 max_model_len=32000To evaluate the model's performance on PHYSICS, download the val set from HF link to PHO_DATA/PHYSICS/test.jsonl and run:
python -m verl_v4.recipe.dapo.simple_eval \
exps='[dapo_32b,ipho_numeric_val,log_all_reward,pass_at_n,use_PHYSICS]' \
model=qwen2.5-3b-instruct data.val_batch_size=1 actor_rollout_ref.rollout.val_kwargs.n=8 \
max_response_length=32000 trainer.n_gpus_per_node=8 model_name=${MODEL_PATH} \
actor_rollout_ref.rollout.gpu_memory_utilization=0.65 max_model_len=32000Since the verifier used in PHYSICS is not opensource, we use Gemini 2.5 Flash as it is a strong verifier. Set the Google API Key in llm/utils/basic_utils.py line 138 before running this evaluation.
If you find this work useful in your research, please cite: