TRTP: a three-stage robust task planning framework for open worlds via visual-language models and digital twin simulation
Official Repository of Our Paper: “TRTP: a three-stage robust task planning framework for open worlds via visual-language models and digital twin simulation”
📄 Paper | 🔗 Project Page
TRTP introduces a three-stage framework for robust robot task planning. Traditional Vision-Language Models (VLMs) often struggle with complex spatial reasoning and lack a mechanism to verify the physical feasibility of their plans. Our framework addresses these limitations by creating a closed-loop system that integrates perception, planning, and simulation-based validation.
The core contributions of our framework are:
- Stage 1: Spatial Prompt Generation: We use a VLM to analyze a scene's keyframe and generate a rich, descriptive text about object relationships, which we call a "Spatial Prompt".
- Stage 2: Spatially-Aware Task Planning: This Spatial Prompt is then fed as a system-level context to another VLM, guiding it to produce a more accurate and spatially coherent task plan based on a video instruction.
- Stage 3: Digital Twin Simulation & Feedback: The generated plan is executed in a high-fidelity digital twin of the environment. If a step fails (e.g., collision, unreachable object), the system generates a structured "Error Prompt" that is fed back to the planner, enabling it to iteratively correct and refine its plan until a feasible solution is found.
This repository contains all components to reproduce the system, including:
- VLM spatial relation extraction and fusion
- Digital twin environment setup and feedback loop
- Qwen2VL fine-tuning using
llama-factory - Depth-based mesh generation via
depth-anything-v2 - UE5-based high-fidelity manipulation environment
TRTP_1/
├── VLM_process_Code/ # Stage 1 & 2: Spatial Prompt Extraction, Fusion, and Task Planning
│ └── README.md # VLM_process_Code's README file
├── DigitalTwinSimEnv/ # Stage 3 (Part 1): 3D Scene Reconstruction from a Single Image
│ └── README.md # DigitalTwinSimEnv's README file
├── src/ # Stage 2 (Fine-tuning): Qwen2VL Fine-tuning using llama-factory (LoRA)
├── DataProcess/ # DataProcess code
│ └── README.md # DataProcess's README file
├── assets/ # Images, diagrams, GIFs, and other visual assets
├── requirements.txt # Core Python dependencies
└── README.md # This file-
Clone the repository:
git clone https://github.com/trtp/TRTP_1.git cd TRTP_1 -
Install core dependencies: A virtual environment is highly recommended.
pip install -r requirements.txt
-
Install specialized dependencies: Our framework leverages several powerful external tools. Please follow their official installation guides:
- For VLM Fine-tuning: llama-factory is required for fine-tuning Qwen2VL.
- For 3D Reconstruction: Depth Anything V2 models and dependencies are needed. Download checkpoints and place them in
DigitalTwinSimEnv/checkpoints/. - For Simulation: Unreal Engine 5 is necessary for running the interactive digital twin environment.
For more details, see the README file in each subfolder.
This module implements Stage 1 and 2 of our framework: generating spatial prompts and using them for task planning. It includes a complete pipeline for data processing, inference, and evaluation.
- Core Workflow:
- Spatial Relation Extraction: A VLM (e.g.,
InternVL2,Qwen2VL) analyzes a keyframe to produce a.txtfile describing object spatial relations (the "Spatial Prompt"). - Dataset Construction: The generated text prompts are paired with their corresponding videos and task instructions to create a structured
.jsondataset. - Spatially-Aware Inference: This JSON dataset is fed to a VLM, which uses the spatial prompt as context to generate a precise task plan.
- Spatial Relation Extraction: A VLM (e.g.,
- Evaluation: Includes scripts to automatically evaluate the quality of the generated spatial prompts (on Precision, Completeness, Redundancy) and the final task plans (on Visual, Temporal, and Physical Consistency).
- Usage Example:
cd VLM_process_Code/makeDatasets/llava-onevision-qwen2-7b-ov-hf/ # Step 1: Generate spatial description texts python 1_llava-onevision-qwen2-7b-ov-hfInferImageAndSaveTxt.py # Step 2: Create a JSON dataset pairing texts with videos python 2_llava-onevision-qwen2-7b-ov-hfMakeDatesetWithSavedTextAndVideos.py # Step 3: Perform final task planning using the spatial prompts python 3_llava-onevision-qwen2-7b-ov-hfInferTaskByPromptAndVideo.py
This module implements the foundation of Stage 3: automatically reconstructing a 3D environment from a single image. This mesh serves as the geometric basis for our high-fidelity simulation.
- 3D Reconstruction Pipeline:
- Depth Estimation:
run.pyuses Depth Anything V2 to generate a high-precision 16-bit depth map from an input image. - Point Cloud Generation:
imageTo3DPoint.pyconverts the depth map into a dense 3D point cloud (.ply) using camera intrinsics. - Mesh Reconstruction:
Ply2Mesh.pyapplies Poisson Surface Reconstruction to the point cloud to generate a clean, watertight 3D mesh (.obj).
- Depth Estimation:
- Simulation Integration: The generated
.objmesh is imported into Unreal Engine 5, where it is combined with a high-fidelity robot model (e.g., FAB Manipulator) to perform physics-based validation. When a failure occurs, the UE5 environment generates a structured error prompt.
This module provides the pipeline for fine-tuning a VLM to better understand task failures and correction instructions.
- Framework: Built on the highly efficient llama-factory.
- Model: We fine-tune Qwen2VL using LoRA (specifically rsLoRA) adapters for parameter-efficient training.
- Dataset: The training data consists of pairs of
(video, initial_plan, error_prompt)and the correspondingcorrected_plan, teaching the model to re-plan based on simulation feedback.
@article{Qu2026TRTP,
title = {TRTP: A three-stage robust task planning framework for open worlds via visual-language models and digital twin simulation},
author = {Qu, Yuanjin and Hu, Xiangtao and Chen, Fei and Wei, Zhihong},
journal = {Applied Intelligence},
year = {2026},
volume = {56},
number = {4},
month = {February},
doi = {10.1007/s10489-026-07143-y},
publisher = {Springer}
}This repository is licensed under the MIT License.
Our work builds upon many fantastic open-source projects. We extend our sincere gratitude to their developers.