TRTP: a three-stage robust task planning framework for open worlds via visual-language models and digital twin simulation

Official Repository of Our Paper: “TRTP: a three-stage robust task planning framework for open worlds via visual-language models and digital twin simulation”

📄 Paper | 🔗 Project Page

Overview

TRTP introduces a three-stage framework for robust robot task planning. Traditional Vision-Language Models (VLMs) often struggle with complex spatial reasoning and lack a mechanism to verify the physical feasibility of their plans. Our framework addresses these limitations by creating a closed-loop system that integrates perception, planning, and simulation-based validation.

The core contributions of our framework are:

Stage 1: Spatial Prompt Generation: We use a VLM to analyze a scene's keyframe and generate a rich, descriptive text about object relationships, which we call a "Spatial Prompt".
Stage 2: Spatially-Aware Task Planning: This Spatial Prompt is then fed as a system-level context to another VLM, guiding it to produce a more accurate and spatially coherent task plan based on a video instruction.
Stage 3: Digital Twin Simulation & Feedback: The generated plan is executed in a high-fidelity digital twin of the environment. If a step fails (e.g., collision, unreachable object), the system generates a structured "Error Prompt" that is fed back to the planner, enabling it to iteratively correct and refine its plan until a feasible solution is found.

This repository contains all components to reproduce the system, including:

VLM spatial relation extraction and fusion
Digital twin environment setup and feedback loop
Qwen2VL fine-tuning using llama-factory
Depth-based mesh generation via depth-anything-v2
UE5-based high-fidelity manipulation environment

Repository Structure

TRTP_1/
├── VLM_process_Code/        # Stage 1 & 2: Spatial Prompt Extraction, Fusion, and Task Planning
│   └── README.md            # VLM_process_Code's README file
├── DigitalTwinSimEnv/       # Stage 3 (Part 1): 3D Scene Reconstruction from a Single Image
│   └── README.md            # DigitalTwinSimEnv's README file
├── src/                     # Stage 2 (Fine-tuning): Qwen2VL Fine-tuning using llama-factory (LoRA)
├── DataProcess/             # DataProcess code 
│   └── README.md            # DataProcess's README file
├── assets/                  # Images, diagrams, GIFs, and other visual assets
├── requirements.txt         # Core Python dependencies
└── README.md                # This file

Setup & Installation

Clone the repository:

git clone https://github.com/trtp/TRTP_1.git
cd TRTP_1

Install core dependencies: A virtual environment is highly recommended.
```
pip install -r requirements.txt
```
Install specialized dependencies: Our framework leverages several powerful external tools. Please follow their official installation guides:
- For VLM Fine-tuning: llama-factory is required for fine-tuning Qwen2VL.
- For 3D Reconstruction: Depth Anything V2 models and dependencies are needed. Download checkpoints and place them in DigitalTwinSimEnv/checkpoints/.
- For Simulation: Unreal Engine 5 is necessary for running the interactive digital twin environment.

Modules

① VLM Spatial Processing Module (`VLM_process_Code/`)

For more details, see the README file in each subfolder.

This module implements Stage 1 and 2 of our framework: generating spatial prompts and using them for task planning. It includes a complete pipeline for data processing, inference, and evaluation.

Core Workflow:
1. Spatial Relation Extraction: A VLM (e.g., InternVL2, Qwen2VL) analyzes a keyframe to produce a .txt file describing object spatial relations (the "Spatial Prompt").
2. Dataset Construction: The generated text prompts are paired with their corresponding videos and task instructions to create a structured .json dataset.
3. Spatially-Aware Inference: This JSON dataset is fed to a VLM, which uses the spatial prompt as context to generate a precise task plan.
Evaluation: Includes scripts to automatically evaluate the quality of the generated spatial prompts (on Precision, Completeness, Redundancy) and the final task plans (on Visual, Temporal, and Physical Consistency).

Usage Example:

cd VLM_process_Code/makeDatasets/llava-onevision-qwen2-7b-ov-hf/

# Step 1: Generate spatial description texts
python 1_llava-onevision-qwen2-7b-ov-hfInferImageAndSaveTxt.py

# Step 2: Create a JSON dataset pairing texts with videos
python 2_llava-onevision-qwen2-7b-ov-hfMakeDatesetWithSavedTextAndVideos.py

# Step 3: Perform final task planning using the spatial prompts
python 3_llava-onevision-qwen2-7b-ov-hfInferTaskByPromptAndVideo.py

② Digital Twin Environment (`DigitalTwinSimEnv/`)

This module implements the foundation of Stage 3: automatically reconstructing a 3D environment from a single image. This mesh serves as the geometric basis for our high-fidelity simulation.

3D Reconstruction Pipeline:
1. Depth Estimation: run.py uses Depth Anything V2 to generate a high-precision 16-bit depth map from an input image.
2. Point Cloud Generation: imageTo3DPoint.py converts the depth map into a dense 3D point cloud (.ply) using camera intrinsics.
3. Mesh Reconstruction: Ply2Mesh.py applies Poisson Surface Reconstruction to the point cloud to generate a clean, watertight 3D mesh (.obj).
Simulation Integration: The generated .obj mesh is imported into Unreal Engine 5, where it is combined with a high-fidelity robot model (e.g., FAB Manipulator) to perform physics-based validation. When a failure occurs, the UE5 environment generates a structured error prompt.

③ Model Fine-tuning (`src/`)

This module provides the pipeline for fine-tuning a VLM to better understand task failures and correction instructions.

Framework: Built on the highly efficient llama-factory.
Model: We fine-tune Qwen2VL using LoRA (specifically rsLoRA) adapters for parameter-efficient training.
Dataset: The training data consists of pairs of (video, initial_plan, error_prompt) and the corresponding corrected_plan, teaching the model to re-plan based on simulation feedback.

📌 Citation

@article{Qu2026TRTP,
  title   = {TRTP: A three-stage robust task planning framework for open worlds via visual-language models and digital twin simulation},
  author  = {Qu, Yuanjin and Hu, Xiangtao and Chen, Fei and Wei, Zhihong},
  journal = {Applied Intelligence},
  year    = {2026},
  volume  = {56},
  number  = {4},
  month   = {February},
  doi     = {10.1007/s10489-026-07143-y},
  publisher = {Springer}
}

License

This repository is licensed under the MIT License.

Acknowledgements

Our work builds upon many fantastic open-source projects. We extend our sincere gratitude to their developers.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.idea		.idea
DataProcess		DataProcess
DigitalTwinSimEnv		DigitalTwinSimEnv
VLM_process_Code		VLM_process_Code
cache		cache
data		data
docker		docker
evaluation		evaluation
examples		examples
scripts		scripts
src		src
README.md		README.md
img.png		img.png
img_1.png		img_1.png
img_2.png		img_2.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TRTP: a three-stage robust task planning framework for open worlds via visual-language models and digital twin simulation

📄 Paper | 🔗 Project Page

Overview

Repository Structure

Setup & Installation

Modules

① VLM Spatial Processing Module (`VLM_process_Code/`)

② Digital Twin Environment (`DigitalTwinSimEnv/`)

③ Model Fine-tuning (`src/`)

📌 Citation

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TRTP: a three-stage robust task planning framework for open worlds via visual-language models and digital twin simulation

📄 Paper | 🔗 Project Page

Overview

Repository Structure

Setup & Installation

Modules

① VLM Spatial Processing Module (VLM_process_Code/)

② Digital Twin Environment (DigitalTwinSimEnv/)

③ Model Fine-tuning (src/)

📌 Citation

License

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

① VLM Spatial Processing Module (`VLM_process_Code/`)

② Digital Twin Environment (`DigitalTwinSimEnv/`)

③ Model Fine-tuning (`src/`)

Packages