- β¨ Key Features
- π¦ Dataset
- βοΈ Environment Setup
- π οΈ Tool Services
- π Inference
- ποΈ Training
- π Repository Structure
- π Citation
- π Acknowledgement
- π License
Skill-3D is a framework for agentic 3D spatial reasoning, where multimodal LLM agents solve spatial reasoning tasks through external perception and geometry tools. Existing tool-augmented agents often apply uniform tool-use strategies across heterogeneous 3D scenes, leading to mismatched evidence, redundant tool calls, and limited gains over non-agentic baselines.
Skill-3D addresses this by constructing a Scene Memory and evolving a Skill Library. Successful tool-use rollouts are distilled into reusable scene-aware workflows, while failed rollouts are retained as corrective lessons. At inference time, Skill-3D retrieves scene-task-relevant skills to guide tool planning, evidence acquisition, and answer grounding. We further use skill-guided trajectories for agentic SFT and GRPO, transferring scene-aware tool-use behavior into compact MLLM agents.
-
Scene-aware skill extraction. Successful rollouts become reusable workflows; failures become lessons.
-
Scene Memory and Skill Library. Unless otherwise specified, training splits from all benchmarks are pooled to construct one shared memory and skill library; test samples are never used for skill construction or post-training.
-
Improved tool utilization. Skill-3D improves effective tool usage from 39% to 78% on VSI-Bench.
We evaluate Skill-3D on four 3D spatial reasoning benchmarks:
- VSI-Bench: object counting, metric distance, object size, room size, relative direction, route planning, and appearance order.
- BLINK: multi-view spatial reasoning.
- CV-3D / CV-Bench: depth ordering and relative distance reasoning.
- MMSI-Bench: positional relationship reasoning.
A typical local data layout is:
dataset/
|-- skill3d_splits/
| |-- vsi_train.jsonl
| |-- vsi_test.jsonl
| |-- mmsi_pr_train.jsonl
| |-- mmsi_pr_test.jsonl
| |-- cv3d_train.jsonl
| |-- cv3d_test.jsonl
| |-- blink_multiview_train.jsonl
| `-- blink_multiview_test.jsonl
|-- VSI-Bench/
|-- MMSI-Bench/
|-- CV-3D/
`-- BLINK/
We release the skill3d_splits used for skill evolution, post-training, and
evaluation here.
Teacher-generated rollouts are collected only from training samples.
We test under the following environment:
- Python 3.11
- PyTorch with CUDA 12.8
- OpenAI-compatible API client
- vLLM for local inference
Clone the repository:
git clone https://github.com/skill-3d/Skill-3D.git
cd Skill-3DInstall dependencies:
conda create -n skill3d python=3.11
conda activate skill3d
pip install -r requirements.txt
pip install "httpx[socks]"External expert tools require their own checkpoints and dependencies. Install and launch only the tool services needed for your experiments.
All tool-augmented methods should use the same tool pool unless otherwise specified.
| Tool | Environment Variable | Default URL |
|---|---|---|
| Depth-Anything-3 | DEPTH_SERVER_URL |
http://127.0.0.1:20019 |
| SAM3 | SEGMENTATION_SERVER_URL |
http://127.0.0.1:20020 |
| GroundingDINO | DETECTION_SERVER_URL |
http://127.0.0.1:20022 |
| Pi3 | PI3_SERVER_URL |
http://127.0.0.1:20030 |
| SwinIR | SWINIR_SERVER_URL |
http://127.0.0.1:20032 |
| Orient-Anything v2 | ORIENT_ANYTHING_SERVER_URL |
http://127.0.0.1:20034 |
Start the needed services before running Skill-3D inference. The commands below
use the default ports expected by scripts/common_env.sh; choose different
CUDA_VISIBLE_DEVICES values if you distribute tools across GPUs.
# Metric depth
python skill3d/external_experts/Depth_AnythingV3/depth_server.py \
--model_dir checkpoints/depth_anything/DA3METRIC-LARGE \
--port 20019
# Segmentation
python skill3d/external_experts/SAM3/sam3_server.py \
--checkpoint_path checkpoints/sam3.1/sam3.1_multiplex.pt \
--sam3_repo third_party/sam3 \
--port 20020
# Open-vocabulary detection
python skill3d/external_experts/GroundingDINO/grounding_dino_server.py \
--checkpoint_path checkpoints/grounding_dino/groundingdino_swinb_cogcoor.pth \
--port 20022
# 3D reconstruction
python skill3d/external_experts/Pi3/pi3_server.py \
--checkpoint_path checkpoints/pi3/model.safetensors \
--port 20030
# Super-resolution / restoration
python skill3d/external_experts/SwinIR/swinir_server.py \
--repo_dir third_party/SwinIR \
--weights_dir checkpoints/swinir \
--port 20032
# Object orientation
python skill3d/external_experts/OrientAnything/orient_anything_server.py \
--repo_dir third_party/Orient-Anything \
--ckpt_path checkpoints/orient_anything/dino_weight.pt \
--port 20034Approximate peak GPU memory for single-request inference is listed below. These numbers are planning estimates and vary with image resolution, frame count, precision, tiling, and CUDA allocator state.
| Service | Typical Input | Approx. Peak VRAM |
|---|---|---|
| Depth Anything metric depth | one resized image, process_res=504 |
10-14 GB |
| SAM3 segmentation | one image with text/box prompt | 14-20 GB |
| GroundingDINO detection | one 640-1024px image | 3-5 GB |
| Pi3 reconstruction | multiple frames | 18-28 GB |
| SwinIR restoration | one image, tiled inference recommended | 2-6 GB |
| Orient-Anything v2 | one object crop or image | 3-6 GB |
A typical checkpoint layout follows the local expert-service convention:
checkpoints/
|-- depth_anything/
| |-- DA3METRIC-LARGE/
| | |-- config.json
| | `-- model.safetensors
| `-- depth_anything_v2_vitb.pth
|-- grounding_dino/
| `-- groundingdino_swinb_cogcoor.pth
|-- orient_anything/
| `-- dino_weight.pt
|-- pi3/
| `-- model.safetensors
|-- sam3.1/
| |-- config.json
| |-- processor_config.json
| |-- sam3.1_multiplex.pt
| |-- tokenizer.json
| |-- tokenizer_config.json
| |-- vocab.json
| `-- merges.txt
`-- swinir/
`-- 003_realSR_BSRGAN_DFOWMFC_s64w8_SwinIR-L_x4_PSNR-with-dict-keys-params-and-params_ema.pth
API inference with an OpenAI-compatible model:
export OPENAI_API_KEY=<your_key>
export OPENAI_BASE_URL=<your_base_url>
BENCHMARK=vsi \
USE_MODEL=gpt-5.4 \
bash scripts/run_skill3d_gpt54_inference.shCommon overrides:
DATA_PATH=dataset/skill3d_splits/vsi_test.jsonl
IMAGE_BASE_PATH=dataset
MAX_WORKERS=4
RETRIEVAL_TOP_K=6Start a local vLLM server:
MODEL_PATH=/path/to/Skill-3D-checkpoint \
SERVED_MODEL_NAME=Skill-3D-GRPO-4B \
GPU_DEVICE=0 \
PORT=8001 \
bash scripts/vllm_start.shRun inference with the local endpoint:
BENCHMARK=vsi \
USE_MODEL=Skill-3D-GRPO-4B \
LLM_BASE_URL=http://localhost:8001/v1 \
bash scripts/run_skill3d_qwen_sft_inference.shOutputs are stored under results/, while logs are written under logs/.
The repository includes SFT/GRPO shell templates and prompt files for reproducibility. Training requires local artifacts that are not tracked. We also release the SFT- and GRPO-trained 4B/8B checkpoints here.
- Qwen3-VL-4B-Instruct and Qwen3-VL-8B-Instruct checkpoints
- SFT and GRPO files
- generated Scene Memory and Skill Library
- running tool services
- optional training plugins
Agentic SFT:
MODEL_PATH=/path/to/Qwen3-VL \
DATASET_JSONL=/path/to/sft_messages.jsonl \
bash train/train_sft.shGRPO:
MODEL_PATH=/path/to/sft-checkpoint \
DATASET_PATH=/path/to/Skill-3D-GRPO-mixed1000.jsonl \
bash train/train_grpo.shSkill-3D/
|-- skill3d/ # Core package
| |-- core/ # Agent, memory, skills, tool registry
| |-- models/ # GPT, Qwen, and local vLLM wrappers
| |-- tools/ # Tool wrappers
| |-- external_experts/ # Client/server implementations
| `-- vllm_models/ # OpenAI-compatible backend helpers
|-- statics/skill3d_shared/ # Public skeleton for generated memory
|-- examples/evaluation/ # Minimal evaluation entry used by inference
|-- scripts/ # API inference, Qwen inference, vLLM startup
|-- train/ # Post-training prompt/shell templates
|-- third_party/ # Lightweight third-party source trees
`-- test/ # Unit and integration checks
By default, Skill-3D uses a shared memory root:
export MEMORY_ROOT=statics/skill3d_shared
export SKILL3D_SKILL_STORAGE_PATH=${MEMORY_ROOT}/learned_skills.json
export SKILL3D_HIERARCHICAL_MEMORY_DIR=${MEMORY_ROOT}/memory
export RETRIEVAL_TOP_K=6For held-out evaluation and post-training, the Skill Library should remain frozen. The provided inference scripts use --disable_skill_update by default.
If you find this repository useful, please cite:
@article{li2026skill3d,
title = {Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning},
author = {Li, Haoyuan and Hu, Zhengdong and Wang, Jun and Fan, Hehe and Yang, Yi},
journal = {arXiv preprint arXiv:2606.07436},
year = {2026}
}Skill-3D builds on the Think3D/SPAgent codebase and integrates open-source expert tools including GroundingDINO, SAM3, Depth-Anything-3, Pi3, SwinIR, Orient-Anything, ms-swift, and vLLM.
This project is released under the Apache-2.0 License. See LICENSE.

