Skip to content

wangpan-ustc/AtlasVA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

AtlasVA logo AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

arXiv Website

Welcome to the official repository for AtlasVA! ๐Ÿš€

AtlasVA is a teacher-free visual skill memory framework designed for Vision-Language Model (VLM) agents. Unlike traditional methods that compress spatial knowledge into lossy text and rely on proprietary LLMs for supervision, AtlasVA keeps experience visually grounded. It organizes memory into three complementary layers: spatial heatmaps, visual exemplars, and symbolic text skills. By evolving danger and affinity atlases directly from trajectory statistics, AtlasVA provides dense, coordinate-aware guidance for reinforcement learning. This unifies perception, memory, and optimization without external LLM supervision, achieving strong performance on spatially intensive tasks like Sokoban, FrozenLake, 3D embodied navigation, and 3D robotic manipulation! ๐Ÿ†

AtlasVA Architecture The overall architecture of AtlasVA. (a) The Three-Layer Visual Memory (VSM) stores spatial heatmaps, visual exemplars, and symbolic text skills. (b) Teacher-Free Visual Atlas Evolution bootstraps memory from raw interaction history. (c) Atlas-Grounded Dense Visual Reward Shaping provides coordinate-aware guidance for RL. (d) Policy Optimization unifies perception, memory, and optimization.

The code is organized to reproduce the reinforcement-learning runs and evaluation scripts used in the paper. The main package is atlasva, and the training backend is built on the vendored verl directory.

๐Ÿ“ฆ What Is Included

  • atlasva/: AtlasVA environments, agent loop, visual skill memory, reward shaping, and evaluation utilities.
  • atlasva/configs/: Hydra entry configs. The main training config is atlasva_multiturn.yaml.
  • scripts/: training and validation configs for each environment.
  • scripts/Giants/: API-based zero-shot large-model evaluation scripts.
  • verl/: the RL training backend used by the training scripts.

โš ๏ธ Note: Large model checkpoints and experiment outputs are not included in the source artifact. Training scripts write outputs to exps/.


๐Ÿ’ป Hardware and Software Assumptions

The full training scripts are configured for one node with 8 GPUs:

  • ๐Ÿง CUDA-capable Linux machine (CUDA 12.8).
  • ๐ŸŽฎ 8 GPUs for the default scripts.
  • ๐Ÿ Python 3.12. We used a Conda environment (PyTorch: 2.8.0+cu128, PyTorch CUDA build: 12.8).
  • ๐Ÿง  Qwen2.5-VL-3B-Instruct as the default policy model.
  • ๐Ÿ“Š Optional Weights & Biases logging. Disable or change trainer.logger in the scripts if W&B is not available.

For quick code checks, a CPU-only machine is enough. For full reproduction, reviewers should run on a CUDA machine with enough GPU memory for vLLM and FSDP.


๐Ÿ› ๏ธ Environment Setup

Create and activate a fresh environment:

conda create -n atlasva python=3.12 -y
conda activate atlasva

Install the vendored verl backend:

cd verl
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install --no-deps -e .
cd ..

Install AtlasVA and runtime dependencies:

pip install -e .
pip install "trl==0.26.2"
pip install "vllm==0.11.0"
pip install "transformers==4.56.2"
pip install "setuptools==79.0.1"
pip install matplotlib einops pyzmq hydra-core fire ray wandb Pillow

If your CUDA/PyTorch setup requires FlashAttention explicitly, install it after PyTorch is available:

pip install "flash-attn==2.8.3" --no-build-isolation

PrimitiveSkill and Swap use ManiSkill/SAPIEN rendering. Install ManiSkill and download the assets before running those environments:

pip install "mani_skill==3.0.0b20"
python -m mani_skill.utils.download_asset PickSingleYCB-v1 -y
python -m mani_skill.utils.download_asset partnet_mobility_cabinet -y

On headless Linux machines, ManiSkill/SAPIEN also needs working EGL/Vulkan drivers and GLVND vendor files. If imports fail with messages such as /usr/share/glvnd/egl_vendor.d missing, install the host NVIDIA driver, Vulkan/GLVND runtime packages, and verify that the container or job exposes /usr/share/glvnd/egl_vendor.d and the NVIDIA ICD files.


๐Ÿง  Model Checkpoints

The pre-trained AtlasVA model weights are publicly available on Hugging Face: ๐Ÿค— wangpan-ustc/AtlasVA

For training, the scripts use Qwen/Qwen2.5-VL-3B-Instruct as the base policy model by default. To avoid repeated downloads, point the scripts to a local checkpoint:

export QWEN25_VL_3B_LOCAL_PATH=/path/to/Qwen2.5-VL-3B-Instruct

Each script checks this variable first. If config.json exists under that directory, the script switches Hugging Face and Transformers to offline mode. Otherwise, it falls back to the Hugging Face model ID.


๐Ÿงช Smoke Tests

Run these commands before launching expensive training:

python -m compileall -q atlasva scripts setup.py
python -c "import atlasva; import atlasva.envs.registry; import atlasva.envs_remote; print('imports ok')"
python setup.py --name

After installing Hydra/Ray/vLLM dependencies, check that the training entry can load:

python -m atlasva.main_ppo \
  --config-path="$(pwd)/atlasva/configs" \
  --config-name=atlasva_multiturn \
  --help

๐Ÿ“‚ Repository Layout

Path Purpose
atlasva/main_ppo.py PPO training entry point.
atlasva/ray_trainer.py AtlasVA training loop extensions.
atlasva/agent_loop/ Multi-turn environment-agent interaction loop.
atlasva/skills/ Text skills, visual skill memory, heatmaps, exemplars, and visual rewards.
atlasva/envs/ Sokoban, FrozenLake, Navigation, and PrimitiveSkill environments.
atlasva/envs_remote/ HTTP client/server wrapper for remote rendered environments.
scripts/*/*.yaml Dataset and environment specifications.
scripts/*/*.sh Reproduction scripts for main runs and baselines.

๐Ÿš€ Main Training Commands

Run all commands from the repository root. The scripts create experiment directories under exps/<project>/<experiment>/.

๐Ÿงฉ Sokoban

bash scripts/sokoban/train_ppo_qwen25vl3b_Base.sh
bash scripts/sokoban/train_ppo_qwen25vl3b_Skill.sh

๐ŸงŠ FrozenLake

bash scripts/frozenlake/train_ppo_qwen25vl3b_Skill.sh
bash scripts/frozenlake/train_grpo_qwen25vl3b_Skill.sh

๐Ÿ—บ๏ธ Navigation

Navigation uses a remote environment server. Start the server first:

python -m atlasva.envs.navigation.serve \
  --port=8036 \
  --devices='[0,1,2,3,4,5,6,7]' \
  --max_envs=512

For a local single-server run, use the _Local script and local YAML files:

bash scripts/navigation/train_ppo_qwen25vl3b_SkillCommon_Local.sh

For multi-server runs, edit base_urls in scripts/navigation/train_navigation_base_common.yaml and scripts/navigation/val_navigation_base_common.yaml, then run:

bash scripts/navigation/train_ppo_qwen25vl3b_BaseCommon.sh
bash scripts/navigation/train_ppo_qwen25vl3b_SkillCommon.sh

๐Ÿค– PrimitiveSkill

PrimitiveSkill also supports remote rendering. Start one or more servers:

python -m atlasva.envs.primitive_skill.serve --port=8037 --max_envs=512
python -m atlasva.envs.primitive_skill.serve --port=8038 --max_envs=512
python -m atlasva.envs.primitive_skill.serve --port=8039 --max_envs=512

Then edit the base_urls fields in scripts/primitive_skill/train_primitive_skill_vision_remote.yaml and scripts/primitive_skill/val_primitive_skill_vision_remote.yaml if your server addresses differ from the defaults. Launch:

bash scripts/primitive_skill/train_ppo_qwen25vl3b_Base.sh
bash scripts/primitive_skill/train_ppo_qwen25vl3b_Skill.sh

๐Ÿ“Š Outputs

Training scripts write:

  • ๐Ÿ’พ checkpoints to exps/<project>/<experiment>/verl_checkpoints/;
  • ๐Ÿ“ˆ validation traces to exps/<project>/<experiment>/validation/;
  • ๐ŸŽฅ rollout dumps to exps/<project>/<experiment>/rollout_data/;
  • ๐Ÿ“ logs to both the experiment directory and the repository root.

The default scripts use W&B plus console logging:

wandb login

To avoid W&B, change trainer.logger=['console','wandb'] to trainer.logger=['console'] in the target script.


๐Ÿค– API-Based Large-Model Evaluation

The scripts/Giants/ directory evaluates closed-source or hosted models through OpenRouter-compatible APIs.

pip install openai
export OPENROUTER_API_KEY="sk-or-..."

bash scripts/Giants/eval_gpt4o.sh
bash scripts/Giants/eval_gpt5.sh
bash scripts/Giants/run_all_sokoban.sh

The API evaluation writes summaries and episode dumps under exps/giants/.


๐ŸŒ Remote Environment Notes

Remote YAML files contain concrete base_urls used in our internal cluster, for example http://localhost:8036. Reviewers should replace these addresses with their own server addresses.

Health check:

curl -s http://localhost:8036/health

If a rendering server runs on a separate machine, expose the server port with SSH tunneling or the cluster networking tool available in your environment.


โ“ Common Issues

  • ModuleNotFoundError: hydra, fire, or ray: install hydra-core fire ray.
  • python -m atlasva.main_ppo cannot find the config: pass --config-path="$(pwd)/atlasva/configs" --config-name=atlasva_multiturn.
  • ModuleNotFoundError: PIL: install Pillow.
  • PrimitiveSkill or Swap fails during mani_skill/SAPIEN import: check that ManiSkill assets are downloaded and EGL/Vulkan/GLVND files are available on the host or inside the container.
  • Remote training hangs at environment creation: check base_urls, firewall rules, and curl http://<host>:<port>/health.
  • Hugging Face downloads are slow or unavailable: set QWEN25_VL_3B_LOCAL_PATH to a local model directory.
  • W&B is unavailable: change trainer.logger to console-only in the script.

โœ… Minimal Reviewer Workflow

For a lightweight artifact check:

  1. Install the environment.
  2. Run the smoke tests.
  3. Start one Navigation or PrimitiveSkill server.
  4. Edit the relevant YAML base_urls to localhost.
  5. Launch the corresponding _Local or remote training script with reduced trainer.total_training_steps and smaller n_envs.

For full reproduction, use the scripts listed above without reducing the training steps.

Acknowledgments

Our work builds upon VAGEN, and we sincerely appreciate the authors for their outstanding contributions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages