AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

Welcome to the official repository for AtlasVA! 🚀

AtlasVA is a teacher-free visual skill memory framework designed for Vision-Language Model (VLM) agents. Unlike traditional methods that compress spatial knowledge into lossy text and rely on proprietary LLMs for supervision, AtlasVA keeps experience visually grounded. It organizes memory into three complementary layers: spatial heatmaps, visual exemplars, and symbolic text skills. By evolving danger and affinity atlases directly from trajectory statistics, AtlasVA provides dense, coordinate-aware guidance for reinforcement learning. This unifies perception, memory, and optimization without external LLM supervision, achieving strong performance on spatially intensive tasks like Sokoban, FrozenLake, 3D embodied navigation, and 3D robotic manipulation! 🏆

The overall architecture of AtlasVA. (a) The Three-Layer Visual Memory (VSM) stores spatial heatmaps, visual exemplars, and symbolic text skills. (b) Teacher-Free Visual Atlas Evolution bootstraps memory from raw interaction history. (c) Atlas-Grounded Dense Visual Reward Shaping provides coordinate-aware guidance for RL. (d) Policy Optimization unifies perception, memory, and optimization.

The code is organized to reproduce the reinforcement-learning runs and evaluation scripts used in the paper. The main package is atlasva, and the training backend is built on the vendored verl directory.

📦 What Is Included

atlasva/: AtlasVA environments, agent loop, visual skill memory, reward shaping, and evaluation utilities.
atlasva/configs/: Hydra entry configs. The main training config is atlasva_multiturn.yaml.
scripts/: training and validation configs for each environment.
scripts/Giants/: API-based zero-shot large-model evaluation scripts.
verl/: the RL training backend used by the training scripts.

⚠️ Note: Large model checkpoints and experiment outputs are not included in the source artifact. Training scripts write outputs to exps/.

💻 Hardware and Software Assumptions

The full training scripts are configured for one node with 8 GPUs:

🐧 CUDA-capable Linux machine (CUDA 12.8).
🎮 8 GPUs for the default scripts.
🐍 Python 3.12. We used a Conda environment (PyTorch: 2.8.0+cu128, PyTorch CUDA build: 12.8).
🧠 Qwen2.5-VL-3B-Instruct as the default policy model.
📊 Optional Weights & Biases logging. Disable or change trainer.logger in the scripts if W&B is not available.

For quick code checks, a CPU-only machine is enough. For full reproduction, reviewers should run on a CUDA machine with enough GPU memory for vLLM and FSDP.

🛠️ Environment Setup

Create and activate a fresh environment:

conda create -n atlasva python=3.12 -y
conda activate atlasva

Install the vendored verl backend:

cd verl
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install --no-deps -e .
cd ..

Install AtlasVA and runtime dependencies:

pip install -e .
pip install "trl==0.26.2"
pip install "vllm==0.11.0"
pip install "transformers==4.56.2"
pip install "setuptools==79.0.1"
pip install matplotlib einops pyzmq hydra-core fire ray wandb Pillow

If your CUDA/PyTorch setup requires FlashAttention explicitly, install it after PyTorch is available:

pip install "flash-attn==2.8.3" --no-build-isolation

PrimitiveSkill and Swap use ManiSkill/SAPIEN rendering. Install ManiSkill and download the assets before running those environments:

pip install "mani_skill==3.0.0b20"
python -m mani_skill.utils.download_asset PickSingleYCB-v1 -y
python -m mani_skill.utils.download_asset partnet_mobility_cabinet -y

On headless Linux machines, ManiSkill/SAPIEN also needs working EGL/Vulkan drivers and GLVND vendor files. If imports fail with messages such as /usr/share/glvnd/egl_vendor.d missing, install the host NVIDIA driver, Vulkan/GLVND runtime packages, and verify that the container or job exposes /usr/share/glvnd/egl_vendor.d and the NVIDIA ICD files.

🧠 Model Checkpoints

The pre-trained AtlasVA model weights are publicly available on Hugging Face: 🤗 wangpan-ustc/AtlasVA

For training, the scripts use Qwen/Qwen2.5-VL-3B-Instruct as the base policy model by default. To avoid repeated downloads, point the scripts to a local checkpoint:

export QWEN25_VL_3B_LOCAL_PATH=/path/to/Qwen2.5-VL-3B-Instruct

Each script checks this variable first. If config.json exists under that directory, the script switches Hugging Face and Transformers to offline mode. Otherwise, it falls back to the Hugging Face model ID.

🧪 Smoke Tests

Run these commands before launching expensive training:

python -m compileall -q atlasva scripts setup.py
python -c "import atlasva; import atlasva.envs.registry; import atlasva.envs_remote; print('imports ok')"
python setup.py --name

After installing Hydra/Ray/vLLM dependencies, check that the training entry can load:

python -m atlasva.main_ppo \
  --config-path="$(pwd)/atlasva/configs" \
  --config-name=atlasva_multiturn \
  --help

📂 Repository Layout

Path	Purpose
`atlasva/main_ppo.py`	PPO training entry point.
`atlasva/ray_trainer.py`	AtlasVA training loop extensions.
`atlasva/agent_loop/`	Multi-turn environment-agent interaction loop.
`atlasva/skills/`	Text skills, visual skill memory, heatmaps, exemplars, and visual rewards.
`atlasva/envs/`	Sokoban, FrozenLake, Navigation, and PrimitiveSkill environments.
`atlasva/envs_remote/`	HTTP client/server wrapper for remote rendered environments.
`scripts//.yaml`	Dataset and environment specifications.
`scripts//.sh`	Reproduction scripts for main runs and baselines.

🚀 Main Training Commands

Run all commands from the repository root. The scripts create experiment directories under exps/<project>/<experiment>/.

🧩 Sokoban

bash scripts/sokoban/train_ppo_qwen25vl3b_Base.sh
bash scripts/sokoban/train_ppo_qwen25vl3b_Skill.sh

🧊 FrozenLake

bash scripts/frozenlake/train_ppo_qwen25vl3b_Skill.sh
bash scripts/frozenlake/train_grpo_qwen25vl3b_Skill.sh

🗺️ Navigation

Navigation uses a remote environment server. Start the server first:

python -m atlasva.envs.navigation.serve \
  --port=8036 \
  --devices='[0,1,2,3,4,5,6,7]' \
  --max_envs=512

For a local single-server run, use the _Local script and local YAML files:

bash scripts/navigation/train_ppo_qwen25vl3b_SkillCommon_Local.sh

For multi-server runs, edit base_urls in scripts/navigation/train_navigation_base_common.yaml and scripts/navigation/val_navigation_base_common.yaml, then run:

bash scripts/navigation/train_ppo_qwen25vl3b_BaseCommon.sh
bash scripts/navigation/train_ppo_qwen25vl3b_SkillCommon.sh

🤖 PrimitiveSkill

PrimitiveSkill also supports remote rendering. Start one or more servers:

python -m atlasva.envs.primitive_skill.serve --port=8037 --max_envs=512
python -m atlasva.envs.primitive_skill.serve --port=8038 --max_envs=512
python -m atlasva.envs.primitive_skill.serve --port=8039 --max_envs=512

Then edit the base_urls fields in scripts/primitive_skill/train_primitive_skill_vision_remote.yaml and scripts/primitive_skill/val_primitive_skill_vision_remote.yaml if your server addresses differ from the defaults. Launch:

bash scripts/primitive_skill/train_ppo_qwen25vl3b_Base.sh
bash scripts/primitive_skill/train_ppo_qwen25vl3b_Skill.sh

📊 Outputs

Training scripts write:

💾 checkpoints to exps/<project>/<experiment>/verl_checkpoints/;
📈 validation traces to exps/<project>/<experiment>/validation/;
🎥 rollout dumps to exps/<project>/<experiment>/rollout_data/;
📝 logs to both the experiment directory and the repository root.

The default scripts use W&B plus console logging:

wandb login

To avoid W&B, change trainer.logger=['console','wandb'] to trainer.logger=['console'] in the target script.

🤖 API-Based Large-Model Evaluation

The scripts/Giants/ directory evaluates closed-source or hosted models through OpenRouter-compatible APIs.

pip install openai
export OPENROUTER_API_KEY="sk-or-..."

bash scripts/Giants/eval_gpt4o.sh
bash scripts/Giants/eval_gpt5.sh
bash scripts/Giants/run_all_sokoban.sh

The API evaluation writes summaries and episode dumps under exps/giants/.

🌐 Remote Environment Notes

Remote YAML files contain concrete base_urls used in our internal cluster, for example http://localhost:8036. Reviewers should replace these addresses with their own server addresses.

Health check:

curl -s http://localhost:8036/health

If a rendering server runs on a separate machine, expose the server port with SSH tunneling or the cluster networking tool available in your environment.

❓ Common Issues

ModuleNotFoundError: hydra, fire, or ray: install hydra-core fire ray.
python -m atlasva.main_ppo cannot find the config: pass --config-path="$(pwd)/atlasva/configs" --config-name=atlasva_multiturn.
ModuleNotFoundError: PIL: install Pillow.
PrimitiveSkill or Swap fails during mani_skill/SAPIEN import: check that ManiSkill assets are downloaded and EGL/Vulkan/GLVND files are available on the host or inside the container.
Remote training hangs at environment creation: check base_urls, firewall rules, and curl http://<host>:<port>/health.
Hugging Face downloads are slow or unavailable: set QWEN25_VL_3B_LOCAL_PATH to a local model directory.
W&B is unavailable: change trainer.logger to console-only in the script.

✅ Minimal Reviewer Workflow

For a lightweight artifact check:

Install the environment.
Run the smoke tests.
Start one Navigation or PrimitiveSkill server.
Edit the relevant YAML base_urls to localhost.
Launch the corresponding _Local or remote training script with reduced trainer.total_training_steps and smaller n_envs.

For full reproduction, use the scripts listed above without reducing the training steps.

Acknowledgments

Our work builds upon VAGEN, and we sincerely appreciate the authors for their outstanding contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
atlasva		atlasva
logo		logo
scripts		scripts
verl		verl
weights		weights
.cursorindexingignore		.cursorindexingignore
.gitignore		.gitignore
README.md		README.md
_probe_jsonl.py		_probe_jsonl.py
_version_		_version_
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

📦 What Is Included

💻 Hardware and Software Assumptions

🛠️ Environment Setup

🧠 Model Checkpoints

🧪 Smoke Tests

📂 Repository Layout

🚀 Main Training Commands

🧩 Sokoban

🧊 FrozenLake

🗺️ Navigation

🤖 PrimitiveSkill

📊 Outputs

🤖 API-Based Large-Model Evaluation

🌐 Remote Environment Notes

❓ Common Issues

✅ Minimal Reviewer Workflow

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

📦 What Is Included

💻 Hardware and Software Assumptions

🛠️ Environment Setup

🧠 Model Checkpoints

🧪 Smoke Tests

📂 Repository Layout

🚀 Main Training Commands

🧩 Sokoban

🧊 FrozenLake

🗺️ Navigation

🤖 PrimitiveSkill

📊 Outputs

🤖 API-Based Large-Model Evaluation

🌐 Remote Environment Notes

❓ Common Issues

✅ Minimal Reviewer Workflow

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages