Official code for:
Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration
Lily Goli, Justin Kerr, Daniele Reda, Alec Jacobson, Andrea Tagliasacchi, Angjoo Kanazawa
University of Toronto · UC Berkeley · Wayve · Vector Institute · Simon Fraser University
[Project page] | [arXiv]
A reinforcement learning agent that learns to explore 3D indoor scenes using only an RGB camera. The agent uses a sliding-window transformer policy (DINO visual encoder + KV-cached transformer backbone) and is trained with PPO, using real-time Gaussian Splatting (GSplat) reconstruction as its intrinsic reward signal.
- Task: Camera exploration in Habitat-Matterport 3D (HM3D) and Gibson scenes
- Policy: DINO ViT-B/8 + sliding-window transformer with KV cache
- Reward: GSplat reconstruction quality (MSE between predicted and ground-truth novel views)
- Downstream tasks: Apple picking, image-goal navigation (finetuned from the explore checkpoint)
modules/
agent/ Policy and transformer architecture
environment/ Habitat environment wrappers, GSplat, rendering
eval/ Checkpoint evaluation scripts
ppo/ PPO training scripts
ablations/ Ablation variants (RNN, ICM, context-window ablations, etc.)
scripts/ Data download and split generation utilities
data/
splits/ Validation episode splits (HM3D, Gibson)
main.py Dispatcher entrypoint
environment.yml Conda environment
requirements.txt Python dependencies
conda env create -f environment.yml
conda activate recuriosityInstall the wheel matching your CUDA driver. Example for CUDA 12.4:
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 \
--extra-index-url https://download.pytorch.org/whl/cu124See https://pytorch.org/get-started/locally/ for other driver versions.
CUDA_HOME=$CONDA_PREFIX CUDA_ARCHITECTURES="75;80;89;90" \
pip install --no-build-isolation \
"git+https://github.com/rahul-goel/fused-ssim"CUDA_HOME=$CONDA_PREFIX TORCH_CUDA_ARCH_LIST="7.5;8.0;8.9;9.0" MAX_JOBS=$(nproc) \
pip install --no-build-isolation -r requirements.txthabitat-sim must be built from source with CUDA and headless rendering support.
pip install scikit-build-core pybind11
CUDA_HOME=$CONDA_PREFIX \
HABITAT_BUILD_GUI_VIEWERS=OFF \
HABITAT_WITH_BULLET=ON \
HABITAT_WITH_CUDA=ON \
MAX_JOBS=$(nproc) \
pip install --no-build-isolation -v \
"habitat-sim @ git+https://github.com/facebookresearch/habitat-sim.git"This takes 10–60 minutes. See https://github.com/facebookresearch/habitat-sim/blob/main/BUILD_FROM_SOURCE.md for details.
Verify:
python -c "import habitat_sim; print('habitat_sim ok')"- Register at https://aihabitat.org/datasets/hm3d/
- Download the HM3D train and validation splits (GLB + navmesh)
- Extract to a local directory.
Pass the path via --data_root in all training and eval commands. Expected layout:
<data_root>/hm3d/hm3d_glb/<scene_id>/<scene>.glb
<data_root>/hm3d/hm3d_nav/<scene_id>/<scene>.navmesh
<data_root>/hm3d-val/hm3d_glb/<scene_id>/<scene>.glb
<data_root>/hm3d-val/hm3d_nav/<scene_id>/<scene>.navmesh
- Request access and download the Gibson dataset (Habitat-compatible GLBs) via the Gibson Dataset request form.
- Extract to a local directory and pass the path via
--gibson_rootin eval commands.
Gibson scenes follow the same GLB format as HM3D. The expected directory layout (matching the episode split paths) is:
$GIBSON_SCENE_ROOT/
Adrian.glb
Annawan.glb
...
Pre-generated splits are included under data/splits/:
| File | Task | Episodes |
|---|---|---|
data/splits/hm3d/val/val.json.gz |
Exploration | 200 (100 scenes × 2 starts) |
data/splits/hm3d/val/val_apples.json |
Apple picking | 200 |
data/splits/hm3d/val/val_image_goal.json |
Image-goal navigation | 200 |
data/splits/gibson/val/val_gibson_bigisland.json.gz |
Exploration (Gibson) | 86 (86 scenes × 1 start) |
Download the pretrained checkpoints from HuggingFace:
| File | Description |
|---|---|
explorer.pt |
Main exploration policy, trained on HM3D |
apple_finetuned.pt |
Apple-picking fine-tune |
image_goal_finetuned.pt |
Image-goal navigation fine-tune |
Pass the checkpoint path via --checkpoint-path (eval) or --base_ckpt / --weights_only_ckpt (fine-tuning). The commands below use checkpoints/explore.pt as an example path.
All training scripts are invoked via main.py. Multi-GPU training uses torchrun.
Pass the HM3D data directory via --data_root (the directory containing hm3d/ and hm3d-val/). When using Docker via make, this is handled automatically by the Makefile.
export WANDB_API_KEY=your_key_here
# To disable: export WANDB_MODE=disabledTo train run the following command:
# Single GPU
python main.py --script explore_no_pose \
--data_root /path/to/hm3d \
--num_envs 4 \
--logdir runs \
--checkpoint_path checkpoints/explore.pt
# Multi-GPU (8× H100)
torchrun --standalone --nproc_per_node=8 main.py --script explore_no_pose \
--data_root /path/to/hm3d \
--num_envs 72 \
--logdir runs \
--checkpoint_path checkpoints/explore.ptTraining runs on 8× H100s; with --num_envs 72 it takes ~ 3 days on 80 GB GPUs. For a lower GPU memory usage, set --num_envs 32 (~6 days; this is the configuration used for the released checkpoint).
Resume from a checkpoint: --base_ckpt checkpoints/explore.pt
Key hyperparameters (all have sensible defaults):
| Flag | Default | Description |
|---|---|---|
--num_envs |
4 | Number of parallel environments (use 32 or 72 for multi-GPU) |
--roll_length |
1024 | Steps per rollout |
--learning_rate |
1e-5 | Adam learning rate |
--attn_window |
64 | Sliding attention window size (frames) |
--nerf_iters |
10 | GSplat optimization steps per rollout step |
--ent_coef_start |
0.1 | Initial entropy coefficient |
# RNN backbone
torchrun --standalone --nproc_per_node=8 main.py --script ablation_rnn ...
# RNN + ICM
torchrun --standalone --nproc_per_node=8 main.py --script ablation_rnn_icm ...
# Transformer + ICM
torchrun --standalone --nproc_per_node=8 main.py --script ablation_icm ...
# Context length ablation: --context_window caps how many past frame tokens the agent receives as input (1, 4, or 16)
torchrun --standalone --nproc_per_node=8 main.py --script ablation_ctx16 --context_window 1 ...
# Forgetful GSplat (sliding scene window)
torchrun --standalone --nproc_per_node=8 main.py --script ablation_forgetful ...Finetune from the pretrained exploration checkpoint:
torchrun --standalone --nproc_per_node=8 main.py --script apples_no_pose \
--data_root /path/to/hm3d \
--weights_only_ckpt checkpoints/explore.pt \
--num_envs 72 \
--logdir runs \
--checkpoint_path checkpoints/apples.pttorchrun --standalone --nproc_per_node=8 main.py --script image_goal_no_pose \
--data_root /path/to/hm3d \
--weights_only_ckpt checkpoints/explore.pt \
--learning_rate 1e-6 \
--num_envs 72 \
--logdir runs \
--checkpoint_path checkpoints/image_goal.ptEvaluation runs the policy on fixed pre-generated episodes and computes surface coverage completeness at 0.05 m threshold. Results are written to eval_outputs/ and logged to W&B.
The --eval-hole-fix and --eval-hole-fix-force-white flags patch scene meshes by adding a white material to backfaces in areas where the mesh has holes. Without this, Habitat disables backface rendering in those regions, leading to mismatched RGB rendering and collision detection in hole areas during evaluation.
python main.py --script eval --data_root /path/to/hm3d \
--ppo-module modules.ppo.train_ppo_explore_no_pose \
--checkpoint-path checkpoints/explore.pt \
--episodes-json data/splits/hm3d/val/val.json.gz \
--eval-hole-fix --eval-hole-fix-force-white \
--output-dir eval_outputs/Multi-GPU (episodes are distributed across GPUs):
torchrun --standalone --nproc_per_node=8 main.py --script eval --data_root /path/to/hm3d \
--ppo-module modules.ppo.train_ppo_explore_no_pose \
--checkpoint-path checkpoints/explore.pt \
--episodes-json data/splits/hm3d/val/val.json.gz \
--eval-hole-fix --eval-hole-fix-force-white \
--output-dir eval_outputs/python main.py --script eval \
--data_root /path/to/hm3d \
--gibson_root /path/to/gibson \
--ppo-module modules.ppo.train_ppo_explore_no_pose \
--checkpoint-path checkpoints/explore.pt \
--episodes-json data/splits/gibson/val/val_gibson_bigisland.json.gz \
--eval-hole-fix --eval-hole-fix-force-white \
--output-dir eval_outputs/python main.py --script eval_apples --data_root /path/to/hm3d \
--ppo-module modules.ppo.train_ppo_apples_no_pose \
--checkpoint-path checkpoints/apples.pt \
--episodes-json data/splits/hm3d/val/val_apples.json \
--eval-hole-fix --eval-hole-fix-force-white \
--output-dir eval_outputs/python main.py --script eval_image_goal --data_root /path/to/hm3d \
--ppo-module modules.ppo.train_ppo_image_goal_no_pose \
--checkpoint-path checkpoints/image_goal.pt \
--episodes-json data/splits/hm3d/val/val_image_goal.json \
--eval-hole-fix --eval-hole-fix-force-white \
--output-dir eval_outputs/Code and instructions for running the active mapping baselines (ANS-RGB, ANS-Depth, OccAnt-RGB, OccAnt-RGBD) will be released in a future update.
| Issue | Cause / Fix |
|---|---|
FileNotFoundError: No GLBs found |
HM3D data missing or --data_root path incorrect |
ModuleNotFoundError: habitat_sim |
habitat-sim not installed; build from source (see above) |
Unable to create windowless context / unable to find CUDA device 0 among EGL devices |
Container needs NVIDIA_DRIVER_CAPABILITIES=compute,utility,graphics. The Makefile sets this automatically. Also requires --gpus all and the nvidia-container-toolkit EGL setup on the host. |
torch.hub download failure (DINO) |
DINOv2 weights are downloaded on first run; ensure internet access or pre-cache with torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14') |
| gsplat build fails | Ensure TORCH_CUDA_ARCH_LIST is set and matches your GPU |
| habitat-sim build fails | Ensure cmake >= 3.14, libegl1-mesa-dev, libgl1-mesa-dev are installed |
| OOM with many environments | Reduce --num_envs (32 is safe for 4× H100 80GB) |
@article{goli2026recuriosity,
title = {Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration},
author = {Goli, Lily and Kerr, Justin and Reda, Daniele and Jacobson, Alec and Tagliasacchi, Andrea and Kanazawa, Angjoo},
journal = {arXiv preprint arXiv:2605.22814},
year = {2026},
url = {https://arxiv.org/abs/2605.22814},
}