OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving
Lening Wang*, Wenzhao Zheng*
$\dagger$ , Yilong Ren, Han Jiang, Zhiyong Cui, Haiyang Yu, Jiwen Lu
* Equal contribution
With trajectory-aware 4D generation, OccSora has the potential to serve as a world simulator for the decision-making of autonomous driving.
- [2024/05/31] Training, evaluation, and visualization code release.
- [2024/05/31] Paper released on arXiv.
Different from most existing world models which adopt an autoregressive framework to perform next-token prediction, we propose a diffusion-based 4D occupancy generation model, OccSora, to model long-term temporal evolutions more efficiently. We employ a 4D scene tokenizer to obtain compact discrete spatial-temporal representations for 4D occupancy input and achieve high-quality reconstruction for long-sequence occupancy videos. We then learn a diffusion transformer on the spatial-temporal representations and generate 4D occupancy conditioned on a trajectory prompt. OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes.
-
Create a conda environment with Python version 3.8.0
-
Install all the packages in environment.yaml
-
Please refer to mmdetection3d about the installation of mmdetection3d
-
Create a soft link from data/nuscenes to your_nuscenes_path
-
Prepare the gts semantic occupancy introduced in [Occ3d]
-
Download the generated train/val pickle files and put them in data/
[nuscenes_infos_train_temporal_v3_scene.pkl]
[nuscenes_infos_val_temporal_v3_scene.pkl]
The dataset should be organized as follows:
OccSora/data
nuscenes - downloaded from www.nuscenes.org
lidarseg
maps
samples
sweeps
v1.0-trainval
gts - download from Occ3d
nuscenes_infos_train_temporal_v3_scene.pkl
nuscenes_infos_val_temporal_v3_scene.pkl
Train the VQVAE on A100 with 80G GPU memory.
python train_1.py --py-config config/train_vqvae.py --work-dir out/vqvae
Generate training Token data using the vqvae results
python step02.py --py-config config/train_vqvae.py --work-dir out/vqvae
Train the OccSora on A100 with 80G GPU memory.
torchrun --nnodes=1 --nproc_per_node=8 train_2.py --model DiT-XL/2 --data-path /path
Evaluate the model on A100 with 80G GPU memory.
The token is obtained by denoising the noise samples_array.npy
python sample.py --model DiT-XL/2 --image-size 256 --ckpt "/results/001-DiT-XL-2/checkpoints/1200000.pt"
python visualize_demo.py --py-config config/train_vqvae.py --work-dir out/vqvae
Our code is based on OccWorld and DiT.
Also thanks to these excellent open-sourced repos: TPVFormer MagicDrive BEVFormer
If you find this project helpful, please consider citing the following paper:
@article{wang2024occsora,
title={OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving},
author={Wang, Lening and Zheng, Wenzhao and Ren, Yilong and Jiang, Han and Cui, Zhiyong and Yu, Haiyang and Lu, Jiwen},
journal={arXiv preprint arXiv:2405.20337},
year={2024}
}