🍾 POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images
Antonin Vobecky Oriane Siméoni David Hurych Spyros Gidaris Andrei Bursuc Patrick Pérez Josef Sivic
Welcome to the official implrmrntation of POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images
@article{
vobecky2023POP3D,
title={POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images},
author={Antonin Vobecky and Oriane Siméoni and David Hurych and Spyros Gidaris and Andrei Bursuc and Patrick Pérez and Josef Sivic},
booktitle = {Advances in Neural Information Processing Systems},
volume = {37},
year = {2023}
}
Please, have GCC 5 or higher.
Run the following script to prepare the pop3d
conda environment:
conda env create -f conda_env.yaml
Download weights from this link and put them to ./ckpts
Step 0. Create a conda environment, activate it and install requirements
cd MaskCLIP
conda create -n maskclip python=3.9
conda activate maskclip
pip install --no-cache-dir -r requirements.txt
pip install --no-cache-dir opencv-python
Step 1. Install PyTorch and Torchvision following official instructions, e.g., fo4 PyTorch 1.10 with CUDA 10.2:
pip install --no-cache-dir torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/cu111/torch_stable.html
Step 2. Install MMCV:
pip install --no-cache-dir mmcv-full==1.5.0
Step 3. Install CLIP.
pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git
Step 4. Install MaskCLIP.
pip install --no-cache-dir -v -e .
Download and extract the nuScenes dataset (link) and place it to the ./data/nuscenes
folder. This means downloading all the nuScenes files, including both trainval and test splits,
We provide files for simpler manipulation with the nuScenes dataset. We use these files in our dataloaders. Again, please put these files to the ./data
folder (in the POP3D
folder). To do this, please simply run:
bash scripts/download_info_files.sh
To download the data for our Open-vocabulary language-driven retrieval dataset, please run:
bash scripts/download_retrieval_benchmark.sh
To activate the environment, please run:
conda activate pop3d
Run the following script to prepare projection files. The default path to the directory with the nuScenes dataset is set to ./data/nuscenes
.
NUSC_ROOT=./data/nuscenes
PROJ_DIR=./data/nuscenes/features/projections
python3 generate_projections_nuscenes.py --nusc_root ${NUSC_ROOT} --proj-dir ${PROJ_DIR}
Switch to MaskCLIP directory in this project (cd MaskCLIP
).
- Activate MaskCLIP environment:
conda activate maskclip
- Prepare backbone weights by:
mkdir -p ./pretrain
python tools/maskclip_utils/convert_clip_weights.py --model ViT16 --backbone
python tools/maskclip_utils/convert_clip_weights.py --model ViT16
-
Download pre-trained weights from this link and put them to
ckpts/maskclip_plus_vit16_deeplabv2_r101-d8_512x512_8k_coco-stuff164k.pth
-
Run feature extraction:
CFG_PATH=configs/maskclip_plus/anno_free/maskclip_plus_vit16_deeplabv2_r101-d8_512x512_8k_coco-stuff164k__nuscenes_trainvaltest.py
CKPT_PATH=ckpts/maskclip_plus_vit16_deeplabv2_r101-d8_512x512_8k_coco-stuff164k.pth
PROJ_DIR=../data/nuscenes/features/projections/data/nuscenes
OUT_DIR=../data/nuscenes/maskclip_features_projections
python tools/extract_features.py ${CFG_PATH} --save-dir ${OUT_DIR} --checkpoint ${CKPT_PATH} --projections-dir ${PROJ_DIR} --complete
to generate the target MaskCLIP+ features to use in the training of our method.
Note: the process of preparing the targets from MaskCLIP+ can be slow, depending on the speed of your file system. If you want to parallelize, we provide the following skeleton for launching multiple jobs using SLURM:
NUM_GPUS=... # fill the number of nodes
ACCOUNT=... # name of your account, if any
HOURS_TOTAL=... # how long you expect the *WHOLE* extraction of features to last
MASKCLIP_DIR=/path/to/POP3D/MaskCLIP
bash generate_features_slurm.sh ${NUM_GPUS} ${HOURS_TOTAL} ${ACCOUNT} ${MASKCLIP_DIR}
Note2: It is expected to get size mismatch for decode_head.text_embeddings: copying a param with shape torch.Size([171, 512]) from checkpoint, the shape in current model is torch.Size([28, 512]).
We do not use these weights during feature extraction.
Our model was trained on 8x NVIIDA A100 GPUs.
Please, modify the following variables in the training script ``:
PARTITION="..." # name of the parition on your cluser, e.g., "gpu"
ACCOUNT="..." # name of your account, if it is set on your cluster
USERNAME="..." # your username on the cluster, used just for printing of running jobs
Script to run the training using SLURM: (NOT WORKING YET)
POP3D_DIR=/path/to/POP3D
bash scripts/train_slurm.sh ${POP3D_DIR}
Weights used for results in the paper are here and used zero-shot weights from here. Please, put both files to ${POP3D_DIR}/pretrained
folder for easier use.
To obtain results from our paper, please run:
A) single-GPU (slow):
CFG=...
CKPT=...
ZEROSHOT_PTH=...
python3 eval.py --py-config ${CFG} --resume-from ${CKPT} --maskclip --no-wandb --text-embeddings-path ${ZEROSHOT_PTH}
If you followed the instructions above, you can run:
python3 eval.py --py-config config/pop3d_maskclip_12ep.py --resume-from ./pretrained/pop3d_weights.pth --maskclip --no-wandb --text-embeddings-path ./pretrained/zeroshot_weights.pth
B) multi-GPU using SLURM (faster), e.g.:
POP3D_DIR=`pwd`
CKPT="./pretrained/pop3d_weights.pth"
NUM_GPUS=8
HOURS=1
CFG="config/pop3d_maskclip_12ep.py"
EXTRA="--text-embeddings-path ./pretrained/zeroshot_weights.pth"
bash scripts/eval_zeroshot_slurm.sh ${POP3D_DIR} ${CKPT} ${NUM_GPUS} ${HOURS} ${CFG} ${EXTRA}
EXPECTED RESULTS:
val_miou_vox_clip_all (evaluated at the complete voxel space): 16.65827465887346
To obtain results from our paper, please run:
python retrieval.py
Expected results:
+-------------------------------+
| train (42 samples) |
+----------+------+-------------+
| method | mAP | mAP visible |
+----------+------+-------------+
| POP3D | 15.3 | 15.6 |
| MaskCLIP | N/A | 13.5 |
+----------+------+-------------+
+-------------------------------+
| val (27 samples) |
+----------+------+-------------+
| method | mAP | mAP visible |
+----------+------+-------------+
| POP3D | 24.1 | 24.7 |
| MaskCLIP | N/A | 18.7 |
+----------+------+-------------+
+-------------------------------+
| test (36 samples) |
+----------+------+-------------+
| method | mAP | mAP visible |
+----------+------+-------------+
| POP3D | 12.6 | 13.6 |
| MaskCLIP | N/A | 12.0 |
+----------+------+-------------+
+-------------------------------+
| valtest (63 samples) |
+----------+------+-------------+
| method | mAP | mAP visible |
+----------+------+-------------+
| POP3D | 17.5 | 18.4 |
| MaskCLIP | N/A | 14.9 |
+----------+------+-------------+
Results will be written to ./results/results_${TIMESTAMP}.txt
and to ./results/results_tables_${TIMESTAMP}.txt
Our code is based on TPVFormer and MaskCLIP. Many thanks to the authors!