3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding

Zeju Li, Chao Zhang, Xiaoyan Wang, Ruilong Ren, Yifan Xu, Ruifei Ma, Xiangde Liu

Description

Official implementation of the paper: 3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding. This paper is accepted by ICME-3DMM.

Setup

To set up the environment, run the following commands:

conda create -n 3dmit python==3.10.13
conda activate 3dmit

pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

Data

Training data

For source 3D scene point clouds,

scannet

you can download the processed 3D pcl files by this link:

https://drive.google.com/file/d/1vTcOFmTK0jvbRpPqggWj2cWx4gA7ulrE/view?usp=sharing

3rscan

you can download the 3D pcl files by its website:
```
https://github.com/WaldJohannaU/3RScan
```

For language instructions, there are tasks including VQA, VG, multi-choice, detection, conversations, etc.

You can download them by this link:

https://drive.google.com/file/d/1s1ehz8Q6WX9bCghVNtL9sPeulb6886I0/view?usp=sharing

put them in this format:

./datasets/data/3D_Instruct/meta_file
    ├── VQA_all_84w.json  # there are all the data, 840K from 3rscan and scannet
    ├── VQA_all_75w.json  # only from scannet, 740K 3d-text pairs
    ├── VQA_3rscan.json   # 9572 3d-text pairs from 3rscan

Eval data

For language instructions, there are 4 eval tasks, you can find them in

./datasets/data/3D_Benchmark/meta_file/
    ├── VQA
    ├── multi-choice
    ├── visual grounding
          ├── obj location prediction
          ├── obj index prediction
    ├── detection

3D features

For scannet:

scannet_attributes.json      
scannet_uni3d_feats_1024.pt
scannet_train_attributes.pt  
scannet_uni3d_feats.pt
scannet_ulip2_feats.pt

For 3rscan:

3rscan_attributes.json     
3rscan_ulip2_feats.pt       
3rscan_uni3d_feats_1024.pt

download link:

https://drive.google.com/file/d/12kXvxn9iYI20l-5k6MEpyONr1sEGr2o2/view?usp=sharing

Model_zoo

src/
├── model_zoo/
│   ├── epcl_ckpts/
│          ├── epcl_scannet_vit-L-14_256tokens_latest.pth 
│   ├── vicuna-7b-v0
│   ├── vicuna-13b-v0
│   ├── llava1.6-7b
│   └── llava1.6-13b

You can download the epcl checkpoint by this link:

https://drive.google.com/file/d/177yY53BGMELlVFWlmHYArE0HlCFsCntW/view?usp=sharing

Ckpt & result

For the result of 3DMIT(Vicuna-7b) with Ulip2 :

https://drive.google.com/file/d/1Debdd_ZsjiAPhlmrnSgjotPO5XMBcm1T/view?usp=sharing

For the result of 3DMIT(Vicuna-7b) with Uni3D :

https://drive.google.com/file/d/1qTrEtpfG2L-luOcfX7L9JySHOeDjoAtr/view?usp=sharing

Model

only using scene info

./src/model/3dmit-onlyscene-512.py

using scene + objects info

./src/model/3dmit.py

using scene + objects + 2D imgs info

./src/model/3dmit-scene+obj+img-512.py

Run

Train 3DMIT

bash ./src/scripts/3DMIT_training.sh

Eval 3DMIT for VQA/description/caption tasks

bash ./src/scripts/3DMIT_3D_Evaluation_7b.sh

for visual grounding task

python ./src/vg_eval_script.py

Citation

If you find our work useful, please consider citing:

@article{li20243dmit,
  title={3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding},
  author={Li, Zeju and Zhang, Chao and Wang, Xiaoyan and Ren, Ruilong and Xu, Yifan and Ma, Ruifei and Liu, Xiangde},
  journal={arXiv preprint arXiv:2401.03201},
  year={2024}
}

Acknowledge

Our based code:

https://github.com/OpenGVLab/LAMM

https://github.com/Chat-3D/Chat-3D-v2

https://github.com/Chat-3D/Chat-3D

https://github.com/baaivision/Uni3D

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
figs		figs
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding

Description

Setup

Data

3D features

Model_zoo

Ckpt & result

Model

Run

Citation

Acknowledge

About

Releases

Packages

Languages

staymylove/3DMIT

Folders and files

Latest commit

History

Repository files navigation

3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding

Description

Setup

Data

3D features

Model_zoo

Ckpt & result

Model

Run

Citation

Acknowledge

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages