This code is for our paper named MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding.
the paper has been accept in ICCV 2025.
<video width="100%" controls autoplay muted loop>
<source src="https://github.com/SixCorePeach/MCAM/raw/main/poster/the%20visual%20result%20of%20MCAM.mp4" type="video/mp4">
</video>This Work is based on the swin-video-transformer and ADAPT. And the CAM is inspired from LLCP, Thanks for them superior work, the cite is as following.
@article{liu2021video,
title={Video Swin Transformer},
author={Liu, Ze and Ning, Jia and Cao, Yue and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Hu, Han},
journal={arXiv preprint arXiv:2106.13230},
year={2021}
}
@article{jin2023adapt,
title={ADAPT: Action-aware Driving Caption Transformer},
author={Jin, Bu and Liu, Xinyu and Zheng, Yupeng and Li, Pengfei and Zhao, Hao and Zhang, Tong and Zheng, Yuhang and Zhou, Guyue and Liu, Jingjing},
journal={arXiv preprint arXiv:2302.00673},
year={2023}
}
@inproceedings{chen2024llcp,
title={LLCP: Learning Latent Causal Processes for Reasoning-based Video Question Answer},
author={Chen, Guangyi and Li, Yuke and Liu, Xiao and Li, Zijian and Al Suradi, Eman and Wei, Donglai and Zhang, Kun},
booktitle={ICLR},
year={2024}
}
our environment setting is likely, as following.
First, we need install the anoconda and pytorch.
conda create --name MCAM python=3.8conda activate MCAMInstall Pytorch torch版本可以按照自己的设备进行调整,只要符合torch本身的架构要求就好
pip install torch==1.13.1+cu117 torchaudio==0.13.1+cu117 torchvision==0.14.1+cu117 -f https://download.pytorch.org/whl/torch_stable.htmlInstall apex 可以选择手动下载 apex的zip包,然后解压到指定文件夹下
#git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--deprecated_fused_adam" --global-option="--xentropy" --global-option="--fast_multihead_attn" ./Install mpi4py 安装必要的包
conda install -c conda-forge mpi4py openmpi安装其他的依赖项,依赖项中有不符合的,在运行过程中,缺少的部分 使用 pip install --对应名字 即可
pip install -r requirements.txt这里给出文件夹的分布情况
${REPO_DIR}
|-- checkpoints
|-- datasets
| |-- BDDX
| | |-- frame_tsv
| | |-- captions_BDDX.json
| | |-- training_32frames_caption_coco_format.json
| | |-- training_32frames.yaml
| | |-- training.caption.lineidx
| | |-- training.caption.lineidx.8b
| | |-- training.caption.linelist.tsv
| | |-- training.caption.tsv
| | |-- training.img.lineidx
| | |-- training.img.lineidx.8b
| | |-- training.img.tsv
| | |-- training.label.lineidx
| | |-- training.label.lineidx.8b
| | |-- training.label.tsv
| | |-- training.linelist.lineidx
| | |-- training.linelist.lineidx.8b
| | |-- training.linelist.tsv
| | |-- validation...
| | |-- ...
| | |-- validation...
| | |-- testing...
| | |-- ...
| | |-- testing...
|-- datasets_part
|-- docs
|-- models
| |-- basemodel
| |-- captioning
| |-- video_swin_transformer
|-- scripts
|-- src
|-- README.md
|-- ...
|-- ... 由于项目中还用到了CoVLA数据集,这里也给出相应的文件分布,大同小异
|-- dataset
| |-- frame_tsv
| |-- captions_BDDX.json
| |-- training_32frames_caption_coco_format.json
| |-- training_32frames.yaml
| |-- training.caption.lineidx
| |-- training.caption.lineidx.8b
| |-- training.caption.linelist.tsv
| |-- training.caption.tsv
| |-- training.img.lineidx
| |-- training.img.lineidx.8b
| |-- training.img.tsv
| |-- training.label.lineidx
| |-- training.label.lineidx.8b
| |-- training.label.tsv
| |-- training.linelist.lineidx
| |-- training.linelist.lineidx.8b
| |-- training.linelist.tsv
| |-- validation...
| |-- ...
| |-- validation...
| |-- testing...
| |-- ...
| |-- testing...
|-- models
|-- output
|-- readme.md
|-- scripts
| |-- CoVLA_adapt_caption.sh CoVLA_only_caption.sh other_scripts
|-- src
| |-- configs
| |-- datasets
| |-- evalcap
| |-- layers
| |-- modeling
| |-- prepro
| |-- pytorch_grad_cam
| |-- solver
| |-- tags
| |-- tasks
| |-- timm
| |-- utils
在准备好上述文件之后,只需要设置好script/xxx.bash 中的内容,执行以下指令就好:
bash xxx.bash
xxx.bash 内容可以是如下:
#CUDA_VISIBLE_DEVICES=4,5,6,7 \
#NCCL_P2P_DISABLE=1 \
#OMPI_COMM_WORLD_SIZE="4" \
#python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --node_rank=0 --master_port=45978 src/tasks/run_adapt.py \
CUDA_VISIBLE_DEVICES=0 \
python src/tasks/run_MCAM.py\
--config src/configs/VidSwinBert/BDDX_multi_default.json \
--train_yaml BDDX/training_32frames.yaml \
--val_yaml BDDX/testing_32frames.yaml \
--per_gpu_train_batch_size 16\
--per_gpu_eval_batch_size 16 \
--num_train_epochs 40 \
--learning_rate 0.0002 \
--max_num_frames 32 \
--pretrained_2d 0 \
--backbone_coef_lr 0.05 \
--mask_prob 0.5 \
--max_masked_token 45 \
--zero_opt_stage 1 \
--mixed_precision_method deepspeed \
--deepspeed_fp16 \
--gradient_accumulation_steps 4 \
--learn_mask_enabled \
--loss_sparse_w 0.1 \
--use_sep_cap \
--multitask \
--signal_types course speed \
--loss_sensor_w 0.05 \
--max_grad_norm 1 \
--output_dir ./output/multitask/sensor_course_speed/MCAM/
We will upload our code one by one, due to the file which could not be move and adjust online. if this work could give you some help, please cite:
@InProceedings{Cheng_2025_ICCV,
author = {Cheng, Tongtong and Li, Rongzhen and Xiong, Yixin and Zhang, Tao and Wang, Jing and Liu, Kai},
title = {MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {5479-5489}
}