Provable Dynamic Fusion for Low-Quality Multimodal Data

This is the official implementation for Provable Dynamic Fusion for Low-Quality Multimodal Data (ICML 2023) by Qingyang Zhang, Haitao Wu, Changqing Zhang , Qinghua Hu, Huazhu Fu, Joey Tianyi Zhou and Xi Peng

This paper provides a theoretical framework to understand the criterion of robust dynamic multimodal fusion.
A novel dynamic multimodal fusion method termed Quality-aware Multimodal Fusion (QMF) is proposed for provably better generalization ability.

Enviroment setup

pip install -r requirements.txt

Dataset preparation

Text-Image Classification:

Step 1: Download food101 and MVSA_Single and put them in the folder datasets.

Step 2: Prepare the train/dev/test splits jsonl files. We follow the MMBT settings and provide them in corresponding folders.

Step 3 (optional): If you want use Glove model for Bow model, you can download glove.840B.300d.txt and put it in the folder datasets/glove_embeds. For bert model, you can download bert-base-uncased (Google Drive Link ) and put in the root folder bert-base-uncased/.
RGBD Scene Recognition:

Step 1: Download NYUD2 and SUNRGBD and put them in the folder datasets.

Feel free to use Baidu Netdisk for food101 MVSA_Single NYUD2 SUNRGBD.

Trained Model

We provide the trained models at Baidu Netdisk.

Pretrained bert model at Baidu Netdisk.

We use the pytorch official pretrained resnet18 in RGB-D classification tasks, which can be downloaded from this link.

Usage Example: Text-Image Classification

Note: Sheels for reference are provided in the folder shells

To run our method on benchmark datasets:

task="MVSA_Single" or "food101"
task_type="classification"
model="latefusion"
name=$task"_"$model"model_run_df$i"

python train_qmf.py --batch_sz 16 --gradient_accumulation_steps 40  \
    --savedir ./saved/$task --name $name  --data_path ./datasets/ \
    --task $task --task_type $task_type  --model $model --num_image_embeds 3 \
    --freeze_txt 5 --freeze_img 3   --patience 5 --dropout 0.1 --lr 5e-05 --warmup 0.1 --max_epochs 100 --seed $i --df true --noise 0.0

To run tmc:

python train_tmc.py --batch_sz 16 --gradient_accumulation_steps 40  \
    --savedir ./saved/$task --name $name  --data_path ./datasets/ \
    --task $task --task_type $task_type  --model $model --num_image_embeds 3 \
    --freeze_txt 5 --freeze_img 3   --patience 5 --dropout 0.1 --lr 5e-05 --warmup 0.1 --max_epochs 100 --seed $i --df true --noise 0.0

To run Others:

task="MVSA_Single" or "food101"
task_type="classification"
model="bow" "bert" "img" "concatbert" "concatbow" "mmbt"
name=$task"_"$model"model_run$i"

python train.py --batch_sz 16 --gradient_accumulation_steps 40  \
    --savedir ./saved/$task --name $name  --data_path ./datasets/ \
    --task $task --task_type $task_type  --model $model --num_image_embeds 3 \
    --freeze_txt 5 --freeze_img 3   --patience 5 --dropout 0.1 --lr 5e-05 --warmup 0.1 --max_epochs 100 --seed $i --df true --noise 0.0

Citation

If our QMF or the idea of dynamic multimodal fusion methods are helpful in your research, please consider citing our paper:

@inproceedings{zhang2023provable,
  title={Provable Dynamic Fusion for Low-Quality Multimodal Data},
  author={Zhang, Qingyang and Wu, Haitao and Zhang, Changqing and Hu, Qinghua and Fu, Huazhu and Zhou, Joey Tianyi and Peng, Xi},
  booktitle={International Conference on Machine Learning},
  year={2023}
}

Acknowledgement

The code is inspired by TMC: Trusted Multi-View Classification and Confidence-Aware Learning for Deep Neural Networks.

Related works

There are many interesting works related to this paper:

For any additional questions, feel free to email qingyangzhang@tju.edu.cn.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
RGBD-scene-recognition		RGBD-scene-recognition
text-image-classification		text-image-classification
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
illustration.png		illustration.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Provable Dynamic Fusion for Low-Quality Multimodal Data

Enviroment setup

Dataset preparation

Trained Model

Usage Example: Text-Image Classification

Citation

Acknowledgement

Related works

About

Releases

Packages

Contributors 2

Languages

License

QingyangZhang/QMF

Folders and files

Latest commit

History

Repository files navigation

Provable Dynamic Fusion for Low-Quality Multimodal Data

Enviroment setup

Dataset preparation

Trained Model

Usage Example: Text-Image Classification

Citation

Acknowledgement

Related works

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages