Skip to content

Official codebase for "Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling".

License

Notifications You must be signed in to change notification settings

stoneMo/DeepAVFusion

Repository files navigation

Masked Autoencoders enable strong Audio-Visual Early Fusion

Official codebase and pre-trained models for our DeepAVFusion framework as described in the paper.

Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling
Shentong Mo, Pedro Morgado
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.

DeepAVFusion Illustration

Setup

Environment

Our environment was created as follows

conda create -n deepavfusion python=3.10
conda activate deepavfusion
conda install pytorch=2.0 torchvision=0.15 torchaudio=2.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install submitit hydra-core av wandb tqdm scipy scikit-image scikit-learn timm mir_eval jupyter matplotlib

Simply run conda env create -f requirements.yml to replicate it.

Datasets

In this work, we used a variety of datasets, including VGGSound, AudioSet, MUSIC and AVSBench. We assume that you have downloaded all datasets. Expected data format is briefly described in DATASETS.md

PATH2VGGSOUND="/path/to/vggsound"
PATH2AUDIOSET="/path/to/audioset"
PATH2MUSIC="/path/to/music"
PATH2AVSBENCH="/path/to/avsbench"

DeepAVFusion Pre-training

We release two models based on the VIT-Base architecture, trained on the VGGSounds and AudioSet datasets, respectively. The models were trained with the following commands.

# Pre-training on VGGSounds
PYTHONPATH=. python launcher.py --config-name=deepavfusion job_name=deepavfusion_vitb_vggsound_ep\${opt.epochs} \
  data.dataset=vggsound data.data_path=${PATH2VGGSOUND} \
  model.fusion.layers=all model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 \
  opt.epochs=200 opt.warmup_epochs=40 opt.batch_size=64 opt.accum_iter=1 opt.blr=1.5e-4 \
  env.ngpu=8 env.world_size=1 env.seed=0
  
# Pre-training on AudioSet
PYTHONPATH=. python launcher.py --config-name=deepavfusion job_name=deepavfusion_vitb_as2m_ep\${opt.epochs} \
  data.dataset=audioset data.data_path=${PATH2AUDIOSET} \
  model.fusion.layers=all model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 \
  opt.epochs=200 opt.warmup_epochs=40 opt.batch_size=64 opt.accum_iter=4 opt.blr=1.5e-4 \
  env.ngpu=8 env.world_size=1 env.seed=0 

The nearest neighbor training curve of the model trained on VGGSound can be seen below. The retrieval performance of fusion tokens is substantially better than uni-modal representations, suggesting that fusion tokens aggregate high-level semantics, while uni-modal representations encode the low-level details required for masked reconstruction.

DeepAVFusion training curve

The pre-trained models are available in the checkpoints/ directory.

Downstream tasks

We evaluate our model on a variety of downstream tasks. In each case, the pre-trained model is used for feature extraction (with or without fine-tuning depending on the evaluation protocol) and a task-specific decoder is trained from scratch to carry the task.

Audio Event Recognition

Dataset Eval Protocol Pre-trained Model Top1 Acc
VGGSound Linear Probe VGGSound-200ep 53.08
CMDPYTHONPATH=. python launcher.py --config-name=linprobe job_name=eval_linprobe_vggsound pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=vggsound data.data_path=${PATH2VGGSOUND} opt.epochs=60 opt.warmup_epochs=10 opt.batch_size=64 opt.accum_iter=4 opt.blr=0.3 env.ngpu=4 env.world_size=1
VGGSound Linear Probe AudioSet2M-200ep 53.08
CMDPYTHONPATH=. python launcher.py --config-name=linprobe job_name=eval_linprobe_vggsound pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=vggsound data.data_path=${PATH2VGGSOUND} opt.epochs=60 opt.warmup_epochs=10 opt.batch_size=64 opt.accum_iter=4 opt.blr=0.3 env.ngpu=4 env.world_size=1
VGGSound Fine-tuning VGGSound-200ep 58.19
CMDPYTHONPATH=. python launcher.py --config-name=finetune job_name=eval_finetune_vggsound pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=vggsound data.data_path=${PATH2VGGSOUND} opt.epochs=100 opt.warmup_epochs=20 opt.batch_size=32 opt.accum_iter=4 opt.blr=3e-4 env.ngpu=4 env.world_size=1
VGGSound Fine-tuning AudioSet2M-200ep 57.91
CMDPYTHONPATH=. python launcher.py --config-name=finetune job_name=finetune_vggsound pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=vggsound data.data_path=${PATH2VGGSOUND} opt.epochs=100 opt.warmup_epochs=20 opt.batch_size=32 opt.accum_iter=4 opt.blr=3e-4 env.ngpu=4 env.world_size=1
Dataset Eval Protocol Pre-trained Model Top1 AP
AudioSet-Bal Linear Probe VGGSound-200ep 53.08
CMDPYTHONPATH=. python launcher.py --config-name=linprobe job_name=eval_linprobe_as2mbal pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=audioset-bal-orig data.data_path=${PATH2AUDIOSET} opt.epochs=300 opt.warmup_epochs=20 opt.batch_size=256 opt.accum_iter=1 opt.blr=0.3 env.ngpu=2 env.world_size=1
AudioSet-Bal Linear Probe AudioSet2M-200ep 53.08
CMDPYTHONPATH=. python launcher.py --config-name=linprobe job_name=eval_linprobe_as2mbal pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=audioset-bal-orig data.data_path=${PATH2AUDIOSET} opt.epochs=300 opt.warmup_epochs=20 opt.batch_size=256 opt.accum_iter=1 opt.blr=0.3 env.ngpu=2 env.world_size=1
AudioSet-Bal Fine-tuning VGGSound-200ep 58.19
CMDPYTHONPATH=. python launcher.py --config-name=finetune job_name=eval_finetune_as2mbal pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=audioset-bal-orig data.data_path=${PATH2AUDIOSET} opt.epochs=200 opt.warmup_epochs=20 opt.batch_size=32 opt.accum_iter=4 opt.blr=3e-4 env.ngpu=4 env.world_size=1
AudioSet-Bal Fine-tuning AudioSet2M-200ep 57.91
CMDPYTHONPATH=. python launcher.py --config-name=finetune job_name=eval_finetune_as2mbal pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=audioset-bal-orig data.data_path=${PATH2AUDIOSET} opt.epochs=200 opt.warmup_epochs=20 opt.batch_size=32 opt.accum_iter=4 opt.blr=3e-4 env.ngpu=4 env.world_size=1

Visually Guided Source Separation

Dataset Pre-training SDR SIR SAR
VGGSound-Music VGGSound-200ep 5.79 8.24 13.82
CMDPYTHONPATH=. python launcher.py --config-name=avsrcsep job_name=eval_avsrcsep_vggsound_music pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=vggsound_music data.data_path=${PATH2VGGSOUND} opt.epochs=300 opt.warmup_epochs=40 opt.batch_size=16 opt.accum_iter=8 opt.blr=3e-4 avss.log_freq=True avss.weighted_loss=True avss.binary_mask=False avss.num_mixtures=2 env.ngpu=4 env.world_size=1
VGGSound-Music AudioSet2M-200ep 6.93 9.93 13.49
CMDPYTHONPATH=. python launcher.py --config-name=avsrcsep job_name=eval_avsrcsep_vggsound_music pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=vggsound_music data.data_path=${PATH2VGGSOUND} opt.epochs=300 opt.warmup_epochs=40 opt.batch_size=16 opt.accum_iter=8 opt.blr=3e-4 avss.log_freq=True avss.weighted_loss=True avss.binary_mask=False avss.num_mixtures=2 env.ngpu=4 env.world_size=1

Audio Visual Semantic Segmentation

Dataset Pre-training mIoU FScore
AVSBench-S4 VGGSounds-200ep 89.94 92.34
CMDPYTHONPATH=. python launcher.py --config-name=avsegm job_name=eval_avsbench_s4 pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=avsbench_s4 data.data_path=${PATH2AVSBENCH} opt.epochs=100 opt.warmup_epochs=20 opt.batch_size=16 opt.accum_iter=8 opt.blr=2e-4 env.ngpu=4 env.world_size=1
AVSBench-S4 AudioSet2M-200ep 90.27 92.49
CMDPYTHONPATH=. python launcher.py --config-name=avsegm job_name=eval_avsbench_s4 pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=avsbench_s4 data.data_path=${PATH2AVSBENCH} opt.epochs=100 opt.warmup_epochs=20 opt.batch_size=16 opt.accum_iter=8 opt.blr=2e-4 env.ngpu=4 env.world_size=1

Demonstrations

Demo 1

Original Video

demo1_orig.mp4

Localized Sources

Separated Source #1

demo1_sep1.mp4

Separated Source #2

demo1_sep2.mp4
Demo 2

Original Video

demo2_orig.mp4

Localized Sources

Separated Source #1

demo2_sep1.mp4

Separated Source #2

demo2_sep2.mp4

Demo 3

Original Video

demo3_orig.mp4

Localized Sources

Separated Source #1

demo3_sep1.mp4

Separated Source #2

demo3_sep2.mp4

Demo 4

Original Video

demo4_orig.mp4

Localized Sources

Separated Source #1

demo4_sep1.mp4

Separated Source #2

demo4_sep2.mp4

Citation

If you find this repository useful, please cite our paper:

@inproceedings{mo2024deepavfusion,
  title={Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling},
  author={Mo, Shentong and Morgado, Pedro},
  booktitle={Proceedings of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR)},
  year={2024}
}