[CVPR'23] AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

📖 Paper: CVPR'23 and arXiv

Our paper (AdaMAE) has been accepted for presentation at CVPR'23.

💡 Contributions:

We propose AdaMAE, a novel, adaptive, and end-to-end trainable token sampling strategy for MAEs that takes into account the spatiotemporal properties of all input tokens to sample fewer but informative tokens.
We empirically show that AdaMAE samples more tokens from high spatiotemporal information regions of the input, resulting in learning meaningful representations for downstream tasks.
We demonstrate the efficiency of AdaMAE in terms of performance and GPU memory against random patch, tube, and frame sampling by conducting a thorough ablation study on the SSv2 dataset.
We show that our AdaMAE outperforms state-of-the-art (SOTA) by $0.7%$ and $1.1%$ (in top-1) improvements on $SSv2$ and $Kinetics-400$, respectively.

Method

Adaptive mask visualizations from $SSv2$ (samples from $50th$ epoch)

Video	Pred.	Error	CAT	Mask		Video	Pred.	Error	CAT	Mask

Adaptive mask visualizations from $K400$ (samples from $50th$ epoch):

Video	Pred.	Error	CAT	Mask		Video	Pred.	Error	CAT	Mask

A comparision

Comparison of our adaptive masking with existing random patch, tube, and frame masking for masking ratio of 80%.} Our adaptive masking approach selects more tokens from the regions with high spatiotemporal information while a small number of tokens from the background.

Ablation experiments on SSv2 dataset:

We use ViT-Base as the backbone for all experiments. MHA $(D=2, d=384)$ denotes our adaptive token sampling network with a depth of two and embedding dimension of $384$. All pre-trained models are evaluated based on the evaluation protocol described in Sec. 4. The default choice of our AdaMAE is highlighted in gray color. The GPU memory consumption is reported for a batch size of 16 on a single GPU.

Pre-training AdaMAE & fine-tuning:

We closely follow the VideoMAE pre-trainig receipy, but now with our adaptive masking instead of tube masking. To pre-train AdaMAE, please follow the steps in DATASET.md, PRETRAIN.md.
To check the performance of pre-trained AdaMAE please follow the steps in DATASET.md and FINETUNE.md.
To setup the conda environment, please refer FINETUNE.md.

Pre-trained model weights

Download the pre-trained model weights for SSv2 and K400 datasets here.

Acknowledgement:

Our AdaMAE codebase is based on the implementation of VideoMAE paper. We thank the authors of the VideoMAE for making their code available to the public.

Citation:

@InProceedings{Bandara_2023_CVPR,
    author    = {Bandara, Wele Gedara Chaminda and Patel, Naman and Gholami, Ali and Nikkhah, Mehdi and Agrawal, Motilal and Patel, Vishal M.},
    title     = {AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning With Masked Autoencoders},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {14507-14517}
}

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
figs		figs
models		models
msc		msc
readme		readme
scripts		scripts
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
README_KINETICS400.md		README_KINETICS400.md
datasets.py		datasets.py
finetune_class.py		finetune_class.py
functional.py		functional.py
kinetics.py		kinetics.py
masking_generator.py		masking_generator.py
optim_factory.py		optim_factory.py
pretrain_mae_vit.py		pretrain_mae_vit.py
rand_augment.py		rand_augment.py
random_erasing.py		random_erasing.py
run_videomae_vis.py		run_videomae_vis.py
ssv2.py		ssv2.py
transforms.py		transforms.py
video_transforms.py		video_transforms.py
volume_transforms.py		volume_transforms.py

License

wgcban/adamae

Folders and files

Latest commit

History

Repository files navigation

[CVPR'23] AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

💡 Contributions:

Method

Adaptive mask visualizations from $SSv2$ (samples from $50th$ epoch)

Adaptive mask visualizations from $K400$ (samples from $50th$ epoch):

A comparision

Ablation experiments on SSv2 dataset:

Pre-training AdaMAE & fine-tuning:

Pre-trained model weights

Acknowledgement:

Citation:

About

Resources

License

Stars

Watchers

Forks

Languages