GitHub - whwu95/ATM: 【ICCV'2023】What Can Simple Arithmetic Operations Do for Temporal Modeling?

【ICCV'2023】What Can Simple Arithmetic Operations Do for Temporal Modeling?

If you like our project, please give us a star ⭐ on GitHub for latest update.

Wenhao Wu^1,2, Yuxin Song², Zhun Sun², Jingdong Wang³, Chang Xu¹, Wanli Ouyang^3,1

¹The University of Sydney, ²Baidu, ³Shanghai AI Lab

This is the official implementation of our ATM (Arithmetic Temporal Module), which explores the potential of four simple arithmetic operations for temporal modeling.

Our best model can achieve 89.4% Top-1 Acc. on Kinetics-400, 65.6% Top-1 Acc. on Something-Something V1, 74.6% Top-1 Acc. on Something-Something V2!

🔥 I also have other recent video recognition projects that may interest you ✨.

Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
Huanjin Yao, Wenhao Wu, Zhiheng Li

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang

Revisiting Classifier: Transferring Vision-Language Models for Video Recognition
Wenhao Wu, Zhun Sun, Wanli Ouyang

📣 News

Nov 29, 2023: Training codes have be released.
July 14, 2023: 🎉Our ATM has been accepted by ICCV-2023.

🌈 Overview

The key motivation behind ATM is to explore the potential of simple arithmetic operations to capture auxiliary temporal clues that may be embedded in current video features, without relying on the elaborate design. The ATM can be integrated into both vanilla CNN backbone (e.g., ResNet) and Vision Transformer (e.g., ViT) for video action recognition.

🚀 Training & Testing

We offer training and testing scripts for Kinetics-400, Sth-Sth V1, and Sth-Sth V2. Please refer to the script folder for details. For example, you can run:

# Train the 8 Frames ViT-B/32 model on Sth-Sth v1.
sh scripts/ssv1/train_base.sh 

# Test the 8 Frames ViT-B/32 model on Sth-Sth v1.
sh scripts/ssv1/test_base_f8.sh

📌 BibTeX & Citation

If you use our code in your research or wish to refer to the baseline results, please use the following BibTeX entry😁.

@inproceedings{atm,
  title={What Can Simple Arithmetic Operations Do for Temporal Modeling?},
  author={Wu, Wenhao and Song, Yuxin and Sun, Zhun and Wang, Jingdong and Xu, Chang and Ouyang, Wanli},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision (ICCV)},
  year={2023}
}

🎗️ Acknowledgement

This repository is built upon portions of VideoMAE, CLIP, and EVA. Thanks to the contributors of these great codebases.

👫 Contact

For any question, please file an issue or contact Wenhao Wu.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
clip		clip
dataset		dataset
eva_clip		eva_clip
lists		lists
modules		modules
pics		pics
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
datasets.py		datasets.py
engine_for_finetuning.py		engine_for_finetuning.py
optim_factory.py		optim_factory.py
run_class_finetuning.py		run_class_finetuning.py
test_for_frame.py		test_for_frame.py
utils.py		utils.py

License

whwu95/ATM

Folders and files

Latest commit

History

Repository files navigation

If you like our project, please give us a star ⭐ on GitHub for latest update.

📣 News

🌈 Overview

🚀 Training & Testing

📌 BibTeX & Citation

🎗️ Acknowledgement

👫 Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages