This is an official repo of paper "Temporal Cross-attention for Action Recognition" at ACCV2022 Workshop on Vision Transformers: Theory and applications (VTTA-ACCV2022).
@InProceedings{Hashiguchi_2022_ACCV,
author = {Hashiguchi, Ryota and Tamaki, Toru},
title = {Temporal Cross-attention for Action Recognition},
booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops},
month = {December},
year = {2022},
pages = {276-288}
}
We thank for the author of TokenShift:
MSCA is build upon TokenShift.
Download ImageNet-22k pretrained weights from Base16
.
Prepare Kinetics-400 dataset organized in the following structure.
Almost same with TokenShift, with slight modificaitons. See config file.
k400
|_ frames331_train
| |_ [category name 0]
| | |_ [video name 0]
| | | |_ img_00001.jpg
| | | |_ img_00002.jpg
| | | |_ ...
| | |
| | |_ [video name 1]
| | | |_ img_00001.jpg
| | | |_ img_00002.jpg
| | | |_ ...
| | |_ ...
| |
| |_ [category name 1]
| | |_ [video name 0]
| | | |_ img_00001.jpg
| | | |_ img_00002.jpg
| | | |_ ...
| | |
| | |_ [video name 1]
| | | |_ img_00001.jpg
| | | |_ img_00002.jpg
| | | |_ ...
| | |_ ...
| |_ ...
|
|_ frames331_val
| |_ [category name 0]
| | |_ [video name 0]
| | | |_ img_00001.jpg
| | | |_ img_00002.jpg
| | | |_ ...
| | |
| | |_ [video name 1]
| | | |_ img_00001.jpg
| | | |_ img_00002.jpg
| | | |_ ...
| | |_ ...
| |
| |_ [category name 1]
| | |_ [video name 0]
| | | |_ img_00001.jpg
| | | |_ img_00002.jpg
| | | |_ ...
| | |
| | |_ [video name 1]
| | | |_ img_00001.jpg
| | | |_ img_00002.jpg
| | | |_ ...
| | |_ ...
| |_ ...
|
|_ trainValTest
|_ train.txt
|_ val.txt
python main.py --tune_from pretrain/ViT-B_16_Img21.npz --cfg config/custom/kinetics400/k400_attentionshift_div4_8x32_base_224.yml