This is an official repo of IV-ViT paper "Joint learning of images and videos with a single Vision Transformer".
@inproceedings{Shimizu_MVA2023_IV_VIT,
author = {Shuki Shimizu and Toru Tamaki},
title = {Joint learning of images and videos with a single Vision Transformer},
booktitle = {18th International Conference on Machine Vision and Applications,
{MVA} 2023, Hamamatsu, Japan, July 23-25, 2023},
pages = {1--6},
publisher = {{IEEE}},
year = {2023},
url = {https://doi.org/10.23919/MVA57639.2023.10215661>,
doi = {10.23919/MVA57639.2023.10215661},
}
You can learn our proposed IV-ViT in various settings with this code. You will need to make the following preparations.
- prepare datasets
- prepare pretrained weights
- prepare library
In this code, Tiny-ImageNet and CIFAR100 can be used as image datasets, and UCF101 and mini-Kinetics as video datasets.
You need to prepare the datasets under datasets/
with the following directory structure.
datasets/
├─Tiny-ImageNet/
│ └─tiny-imagenet/
│ ├─train/
│ │ ├─[category0]/
│ │ ├─[category1]/
│ │ ├─...
│ │
│ └─val/
│ ├─[category0]/
│ ├─[category1]/
│ ├─...
│
├─CIFAR100/
│ └─cifar-100-python/
│
├─UCF101/
│ └─ucfTrainTestlist/
│ ├─trainlist01.txt
│ └─testlist01.txt
│
└─Kinetics200/
├─train/
│ ├─[category0]/
│ ├─[category1]/
│ ├─...
│
└─val/
├─[category0]/
├─[category1]/
├─...
In this paper we use multiple pretrained weights.
You will need to download the pretrained weights under pretrained_weight/
with the following directory structure.
pretrained_weight/
├─ImageNet21k/
│ └─video_model_imagenet21k_pretrained.pth
└─Kinetics400/
└─video_model_kinetics_pretrained.pth
You can download each pretrained weight with the following command.
sh download.sh
You can install all libraries required by this code with the following command.
pip install -r requirements.txt
You can do two things with this code, model training and searching hyperparameters.
python main.py --mode train
python main.py --mode optuna
We use argument to manage the experimental setup. The following is a description of the main arguments (see args.py for details).
i (int)
: you can set training iteration.bsl (list[int])
: you can set batch size for each datset. Then you must same order with dataset (dn
).dn (list[string])
: you can set dataset with following choices: [Tiny-ImageNet, CIFAR100, UCF101, Kinetics200].model (string)
: you can set model with following choices: [IV-ViT, TokenShift, MSCA, polyvit].pretrain (string)
: you can set pretrained weights with following choices: [Kinetics400, ImageNet-21k, ImageNet-1k, polyvit].use_comet (bool)
: you can select if use comet or not by givenuse_comet
.root_path (string)
: you must set root path, e.g.~/data_root/
.
For example, you want to train with following settings.
- iteration:10000
- batch size
- Tiny-ImageNet:16
- CIFAR100:16
- UCF101:4
- Kinetics200:4
- model:IV-ViT
- pretrain weight:Kinetics400
- you don't use comet
- root path:
~/data_root/
Then, you execute the following command.
python main.py -i 10000 -bsl 16 16 4 4 -dn Tiny-ImageNet CIFAR100 UCF101 Kinetics200 --model IV-ViT --pretrain Kinetics400 --use_comet --root_path ~/data_root/