✈️ avion

AVION is short for A VIdeo model in ONe day. AVION (meaning plane in French and Spanish) is fast.

Training a Large Video Model on a Single Machine in a Day
Yue Zhao, Philipp Krähenbühl
UT Austin
arxiv | bibtex

Installation

See INSTALL.md to install this code.

Main results

AVION enables video-language contrastive pre-training on Ego4D (original narratives) on a single node of 8× consumer-grade GPUs within a day.

Method Backbone batch-size
per GPU GPU memory Hardware GPU×hour^ EK100 MIR
0-shot Avg. mAP

EgoVLP TSF-B 16 22 32× A100 1536 22.1

Ours ViT-B 256 19 8× A5000 130 27.4

^The reported GPU×hour is not normalized for GPU generations. The cost for EgoVLP is obtained from the original paper (Sec 6.1).
AVION speeds up LLM-augmented video-language contrastive pre-training (LaViLa) on Ego4D.

a. Pretraining cost and performance.

Method Backbone batch-size
per GPU GPU memory Hardware GPU×hour^ EK100 MIR
0-shot Avg. mAP

LaViLa TSF-B 32 25 32× V100 1824 30.9

Ours ViT-B 256 19 8× A5000 260 33.2

^The reported GPU×hour is not normalized for GPU generations.

b. Downstream performance.

Method Backbone EK100 MIR
Avg. mAP EK100 MIR
Avg. nDCG EK100 CLS
Action Top-1

LaViLa TSF-B 50.5 65.0 46.9

Ours ViT-B 51.7 66.8 49.5

LaViLa TSF-L 50.9 66.5 51.0

Ours ViT-L 54.5 69.0 54.5

🏆 LaViLa+AVION helps us win CVPR 2023 EPIC-Kitchens Challenges in both Action Recognition and Multi-Instance Retrieval Tasks by a significant margin.
AVION speeds up VideoMAE pre-training.

Method Backbone Epochs GPU×hour^^ top-1/top-5 (w/. FT)

VideoMAE ViT-B 800 995 80.0/94.4

Ours ViT-B 800 583 80.1/94.5

^^Both GPU×hour are measured on the same hardware environment (4× A5000 GPU).

For more details, please refer to MODEL_ZOO.

License

MIT License.

Acknowledgements

The vision-language contrastive pretraining part is refactored from LaViLa.
The MAE-style self-supervised pre-training part is built upon VideoMAE.

Citing AVION

@article{zhao2023training,
  title={Training a large video model on a single machine in a day},
  author={Zhao, Yue and Kr{\"a}henb{\"u}hl, Philipp},
  journal={arXiv preprint arXiv:2309.16669},
  year={2023}
}

@inproceedings{zhao2023lavila,
  title={Learning Video Representations from Large Language Models},
  author={Zhao, Yue and Misra, Ishan and Kr{\"a}henb{\"u}hl, Philipp and Girdhar, Rohit},
  booktitle={CVPR},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
avion		avion
datasets		datasets
docs		docs
scripts		scripts
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avion

avion

datasets

datasets

docs

docs

scripts

scripts

third_party

third_party

.gitignore

.gitignore

.gitmodules

.gitmodules

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

✈️ avion

Installation

Main results

License

Acknowledgements

Citing AVION

About

Releases

Packages

Languages

Method	Backbone	batch-size per GPU	GPU memory	Hardware	GPU×hour^	EK100 MIR 0-shot Avg. mAP
EgoVLP	TSF-B	16	22	32× A100	1536	22.1
Ours	ViT-B	256	19	8× A5000	130	27.4

Method	Backbone	batch-size per GPU	GPU memory	Hardware	GPU×hour^	EK100 MIR 0-shot Avg. mAP
LaViLa	TSF-B	32	25	32× V100	1824	30.9
Ours	ViT-B	256	19	8× A5000	260	33.2

Method	Backbone	EK100 MIR Avg. mAP	EK100 MIR Avg. nDCG	EK100 CLS Action Top-1
LaViLa	TSF-B	50.5	65.0	46.9
Ours	ViT-B	51.7	66.8	49.5
LaViLa	TSF-L	50.9	66.5	51.0
Ours	ViT-L	54.5	69.0	54.5

Method	Backbone	Epochs	GPU×hour^^	top-1/top-5 (w/. FT)
VideoMAE	ViT-B	800	995	80.0/94.4
Ours	ViT-B	800	583	80.1/94.5

License

zhaoyue-zephyrus/AVION

Folders and files

Latest commit

History

Repository files navigation

✈️ avion

Installation

Main results

License

Acknowledgements

Citing AVION

About

Topics

Resources

License

Stars

Watchers

Forks

Languages