[Project page] [ArXiv] [Dataset(Google drive)] [Dataset(Baidu drive)] [Benchmark]
This repository contains code for CVPR 2023 paper "Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline". This paper introduces the first Untrimmed Audio-Visual (UnAV-100) dataset and proposes to sovle audio-visual event localization problem in more realistic and challenging scenarios.
The implemetation is based on PyTorch. Follow INSTALL.md to install required dependencies.
The proposed UnAV-100 dataset can be downloaded from [Project Page], including YouTube links of raw videos, annotations and extracted features.
If you want to use your own choices of video features, you can download the raw videos from this link (Baidu Drive, pwd: qslx). A download script is also provided for raw videos at scripts/video_download.py
.
Note: after downloading data, unpack files under data/unav100
. The folder structure should look like:
This folder
│ README.md
│ ...
└───data/
│ └───unav100/
│ └───annotations/
│ └───unav100_annotations.json
│ └───av_features/
│ └───__2MwJ2uHu0_flow.npy # mix all features together
│ └───__2MwJ2uHu0_rgb.npy
│ └───__2MwJ2uHu0_vggish.npy
| ...
└───libs
│ ...
Run train.py
to train the model on UnAV-100 dataset. This will create an experiment folder under ./ckpt
that stores training config, logs, and checkpoints.
python ./train.py ./configs/avel_unav100.yaml --output reproduce
Run eval.py
to evaluate the trained model.
python ./eval.py ./configs/avel_unav100.yaml ./ckpt/avel_unav100_reproduce
[Optional] We also provide a pretrained model for UnAV-100, which can be downloaded from this link.
If you find our dataset and code are useful for your research, please cite our paper
@inproceedings{geng2023dense,
title={Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline},
author={Geng, Tiantian and Wang, Teng and Duan, Jinming and Cong, Runmin and Zheng, Feng},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={22942--22951},
year={2023}
}
The video features of I3D-rgb & flow and Vggish-audio were extracted using video_features. Our baseline model was implemented based on ActionFormer. We thank the authors for sharing their codes. If you use our code, please consider to cite their works.