Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

[Project page] [ArXiv] [Dataset(Google drive)] [Dataset(Baidu drive)] [Benchmark]

This repository contains code for CVPR 2023 paper "Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline". This paper introduces the first Untrimmed Audio-Visual (UnAV-100) dataset and proposes to sovle audio-visual event localization problem in more realistic and challenging scenarios.

Requirements

The implemetation is based on PyTorch. Follow INSTALL.md to install required dependencies.

Data preparation

The proposed UnAV-100 dataset can be downloaded from [Project Page], including YouTube links of raw videos, annotations and extracted features.

If you want to use your own choices of video features, you can download the raw videos from this link (Baidu Drive, pwd: qslx). A download script is also provided for raw videos at scripts/video_download.py.

Note: after downloading data, unpack files under data/unav100. The folder structure should look like:

This folder
│   README.md
│   ...  
└───data/
│    └───unav100/
│    	 └───annotations/
│               └───unav100_annotations.json
│    	 └───av_features/   
│               └───__2MwJ2uHu0_flow.npy    # mix all features together
│               └───__2MwJ2uHu0_rgb.npy 
│               └───__2MwJ2uHu0_vggish.npy 
|                   ...
└───libs
│   ...

Training

Run train.py to train the model on UnAV-100 dataset. This will create an experiment folder under ./ckpt that stores training config, logs, and checkpoints.

python ./train.py ./configs/avel_unav100.yaml --output reproduce

Evaluation

Run eval.py to evaluate the trained model.

python ./eval.py ./configs/avel_unav100.yaml ./ckpt/avel_unav100_reproduce

[Optional] We also provide a pretrained model for UnAV-100, which can be downloaded from this link.

Citation

If you find our dataset and code are useful for your research, please cite our paper

@inproceedings{geng2023dense,
  title={Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline},
  author={Geng, Tiantian and Wang, Teng and Duan, Jinming and Cong, Runmin and Zheng, Feng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={22942--22951},
  year={2023}
}

Acknowledgement

The video features of I3D-rgb & flow and Vggish-audio were extracted using video_features. Our baseline model was implemented based on ActionFormer. We thank the authors for sharing their codes. If you use our code, please consider to cite their works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

Requirements

Data preparation

Training

Evaluation

Citation

Acknowledgement

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
configs		configs
libs		libs
scripts		scripts
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
train.py		train.py

License

ttgeng233/UnAV

Folders and files

Latest commit

History

Repository files navigation

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

Requirements

Data preparation

Training

Evaluation

Citation

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages