Skip to content

Unified Audio-Visual Perception for Multi-Task Video Localization

License

Notifications You must be signed in to change notification settings

ttgeng233/UniAV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unified Audio-Visual Perception for Multi-Task Video Localization

This repo is the official code of "Unified Audio-Visual Perception for Multi-Task Video Localization".

Introduction

This paper introduces the first unified framework to localize all three kinds of instances in untrimmed videos, including visual actions, sound events and audio-visual events. All these instances contribute equally to a comprehensive understanding of video content.

Requirements

The implemetation is based on PyTorch. Environment: Linux, GCC >= 4.9, CUDA >= 11.0, Python = 3.9, Pytorch = 1.11.0. Follow INSTALL.md to install required dependencies.

Data preparation

  • Download ActivityNet 1.3 from this link (pwd: cjf7). For visual features, fps=16, sliding window size=16 and stride=8. For audio features, sample rate=16kHZ, sliding window size=1s and stride=0.5s.
  • Download DESED from this link (pwd: 61le). For visual features, fps=16, sliding window size=16 and stride=4. For audio features, sample rate=16kHZ, sliding window size=1s and stride=0.25s.
  • Download UnAV-100 from this link (pwd: zyfm). For visual features, fps=16, sliding window size=16 and stride=4. For audio features, sample rate=16kHZ, sliding window size=1s and stride=0.25s.

Details: Each link includes the files of annotations in json format and audio and visual features. The audio and visual features are extracted from the audio and visual encoder of ONE-PEACE, respectively, where the visual encoder is finetuned on Kinetics-400.

After downloading, unpack files under ./data. The folder structure should look like:

This folder
│   README.md
│   ...  
└───data/
│    └───activitynet13/
│    |	 └───annotations
│    |	 └───av_features  # mix av features together
│    └───desed/
│    |	 └───annotations
│    |	 └───av_features 
│    └───unav100/
│    	 └───annotations
│    	 └───av_features  
└───libs
│   ...

Training

Run ./train.py to jointly train UniAV on three tasks (TAL, SED and AVEL). We use distributed training here.

CUDA_VISIBLE_DEVICES={divice_id} MASTER_ADDR={localhost} WORLD_SIZE={1} RANK={0} python -m torch.distributed.launch --master_port {port_id} --nproc_per_node={1} train.py ./configs/multi_task_anet_unav_dcase.yaml --output reproduce  --tasks 1-2-3 --num_train_epochs 5

Evaluation

Run eval.py to evaluate the trained model. You can download our pre-trained model from this link (pwd: kfne).

CUDA_VISIBLE_DEVICES={divice_id} MASTER_ADDR={localhost} WORLD_SIZE={1} RANK={0} python -m torch.distributed.launch --master_port {port_id} --nproc_per_node={1} eval.py ./configs/multi_task_anet_unav_dcase.yaml ./ckpt/multi_task_anet_unav_dcase_reproduce --tasks 1-2-3 

Running UniAV on your own videos

Given an untrimmed video with audio, our model can localize all three kinds of instances occurring in the video in a single pass. The inference code and demo will be released soon.

Citation

If you find our data and code are useful for your research, please consider citing our paper

@article{geng2024uniav,
  title={UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization},
  author={Geng, Tiantian and Wang, Teng and Zhang, Yanfu and Duan, Jinming and Guan, Weili and Zheng, Feng},
  journal={arXiv preprint arXiv:2404.03179},
  year={2024}
}

Acknowledgement

The video and audio features were extracted using ONE-PEACE. Our baseline model was implemented based on ActionFormer and UnAV. We thank the authors for their efforts. If you use our code, please also consider to cite their works.

About

Unified Audio-Visual Perception for Multi-Task Video Localization

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages