Recent work in visual sound source localization relies on semantic audio-visual representations learned in a self-supervised manner, and by design excludes temporal information present in videos. While it proves to be effective for widely used benchmark datasets, the method falls short for challenging scenarios like urban traffic. This work introduces temporal context into the state-of-the-art methods for sound source localization in urban scenes using optical flow as a means to encode motion information.
result.video.mp4
pip install -r requirements.txt
We've used the Urban Sound and Sight dataset to train and evaluate our models. Evaluation can be run on urbansas as follows -
-
Prepare data for evaluation
python src/data/prepare_data.py -d PATH_TO_URBANSAS python src/data/calc_flow.py -d PATH_TO_URBANSAS
-
Run evaluation - Several different models can be used for evaluation and the model can be passed as an argument.
python evaluate.py -m MODEL
The following model choices are available -
- rcgrad - Pretrained model from How to Listen? Rethinking Visual Sound Source Localization (Paper) (Repo)
- flow - optical flow used as localization maps
- flowgrad-H, flowgrad-EN, flowgrad-IC - variants of flowgrad (refer to paper for details)
- yolo_baseline, yolo_topline - vision only object detection models used as baselines. The topline includes motion based filtering (stationary objects are discarded).
Model | IoU (τ = 0.5) | AUC |
---|---|---|
Vision-only+CF+TF (topline) | 0.68 | 0.51 |
Optical flow (baseline) | 0.33 | 0.23 |
RCGrad | 0.16 | 0.13 |
FlowGrad-H | 0.50 | 0.30 |
FlowGrad-IC | 0.26 | 0.18 |
FlowGrad-EN | 0.37 | 0.23 |
- Inference on custom videos
- Training
- Support for other benchmark datasets for Visual Sound Source Localization
If you have any ideas or feature requests, please feel free to raise an issue!
This work is built upon and borrows heavily from hohsiangwu/rethinking-visual-sound-localization