Skip to content

Flowgrad - Using Motion for Visual Sound Source Localization

License

Notifications You must be signed in to change notification settings

rrrajjjj/flowgrad

Repository files navigation

Flowgrad - Using Motion for Visual Sound Source Localization

Conference Paper Paper

Recent work in visual sound source localization relies on semantic audio-visual representations learned in a self-supervised manner, and by design excludes temporal information present in videos. While it proves to be effective for widely used benchmark datasets, the method falls short for challenging scenarios like urban traffic. This work introduces temporal context into the state-of-the-art methods for sound source localization in urban scenes using optical flow as a means to encode motion information.

Inference on a Video

result.video.mp4

Setup

Install requirements

pip install -r requirements.txt

Evaluation on Urbansas

We've used the Urban Sound and Sight dataset to train and evaluate our models. Evaluation can be run on urbansas as follows -

  1. Prepare data for evaluation

     python src/data/prepare_data.py -d PATH_TO_URBANSAS
     python src/data/calc_flow.py -d PATH_TO_URBANSAS
    
  2. Run evaluation - Several different models can be used for evaluation and the model can be passed as an argument.

     python evaluate.py -m MODEL
    

The following model choices are available -

  1. rcgrad - Pretrained model from How to Listen? Rethinking Visual Sound Source Localization (Paper) (Repo)
  2. flow - optical flow used as localization maps
  3. flowgrad-H, flowgrad-EN, flowgrad-IC - variants of flowgrad (refer to paper for details)
  4. yolo_baseline, yolo_topline - vision only object detection models used as baselines. The topline includes motion based filtering (stationary objects are discarded).

Performance

Model IoU (τ = 0.5) AUC
Vision-only+CF+TF (topline) 0.68 0.51
Optical flow (baseline) 0.33 0.23
RCGrad 0.16 0.13
FlowGrad-H 0.50 0.30
FlowGrad-IC 0.26 0.18
FlowGrad-EN 0.37 0.23

main_result

Upcoming!

  1. Inference on custom videos
  2. Training
  3. Support for other benchmark datasets for Visual Sound Source Localization

If you have any ideas or feature requests, please feel free to raise an issue!

Acknowledgements

This work is built upon and borrows heavily from hohsiangwu/rethinking-visual-sound-localization

About

Flowgrad - Using Motion for Visual Sound Source Localization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages