Skip to content
Co-Separating Sounds of Visual Objects (ICCV 2019)
Branch: master
Clone or download
Latest commit 11c689a Oct 11, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
data initial commit Oct 11, 2019
models initial commit Oct 11, 2019
options initial commit Oct 11, 2019
utils initial commit Oct 11, 2019
LICENSE Rename LICENSE.txt to LICENSE Oct 11, 2019 update README Oct 11, 2019
co_separation.png initial commit Oct 11, 2019 initial commit Oct 11, 2019 initial commit Oct 11, 2019

Co-Separating Sounds of Visual Objects

[Project Page] [arXiv] [Video]

Co-Separating Sounds of Visual Objects
Ruohan Gao1 and Kristen Grauman1,2
1UT Austin, 2Facebook AI Research
In International Conference on Computer Vision (ICCV), 2019

If you find our code or project useful in your research, please cite:

   title = {Co-Separating Sounds of Visual Objects},
   author = {Gao, Ruohan and Grauman, Kristen},
   booktitle = {ICCV},
   year = {2019}

Generate noisy object detections

We use the public PyTorch implementation of Faster R-CNN ( to train an object detector with a ResNet-101 backbone. The object detector is trained on ∼30k images of 15 object categories from the Open Images dataset. The 15 object categories include: Banjo, Cello, Drum, Guitar, Harp, Harmonica, Oboe, Piano, Saxophone, Trombone, Trumpet, Violin, Flute, Accordion, and Horn. The pre-trained detector is shared at Google Drive. Please refer to for instructions on how to use the pre-trained object detector or train a new detector on categories of your interest. Use the pretrained-detector to generate object detections for both training and testing set, and save the object detection results of each video as one .npy file under /your_data_root/detection_results/. See Supp. for how we reduce the noise of the obtained detections.

Co-Separation training

Use the following command to train your co-separation model:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python --name audioVisual --hdf5_path /your_root/hdf5/soloduet/ --scene_path /your_root/hdf5/ADE.h5 --gpu_ids 0,1,2,3,4,5,6,7 --batchSize 80 --nThreads 32 --display_freq 10 --save_latest_freq 500 --niter 1 --validation_freq 200 --validation_batches 20 --num_batch 35000 --lr_steps 15000 30000 --classifier_loss_weight 0.05 --coseparation_loss 1 --unet_num_layers 7 --lr_visual 0.00001 --lr_unet 0.0001 --lr_classifier 0.0001 --weighted_loss --visual_pool conv1x1 --optimizer adam --log_freq True --with_additional_scene_image --tensorboard True --validation_visualization True |& tee -a log.txt

Co-Separation testing

Use the following command to mix and separate two solo videos using the your trained co-separation model or the shared model pre-trained on MUSIC dataset:

python --video1_name video1_name --video2_name video2_name --visual_pool conv1x1 --unet_num_layers 7 --data_path /your_data_root/MUSIC_dataset/solo/ --weights_visual pretrained_models/audioVisual/visual_latest.pth --weights_unet pretrained_models/audioVisual/unet_latest.pth --weights_classifier pretrained_models/audioVisual/classifier_latest.pth  --num_of_object_detections_to_use 5 --with_additional_scene_image --scene_path /your_root/hdf5/ADE.h5 --output_dir_root results/


Thanks to Dongguang You for help with initial experiments setup. Portions of the code are adapted from the 2.5D Visual Sound implementation ( and the Sound-of-Pixels implementation ( Please also refer to the original License of these projects.


The code in this repository is CC BY 4.0 licensed, as found in the LICENSE file.

You can’t perform that action at this time.