DCASE2019 task4: Sound event detection in domestic environments (DESED dataset and baseline)
Detailed information about the baseline can be found on the dedicated baseline page.
If you use the dataset or the baseline, please cite this paper.
6th march: [baseline] add baseline/Logger.py, update baseline/config.py and update README to send csv files.
2nd May: Removing duplicates in dataset/validation/test_dcase2018.csv and dataset/validation/validation.csv, changing eventbased results of 0.03%
19th May: Updated the eval_dcase2018.csv and validation.csv. Problem due to annotation export. Files with empty annotations did have annotations.
28th May: Updated evaluation dataset 2019.
31st May: Update link to evaluation dataset (tar.gz) because of compression problem on some OS.
30th June: [baseline] Update get_predictions (+refactor) to get directly predictions in seconds.
Python >= 3.6, pytorch >= 1.0, cudatoolkit=9.0, pandas >= 0.24.1, scipy >= 1.2.1, pysoundfile >= 0.10.2, librosa >= 0.6.3, youtube-dl >= 2019.4.30, tqdm >= 4.31.1, ffmpeg >= 4.1, dcase_util >= 0.2.5, sed-eval >= 0.2.1
A simplified installation procedure example is provide below for python 3.6 based Anconda distribution for Linux based system:
- install Ananconda
- launch conda_create_environment.sh`
The baseline and download script have been tested with python 3.6, on linux (CentOS 7)
The Domestic Environment Sound Event Detection (DESED) dataset is composed of two subset that can be downloaded independently:
- (Real recordings) launch
- (Synthetic clips) download at : synthetic_dataset.
- (Evaluation set) download at: evaluation dataset. There is 13190 files, find the csv in
dataset/metadata/eval/eval.csv. (Use tar -xzf eval.tar.gz to uncompress it.)
It is likely that you'll have download issues with the real recordings.
Don't hesitate to relaunch
download_data.py once or twice.
At the end of the download, please send a mail with the CSV files
created in the
missing_files directory. (in priority to Nicolas Turpault and Romain Serizel)
You should have a development set structured in the following manner:
dataset root └───metadata (directories containing the annotations files) │ │ │ └───train (annotations for the training sets) │ │ weak.csv (weakly labeled training set list and annotations) │ │ unlabel_in_domain.csv (unlabeled in domain training set list) │ │ synthetic.csv (synthetic data training set list and annotations) │ │ │ └───validation (annotations for the test set) │ validation.csv (validation set list with strong labels) │ test_2018.csv (test set list with strong labels - DCASE 2018) │ eval_2018.csv (eval set list with strong labels - DCASE 2018) │ └───audio (directories where the audio files will be downloaded) └───train (audio files for the training sets) │ └───weak (weakly labeled training set) │ └───unlabel_in_domain (unlabeled in domain training set) │ └───synthetic (synthetic data training set) │ └───validation (validation set)
Synthetic data (1.8Gb)
Freesound dataset [1,2]: A subset of FSD is used as foreground sound events for the synthetic subset of the DESED dataset. FSD is a large-scale, general-purpose audio dataset composed of Freesound content annotated with labels from the AudioSet Ontology .
SINS dataset : The derivative of the SINS dataset used for DCASE2018 task 5 is used as background for the synthetic subset of the dataset for DCASE 2019 task 4. The SINS dataset contains a continuous recording of one person living in a vacation home over a period of one week. It was collected using a network of 13 microphone arrays distributed over the entire home. The microphone array consists of 4 linearly arranged microphones.
The synthetic set is composed of 10 sec audio clips generated with Scaper . The foreground events are obtained from FSD. Each event audio clip was verified manually to ensure that the sound quality and the event-to-background ratio were sufficient to be used an isolated event. We also verified that the event was actually dominant in the clip and we controlled if the event onset and offset are present in the clip. Each selected clip was then segmented when needed to remove silences before and after the event and between events when the file contained multiple occurrences of the event class.
All sounds comming from FSD are released under Creative Commons licences. Synthetic sounds can only be used for competition purposes until the full CC license list is made available at the end of the competition.
Real recordings (23.4Gb):
Subset of Audioset . Audioset: Real recordings are extracted from Audioset. It consists of an expanding ontology of 632 sound event classes and a collection of 2 million human-labeled 10-second sound clips (less than 21% are shorter than 10-seconds) drawn from 2 million Youtube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds.
The download/extraction process can take approximately 4 hours. If you experience problems during the download of this subset please contact the task organizers.
The weak annotations have been verified manually for a small subset of the training set. The weak annotations are provided in a tab separated csv file under the following format:
[filename (string)][tab][event_labels (strings)]
Synthetic subset and validation set have strong annotations.
The minimum length for an event is 250ms. The minimum duration of the pause between two events from the same class is 150ms. When the silence between two consecutive events from the same class was less than 150ms the events have been merged to a single event. The strong annotations are provided in a tab separated csv file under the following format:
[filename (string)][tab][event onset time in seconds (float)][tab][event offset time in seconds (float)][tab][event_label (strings)]
YOTsn73eqbfc_10.000_20.000.wav 0.163 0.665 Alarm_bell_ringing
This task is the follow-up to DCASE 2018 task 4. The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. The challenge of exploring the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance remains but an additional training set with strongly annotated synthetic data is provided. The labels in all the annotated subsets are verified and can be considered as reliable. An additional scientific question this task is aiming to investigate is whether we really need real but partially and weakly annotated data or is using synthetic data sufficient? or do we need both?
Further information on dcase_website
Nicolas Turpault, Romain Serizel, Justin Salamon, Ankit Parag Shah, 2019 -- Present
 F. Font, G. Roma & X. Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia. ACM, 2013.
 E. Fonseca, J. Pons, X. Favory, F. Font, D. Bogdanov, A. Ferraro, S. Oramas, A. Porter & X. Serra. Freesound Datasets: A Platform for the Creation of Open Audio Datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017.
 Jort F. Gemmeke and Daniel P. W. Ellis and Dylan Freedman and Aren Jansen and Wade Lawrence and R. Channing Moore and Manoj Plakal and Marvin Ritter. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings IEEE ICASSP 2017, New Orleans, LA, 2017.
 Gert Dekkers, Steven Lauwereins, Bart Thoen, Mulu Weldegebreal Adhana, Henk Brouckxon, Toon van Waterschoot, Bart Vanrumste, Marian Verhelst, and Peter Karsmakers. The SINS database for detection of daily activities in a home environment using an acoustic sensor network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 32–36. November 2017.
 J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello. Scaper: A library for soundscape synthesis and augmentation In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, Oct. 2017.
 Romain Serizel, Nicolas Turpault. Sound Event Detection from Partially Annotated Data: Trends and Challenges. IcETRAN conference, Srebrno Jezero, Serbia, June 2019.