No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Latest commit 5f9b216 Aug 27, 2018
Type Name Latest commit message Commit time
Failed to load latest commit information.
appendixes update Aug 2, 2018
pytorch update Aug 2, 2018
utils update Aug 2, 2018
LICENSE.txt add Jul 16, 2018 add cite Aug 26, 2018 update Jul 26, 2018

DCASE 2018 Task 2 Audio Tagging of Freesound

DCASE 2018 Task 2 is a task to classifiy short audio clips to one of 41 audio classes such as 'fireworks', 'cello', 'barks' and so on. We provide a convolutional neural network (CNN) baseline system implemented with PyTorch in this code base. More details about this challenge can be found


The dataset is downloadable from There are 41 audio classes. The duration of the audio samples ranges from 300ms to 30s.

Training Testing
Manually verified 3710 ~1.6k
Not manually verified 5763 ~7.8k (Not used for evaluation)
Total 9473 ~9.4k

Run the code

Prepare data. Download and upzip the data. The data looks like:

├── audio_train (9473 audios)
│     └── ...
├── audio_test (9400 audios)
│     └── ...
├── train.csv
└── sample_submission.csv

1. (Optional) Install dependent packages. The code is implememnted with python 3. If you are using conda, simply run:

$ conda env create -f environment.yml

$ conda activate py3_dcase2018_task2

(We developed this system with python 3. If you are using pytorch as backend then pytorch 0.4.0 is required.)

2. Then simply run:

$ ./

Or run the commands in line by line, including:

(1) Modify the paths of data and your workspace

(2) Extract features

(3) Train model

(4) Evaluation


We apply a convolutional neural network on the log mel spectrogram feature to solve this task. Training takes around 300 ms / iteration on a GTX Titan X GPU for the VGGish model. The training almost converge when after 3000 iterations.

Loading data time: 1.708 s
Training audios number: 7104
Validation audios number: 2369
Training patches number: 18777
train acc: 0.020, train mapk: 0.032
valid acc: 0.024, validate mapk: 0.040
Iteration: 0, train time: 0.004 s, eval time: 2.829 s
train acc: 0.747, train mapk: 0.828
valid acc: 0.693, validate mapk: 0.775
Iteration: 2800, train time: 59.517 s, eval time: 3.286 s
train acc: 1.000, train mapk: 1.000
valid acc: 0.897, validate mapk: 0.930

Overall accuracy

We split development data to 4 folds. The overall performance on the 4 folds is:

accuracy mAP@3
Total 0.895 0.928

Class-wise accuracy

The class-wise accuracy is shown as blow:

alt text


This codebase provides a convolutional neural network (CNN) for DCASE 2018 challenge Task 2. Some sound classes such as 'applause', 'bark' have high classification accuracy. Some sound classes such as 'squeak', 'telephone' have low classification accuracy.


"Kong, Qiuqiang, Turab Iqbal, Yong Xu, Wenwu Wang, and Mark D. Plumbley. "DCASE 2018 Challenge baseline with convolutional neural networks." arXiv preprint arXiv:1808.00773 (2018)."


If you met running out of GPU memory error, then try reduce batch_size.

External link

The official baseline system implemented using Keras can be found in