Skip to content

Acoustic scene classification example based on DCASE 2019 challenge

Notifications You must be signed in to change notification settings


Repository files navigation

Acoustic scene classification

"Darling, where are you?" may sound a bit catchy but it describes well what acoustic scene classification is about. When interacting with mobile devices we expect relevant information to be presented with a minimum of input effort. What is relevant depends on the context in which we are acting. If we are requesting a route information while sitting at a bus stop we most probably are looking for directions for travel via a bus, while at a railway station we most probably are looking for a train connection. One possibility for a device to identify the context is via geolocation information. But this information may not be available inside buildings. An alternative approach is the analysis of ambient sound.

This project demonstrates how a convoluted neural network can be used for acoustic scene classification.

Recordings made available as part of the DCASE (Detection and Classification o Acoustic Scenes and Events) 2019 challenge are used as input data. These comprise 14400 audio files recorded in 10 settings in 10 different cities.

The recordings are converted to log-frequency short time power spectrograms.


Two different augmentation methods were used. While time warping proved benefical random masking of frequency bands showed a negative effect.

After applying a hyperparameter optimization settings where identified with an accuracy of 69.7 % for the validation data set.

Hence acoustic scene classification seems to be a feasible approach for providing context dependent information.


The main files to look at are the Jupyter notebook acoustic_scene_classification.ipynb and the blog in blog/index.html.

  • acoustic_scene_classification.ipynb - Jupyter notebook
  • acoustic_scene_classification.html - HTML export of the Jupyter notebook
  • blog/index.html - an overview article
  • data/ - a script to download the raw data used
  • - Python script to create log-frequency power spectrograms
  • - Python script to split the data into training, validation and test
  • - Python script to train the neural network
  • - Python script to make predicitions base on the network


It is assumed that you are running in a Anaconda environment with the packages mentioned in the "Software used" chapter installed and are using a Posix operating system, e.g. Debian Buster.

You will need at least 85 GiB free disk space for the data. Check it with

df .

You will need at least 6 GiB video memory. Check it with


Download the raw data (34 GiB) and extract the zip files

cd data/ && ./ && cd ..

Convert the data to spectrograms

mkdir -p data/spectrograms/
python data/TAU-urban-acoustic-scenes-2019-development/audio/ \

Split the data into training, validation, and test sub-sets

mkdir -p data/splitted/
python data/spectrograms/ data/splitted/

Now all prerequites are set up and you can run the Juypter notebook

jupyter notebook

Train the neural network. This will take several hours.

python --epochs 64 data/splitted

Or use the checkpoint created by running the Jupyter notebook.

After the network is trained a checkpoint file '' is saved. This file is needed as prediction. By default the file is saved in the current directory.

The console outputs shows the training progress.

To show all available parameters of you can use

python --help

The checkpoint can now be used to make predictions by applying the trained model to a spectrogram:

python data/spectrograms/bus/milan/1115-42136-a.png

The console output shows the propabilities of the top five categories.

To show all available parameters of you can use

python --help

Software used

These are the software versions that were used:

  • Debian Buster Bullseye
  • CUDA 10.1
  • Python 3.7.4
  • conda 4.7.12
  • ipython 7.8.0
  • librosa 0.7.1
  • matplotlib 3.1.1
  • notebook 6.0.1
  • numpy 1.17.2
  • pandas 0.25.1
  • pillow 6.2.0
  • pytorch 1.3.0
  • scikit-image 0.15.0
  • scikit-learn 0.21.3
  • torchvision 0.4.1


The code is published under [LGPL-v2]:

The documentation is published under the [CC BY-SA 4.0]: license


Acoustic scene classification example based on DCASE 2019 challenge






No releases published


No packages published