Sample-level CNN Architectures for Music Auto-tagging Using Raw Waveforms

A TensorFlow+Keras implementation of "Sample-level CNN Architectures for Music Auto-tagging Using Raw Waveforms" including Jupyter note for excitation analysis

ICASSP 2018 Poster

Citation

@inproceedings{kim2018sample,
  title={Sample-level CNN Architectures for Music Auto-tagging Using Raw Waveforms},
  author={Kim, Taejun and Lee, Jongpil and Nam, Juhan},
  booktitle={International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2018},
  organization={IEEE}
}

Prerequisites

Python 3.5 and the required packages
ffmpeg (required for madmom)

Installing required Python packages

pip install -r requirements.txt
pip install madmom

The madmom package has a install-time dependency, so should be installed after installing packages in requirements.txt.

This will install the required packages:

keras must use v2.0.5 (has an issue)
tensorflow
numpy
pandas
scikit-learn
madmom
scipy (madmom dependency)
cython (madmom dependency)
matplotlib (used for excitation analysis)

Installing ffmpeg

ffmpeg is required for madmom.

MacOS (with Homebrew):

brew install ffmpeg

Ubuntu:

add-apt-repository ppa:mc3man/trusty-media
apt-get update
apt-get dist-upgrade
apt-get install ffmpeg

CentOS:

yum install epel-release
rpm --import http://li.nux.ro/download/nux/RPM-GPG-KEY-nux.ro
rpm -Uvh http://li.nux.ro/download/nux/dextop/el ... noarch.rpm
yum install ffmpeg

Preparing MagnaTagATune (MTT) dataset

Download audio data and tag annotations from here. Then you should see 3 .zip files and 1 .csv file:

mp3.zip.001
mp3.zip.002
mp3.zip.003
annotations_final.csv

To unzip the .zip files, merge and unzip them (referenced here):

cat mp3.zip.* > mp3_all.zip
unzip mp3_all.zip

You should see 16 directories named 0 to f. Typically, 0 ~ b are used to training, c to validation, and d ~ f to test.

To make your life easier, make a directory named dataset, and place them in the directory as below:

mkdir dataset

Your directory structure should be like:

dataset
├── annotations_final.csv
└── mp3
    ├── 0
    ├── 1
    ├── ...
    └── f

Now, the MTT dataset preparation is Done!

Preprocessing the MTT dataset

This section describes a required preprocessing task for the MTT dataset. Note that this requires 48G storage space. It uses multiprocessing.

These are what the preprocessing does:

Select top 50 tags in annotations_final.csv
Split dataset into training, validation, and test sets
Resample the raw audio files to 22050Hz
Segment the resampled audios into 59049 sample length
Convert the segments to TFRecord format

To run the preprocessing:

python build_mtt.py

After the preprocessing your dataset directory should be like:

dataset
├── annotations_final.csv
├── mp3
│   ├── 0
│   ├── ...
│   └── f
└── tfrecord
    ├── test-0000-of-0043.seq.tfrecord
    ├── ...
    ├── test-0042-of-0043.seq.tfrecord
    ├── train-0000-of-0152.tfrecord
    ├── ...
    ├── train-0151-of-0152.tfrecord
    ├── val-0000-of-0015.tfrecord
    ├── ...
    └── val-0014-of-0015.tfrecord

18 directories, 211 files

Training a model from scratch

To train a model from scratch, run the code:

python train.py

The trained model and logs will be saved under the directory log.

Training a model with options

train.py trains a model using SE block by default. To see configurable options, run python train.py -h. Then you will see:

usage: train.py [-h] [--data-dir PATH] [--train-dir PATH]
                [--block {se,rese,res,basic}] [--no-multi] [--alpha A]
                [--batch-size N] [--momentum M] [--lr LR] [--lr-decay DC]
                [--dropout DO] [--weight-decay WD] [--initial-stage N]
                [--patience N] [--num-lr-decays N] [--num-audios-per-shard N]
                [--num-segments-per-audio N] [--num-read-threads N]

Sample-level CNN Architectures for Music Auto-tagging.

optional arguments:
  -h, --help            show this help message and exit
  --data-dir PATH
  --train-dir PATH      Directory where to write event logs and checkpoints.
  --block {se,rese,res,basic}
                        Block to build a model: {se|rese|res|basic} (default:
                        se).
  --no-multi            Disables multi-level feature aggregation.
  --alpha A             Amplifying ratio of SE block.
  --batch-size N        Mini-batch size.
  --momentum M          Momentum for SGD.
  --lr LR               Learning rate.
  --lr-decay DC         Learning rate decay rate.
  --dropout DO          Dropout rate.
  --weight-decay WD     Weight decay.
  --initial-stage N     Stage to start training.
  --patience N          Stop training stage after #patiences.
  --num-lr-decays N     Number of learning rate decays.
  --num-audios-per-shard N
                        Number of audios per shard.
  --num-segments-per-audio N
                        Number of segments per audio.
  --num-read-threads N  Number of TFRecord readers.

For example, if you want to train a model without multi-feature aggregation using Res block:

python train.py --block res --no-multi

Downloading pre-trained models

You can download the two best models of the paper:

SE+multi (AUC 0.9111) [download]: a model using SE blocks and multi-feature aggregation
ReSE+multi (AUC 0.9113) [download]: a model using ReSE blocks and multi-feature aggregation

To download them from command line:

# SE+multi
curl -L -o se-multi-auc_0.9111-tfrmodel.hdf5 https://www.dropbox.com/s/r8qlxbol2p4ods5/se-multi-auc_0.9111-tfrmodel.hdf5?dl=1

# ReSE+multi
curl -L -o rese-multi-auc_0.9113-tfrmodel.hdf5 https://www.dropbox.com/s/fr3y1o3hyha0n2m/rese-multi-auc_0.9113-tfrmodel.hdf5?dl=1

Evaluating a model

To evaluate a model run:

python eval.py <MODEL_PATH>

For example, if you want to evaluate the downloaded SE+multi model:

python eval.py se-multi-auc_0.9111-tfrmodel.hdf5

Excitation Analysis

If you just want to see codes and plots, please open excitation_analysis.ipynb.
If you want to analyze excitations by yourself, please follow next steps.

1. Extracting excitations.

python extract_excitations.py <MODEL_PATH>

# For example, to extract excitations from the downloaded `SE+multi` model:
python extract_excitations.py se-multi-auc_0.9111-tfrmodel.hdf5

This will extract excitations from the model and save them as a Pandas DataFrame. The saved file name is excitations.pkl by default.

2. Analyze the extracted excitations.

Run Jupyter notebook:

jupyter notebook

And open the note excitation_analysis.ipynb in Jupyter notebook. Run and explore excitations by yourself.

Issues & Questions

If you have any issues or questions, please post it on issues so that other people can share it :) Thanks!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
data		data
.gitignore		.gitignore
README.md		README.md
build_mtt.py		build_mtt.py
eval.py		eval.py
excitation_analysis.ipynb		excitation_analysis.ipynb
extract_excitations.py		extract_excitations.py
model.py		model.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

tae-jun/resemul

Folders and files

Latest commit

History

Repository files navigation

Sample-level CNN Architectures for Music Auto-tagging Using Raw Waveforms

Table of contents