# Instrumental, Genre and Mood Detection in Music with Deep Learning

This tutorial shows how different Convolutional Neural Network architectures are used for:
* Instrumental vs. Vocal Detection:  detecting whether a piece of music is instrumental or contains vocals
* Genre Classification
* Mood Recognition

The data set used is the [MagnaTagATune Dataset](http://mirg.city.ac.uk/codeapps/the-magnatagatune-dataset), but a smaller subset of it, with only 1 sample excerpt of each of the original audio files.

It consists of 5405 files, each 30 seconds long. 

The annotations for this dataset contain a multitude of tags, including some that hint at whether the file is instrumental or vocal. (see [Create 2 classes from a list of tags](#Create-2-classes-from-a-list-of-tags) below)

### Requirements

* Python 3.5
* Keras >= 2.1.1
* Tensorflow
* scikit-learn >= 0.18
* Pandas
* Librosa
* MatplotLib

### Table of Contents

This tutorial contains:
* Loading and Preprocessing of Audio files
* Loading class files from CSV and using Label Encoder
* Audio Preprocessing: Generating log Mel spectrograms
* Standardization of Data
* Convolutional Neural Networks
* Train/Test set split

* Instrumental vs. Vocal Detection
* Genre Classification
* Mood Recognition

You can execute the following code blocks by pressing SHIFT+Enter consecutively.

### Download Data

The (subsampled) data set can be downloaded from [here](https://owncloud.tuwien.ac.at/index.php/s/hivOGXKoUQtacbo).

Please unzip it.

Set the path to the unpacked folder in the next box:

In [10]:
import os

DATA_PATH = 'C:/DATA/audio/MagnaTagATune'
AUDIO_PATH = os.path.join(DATA_PATH, 'audio')
META_PATH = os.path.join(DATA_PATH, 'metadata')

#NPZ_FILE = '/home/tlidy/data/mel_spectrogram_segments_96x1366.npz'

In [11]:
# SET GPUs to use:
os.environ["CUDA_VISIBLE_DEVICES"]="0" #"0,1,2,3" 

In [12]:
# General Imports

import argparse
import csv
import datetime
import glob
import math
import sys
import time
import numpy as np
import pandas as pd # Pandas for reading CSV files and easier Data handling in preparation

# Deep Learning

import keras
from keras import optimizers
from keras import backend as K
from keras.models import Sequential, Model
from keras.layers import Input, Convolution2D, MaxPooling2D, Dense, Dropout, Activation, Flatten, merge
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import ELU

# Machine Learning preprocessing and evaluation

from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit, StratifiedShuffleSplit

In [41]:
import librosa
import progressbar

## Load the Metadata

The tab-separated file contains pairs of filename TAB class.

In [14]:
csv_file = os.path.join(META_PATH,'annotations_final_subsample.csv')

# we select the last column (-1) as the index column (= filename)
metadata = pd.read_csv(csv_file, index_col=0, sep='\t')
metadata.head(10)

Unnamed: 0_level_0,clip_id,no voice,singer,duet,plucking,hard rock,world,bongos,harpsichord,female singing,...,female singer,rap,metal,hip hop,quick,water,baroque,women,fiddle,english
mp3_path,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
d/ambient_teknology-the_all_seeing_eye_project-01-cyclops-262-291.mp3,1119,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
d/ambient_teknology-the_all_seeing_eye_project-02-all_seeing_eye-175-204.mp3,6021,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
d/ambient_teknology-the_all_seeing_eye_project-03-black-175-204.mp3,11847,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
d/ambient_teknology-the_all_seeing_eye_project-04-confusion_says-88-117.mp3,17119,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
d/ambient_teknology-the_all_seeing_eye_project-05-the_beholder-291-320.mp3,25118,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
d/ambient_teknology-the_all_seeing_eye_project-06-cyead-378-407.mp3,26533,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
d/ambient_teknology-the_all_seeing_eye_project-07-telekonology-117-146.mp3,33637,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0/american_bach_soloists-j_s__bach__cantatas_volume_v-01-gleichwie_der_regen_und_schnee_vom_himmel_fallt_bwv_18_i_sinfonia-117-146.mp3,29,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0/american_bach_soloists-j_s__bach__cantatas_volume_v-02-gleichwie_der_regen_und_schnee_vom_himmel_fallt_bwv_18_ii_recitative__gleichwie_der_regen_und_schnee-30-59.mp3,5864,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0/american_bach_soloists-j_s__bach__cantatas_volume_v-03-gleichwie_der_regen_und_schnee_vom_himmel_fallt_bwv_18_iii_recitative_and_litany__mein_gott_hier_wird_mein_herze_sein-88-117.mp3,11264,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
# remove the unneeded column "clip_id"
cols = "clip_id"
metadata.drop(cols,axis=1,inplace=True)

metadata.head()

Unnamed: 0_level_0,no voice,singer,duet,plucking,hard rock,world,bongos,harpsichord,female singing,clasical,...,female singer,rap,metal,hip hop,quick,water,baroque,women,fiddle,english
mp3_path,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
d/ambient_teknology-the_all_seeing_eye_project-01-cyclops-262-291.mp3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
d/ambient_teknology-the_all_seeing_eye_project-02-all_seeing_eye-175-204.mp3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
d/ambient_teknology-the_all_seeing_eye_project-03-black-175-204.mp3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
d/ambient_teknology-the_all_seeing_eye_project-04-confusion_says-88-117.mp3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
d/ambient_teknology-the_all_seeing_eye_project-05-the_beholder-291-320.mp3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# 1) Instrumental vs. Vocal Detection

this is a binary classification task (output decision is between 0 and 1)

### Create 2 classes from a list of tags

There are plenty of "tags" in this data set which hint at wether a track is "vocal" or "instrumental". We group these tags and finally come up with 1 boolean column saying whether a track is "vocal" or "instrumental".

In [16]:
tags_vocal = ['singer', 'female singing', 'female opera', 'male vocal', 'vocals', 'men', 'female', 'female voice', 'voice', 'male voice', 'girl', 'chanting', 'talking', 'choral', 'male singer', 'man singing', 'male opera', 'chant', 'man', 'female vocal', 'male vocals', 'vocal', 'woman', 'woman singing', 'singing', 'female vocals', 'voices', 'choir', 'female singer', 'women', 'choir', 'women']

tags_instrumental = ['instrumental', 'no voice', 'no voices', 'no vocals', 'no vocal', 'no singing', 'no singer']

In [17]:
metadata[tags_vocal].head()

Unnamed: 0_level_0,singer,female singing,female opera,male vocal,vocals,men,female,female voice,voice,male voice,...,woman,woman singing,singing,female vocals,voices,choir,female singer,women,choir,women
mp3_path,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
d/ambient_teknology-the_all_seeing_eye_project-01-cyclops-262-291.mp3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
d/ambient_teknology-the_all_seeing_eye_project-02-all_seeing_eye-175-204.mp3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
d/ambient_teknology-the_all_seeing_eye_project-03-black-175-204.mp3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
d/ambient_teknology-the_all_seeing_eye_project-04-confusion_says-88-117.mp3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
d/ambient_teknology-the_all_seeing_eye_project-05-the_beholder-291-320.mp3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# set vocal to True of any of the tags_vocal are 1
gt_vocal = metadata[tags_vocal].any(axis=1)
gt_vocal.head()

mp3_path
d/ambient_teknology-the_all_seeing_eye_project-01-cyclops-262-291.mp3           False
d/ambient_teknology-the_all_seeing_eye_project-02-all_seeing_eye-175-204.mp3    False
d/ambient_teknology-the_all_seeing_eye_project-03-black-175-204.mp3             False
d/ambient_teknology-the_all_seeing_eye_project-04-confusion_says-88-117.mp3     False
d/ambient_teknology-the_all_seeing_eye_project-05-the_beholder-291-320.mp3      False
dtype: bool

In [19]:
# set instrumental to True of any of the tags_instrumental are 1
gt_instrumental = metadata[tags_instrumental].any(axis=1)
gt_instrumental.head()

mp3_path
d/ambient_teknology-the_all_seeing_eye_project-01-cyclops-262-291.mp3           False
d/ambient_teknology-the_all_seeing_eye_project-02-all_seeing_eye-175-204.mp3    False
d/ambient_teknology-the_all_seeing_eye_project-03-black-175-204.mp3             False
d/ambient_teknology-the_all_seeing_eye_project-04-confusion_says-88-117.mp3     False
d/ambient_teknology-the_all_seeing_eye_project-05-the_beholder-291-320.mp3       True
dtype: bool

<b>We can only use the tag if EITHER instrumental OR vocal is True.</b><br>
If both of them are True or both of them are False, we cannot trust the groundtruth data. Ergo we have to remove these and retain only the others.

In [20]:
retain = np.logical_xor(gt_vocal,gt_instrumental)
retain.head()

mp3_path
d/ambient_teknology-the_all_seeing_eye_project-01-cyclops-262-291.mp3           False
d/ambient_teknology-the_all_seeing_eye_project-02-all_seeing_eye-175-204.mp3    False
d/ambient_teknology-the_all_seeing_eye_project-03-black-175-204.mp3             False
d/ambient_teknology-the_all_seeing_eye_project-04-confusion_says-88-117.mp3     False
d/ambient_teknology-the_all_seeing_eye_project-05-the_beholder-291-320.mp3       True
dtype: bool

In [21]:
n_orig = len(gt_vocal)

n_retain = sum(retain)

print("From originally", n_orig, "input examples, we can only retain",n_retain, "trusted ones in our groundtruth")

From originally 3023 input examples, we can only retain 959 trusted ones in our groundtruth


In the end we cut from gt_vocal only the exampls to retain. If they are True they are vocal, if they are False, they are instrumental:

In [22]:
gt_final = gt_vocal[retain]
gt_final.head(9)

mp3_path
d/ambient_teknology-the_all_seeing_eye_project-05-the_beholder-291-320.mp3                                                                                                                   False
0/american_bach_soloists-j_s__bach__cantatas_volume_v-03-gleichwie_der_regen_und_schnee_vom_himmel_fallt_bwv_18_iii_recitative_and_litany__mein_gott_hier_wird_mein_herze_sein-88-117.mp3     True
0/american_bach_soloists-j_s__bach__cantatas_volume_v-07-weinen_klagen_sorgen_zagen_bwv_12_ii_chorus__weinen_klagen_sorgen_zagen-262-291.mp3                                                  True
0/american_bach_soloists-j_s__bach__cantatas_volume_v-08-weinen_klagen_sorgen_zagen_bwv_12_iii_recitative__wir_mussen_durch_viel_truebsal-0-29.mp3                                            True
0/american_bach_soloists-j_s__bach__cantatas_volume_v-09-weinen_klagen_sorgen_zagen_bwv_12_iv_aria__kreuz_und_krone_sind_verbunden-59-88.mp3                                                  True
0/american_bach_

In [23]:
print(str(sum(gt_final)) + " vocal tracks")

671 vocal tracks


In [24]:
print(str(sum(np.logical_not(gt_final))) + " instrumental tracks")

288 instrumental tracks


<b>Create two lists: one with filenames and one with associated classes</b>

In [25]:
# index in list of strings
filelist = gt_final.index.tolist()
# convert boolean to int and store in other list
classes = (gt_final * 1).tolist()

In [26]:
filelist[0:5]

['d/ambient_teknology-the_all_seeing_eye_project-05-the_beholder-291-320.mp3',
 '0/american_bach_soloists-j_s__bach__cantatas_volume_v-03-gleichwie_der_regen_und_schnee_vom_himmel_fallt_bwv_18_iii_recitative_and_litany__mein_gott_hier_wird_mein_herze_sein-88-117.mp3',
 '0/american_bach_soloists-j_s__bach__cantatas_volume_v-07-weinen_klagen_sorgen_zagen_bwv_12_ii_chorus__weinen_klagen_sorgen_zagen-262-291.mp3',
 '0/american_bach_soloists-j_s__bach__cantatas_volume_v-08-weinen_klagen_sorgen_zagen_bwv_12_iii_recitative__wir_mussen_durch_viel_truebsal-0-29.mp3',
 '0/american_bach_soloists-j_s__bach__cantatas_volume_v-09-weinen_klagen_sorgen_zagen_bwv_12_iv_aria__kreuz_und_krone_sind_verbunden-59-88.mp3']

In [27]:
classes[0:5]

[0, 1, 1, 1, 1]

In [28]:
# convert to Numpy array as needed by Keras
classes = np.array(classes)
classes[0:5]

array([0, 1, 1, 1, 1])

## Load the Audio Files

#### Function to analyze audio files and get small or large spectrogram excerpts

In [35]:
?librosa.load

In [42]:
def create_spectrograms(filelist, n_mel_bands=40, frames=80, return_example=False):

    list_spectrograms = [] # spectrograms are put into a list first

    # some FFT parameters
    fft_window_size=512
    fft_overlap = 0.5
    hop_size = int(fft_window_size*(1-fft_overlap))
    segment_size = fft_window_size + (frames-1) * hop_size # segment size for desired # frames

    print("Reading and processing", len(filelist), "audio files")
    
    pbar = progressbar.ProgressBar()
    
    for filename in pbar(filelist):
        
        filepath = os.path.join(AUDIO_PATH, filename)
        wavedata, samplerate = librosa.load(filepath, sr=44100)
        sample_length = wavedata.shape[0]

        # make Mono (in case of multiple channels / stereo)
        if wavedata.ndim > 1:
            wavedata = np.mean(wavedata, 1)

        # take only a segment; choose start position:
        #pos = 0 # beginning
        pos = int(wavedata.shape[0]/2 - segment_size/2) # center minus half segment size
        wav_segment = wavedata[pos:pos+segment_size]

        # 2) Transform to perceptual Mel scale (uses librosa.filters.mel)
        spectrogram = librosa.feature.melspectrogram(wavedata, 
                                                     sr=samplerate, 
                                                     n_fft=fft_window_size,
                                                     hop_length=hop_size,
                                                     n_mels=n_mel_bands)

        # 3) Log 10 transform
        spectrogram = librosa.amplitude_to_db(spectrogram)

        list_spectrograms.append(spectrogram)

    print("\nConverting to big data array...")
    # a list of many spectrograms is made into 1 big array with 3 dimensions
    # + convert the input data to the right data type used by Keras Deep Learning (GPU)
    data = np.array(list_spectrograms, dtype=K.floatx())

    # replace Inf values:
    # as in our preprocessing some files generated an Inf value in the log10 computation, we replace those by 0:

    data[np.isinf(data)] = 0

    print("done.")
    
    # just for illustration purposes, return the last wav file and its spectrogram and audio data
    if return_example:
        return data, wav_segment, spectrogram, segment_size, samplerate, samplewidth
    
    return data

#### Define desired output parameters

In [43]:
# small spectrograms
#n_mel_bands = 40   # y axis
#frames = 80        # x axis

# large  spectrograms
n_mel_bands = 96   # y axis
frames = 683        # x axis

# extra large  spectrograms
#n_mel_bands = 96   # y axis
#frames = 1366        # x axis

In [44]:
# if we saved the audio spectrograms before, we try to load them
load_features = True

# if not, we store audio features for faster reload the next time
save_features = True

FEAT_FILE = os.path.join(DATA_PATH, "spectrograms_instrumental.npz")

In [45]:
if load_features:
    if os.path.exists(FEAT_FILE):
        with np.load(FEAT_FILE) as npz:
            data = npz['data']
            filelist = npz['filenames']
            classes = npz['classes']
        print("Loaded features successfully: " + str(len(filelist)), "files, dimensions:", data.shape)
    else:
        load_features = False

In [None]:
if not load_features:
    #data = create_spectrograms(filelist, n_mel_bands, frames)

    # get some extra data for illustration
    data, wav_segment, spectrogram, segment_size, samplerate, samplewidth = create_spectrograms(filelist, n_mel_bands, frames, return_example=True)
    
    if save_features:
        np.savez(FEAT_FILE, data=data, filenames=filelist, classes=classes)
        print("Features stored to " + FEAT_FILE)

                                                                               N/A% (0 of 959) |                        | Elapsed Time: 0:00:00 ETA:  --:--:--

Reading and processing 959 audio files


 30% (294 of 959) |######                | Elapsed Time: 0:18:48 ETA:   0:16:18

### Show Waveform and Spectrogram (just for illustration)

In [25]:
if not load_features:
    print(samplerate, samplewidth)
    print(spectrogram.shape)
    print(data.shape)
    print("An audio segment is", round(float(segment_size) / samplerate, 2), "seconds long")

In [26]:
if not load_features:
    print(wav_segment)

In [27]:
# you can skip this if you do not have matplotlib installed

if not load_features:
    import matplotlib.pyplot as plt
    %matplotlib inline 

    # show 1 sec wave segment
    plt.plot(wav_segment)

In [28]:
# show spectrogram

if not load_features:
    fig = plt.imshow(spectrogram, origin='lower', aspect='auto')
    fig.set_cmap('jet')
    fig.axes.get_xaxis().set_visible(False)
    fig.axes.get_yaxis().set_visible(False)

## Standardization

<b>Always standardize</b> the data before feeding it into the Neural Network!

We use <b>Zero-mean Unit-variance standardization</b> (also known as Z-score normalization).
Here, we use <b>attribute-wise standardization</b>, i.e. each pixel is standardized individually, as opposed to computing a single mean and single standard deviation of all values.

('Flat' standardization would also be possible, but we have seen benefits of attribut-wise standardization in our experiments).

We use the StandardScaler from the scikit-learn package for our purpose.
As it works typically on vector data, we have to vectorize (i.e. reshape) our matrices first.

In [29]:
def standardize(data):
    # vectorize before standardization (cause scaler can't do it in that format)
    N, ydim, xdim = data.shape
    data = data.reshape(N, xdim*ydim)

    # standardize
    scaler = preprocessing.StandardScaler()
    data = scaler.fit_transform(data)

    # reshape to original shape
    return data.reshape(N, ydim, xdim)

In [30]:
data = standardize(data)

# Convolutional Neural Networks

A Convolutional Neural Network (ConvNet or CNN) is a type of (deep) Neural Network that is well-suited for 2D axes data, such as images or spectrograms, as it is optimized for learning from spatial proximity. Its core elements are 2D filter kernels which essentially learn the weights of the Neural Network, and downscaling functions such as Max Pooling.

A CNN can have one or more Convolution layers, each of them having an arbitrary number of N filters (which define the depth of the CNN layer), following typically by a pooling step, which aggregates neighboring pixels together and thus reduces the image resolution by retaining only the maximum values of neighboring pixels.

## Preparing the Data

### Adding the channel

As CNNs were initially made for image data, we need to add a dimension for the color channel to the data. RGB images typically have a 3rd dimension with the color. 

<b>Spectrograms, however, are considered like greyscale images, as in the previous tutorial.
Likewise we need to add an extra dimension for compatibility with the CNN implementation.</b>

For greyscale images, we add the number 1 as the depth of the additional dimension of the input shape (for RGB color images, the number of channels is 3).

<i>Note on Tensorflow vs. Theano:</i>

In Theano, traditionally the color channel was the <b>first</b> dimension in the image shape. 
In Tensorflow, the color channel is the <b>last</b> dimension in the image shape. 

This can be configured in ~/.keras/keras.json: "image_dim_ordering": "th" or "tf" (for Theano or Tensorflow) *or* with "image_data_format" set to "channels_first" or "channels_last".

Tensorflow is now the default image ordering for Kears ("tf" and/or "channels_last").
To be on the safe side, we added the if statement below.

In [31]:
keras.backend.image_data_format()

'channels_last'

In [32]:
def add_channel(data, n_channels=1):
    # n_channels: 1 for grey-scale, 3 for RGB, but usually already present in the data
    
    N, ydim, xdim = data.shape

    if keras.backend.image_data_format() == 'channels_last':  # TENSORFLOW
        # Tensorflow ordering (~/.keras/keras.json: "image_dim_ordering": "tf")
        data = data.reshape(N, ydim, xdim, n_channels)
    else: # THEANO
        # Theano ordering (~/.keras/keras.json: "image_dim_ordering": "th")
        data = data.reshape(N, n_channels, ydim, xdim)
        
    return data

In [33]:
data = add_channel(data, n_channels=1)
data.shape

(959, 96, 683, 1)

In [34]:
# we store the new shape of the images in the 'input_shape' variable.
# take all dimensions except the 0th one (which is the number of files)
input_shape = data.shape[1:]  
input_shape

(96, 683, 1)

### Train & Test Set Split

We split the original full data set into two parts: Train Set (75%) and Test Set (25%).

Note: 
For demo purposes we use only 1 split here. A better way to do it is to use **Cross-Validation**, doing the split multiple times, iterating training and testing over the splits and averaging the results.

In [35]:
testset_size = 0.25 # % portion of whole data set to keep for testing, i.e. 75% is used for training

In [36]:
# Stratified Split retains the class balance in both sets

splitter = StratifiedShuffleSplit(n_splits=1, test_size=testset_size, random_state=0)
splits = splitter.split(data, classes)

for train_index, test_index in splits:
    print("TRAIN INDEX:", train_index)
    print("TEST INDEX:", test_index)
    train_set = data[train_index]
    test_set = data[test_index]
    train_classes = classes[train_index]
    test_classes = classes[test_index]
# Note: this for loop is only executed once if n_splits==1

TRAIN INDEX: [135 957 350 397 651 154 704 255 811 881 937 921 229 635 839 643   2 619
 299 225 388 400 717 444 736 673 675 548 731 850 401 112 608  71 791  76
 281 721 545 538 155 753 206  41 527 856 491 745 366 756 720 227 218 836
 103 711  72 878  42 144 447 349 589 710 362 203 376 864 782 226 302 220
 379 773 256 555 476 231  49  51 601 458 744 623 906 685 136 730 115 781
 747 628 869 441 148 728 264 934  37 479 599 237 134 493 557 265 687  88
 363 884 179 201 195 578 166 342 797 339 842 284 924 416 145 382  22  80
 851 927  60 952 912 292 378 634 287 263 181 316 954  53 141 928 278 348
 563 890 735 553  93 385 564 294 824 371 841 585 198 757 700 876   3 558
 953 748 678 903 544 920 784  16 832 466 448 883 190 187 200 483 650 654
 128 442  75 752 914 894 776 386 100  30 178   1 249 420 129 101 497 429
  61 888 450 396 480 172 923 885 694 693 916 373 909 273 624 621 331  55
  39 570 104 234 785 672  86 270 475  54  95 569 146  59 603 521 543 568
 403 726  47 702 577  94  12 216 267 7

In [37]:
print(train_set.shape)
print(test_set.shape)

(719, 96, 683, 1)
(240, 96, 683, 1)


In [38]:
print("Class Counts: Class 0:", sum(train_classes==0), "Class 1:", sum(train_classes))

Class Counts: Class 0: 216 Class 1: 503


# Creating CNN Models in Keras

## Compact CNN

This is a 5 layer Convolutional Neural Network inspired and adapted from Keunwoo Choi (https://github.com/keunwoochoi/music-auto_tagging-keras)

In [39]:
data.shape

(959, 96, 683, 1)

In [40]:
def CompactCNN(input_shape, nb_conv, nb_filters, n_mels, normalize, nb_hidden, dense_units, 
               output_shape, activation, dropout, multiple_segments=False, graph_model=False, input_tensor=None):
    
    melgram_input = Input(shape=input_shape)

    if n_mels >= 256:
        poolings = [(2, 4), (4, 4), (4, 5), (2, 4), (4, 4)]
    elif n_mels >= 128:
        poolings = [(2, 4), (4, 4), (2, 5), (2, 4), (4, 4)]
    elif n_mels >= 96:
        poolings = [(2, 4), (3, 4), (2, 5), (2, 4), (4, 4)]
    elif n_mels >= 72:
        poolings = [(2, 4), (3, 4), (2, 5), (2, 4), (3, 4)]
    elif n_mels >= 64:
        poolings = [(2, 4), (2, 4), (2, 5), (2, 4), (4, 4)]

    # Determine input axis
    if keras.backend.image_dim_ordering() == 'th':
        channel_axis = 1
        freq_axis = 2
        time_axis = 3
    else:
        channel_axis = 3
        freq_axis = 1
        time_axis = 2
            
    # Input block
    #x = BatchNormalization(axis=time_axis, name='bn_0_freq')(melgram_input)
        
    if normalize == 'batch':
        x = BatchNormalization(axis=freq_axis, name='bn_0_freq')(melgram_input)
    elif normalize in ('data_sample', 'time', 'freq', 'channel'):
        x = Normalization2D(normalize, name='nomalization')(melgram_input)
    elif normalize in ('no', 'False'):
        x = melgram_input

    # Conv block 1
    x = Convolution2D(nb_filters[0], (3, 3), padding='same')(x)
    x = BatchNormalization(axis=channel_axis, name='bn1')(x)
    x = ELU()(x)
    x = MaxPooling2D(pool_size=poolings[0], name='pool1')(x)
        
    # Conv block 2
    x = Convolution2D(nb_filters[1], (3, 3), padding='same')(x)
    x = BatchNormalization(axis=channel_axis, name='bn2')(x)
    x = ELU()(x)
    x = MaxPooling2D(pool_size=poolings[1], name='pool2')(x)
        
    # Conv block 3
    x = Convolution2D(nb_filters[2], (3, 3), padding='same')(x)
    x = BatchNormalization(axis=channel_axis, name='bn3')(x)
    x = ELU()(x)
    x = MaxPooling2D(pool_size=poolings[2], name='pool3')(x)
    
    # Conv block 4
    if nb_conv > 3:        
        x = Convolution2D(nb_filters[3], (3, 3), padding='same')(x)
        x = BatchNormalization(axis=channel_axis, name='bn4')(x)
        x = ELU()(x)   
        x = MaxPooling2D(pool_size=poolings[3], name='pool4')(x)
        
    # Conv block 5
    if nb_conv == 5:
        x = Convolution2D(nb_filters[4], (3, 3), padding='same')(x)
        x = BatchNormalization(axis=channel_axis, name='bn5')(x)
        x = ELU()(x)
        x = MaxPooling2D(pool_size=poolings[4], name='pool5')(x)

    # Flatten the outout of the last Conv Layer
    x = Flatten()(x)
      
    if nb_hidden == 1:
        x = Dropout(dropout)(x)
        x = Dense(dense_units, activation='relu')(x)
    elif nb_hidden == 2:
        x = Dropout(dropout)(x)
        x = Dense(dense_units[0], activation='relu')(x)
        x = Dropout(dropout)(x)
        x = Dense(dense_units[1], activation='relu')(x) 
    else:
        raise ValueError("More than 2 hidden units not supported at the moment.")
    
    # Output Layer
    x = Dense(output_shape, activation=activation, name = 'output')(x)
    
    # Create model
    model = Model(melgram_input, x)
    
    return model

### Set model parameters



In [41]:
# number of Convolutional Layers
nb_conv_layers = 4

# number of Filters in each layer
nb_filters = [64,64,64,128,128]

# number of hidden layers at the end of the model
nb_hidden = 1 # 2

# how many neurons in each hidden layer
dense_units = 128 #[128,56]

# how many output units
# IN A BINARY CLASSIFICATION TASK with 2 possible outputs, 1 single output unit is sufficent (deciding between 0 and 1)
output_shape = 1

# which activation function to use for OUTPUT layer
# IN A BINARY CLASSIFICATION TASK sigmoid activation is the right choice (activating betwee 0 and 1)
output_activation = 'sigmoid'

# which type of normalization
normalization = 'batch'

# droupout
dropout = 0.2

In [42]:
model = CompactCNN(input_shape, nb_conv = nb_conv_layers, nb_filters= nb_filters, n_mels = 96, 
                           normalize=normalization, 
                           nb_hidden = nb_hidden, dense_units = dense_units, 
                           output_shape = output_shape, activation = output_activation, 
                           dropout = dropout)

In [43]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 96, 683, 1)        0         
_________________________________________________________________
bn_0_freq (BatchNormalizatio (None, 96, 683, 1)        384       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 96, 683, 64)       640       
_________________________________________________________________
bn1 (BatchNormalization)     (None, 96, 683, 64)       256       
_________________________________________________________________
elu_1 (ELU)                  (None, 96, 683, 64)       0         
_________________________________________________________________
pool1 (MaxPooling2D)         (None, 48, 170, 64)       0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 48, 170, 64)       36928     
__________

## Training Setup

In [44]:
# Loss

# the loss for a binary classification task is BINARY crossentropy
loss = 'binary_crossentropy' 

In [45]:
# Optimizers

# simple case:
# Stochastic Gradient Descent
#optimizer = 'sgd' 

# advanced:
sgd = optimizers.SGD(momentum=0.9, nesterov=True)
rmsprop = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.01)#lr=0.001 decay = 0.03
adagrad = optimizers.Adagrad(lr=0.01, epsilon=1e-08, decay=0.0)

# We use mostly ADAM
adam = optimizers.Adam(lr=0.003, beta_1=0.9, beta_2=0.999, epsilon=1e-07, decay=0.01)
nadam = optimizers.Nadam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=1e-07, schedule_decay=0.004)

# choose
optimizer = adam

In [46]:
# Metrics

def precision(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def recall(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

metrics = ['accuracy', precision, recall]

In [47]:
# Other
batch_size = 32 

epochs = 30

validation_split=0.1 

#n_folds = 5
random_seed = 0

callbacks = None

### Tensorboard (optional)

In [48]:
from keras.callbacks import TensorBoard

In [49]:
#home_dir = os.getenv("HOME")

#TB_LOGDIR = os.path.join(home_dir, "./tensorboard")

TB_LOGDIR = "./tensorboard"

experiment_name = "instrumental"

tb_logdir_cur = os.path.join(TB_LOGDIR, experiment_name)

In [50]:
# OPTIONAL
# new tensorboard callback at each training
# tensorboard_run_id = "Vocal_magna_2seg_adam_compact_128fbis_128h"
# tb_logdir = "%s/%s_fold%d %s" %(tb_logdir, tensorboard_run_id, fold, strftime("%Y-%m-%d %H:%M:%S", localtime()))

In [51]:
print("Execute the following in a terminal:\n")
print("tensorboard --logdir=" + TB_LOGDIR)

Execute the following in a terminal:

tensorboard --logdir=/home/schindler/tensorboard


In [52]:
# initialize TensorBoard in Python
tensorboard = TensorBoard(log_dir = tb_logdir_cur)

# + add to callbacks
callbacks = [tensorboard]

Then open Tensorboard in browser:

http://localhost:6006

## Training

In [53]:
# Summary of Training options

print(loss)
print(optimizer)
print(metrics)
print("Batch size:", batch_size, "Epochs:", epochs)

binary_crossentropy
<keras.optimizers.Adam object at 0x7f6850fa47b8>
['accuracy', <function precision at 0x7f68f5674598>, <function recall at 0x7f68f5674510>]
Batch size: 32 Epochs: 30


In [54]:
# COMPILE MODEL

model.compile(loss=loss, metrics=metrics, optimizer=optimizer)

In [55]:
# past_epochs is only for the case that we execute the next code box multiple times (so that Tensorboard is displaying properly)
past_epochs = 0

In [56]:
# START TRAINING

history = model.fit(train_set, train_classes, 
                     validation_split=validation_split,
                     #validation_data=(X_test,y_test), 
                     epochs=epochs, 
                     initial_epoch=past_epochs,
                     batch_size=batch_size, 
                     callbacks=callbacks
                     )

past_epochs += epochs

Train on 647 samples, validate on 72 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


### Verifying Accuracy on Test Set

In [57]:
# compute probabilities for the classes (= get outputs of output layer)
test_pred_prob = model.predict(test_set)
test_pred_prob[0:10]

array([[1.0000000e+00],
       [9.2034358e-01],
       [8.6222380e-01],
       [9.9999821e-01],
       [8.7224616e-04],
       [1.5277641e-02],
       [9.9998999e-01],
       [9.9984491e-01],
       [9.9999988e-01],
       [5.1573163e-01]], dtype=float32)

In [58]:
# to get the predicted class we have to round 0 < 0.5 > 1
test_pred = np.round(test_pred_prob)
test_pred[0:10]

array([[1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.]], dtype=float32)

In [59]:
# get final Accuracy
accuracy_score(test_classes, test_pred)

0.8708333333333333

# 2) Genre Classification

this is a single-label / multi-class task (multiple categories, but decision needs to be for 1 of them)

## Prepare Metadata

we start with the original metadata

In [60]:
metadata.head()

Unnamed: 0_level_0,no voice,singer,duet,plucking,hard rock,world,bongos,harpsichord,female singing,clasical,...,female singer,rap,metal,hip hop,quick,water,baroque,women,fiddle,english
mp3_path,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
d/ambient_teknology-the_all_seeing_eye_project-01-cyclops-262-291.mp3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
d/ambient_teknology-the_all_seeing_eye_project-02-all_seeing_eye-175-204.mp3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
d/ambient_teknology-the_all_seeing_eye_project-03-black-175-204.mp3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
d/ambient_teknology-the_all_seeing_eye_project-04-confusion_says-88-117.mp3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
d/ambient_teknology-the_all_seeing_eye_project-05-the_beholder-291-320.mp3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [61]:
# check which columns are available
tags_all = metadata.columns.tolist()
print(tags_all)

['no voice', 'singer', 'duet', 'plucking', 'hard rock', 'world', 'bongos', 'harpsichord', 'female singing', 'clasical', 'sitar', 'chorus', 'female opera', 'male vocal', 'vocals', 'clarinet', 'heavy', 'silence', 'beats', 'men', 'woodwind', 'funky', 'no strings', 'chimes', 'foreign', 'no piano', 'horns', 'classical', 'female', 'no voices', 'soft rock', 'eerie', 'spacey', 'jazz', 'guitar', 'quiet', 'no beat', 'banjo', 'electric', 'solo', 'violins', 'folk', 'female voice', 'wind', 'happy', 'ambient', 'new age', 'synth', 'funk', 'no singing', 'middle eastern', 'trumpet', 'percussion', 'drum', 'airy', 'voice', 'repetitive', 'birds', 'space', 'strings', 'bass', 'harpsicord', 'medieval', 'male voice', 'girl', 'keyboard', 'acoustic', 'loud', 'classic', 'string', 'drums', 'electronic', 'not classical', 'chanting', 'no violin', 'not rock', 'no guitar', 'organ', 'no vocal', 'talking', 'choral', 'weird', 'opera', 'soprano', 'fast', 'acoustic guitar', 'electric guitar', 'male singer', 'man singing',

In [62]:
len(tags_all)

188

In [63]:
genres = ['classical', 'rock', 'pop', 'jazz', 'techno'] # 'electronic', ## too little data: , 'reggae', 'metal', 'hip hop']

n_genres = len(genres)
n_genres

5

In [64]:
metadata[genres]

Unnamed: 0_level_0,classical,rock,pop,jazz,techno
mp3_path,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
d/ambient_teknology-the_all_seeing_eye_project-01-cyclops-262-291.mp3,0,0,0,0,0
d/ambient_teknology-the_all_seeing_eye_project-02-all_seeing_eye-175-204.mp3,0,0,0,0,1
d/ambient_teknology-the_all_seeing_eye_project-03-black-175-204.mp3,0,0,0,0,1
d/ambient_teknology-the_all_seeing_eye_project-04-confusion_says-88-117.mp3,0,0,0,0,1
d/ambient_teknology-the_all_seeing_eye_project-05-the_beholder-291-320.mp3,0,0,0,0,1
d/ambient_teknology-the_all_seeing_eye_project-06-cyead-378-407.mp3,0,0,0,0,1
d/ambient_teknology-the_all_seeing_eye_project-07-telekonology-117-146.mp3,0,0,0,0,1
0/american_bach_soloists-j_s__bach__cantatas_volume_v-01-gleichwie_der_regen_und_schnee_vom_himmel_fallt_bwv_18_i_sinfonia-117-146.mp3,1,0,0,0,0
0/american_bach_soloists-j_s__bach__cantatas_volume_v-02-gleichwie_der_regen_und_schnee_vom_himmel_fallt_bwv_18_ii_recitative__gleichwie_der_regen_und_schnee-30-59.mp3,0,0,0,0,0
0/american_bach_soloists-j_s__bach__cantatas_volume_v-03-gleichwie_der_regen_und_schnee_vom_himmel_fallt_bwv_18_iii_recitative_and_litany__mein_gott_hier_wird_mein_herze_sein-88-117.mp3,0,0,0,0,0


In [65]:
metadata[genres].sum()

classical    608
rock         281
pop          125
jazz          64
techno       265
dtype: int64

In [66]:
metadata[genres].shape

(3023, 5)

In [67]:
# for the single-label genre task, we only retain tracks that have EXACTLY 1 genre assigned in groundtruth
idx = metadata[genres].sum(axis=1) == 1

In [68]:
genre_metadata = metadata.loc[idx,genres]
genre_metadata.shape

(1168, 5)

In [69]:
genre_metadata.sum()

classical    604
rock         218
pop           66
jazz          48
techno       232
dtype: int64

In [70]:
# classes needs to be a "1-hot encoded" numpy array (which our groundtruth already is! we just convert pandas to numpy)
classes = genre_metadata.values
classes

array([[0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1],
       ...,
       [0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 0, 1]])

In [71]:
filelist = genre_metadata.index.tolist()

## Load Audio Spectrograms

based on the new filelist needed for the genre task 

we keep n_mel_bands and frames the same as before

In [72]:
# if we saved the audio spectrograms before, we try to load them
load_features = True

# if not, we store audio features for faster reload the next time
save_features = True

FEAT_FILE = os.path.join(DATA_PATH, "spectrograms_genres.npz")

In [73]:
if load_features:
    if os.path.exists(FEAT_FILE):
        with np.load(FEAT_FILE) as npz:
            data = npz['data']
            filelist = npz['filenames']
            classes = npz['classes']
        print("Loaded features successfully: " + str(len(filelist)), "files, dimensions:", data.shape)
    else:
        load_features = False

In [74]:
if not load_features:
    data = create_spectrograms(filelist, n_mel_bands, frames)

    if save_features:
        np.savez(FEAT_FILE, data=data, filenames=filelist, classes=classes)
        print("Features stored to " + FEAT_FILE)

Reading and processing 1168 audio files
....................................................................................................................................................................................................................................................................................................................................................................................................................................



............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Converting to big data array...
done.
Features stored to /home/schindler/tutorials/mlprague2018/data//MagnaTagATune/spectrograms_genres.npz


In [75]:
data.shape

(1168, 96, 683)

In [76]:
# standardize the data (see above)
data = standardize(data)
data.shape

(1168, 96, 683)

In [77]:
# add color channel (see above)
data = add_channel(data, n_channels=1)
data.shape

(1168, 96, 683, 1)

In [78]:
# input_shape: we store the new shape of the images in the 'input_shape' variable.
# take all dimensions except the 0th one (which is the number of files)
input_shape = data.shape[1:]  
input_shape

(96, 683, 1)

### Train & Test Set Split

We split the original full data set into two parts: Train Set (75%) and Test Set (25%).

In [79]:
testset_size = 0.25 # % portion of whole data set to keep for testing, i.e. 75% is used for training

In [80]:
# Stratified Split retains the class balance in both sets

splitter = StratifiedShuffleSplit(n_splits=1, test_size=testset_size, random_state=0)
splits = splitter.split(data, classes)

for train_index, test_index in splits:
    train_set = data[train_index]
    test_set = data[test_index]
    train_classes = classes[train_index]
    test_classes = classes[test_index]
# Note: this for loop is only executed once if n_splits==1

In [81]:
print(train_set.shape)
print(test_set.shape)

(876, 96, 683, 1)
(292, 96, 683, 1)


## Model and Training Parameters

we use the same model as for Instrumental vs. Vocal above

with a few changes in the Training parameters

### Change #1: Loss

In [82]:
# the loss for a single label classification task is CATEGORICAL crossentropy
loss = 'categorical_crossentropy' 

### Change #2: Output units and activation

In [83]:
# how many output units
# IN A SINGLE LABEL MULTI-CLASS TASK with N classes, we need N output units
output_shape = n_genres

# which activation function to use for OUTPUT layer
# IN A SINGLE LABEL MULTI-CLASS TASK with N classes we use softmax activation to BALANCE best between the classes 
# and find the best decision for ONE class
output_activation = 'softmax'


### TensorBoard setup

In [84]:
experiment_name = "genres"

tb_logdir_cur = os.path.join(TB_LOGDIR, experiment_name)

# initialize TensorBoard in Python
tensorboard = TensorBoard(log_dir = tb_logdir_cur)

# + add to callbacks
callbacks = [tensorboard]

### Rest of Parameters

stay essentially the same (or similar)

In [85]:
# Optimizer
optimizer = adam

batch_size = 32 

epochs = 30

validation_split=0.1 

random_seed = 0

## Training

In [86]:
# Summary of Training options

print(loss)
print(optimizer)
print(metrics)
print("Batch size:", batch_size, "Epochs:", epochs)

categorical_crossentropy
<keras.optimizers.Adam object at 0x7f6850fa47b8>
['accuracy', <function precision at 0x7f68f5674598>, <function recall at 0x7f68f5674510>]
Batch size: 32 Epochs: 30


In [87]:
model = CompactCNN(input_shape, nb_conv = nb_conv_layers, nb_filters= nb_filters, n_mels = 96, 
                           normalize=normalization, 
                           nb_hidden = nb_hidden, dense_units = dense_units, 
                           output_shape = output_shape, activation = output_activation, 
                           dropout = dropout)

In [88]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 96, 683, 1)        0         
_________________________________________________________________
bn_0_freq (BatchNormalizatio (None, 96, 683, 1)        384       
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 96, 683, 64)       640       
_________________________________________________________________
bn1 (BatchNormalization)     (None, 96, 683, 64)       256       
_________________________________________________________________
elu_5 (ELU)                  (None, 96, 683, 64)       0         
_________________________________________________________________
pool1 (MaxPooling2D)         (None, 48, 170, 64)       0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 48, 170, 64)       36928     
__________

In [89]:
# COMPILE MODEL

model.compile(loss=loss, metrics=metrics, optimizer=optimizer)

In [90]:
# past_epochs is only for the case that we execute the next code box multiple times (so that Tensorboard is displaying properly)
past_epochs = 0

In [91]:
# START TRAINING

history = model.fit(train_set, train_classes, 
                     validation_split=validation_split,
                     #validation_data=(X_test,y_test), 
                     epochs=epochs, 
                     initial_epoch=past_epochs,
                     batch_size=batch_size, 
                     callbacks=callbacks
                     )

past_epochs += epochs

Train on 788 samples, validate on 88 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


### Verifying Accuracy on Test Set

In [92]:
# compute probabilities for the classes (= get outputs of output layer)
test_pred_prob = model.predict(test_set)
test_pred_prob

array([[9.9985695e-01, 1.4222712e-07, 4.9297563e-05, 8.0213540e-06,
        8.5580294e-05],
       [9.9996293e-01, 1.6717519e-06, 8.6889086e-06, 1.8344124e-05,
        8.4001213e-06],
       [1.0392587e-02, 7.2592814e-03, 9.6788621e-01, 1.1517332e-02,
        2.9445307e-03],
       ...,
       [8.5956073e-01, 2.1771318e-04, 3.1769183e-02, 1.0453298e-01,
        3.9193849e-03],
       [9.9992311e-01, 2.6107367e-07, 5.7327125e-05, 6.6696407e-06,
        1.2679926e-05],
       [9.9958521e-01, 1.8424938e-06, 3.3884449e-04, 4.8097525e-05,
        2.6122130e-05]], dtype=float32)

In [93]:
# to get the predicted class, we take the ARG MAX of the row vectors 
test_pred = np.argmax(test_pred_prob, axis=1)
test_pred

array([0, 0, 2, 1, 0, 0, 1, 4, 0, 0, 2, 1, 0, 0, 4, 0, 0, 0, 0, 0, 4, 0,
       0, 0, 4, 4, 4, 4, 2, 4, 1, 0, 2, 0, 4, 0, 4, 4, 0, 0, 4, 1, 4, 1,
       4, 0, 0, 3, 4, 0, 4, 4, 0, 0, 4, 0, 0, 0, 0, 0, 0, 1, 4, 0, 0, 4,
       4, 4, 0, 1, 1, 4, 4, 1, 0, 1, 1, 0, 0, 1, 0, 4, 2, 4, 4, 4, 0, 4,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 4, 1, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 1,
       4, 2, 1, 0, 1, 4, 0, 2, 0, 0, 0, 0, 4, 1, 0, 0, 0, 0, 0, 1, 0, 4,
       1, 2, 0, 0, 4, 1, 0, 4, 0, 0, 1, 4, 0, 2, 0, 3, 0, 4, 0, 0, 4, 0,
       0, 0, 4, 1, 4, 0, 4, 1, 0, 0, 4, 0, 0, 1, 4, 4, 0, 4, 0, 4, 0, 4,
       0, 0, 1, 0, 2, 0, 0, 0, 0, 1, 0, 2, 4, 0, 1, 1, 1, 2, 0, 0, 1, 0,
       4, 0, 0, 4, 0, 1, 0, 4, 0, 4, 0, 0, 1, 0, 0, 0, 0, 1, 1, 2, 1, 0,
       4, 0, 4, 0, 3, 4, 0, 4, 4, 0, 3, 0, 0, 1, 0, 0, 0, 1, 4, 0, 2, 0,
       4, 4, 0, 0, 0, 0, 0, 0, 1, 4, 0, 0, 0, 2, 4, 1, 4, 0, 0, 0, 4, 0,
       0, 0, 4, 4, 0, 0, 1, 0, 0, 4, 0, 0, 0, 1, 0, 4, 0, 0, 0, 0, 4, 0,
       4, 0, 3, 0, 0, 0])

In [94]:
# do the same for groundtruth
test_gt = np.argmax(test_classes, axis=1)
test_gt

array([0, 0, 2, 1, 0, 0, 1, 1, 0, 0, 2, 1, 0, 0, 4, 0, 0, 0, 0, 0, 4, 0,
       0, 0, 4, 4, 4, 1, 2, 1, 1, 0, 1, 0, 4, 0, 1, 4, 0, 0, 4, 1, 4, 1,
       2, 0, 0, 2, 4, 0, 4, 1, 0, 0, 4, 0, 0, 3, 0, 0, 0, 1, 4, 0, 0, 2,
       2, 4, 3, 1, 1, 2, 4, 1, 0, 1, 1, 0, 0, 1, 0, 4, 2, 4, 4, 4, 0, 4,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 4, 1, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 1,
       4, 2, 1, 0, 1, 2, 0, 3, 0, 0, 0, 0, 4, 1, 0, 0, 0, 0, 0, 1, 0, 4,
       2, 1, 0, 0, 4, 1, 0, 2, 0, 0, 1, 4, 0, 1, 0, 3, 0, 4, 0, 0, 4, 0,
       0, 0, 4, 1, 4, 0, 4, 1, 0, 0, 4, 0, 3, 1, 4, 4, 0, 4, 0, 4, 0, 3,
       0, 0, 1, 0, 4, 0, 0, 0, 0, 1, 0, 2, 4, 0, 1, 1, 1, 3, 0, 0, 1, 0,
       4, 0, 0, 4, 0, 1, 0, 4, 0, 4, 0, 0, 1, 0, 0, 0, 0, 1, 2, 2, 1, 0,
       4, 0, 4, 0, 3, 1, 0, 4, 4, 3, 3, 0, 0, 1, 0, 0, 0, 1, 4, 2, 1, 0,
       4, 3, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 4, 4, 1, 4, 0, 0, 0, 1, 0,
       0, 0, 4, 4, 0, 0, 1, 0, 0, 4, 0, 0, 0, 1, 0, 4, 0, 0, 0, 0, 4, 0,
       1, 0, 3, 4, 0, 0])

In [95]:
# get final Accuracy
accuracy_score(test_gt, test_pred)

0.8835616438356164

# 3) Mood Recognition

this is a multi-label classification task (multiple categories to detect, any of them can be 0 or 1)

In [96]:
a = metadata.sum()

In [97]:
pd.set_option('display.max_rows', len(a))
print(a)
pd.reset_option('display.max_rows')


no voice             71
singer               21
duet                  8
plucking              7
hard rock            21
world                 8
bongos                8
harpsichord         159
female singing        6
clasical              3
sitar                67
chorus               42
female opera         11
male vocal          133
vocals              168
clarinet              4
heavy                15
silence               6
beats                51
men                  17
woodwind              4
funky                22
no strings           10
chimes                7
foreign              32
no piano             37
horns                 7
classical           608
female              178
no voices             1
soft rock             7
eerie                 4
spacey                7
jazz                 64
guitar              683
quiet                86
no beat               5
banjo                31
electric             31
solo                130
violins              34
folk            

## Adapt Metadata

In [98]:
# we select 5 moods from the original list of tags 
moods = ['funky', 'quiet', 'mellow','calm', 'sad'] ## too little data: 'happy','scary']

In [99]:
# and check the data on it
#metadata[moods]

In [100]:
metadata[moods].sum()

funky     22
quiet     86
mellow    10
calm      13
sad       12
dtype: int64

In [101]:
# for the single-label genre task, we only retain tracks that have AT LEAST 1 of these moods assigned in groundtruth
idx = metadata[moods].sum(axis=1) >= 1

In [102]:
mood_metadata = metadata.loc[idx,moods]
mood_metadata.shape

(136, 5)

In [103]:
# double check
mood_metadata.sum()

funky     22
quiet     86
mellow    10
calm      13
sad       12
dtype: int64

In [104]:
mood_metadata

Unnamed: 0_level_0,funky,quiet,mellow,calm,sad
mp3_path,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0/american_bach_soloists-j_s__bach__cantatas_volume_v-09-weinen_klagen_sorgen_zagen_bwv_12_iv_aria__kreuz_und_krone_sind_verbunden-59-88.mp3,0,1,0,0,0
7/american_bach_soloists-j_s__bach__mass_in_b_minor_cd2-05-crucifixus-88-117.mp3,0,1,0,0,0
7/american_bach_soloists-j_s__bach__mass_in_b_minor_cd2-07-et_in_spiritum_sanctum_dominum-291-320.mp3,0,1,0,0,0
7/american_bach_soloists-j_s__bach__mass_in_b_minor_cd2-14-agnus_dei-0-29.mp3,0,1,0,0,0
0/american_bach_soloists-j_s__bach__transcriptions_of_italian_music-08-psalm_51_tilge_hochster_meine_sunden_v_largo_verses_5_and_6-88-117.mp3,0,1,0,0,0
0/american_bach_soloists-joseph_haydn__masses-18-agnus_dei__adagio-30-59.mp3,0,1,0,0,0
f/american_baroque-the_four_seasons_by_vivaldi-04-concerto_no_2_in_g_minor_rv_315_summer__allegro_non_molto-204-233.mp3,0,0,0,0,1
a/asteria-soyes_loyal-10-of_a_rose_singe_we_lute_anon-88-117.mp3,0,1,0,0,0
8/beth_quist-lucidity-02-eli-0-29.mp3,0,1,0,0,0
a/bjorn_fogelberg-karooshi_porn-07-wave-204-233.mp3,0,0,0,1,0


In [105]:
# classes needs to be a MULTI-HOT encoded" numpy array 
# (which our groundtruth already is! we just convert pandas to numpy)
classes = mood_metadata.values
classes

array([[0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 1, 0, 1, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0,

In [106]:
classes.sum(axis=0)

array([22, 86, 10, 13, 12])

In [107]:
filelist = mood_metadata.index.tolist()

## Load Audio Spectrograms

based on the new filelist needed for the mood task 

we keep n_mel_bands and frames the same as before

In [108]:
# if we saved the audio spectrograms before, we try to load them
load_features = True

# if not, we store audio features for faster reload the next time
save_features = True

FEAT_FILE = os.path.join(DATA_PATH, "spectrograms_moods.npz")

In [109]:
if load_features:
    if os.path.exists(FEAT_FILE):
        with np.load(FEAT_FILE) as npz:
            data = npz['data']
            filelist = npz['filenames']
            classes = npz['classes']
        print("Loaded features successfully: " + str(len(filelist)), "files, dimensions:", data.shape)
    else:
        load_features = False

In [110]:
if not load_features:
    data = create_spectrograms(filelist, n_mel_bands, frames)

    if save_features:
        np.savez(FEAT_FILE, data=data, filenames=filelist, classes=classes)
        print("Features stored to " + FEAT_FILE)

Reading and processing 136 audio files
........



................................................................................................................................
Converting to big data array...
done.
Features stored to /home/schindler/tutorials/mlprague2018/data//MagnaTagATune/spectrograms_moods.npz


In [111]:
data.shape

(136, 96, 683)

In [112]:
# standardize the data (see above)
data = standardize(data)
data.shape

(136, 96, 683)

In [113]:
# add color channel (see above)
data = add_channel(data, n_channels=1)
data.shape

(136, 96, 683, 1)

In [114]:
# input_shape: we store the new shape of the images in the 'input_shape' variable.
# take all dimensions except the 0th one (which is the number of files)
input_shape = data.shape[1:]  
input_shape

(96, 683, 1)

### Train & Test Set Split

We split the original full data set into two parts: Train Set (75%) and Test Set (25%).

### Change: We cannot use Stratified Split here as it does not make sense for a MULTI-LABEL TASK!

In [115]:
# use ShuffleSplit INSTEAD OF StratifiedShuffleSplit 

splitter = ShuffleSplit(n_splits=1, test_size=testset_size, random_state=0)
splits = splitter.split(data, classes)

for train_index, test_index in splits:
    train_set = data[train_index]
    test_set = data[test_index]
    train_classes = classes[train_index]
    test_classes = classes[test_index]
# Note: this for loop is only executed once if n_splits==1

In [116]:
print(train_set.shape)
print(test_set.shape)

(102, 96, 683, 1)
(34, 96, 683, 1)


## Model and Training Parameters

we use the same model as for Instrumental vs. Vocal and Genres above

with a few changes in the Training parameters

### Change #1: Loss

In [117]:
# the loss for a MULTI label classification task is BINARY crossentropy
loss = 'binary_crossentropy' 

### Change #2: Output units and activation

In [118]:
# how many output units
# IN A SINGLE-LABEL MULTI-CLASS or MULTI-LABEL TASK with N classes, we need N output units

output_shape = n_genres

# which activation function to use for OUTPUT layer
# IN A MULTI-LABEL TASK with N classes we use SIGMOID activation same as with a BINARY task
# as EACH of the classes can be 0 or 1 

output_activation = 'sigmoid'

### TensorBoard setup

In [119]:
experiment_name = "moods"

tb_logdir_cur = os.path.join(TB_LOGDIR, experiment_name)

# initialize TensorBoard in Python
tensorboard = TensorBoard(log_dir = tb_logdir_cur)

# + add to callbacks
callbacks = [tensorboard]

### Rest of Parameters

stay essentially the same (or similar)

In [120]:
# Optimizer
optimizer = adam

batch_size = 32 

epochs = 30

validation_split=0.1 

random_seed = 0

## Training

In [121]:
# Summary of Training options

print(loss)
print(optimizer)
print(metrics)
print("Batch size:", batch_size, "Epochs:", epochs)

binary_crossentropy
<keras.optimizers.Adam object at 0x7f6850fa47b8>
['accuracy', <function precision at 0x7f68f5674598>, <function recall at 0x7f68f5674510>]
Batch size: 32 Epochs: 30


In [122]:
# COMPILE MODEL

model.compile(loss=loss, metrics=metrics, optimizer=optimizer)

In [123]:
# past_epochs is only for the case that we execute the next code box multiple times (so that Tensorboard is displaying properly)
past_epochs = 0

In [124]:
# START TRAINING

history = model.fit(train_set, train_classes, 
                     validation_split=validation_split,
                     #validation_data=(X_test,y_test), 
                     epochs=epochs, 
                     initial_epoch=past_epochs,
                     batch_size=batch_size, 
                     callbacks=callbacks
                     )

past_epochs += epochs

Train on 91 samples, validate on 11 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


### Verifying Accuracy on Test Set

In [125]:
# compute probabilities for the classes (= get outputs of output layer)
test_pred_prob = model.predict(test_set)
test_pred_prob[0:10]

array([[7.5280885e-03, 4.4500372e-01, 1.7079780e-02, 4.9981843e-03,
        5.2539021e-01],
       [4.5398569e-06, 9.9443001e-01, 9.7266812e-04, 3.2460450e-03,
        1.3468076e-03],
       [8.9994806e-08, 9.7452253e-01, 4.4043329e-05, 1.3854733e-02,
        1.1578686e-02],
       [1.3572507e-07, 9.9890471e-01, 2.4608391e-05, 8.1000332e-04,
        2.6058347e-04],
       [3.9400297e-07, 9.9744248e-01, 3.6083722e-05, 1.9670506e-04,
        2.3243695e-03],
       [6.9213549e-08, 9.9819309e-01, 1.6649756e-05, 1.1169816e-03,
        6.7316071e-04],
       [1.0938060e-06, 9.9205339e-01, 1.4200664e-05, 7.5426246e-03,
        3.8871693e-04],
       [1.0661203e-06, 9.9406260e-01, 2.3826586e-04, 1.1620361e-03,
        4.5359670e-03],
       [2.2570396e-07, 9.9854505e-01, 7.3469142e-05, 2.3026942e-04,
        1.1510079e-03],
       [9.9998927e-01, 6.3664550e-07, 6.9976963e-06, 2.9439143e-06,
        2.1522482e-07]], dtype=float32)

In [126]:
# to get the predicted class we have to round 0 < 0.5 > 1
test_pred = np.round(test_pred_prob)
test_pred[0:10]

array([[0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.]], dtype=float32)

In [127]:
test_classes[0:10]

array([[0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0]])

In [128]:
# get final Accuracy
accuracy_score(test_classes, test_pred)

0.6764705882352942