# Music vs. Speech Classification with Deep Learning

This tutorial shows how different Convolutional Neural Network architectures are used for the taks of discriminating a piece of audio whether it is music or speech (binary classification).

The data set used is the [Music Speech](http://marsyasweb.appspot.com/download/data_sets/) data set compiled by George Tzanetakis. It consists of 128 tracks, each 30 seconds long. Each class (music/speech) has 64 examples. The tracks are all 22050Hz Mono 16-bit audio files in .wav format.

This tutorial contains:
* Loading and Preprocessing of Audio files
* Loading class files from CSV and using Label Encoder
* Generating Mel spectrograms
* Standardization of Data
* Convolutional Neural Networks: single, stacked, parallel
* ReLU Activation
* Dropout
* Train/Test set split
* (Cross-validation - TODO)

In [2]:
import os
device = 'gpu' # 'cpu' or 'gpu'
os.environ['THEANO_FLAGS']='mode=FAST_RUN,device=' + device + ',floatX=float32'

import argparse
import csv
import datetime
import glob
import math
import sys
import time
import numpy as np
import pandas as pd # Pandas for reading CSV files and easier Data handling in preparation

from sklearn import preprocessing
from sklearn.metrics import accuracy_score

from theano import config

import keras
from keras.models import Sequential
from keras.layers import Convolution2D, MaxPooling2D, Dense, Dropout, Activation, Flatten
from keras.layers.normalization import BatchNormalization

# local
import rp_extract as rp
from audiofile_read import audiofile_read

Using gpu device 0: GeForce GTX 980 Ti (CNMeM is disabled, cuDNN 5105)
Using Theano backend.


In [17]:
from sklearn import __version__ as sklearn_version

In [25]:
if sklearn_version == '0.17':
    from sklearn.cross_validation import train_test_split
    from sklearn.cross_validation import StratifiedShuffleSplit
elif sklearn_version == '0.18':
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import StratifiedShuffleSplit

## Load the Metadata

The tab-separated file contains pairs of filename TAB class.

In [3]:
csv_file = 'data/Music_speech/filelist_wclasses.txt' 
metadata = pd.read_csv(csv_file, index_col=0, sep='\t', header=None)
metadata.head(10)

Unnamed: 0_level_0,1
0,Unnamed: 1_level_1
speech/stupid.wav,speech
speech/teachers2.wav,speech
speech/danie1.wav,speech
speech/oneday.wav,speech
speech/jvoice.wav,speech
speech/relation.wav,speech
speech/geography.wav,speech
speech/pulp2.wav,speech
speech/greek1.wav,speech
speech/conversion.wav,speech


In [4]:
# create list of filenames with associated classes
filelist = metadata.index.tolist()
classes = metadata[1].values.tolist()

## Encode Labels to Numbers

String labels need to be encoded as numbers. We use the LabelEncoder from the scikit-learn package.

In [5]:
classes[0:5]

['speech', 'speech', 'speech', 'speech', 'speech']

In [6]:
from sklearn.preprocessing import LabelEncoder

labelencoder = LabelEncoder()
labelencoder.fit(classes)
print len(labelencoder.classes_), "classes:", ", ".join(list(labelencoder.classes_))

classes_num = labelencoder.transform(classes)
classes_num[0:5]

2 classes: music, speech


array([1, 1, 1, 1, 1])

Note: In order to correctly re-transform any predicted numbers into strings, we keep the labelencoder for later.

## Load the Audio Files

In [7]:
path = 'data/Music_speech'
list_spectrograms = [] # spectrograms are put into a list first

# desired output parameters
n_mel_bands = 40   # y axis
frames = 80        # x axis

# some FFT parameters
fft_window_size=512
fft_overlap = 0.5
hop_size = int(fft_window_size*(1-fft_overlap))
segment_size = fft_window_size + (frames-1) * hop_size # segment size for desired # frames

for filename in filelist:
    print ".", 
    filepath = os.path.join(path, filename)
    samplerate, samplewidth, wavedata = audiofile_read(filepath,verbose=False)
    sample_length = wavedata.shape[0]

    # make Mono (in case of multiple channels / stereo)
    if wavedata.ndim > 1:
        wavedata = np.mean(wavedata, 1)
        
    # take only a segment
    pos = 0 # start position
    wav_segment = wavedata[pos:pos+segment_size]

    # 1) FFT spectrogram 
    spectrogram = rp.calc_spectrogram(wav_segment,fft_window_size,fft_overlap)

    # 2) Transform to perceptual Mel scale (uses librosa.filters.mel)
    spectrogram = rp.transform2mel(spectrogram,samplerate,fft_window_size,n_mel_bands)
        
    # 3) Log 10 transform
    spectrogram = np.log10(spectrogram)
    
    list_spectrograms.append(spectrogram)
        
print "\nRead", len(filelist), "audio files"



. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
Read 128 audio files


In [8]:
len(list_spectrograms)

128

In [9]:
spectrogram.shape

(40, 80)

## Make 1 big array of list of spectrograms

In [10]:
# a list of many 40x80 spectrograms is made into 1 big array
# config.floatX is from Theano configration to enforce float32 precision (needed for GPU computation)
data = np.array(list_spectrograms, dtype=config.floatX)
data.shape

(128, 40, 80)

## Standardization

<b>Always standardize</b> the data before feeding it into the Neural Network!

As in the Car image tutorial we use <b>Zero-mean Unit-variance standardization</b> (also known as Z-score normalization).
However, this time we use <b>attribute-wise standardization</b>, i.e. each pixel is standardized individually, as opposed to computing a single mean and single standard deviation of all values.

('Flat' standardization would also be possible, but we have seen benefits of attribut-wise standardization in our experiments).

This time, we use the StandardScaler from the scikit-learn package for our purpose.
As it works typically on vector data, we have to vectorize (i.e. reshape) our matrices first.

In [11]:
# vectorize
N, ydim, xdim = data.shape
data = data.reshape(N, xdim*ydim)
data.shape

(128, 3200)

In [12]:
# standardize
scaler = preprocessing.StandardScaler()
data = scaler.fit_transform(data)

In [13]:
# show mean and standard deviation: two vectors with same length as data.shape[1]
scaler.mean_, scaler.scale_

(array([-4.02511168, -4.0533061 , -4.02470922, -4.02415657, -4.02041769, -4.02912712, -4.05844784, -4.04009008, -4.05362654, -4.0890398 , ..., -8.21372128, -8.17889404, -8.18083572, -8.18607426,
        -8.18168163, -8.19005871, -8.17850971, -8.19392109, -8.19137573, -8.12756443], dtype=float32),
 array([ 0.92513525,  0.91504389,  0.93706119,  0.90523106,  0.90080869,  0.85922456,  0.93110597,  0.91019833,  0.94414949,  0.92181993, ...,  0.97755539,  1.03284562,  1.02812028,  1.07878053,
         1.06577301,  1.04868042,  1.02863121,  1.00643325,  0.99857688,  1.07620108], dtype=float32))

# Creating Train & Test Set 

In [21]:
testset_size = 0.25 # % portion of whole data set to keep for testing, i.e. 75% is used for training

# normal (random) split of data set into 2 parts
#from sklearn.model_selection import train_test_split

train_set, test_set, train_classes, test_classes = train_test_split(data, classes_num, test_size=testset_size, random_state=0)

In [22]:
train_classes

array([0, 1, 1, 0, 0, 1, 1, 0, 1, 0, ..., 1, 0, 1, 0, 0, 0, 0, 0, 1, 1])

In [23]:
test_classes

array([1, 1, 0, 1, 1, 0, 0, 0, 0, 1, ..., 1, 0, 0, 0, 0, 1, 0, 0, 1, 0])

In [24]:
# The two classes may be unbalanced
print "Class Counts: Class 0:", sum(train_classes==0), "Class 1:", sum(train_classes)

Class Counts: Class 0: 49 Class 1: 47


In [32]:
# better: Stratified Split retains the class balance in both sets
#from sklearn.model_selection import StratifiedShuffleSplit

if sklearn_version == '0.17':
    splits = StratifiedShuffleSplit(classes_num, n_iter=1, test_size=testset_size, random_state=0)
elif sklearn_version == '0.18':
    splitter = StratifiedShuffleSplit(n_splits=1, test_size=testset_size, random_state=0)
    splits = splitter.split(data, classes_num)

for train_index, test_index in splits:
    print "TRAIN INDEX:", train_index
    print "TEST INDEX:", test_index
    train_set = data[train_index]
    test_set = data[test_index]
    train_classes = classes_num[train_index]
    test_classes = classes_num[test_index]
# Note: this for loop is only executed once, if n_splits==1

print train_set.shape
print test_set.shape
# Note: we will reshape the data later back to matrix form 

TRAIN INDEX: [ 97 102   0  34  13  44  60  20  75  22 ...,  43  94  65  15  26  84  30 107  10  91]
TEST INDEX: [  1  27  64 117  88  85  35  18  46 100 ...,  38  73  56 113  67  39 114  42  41  62]
(96, 3200)
(32, 3200)


In [33]:
print "Class Counts: Class 0:", sum(train_classes==0), "Class 1:", sum(train_classes)

Class Counts: Class 0: 48 Class 1: 48


# Convolutional Neural Networks

A Convolutional Neural Network (ConvNet or CNN) is a type of (deep) Neural Network that is well-suited for 2D axes data, such as images or spectrograms, as it is optimized for learning from spatial proximity. Its core elements are 2D filter kernels which essentially learn the weights of the Neural Network, and downscaling functions such as Max Pooling.

A CNN can have one or more Convolution layers, each of them having an arbitrary number of N filters (which define the depth of the CNN layer), following typically by a pooling step, which aggregates neighboring pixels together and thus reduces the image resolution by retaining only the maximum values of neighboring pixels.

## Preparing the Data

### Adding the channel

As previously in the Car image tutorial, we need to add a dimension for the color channel to the data. RGB images typically have an 3rd dimension with the color. 

<b>Spectrograms, however, are considered like greyscale images, as in the previous tutorial.
Likewise we need to add an extra dimension for compatibility with the CNN implementation.</b>

<i>Same as in the previous tutorial:</i>

In Theano, traditionally the color channel was the <b>first</b> dimension in the image shape. 
In Tensorflow, the color channel is the <b>last</b> dimension in the image shape. 

This can be configured now in ~/.keras/keras.json: "image_dim_ordering": "th" or "tf" with "tf" (Tensorflow) being the default image ordering even though you use Theano. Depending on this, use one of the code lines below.

For greyscale images, we add the number 1 as the depth of the additional dimension of the input shape (for RGB color images, the number of channels is 3).

In [34]:
n_channels = 1 # for grey-scale, 3 for RGB, but usually already present in the data

if keras.backend.image_dim_ordering() == 'th':
    # Theano ordering (~/.keras/keras.json: "image_dim_ordering": "th")
    train_set = train_set.reshape(train_set.shape[0], n_channels, ydim, xdim)
    test_set = test_set.reshape(test_set.shape[0], n_channels, ydim, xdim)
else:
    # Tensorflow ordering (~/.keras/keras.json: "image_dim_ordering": "tf")
    train_set = train_set.reshape(train_set.shape[0], ydim, xdim, n_channels)
    test_set = test_set.reshape(test_set.shape[0], ydim, xdim, n_channels)

In [35]:
train_set.shape

(96, 1, 40, 80)

In [36]:
test_set.shape

(32, 1, 40, 80)

In [37]:
# we store the new shape of the images in the 'input_shape' variable.
# take all dimensions except the 0th one (which is the number of images)
input_shape = train_set.shape[1:]  
input_shape

(1, 40, 80)

# Creating Neural Network Models in Keras

## Sequential Models

In Keras, one can choose between a Sequential model and a Graph model. Sequential models are the standard case. Graph models are for parallel networks.

## Creating a Single Layer and a Two Layer CNN

Try: (comment/uncomment code in the following code block)
* 1 Layer
* 2 Layer
* more conv_filters
* Dropout?

In [143]:
model = Sequential()

#conv_filters = 16   # number of convolution filters (= CNN depth)
conv_filters = 32   # number of convolution filters (= CNN depth)

# Layer 1
model.add(Convolution2D(conv_filters, 3, 3, border_mode='valid', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2))) 

# Layer 2
model.add(Convolution2D(conv_filters, 3, 3, border_mode='valid', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2))) 

# After Convolution, we have a 16*x*y matrix output
# In order to feed this to a Full(Dense) layer, we need to flatten all data
# Note: Keras does automatic shape inference, i.e. it knows how many (flat) input units the next layer will need,
# so no parameter is needed for the Flatten() layer.
model.add(Flatten()) 

# Full layer
model.add(Dense(256)) 

# Output layer
# For binary/2-class problems use ONE sigmoid unit, 
# for multi-class/multi-label problems use n output units and activation='softmax!'
model.add(Dense(1,activation='sigmoid'))

If you get OverflowError: Range exceeds valid bounds in the above box, check the correct Theano vs. Tensorflow ordering in the box before and your keras.json configuration file.

In [144]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
convolution2d_36 (Convolution2D) (None, 32, 33, 78)    800         convolution2d_input_20[0][0]     
____________________________________________________________________________________________________
maxpooling2d_35 (MaxPooling2D)   (None, 32, 16, 39)    0           convolution2d_36[0][0]           
____________________________________________________________________________________________________
convolution2d_37 (Convolution2D) (None, 32, 14, 32)    24608       maxpooling2d_35[0][0]            
____________________________________________________________________________________________________
maxpooling2d_36 (MaxPooling2D)   (None, 32, 7, 16)     0           convolution2d_37[0][0]           
___________________________________________________________________________________________

## Training the CNN

In [145]:
# Define a loss function 
loss = 'binary_crossentropy'  # 'categorical_crossentropy' for multi-class problems

# Optimizer = Stochastic Gradient Descent
optimizer = 'sgd' 

# Compiling the model
model.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])

In [149]:
# TRAINING the model
epochs = 15
history = model.fit(train_set, train_classes, batch_size=32, nb_epoch=epochs)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


#### Accuracy goes up pretty quickly for 1 layer on Train set! Also on Test set?

### Verifying Accuracy on Test Set

In [150]:
test_pred = model.predict_classes(test_set)



In [56]:
# 1 layer (Mac)
accuracy_score(test_classes, test_pred)

0.71875

In [51]:
# 2 layer
accuracy_score(test_classes, test_pred)

0.78125

In [57]:
# 2 layer + 32 convolution filters
accuracy_score(test_classes, test_pred)

0.8125

In [151]:
# 2 layer + 32 convolution filters + ...?
accuracy_score(test_classes, test_pred)

0.78125

## Additional Parameters & Techniques

Try: (comment/uncomment code blocks below)
* Adding ReLU activation
* Adding Batch normalization
* Adding Dropout

In [None]:
model = Sequential()

conv_filters = 16   # number of convolution filters (= CNN depth)

# Layer 1
model.add(Convolution2D(conv_filters, 3, 3, border_mode='valid', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2))) 

# Layer 2
model.add(Convolution2D(conv_filters, 3, 3, border_mode='valid', input_shape=input_shape))


#model.add(BatchNormalization())
#model.add(Activation('relu')) 
model.add(MaxPooling2D(pool_size=(2, 2))) 

# After Convolution, we have a 16*x*y matrix output
# In order to feed this to a Full(Dense) layer, we need to flatten all data
# Note: Keras does automatic shape inference, i.e. it knows how many (flat) input units the next layer will need,
# so no parameter is needed for the Flatten() layer.
model.add(Flatten()) 

# Full layer
model.add(Dense(256))  
#model.add(Activation('relu'))
#model.add(Dropout(0.1))

# Output layer
# For binary/2-class problems use ONE sigmoid unit, 
# for multi-class/multi-label problems use n output units and activation='softmax!'
model.add(Dense(1,activation='sigmoid'))

## Parallel CNNs

In [None]:
# other optimizers:
#from keras.optimizers import SGD, RMSprop, Adagrad

  