# Improved music genre prediction using Keras


In previous lessons we were using the multilayer perceptron architecture to solve music genre prediction problem, but this approach didn't prove very successful for a more challenging task of distinguishing between rock and hip-hop music. In the final lesson we will upgrade our model and use some more advanced techniques to maximize the model's accuracy.

## Identifying the problem

Input to our classifier is a 3-second long music excerpt sampled at 400Hz, which gives us exactly 1200 audio samples. The multilayer perceptron treats each sample as an independent input value and tries to identify, during training, some relations between these values that would help to classify the audio excerpt. Indeed if we would provide the input samples at random (but constant) order the performance of the model would be identical.

But of course we know correct order of samples and indeed this is a crucial information that should be utilized by the classifier. The structure of our classifier should reflect the fact that our input data is a continuous signal that changes in time in come particular way.

## Solution

One of the most commonly used upgrades from standard MLP when working with signals are convolutional layers. Imagine that instead of creating neurons that are connected to all neurons from previous layer, you create neurons that are connected only to N consecutive neurons from previous layer (please note that now we're assuming that neurons in previous layer have some fixed order). You can now move these new neurons across the input data and get output at each location you visited. This is essentially a convolutional layer.

## Implementation

The beginning of our script looks very familiar except for some new imports which will be described later.

In [1]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Conv1D, Reshape, Flatten
from keras.optimizers import Adam

IN_FILE = "./classical_vs_rock.npz"
SAMPLE_WIDTH = 1200

BATCH_SIZE = 16
LEARNING_RATE = 0.001

NUM_EPOCHS = 10

data = np.load(IN_FILE)
x_train = data['train']
y_train = np.zeros((x_train.shape[0]))
y_train[int(x_train.shape[0]/2):] = 1.0
x_test = data['test']
y_test = np.zeros((x_test.shape[0]))
y_test[int(x_test.shape[0]/2):] = 1.0

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Now let's  define our model:

In [2]:
model = Sequential()
model.add(Reshape((SAMPLE_WIDTH, 1), input_shape=(SAMPLE_WIDTH,)))

So far our data had a fixed shape:

```
batch_size x num_inputs
```

but now we need a third dimension to distinguish time from features (each neuron will produce an output in every location it is placed over the input data).

The reshape layer reshaped or matrix into three-dimensional tensor of shape:

```
batch_size x time x num_inputs
```

Time will now have length of 1200 and num_inputs will become one. Indeed, there is in fact just one feature (air pressure) but measured at many points in time.

In [3]:
model.add(Conv1D(32, kernel_size=31, strides=4, padding='same', activation='relu'))
model.add(Conv1D(32, kernel_size=31, strides=3, padding='same', activation='relu'))
model.add(Conv1D(48, kernel_size=15, strides=2, padding='same', activation='relu'))
model.add(Conv1D(48, kernel_size=15, strides=2, padding='same', activation='relu'))

We add four convolutional layers stacked on one another. Let's go one by one and see what they do.

### First convolutional layer

First convolutional layer accepts input data of shape as described before, it consists of 32 filters (neurons that will be placed in different locations on time axis over input data). Each filter has length of 31, which means that it will be connected to 31 consecutive neurons from previous layer in single location. Strides is in other words jumps length between different locations on time axis. Padding tells us what happens on the edges of input signal. Option 'same' tells the model to place our filters on input data as long as the central element of the filter fits within input data. If some part of the filter falls outside of the input data, the input data is assumed to be zero.

Output form this layer will therefore be following:

```
batch_size x 300 x 32
```

Because we have used the "same" padding, the time axis is simply reduced "strides" times.

### Second convolutional layer

Output form this layer will be following:

```
batch_size x 100 x 32
```

### Third convolutional layer

Output form this layer will be following:

```
batch_size x 50 x 48
```

### Fourth convolutional layer

Output form this layer will be following:

```
batch_size x 25 x 48
```

In [4]:
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

After final convolutional layer we have a three-dimensional feature map, but from now on we want to use standard dense layers so we need to reshape the feature map. This operation basically means that output of all 48 filters at each of the 25 locations in time will now become a set of independent values.

The rest of our model is straight-forward.

The rest of the scripts look the same as bfore except for a more sophisticated optimizer used (Adam instead of SGD). The basic principle of Adam is similar to SGD, but it is more advanced and also takes into consideration updates that were done prior to current update. Because of that change we also had to update learning rate (it's optimal value is highly dependent on optimizer).

In [5]:
model.compile(loss='mean_squared_error',
              optimizer=Adam(lr=LEARNING_RATE),
              metrics=['accuracy'])

history = model.fit(x_train, y_train,
                    batch_size=BATCH_SIZE,
                    epochs=NUM_EPOCHS,
                    verbose=2,
                    validation_data=(x_test, y_test))

Train on 8000 samples, validate on 2000 samples
Epoch 1/10
 - 3s - loss: 0.1305 - acc: 0.8252 - val_loss: 0.1189 - val_acc: 0.8370
Epoch 2/10
 - 2s - loss: 0.1192 - acc: 0.8396 - val_loss: 0.1081 - val_acc: 0.8560
Epoch 3/10
 - 2s - loss: 0.1129 - acc: 0.8465 - val_loss: 0.1059 - val_acc: 0.8590
Epoch 4/10
 - 2s - loss: 0.1091 - acc: 0.8575 - val_loss: 0.0999 - val_acc: 0.8695
Epoch 5/10
 - 2s - loss: 0.1038 - acc: 0.8638 - val_loss: 0.0977 - val_acc: 0.8700
Epoch 6/10
 - 2s - loss: 0.1003 - acc: 0.8698 - val_loss: 0.0953 - val_acc: 0.8795
Epoch 7/10
 - 2s - loss: 0.0977 - acc: 0.8745 - val_loss: 0.0925 - val_acc: 0.8770
Epoch 8/10
 - 2s - loss: 0.0950 - acc: 0.8779 - val_loss: 0.0943 - val_acc: 0.8750
Epoch 9/10
 - 2s - loss: 0.0901 - acc: 0.8826 - val_loss: 0.0933 - val_acc: 0.8820
Epoch 10/10
 - 2s - loss: 0.0877 - acc: 0.8892 - val_loss: 0.0901 - val_acc: 0.8835


The effort was worthwhile, we have gained about 5 percent points in accuracy compared to previous solution.

In [6]:
for layer_id, layer in enumerate(model.layers):
    weights = layer.get_weights()
    for param_id, param in enumerate(weights):
        print("Layer: {} parameter: {} type: {} shape: {}".format(
                layer_id, param_id, param.dtype, param.shape))

Layer: 1 parameter: 0 type: float32 shape: (31, 1, 32)
Layer: 1 parameter: 1 type: float32 shape: (32,)
Layer: 2 parameter: 0 type: float32 shape: (31, 32, 32)
Layer: 2 parameter: 1 type: float32 shape: (32,)
Layer: 3 parameter: 0 type: float32 shape: (15, 32, 48)
Layer: 3 parameter: 1 type: float32 shape: (48,)
Layer: 4 parameter: 0 type: float32 shape: (15, 48, 48)
Layer: 4 parameter: 1 type: float32 shape: (48,)
Layer: 6 parameter: 0 type: float32 shape: (1200, 64)
Layer: 6 parameter: 1 type: float32 shape: (64,)
Layer: 7 parameter: 0 type: float32 shape: (64, 1)
Layer: 7 parameter: 1 type: float32 shape: (1,)


Please note that although we now have six layers instead of three, the number of parameters is actually almost two times smaller (323969 vs current 167425).

## Rock VS Hip-Hop

Now for the final evaluation. How will our new model perform? I will leave you the pleasure of find out for yourself :-)

Thank you for reading and have a nice day!