<a href="https://colab.research.google.com/github/szilaard/AIT_project/blob/main/AitProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AIT Deep Learning Project - Music genre classification based on audio

Péter Czumbel, Szilárd Horváth


In [1]:
import tensorflow as tf
import librosa
import pandas as pd
from glob import glob
from IPython.display import display
from IPython.display import Audio
import numpy as np
import matplotlib.pyplot as plt
import math
from tensorflow.keras.utils import to_categorical
from random import shuffle

## 1. Exploring the data

We are using the GTZAN dataset, which consists of 1000 audio tracks each 30 seconds long. It contains 10 genres, each represented by 100 tracks. The tracks are all 22050Hz Mono 16-bit audio files in .wav format. However, downloading the GTZAN dataset from tensorflow datasets doesn't work, the URL times out, 
(see: https://github.com/tensorflow/datasets/issues/4090), therefore we are using [this](https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification) version of the dataset from kaggle instead.<br>
After downloading the dataset from Kaggle, extract the Data folder and place it into the projects root directory, if you wish to run the notebook yourself.



### 1.1 Loading the dataset

To download the dataset from kaggle, run the block below and upload your own kaggle API key

In [None]:
! pip install -q kaggle
from google.colab import files
files.upload()
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets download -d andradaolteanu/gtzan-dataset-music-genre-classification
! unzip gtzan-dataset-music-genre-classification.zip

First we read all the data from the directories:

In [2]:
audio_files = glob("Data/genres_original/*/*.wav")

Setting some variables based on the dataset description:

In [3]:
sample_rate = 22050   # sampling frequency
duration = 30         # length of the tracks in seconds

### 1.2 Examples

There are 100 tracks of each genre, and our dataset is ordered, so if we check every 100th track, we can see all the different genres.<br>
Example track for each genre:

In [None]:
for i in range(10):
    print(audio_files[i*100].split("\\")[1])
    display(Audio(audio_files[i*100]))

### 1.3 Plotting the waveforms

Plotting the waveforms of different music genres, we can see that classifying most of the genres would probably be possible even by only using the waveform, however some genres, like country and metal can look quite similar. 

In [None]:
fig = plt.figure(figsize=(20, 7))
fig.tight_layout()
rows = 2
columns = 5
for i in range(1, columns * rows + 1):
    fig.add_subplot(rows, columns, i)
    signal, sr = librosa.load(audio_files[(i-1)*100], sr=sample_rate)
    librosa.display.waveshow(signal, sr=sample_rate)
    plt.title(audio_files[(i-1)*100].split("\\")[1])
    plt.xlabel("")
plt.show()


### 1.4 Plotting the MFCCs

Setting variables for calculating the MFCCs:

In [4]:
n_fft = 2048          # number of samples per fft - the size of the window when performing an fft
n_mfcc = 50           # number of extracted coefficients
hop_length = 512      # the amount we shift with each fft

Plotting the MFCCs of different genres yields more easily differentiable data for each genre. We will be using this version of the data to train our deep neural network.

In [None]:
fig = plt.figure(figsize=(20, 7))
fig.tight_layout()
rows = 2
columns = 5
for i in range(1, columns * rows + 1):
    fig.add_subplot(rows, columns, i)
    signal, sr = librosa.load(audio_files[(i-1)*100], sr=sample_rate)
    mfcc = librosa.feature.mfcc(y=signal, sr=sample_rate, n_mfcc=n_mfcc, n_fft=n_fft, hop_length=hop_length)
    librosa.display.specshow(mfcc, sr=sample_rate, hop_length=hop_length)
    plt.title(audio_files[(i-1)*100].split("\\")[1])
plt.show()

## 2. Preprocessing the dataset

### 2.1 Splitting the tracks to segments and calculating MFCCs

We create a data structure for the mappings, the raw mfcc data and the labels. This way we can save the preprocessed data as a JSON file later.

In [5]:
data = {
    "mapping": [],  # mapping the names of the genres to indexes 0 to 9
    "mfcc": [],     # array containing the mfcc arrays of the track segments
    "labels": []    # array of the genre labels of the track segments
}   

We define these parameters so we can finetune them if needed in the future. These parameters are needed so we will get uniform shape outputs after the sampling and the transformation.

In [6]:
number_of_segments = 10      # the number of segments we want to split each track
samples_per_track = sample_rate * duration  # how many samples do we get from each track
samples_per_segment=int(samples_per_track/number_of_segments)    # how many samples are there in a segment
num_mfcc_vectors_per_segment = math.ceil(samples_per_segment / hop_length)   # this is to check if the output has the correct dimensions

For the next part we separate our audio data into segments, then we use mel frequency cepstral coefficients (MFCCs) on them. This transforms our data closer to what humans would hear/notice listening to the music.

In [7]:
shuffle(audio_files)    # the audio files are ordered by category, its easier to shuffle them here while we only have to shuffle one array
for audio_file in audio_files:
    # cutting the name of the genre from the filename
    genre = audio_file.split("\\")[1]
    # adding genre to mapping if its not already there
    if genre not in data["mapping"]:      
        data["mapping"].append(genre) 
    try:
        # reading signal and sample rate from the file
        signal, sr = librosa.load(audio_file) 
    except:
        #there are some corrupted/non readable files so we dont process them
        continue
        
    # we dont have much data, so we split the tracks into segments to increase our training data
    for i in range(number_of_segments):
        # calculating start and finish index of the segment
        start = samples_per_segment * i
        end = start + samples_per_segment
        # Calculating the mfcc of the segment
        mfcc = librosa.feature.mfcc(y=signal[start:end], sr=sample_rate, n_mfcc=n_mfcc, n_fft=n_fft, hop_length=hop_length)
        mfcc = mfcc.T
        # Some tracks are shorter than 30 seconds, so we have segments with incorrect length. We filter those out here
        if len(mfcc) == num_mfcc_vectors_per_segment:
            # Adding the mfcc and label to our data
            data["mfcc"].append(mfcc)
            data["labels"].append(data["mapping"].index(genre))

  signal, sr = librosa.load(audio_file)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


We transform the created lists into numpy arrays, so they are easier to handle.

In [8]:
data["mfcc"] = np.array(data["mfcc"], dtype=np.float32)
data["labels"] = np.array(data["labels"], dtype=np.float32)

### 2.2 Flattening the data

We flatten the data to make it into a one-dimensional array insted of a 2D array.

In [9]:
print(data["mfcc"].shape)
flattened_dim=np.prod(data["mfcc"].shape[1:])

(9986, 130, 50)


In [None]:
#data["mfcc"] = data["mfcc"].reshape(-1,flattened_dim)

#data["mfcc"] = data["mfcc"].astype(float)

### 2.3 Splitting training, testing and validation data

We separate our data into training, validation and test datasets, we define the ratios so we can fine tune them later.

In [10]:
data_length = len(data["mfcc"])
train_ratio = 0.7
valid_ratio = 0.2
test_ratio = 0.1

train_size = int(train_ratio*data_length)
valid_size = int(valid_ratio*data_length)
test_size = int(test_ratio*data_length)

X_train = data["mfcc"][:train_size]
Y_train = data["labels"][:train_size]
X_valid = data["mfcc"][train_size:train_size+valid_size]
Y_valid = data["labels"][train_size:train_size+valid_size]
X_test = data["mfcc"][train_size+valid_size:]
Y_test = data["labels"][train_size+valid_size:]


### 2.3 Standardization

We calculate the mean and variance of the training data, then use these values to standerdize the whole dataset.

In [11]:
X_train=np.asarray(X_train)
Y_train=np.asarray(Y_train)
X_valid=np.asarray(X_valid)
Y_valid=np.asarray(Y_valid)
X_test=np.asarray(X_test)
Y_test=np.asarray(Y_test)

In [12]:
mean = np.mean(X_train, axis=0)
std  = np.std(X_train, axis=0, dtype=np.float32)

In [13]:
X_train = (X_train - mean) / std
X_valid = (X_valid - mean) / std
X_test  = (X_test - mean) / std

### 2.4 Encoding the labels and performing checks

We check if each data set has the same number of categories in the output.

In [14]:
nb_classes = len(np.unique(Y_train))
print("Validation data has the same number of classes, as the training data:", nb_classes == len(np.unique(Y_valid)))
print("Test data has the same number of classes, as the training data:", nb_classes == len(np.unique(Y_test)))

Validation data has the same number of classes, as the training data: True
Test data has the same number of classes, as the training data: True


We change the dense representation of the classes to one-hot encoding.

In [15]:
Y_train = to_categorical(Y_train)
Y_valid = to_categorical(Y_valid)
Y_test  = to_categorical(Y_test)

Final check if the data has the right shape, mean and standard deviation.

In [16]:
print("Shapes of the training, validation and test input data:", X_train.shape, X_valid.shape, X_test.shape)
print("Shapes of the training, validation and test output data:", Y_train.shape, Y_valid.shape, Y_test.shape)
print("Mean values of the training, validation and test input data:", X_train.mean(), X_valid.mean(), X_test.mean())
print("Standard deviation of the training, validation and test input data:", X_train.std(), X_valid.std(), X_test.std())

Shapes of the training, validation and test input data: (6990, 130, 50) (1997, 130, 50) (999, 130, 50)
Shapes of the training, validation and test output data: (6990, 10) (1997, 10) (999, 10)
Mean values of the training, validation and test input data: 8.086974e-09 -0.007472902 0.008522864
Standard deviation of the training, validation and test input data: 0.9999996 1.0180941 1.0498854


In [17]:
X_train = np.array([np.array(val) for val in X_train])
Y_train = np.array([np.array(val) for val in Y_train])

X_train = tf.cast(X_train , dtype=tf.float32)
Y_train = tf.cast(Y_train , dtype=tf.float32)

# 3. Models
For models we decided to use...

### 3.1 LSTM

In [18]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.layers import LSTM
#from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.callbacks import EarlyStopping

We use Earlystopping for both of our models, with the same parameter. We restore the best weight in the end of the training and we use a patience of 5.

In [None]:
es = EarlyStopping(
    monitor="val_accuracy",
    min_delta=0,
    patience=5,
    verbose=0,
    mode="auto",
    baseline=None,
    restore_best_weights=True,
    start_from_epoch=0,
)

Our first model is an LSTM model. We chose LSTM becasue it is a useful model for timeseries, like music. In the model we stacked multiple LSTMs and between them we put DropOut layers to minimize overfitting.

In [None]:
lstm_model = Sequential()
lstm_model.add(LSTM(80, input_shape=(X_train.shape[-2], X_train.shape[-1]),return_sequences=True))
#lstm_model.add(Dropout(0.2))
#lstm_model.add(LSTM(80, input_shape=(X_train.shape[-2], X_train.shape[-1]),return_sequences=True))
lstm_model.add(LSTM(100, input_shape=(X_train.shape[-2], X_train.shape[-1])))
lstm_model.add(Dropout(0.2))
lstm_model.add(Dense(35, activation='selu',kernel_initializer='he_normal'))
lstm_model.add(Dropout(0.35))
lstm_model.add(Dense(nb_classes))
lstm_model.add(Activation('softmax')) 


In [None]:
lstm_model.summary()

Because it is a multi-class classification task we use categorical crossentropy as loss function. We tried different optimizers and we chose ..... 

In [None]:
lstm_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])

In [None]:
lstm_history = lstm_model.fit(X_train, Y_train,
              batch_size=256,
              epochs=40,
              validation_data=(X_valid, Y_valid),
              verbose=1, 
              callbacks=es)

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns

In [None]:
lstm_preds = lstm_model.predict(X_test)

In [None]:
print(classification_report(np.argmax(Y_test,1),np.argmax(lstm_preds,1)))

In [None]:
conf=confusion_matrix(np.argmax(Y_test,1),np.argmax(lstm_preds,1))
sns.heatmap(conf, annot=True, fmt='d', vmax=100)

### 3.2 CNN

In [27]:
from tensorflow.keras.layers import Dense, Conv2D, MaxPool2D, Flatten, Dropout, BatchNormalization
from tensorflow.keras.initializers import HeNormal
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.losses import SparseCategoricalCrossentropy

In [None]:
cnn_model = Sequential()
cnn_model.add(Conv2D(32, (3, 3), activation='relu',kernel_initializer=HeNormal, input_shape=(32, 32, 3)))
cnn_model.add(BatchNormalization())
#cnn_model.add(Dropout(0.4))
cnn_model.add(MaxPool2D((2, 2)))
cnn_model.add(Conv2D(64, (3, 3), activation='relu',kernel_initializer=HeNormal))
cnn_model.add(BatchNormalization())
cnn_model.add(Dropout(0.4))
cnn_model.add(MaxPool2D((2, 2)))
cnn_model.add(Conv2D(64, (3, 3), activation='relu',kernel_initializer=HeNormal))
cnn_model.add(Dropout(0.5))
cnn_model.add(BatchNormalization())
cnn_model.add(Flatten())
cnn_model.add(Dense(64, activation='relu',kernel_initializer=HeNormal))
cnn_model.add(Dropout(0.3))
cnn_model.add(Dense(nb_classes,activation="softmax"))


In [None]:
cnn_model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [None]:
cnn_model.summary()

In [None]:
cnn_history = cnn_model.fit(X_train,
                    Y_train,
                    epochs=40,
                    batch_size=64,  
                    validation_data=(X_valid, Y_valid),
                    verbose=1, 
                    callbacks=es
                    )

In [None]:
cnn_preds = cnn_model.predict(X_test)

In [None]:
print(classification_report(np.argmax(Y_test,1),np.argmax(cnn_preds,1)))

In [None]:
conf=confusion_matrix(np.argmax(Y_test,1),np.argmax(cnn_preds,1))
sns.heatmap(conf, annot=True, fmt='d', vmax=100)

## CNN + LSTM

In [None]:
cnn2_model = Sequential()
cnn2_model.add(Conv1D(filters=20, kernel_size=48, activation='selu', kernel_initializer='he_normal', input_shape=(X_train.shape[-2],X_train.shape[-1]),padding='same'))
cnn2_model.add(MaxPooling1D())
cnn2_model.add(Dropout(0.4))  
cnn2_model.add(Conv1D(filters=20, kernel_size=48, activation='selu', kernel_initializer='he_normal'))
cnn2_model.add(Dropout(0.4))
cnn2_model.add(LSTM(25, input_shape=(X_train.shape[-2], X_train.shape[-1]),return_sequences=True))
cnn2_model.add(LSTM(20, input_shape=(X_train.shape[-2], X_train.shape[-1])))
cnn2_model.add(Dropout(0.4))
cnn2_model.add(Dense(35, activation='selu',kernel_initializer='he_normal'))
cnn2_model.add(Dropout(0.3))
cnn2_model.add(Dense(nb_classes, activation='softmax'))

In [None]:
cnn2_model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

In [None]:
cnn2_history = cnn2_model.fit(X_train,
                    Y_train,
                    epochs=80,
                    batch_size=128,  
                    validation_data=(X_valid, Y_valid),
                    verbose=1, 
                    callbacks=es
                    )

In [19]:
X_train.shape

TensorShape([6990, 130, 50])

In [93]:
class LSTM_Model(tf.keras.Model):
    def __init__(self, size, N, output_dim=10):
        super().__init__()
        self.size = size
        self.N = N

        self.lstm_layers = [tf.keras.layers.LSTM(self.size, return_sequences=True) for _ in range(self.N)]
        self.lstm_final = tf.keras.layers.LSTM(self.size, return_sequences=False)
        self.dense1 = tf.keras.layers.Dense(60, activation='relu',kernel_initializer='he_normal')
        self.dense2 = tf.keras.layers.Dense(output_dim, activation='softmax',kernel_initializer='he_normal')

    def call(self, inputs):
        x = inputs
        #for layer in self.lstm_layers:
        #    x = layer(x)
        x = self.lstm_final(x)
        x = tf.keras.layers.Flatten()(x)
        x = self.dense1(x)
        return self.dense2(x)

In [94]:
lstm = LSTM_Model(60, 2, nb_classes)

In [95]:
lstm.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

In [96]:
lstm.build(input_shape=(None, X_train.shape[-2], X_train.shape[-1]))

In [97]:
lstm.summary()

Model: "lstm__model_11"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_16 (LSTM)              multiple                  0 (unused)
                                                                 
 lstm_17 (LSTM)              multiple                  0 (unused)
                                                                 
 lstm_18 (LSTM)              multiple                  26640     
                                                                 
 dense_22 (Dense)            multiple                  3660      
                                                                 
 dense_23 (Dense)            multiple                  610       
                                                                 
Total params: 30,910
Trainable params: 30,910
Non-trainable params: 0
_________________________________________________________________


In [30]:
es = EarlyStopping(monitor='val_accuracy', mode='auto', verbose=1, patience=5, restore_best_weights=True)

In [99]:
lstm.fit(X_train, Y_train, epochs=40, batch_size=16, validation_data=(X_valid, Y_valid), verbose=1, callbacks=es)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 8: early stopping


<keras.callbacks.History at 0x1e818f15000>

In [25]:
X_train = tf.expand_dims(X_train, axis=-1)
X_valid = tf.expand_dims(X_valid, axis=-1)
X_test = tf.expand_dims(X_test, axis=-1)

In [50]:
class CONV_Model(tf.keras.Model):
    def __init__(self, output_dim=10):
        super().__init__()
        self.conv1 = tf.keras.layers.Conv2D(filters=256, kernel_size=3, activation='relu', input_shape=(X_train.shape[1],X_train.shape[2], 1),padding='valid')
        self.conv3 = tf.keras.layers.Conv2D(filters=256, kernel_size=3, activation='relu', padding='valid')
        self.conv2 = tf.keras.layers.Conv2D(filters=512, kernel_size=3, activation='relu', padding='valid')
        self.ap = tf.keras.layers.AveragePooling2D(pool_size=3, strides=2, padding='same')
        self.dense1 = tf.keras.layers.Dense(256, activation='relu',kernel_initializer='he_normal')
        self.dense2 = tf.keras.layers.Dense(128, activation='relu',kernel_initializer='he_normal')
        self.dense3 = tf.keras.layers.Dense(output_dim, activation='softmax', kernel_initializer='zeros')

    def call(self, inputs):
        x = self.conv1(inputs)
        x = self.conv3(x)
        x = self.ap(x)
        x = self.conv3(x)
        x = self.ap(x)
        x = self.conv2(x)
        x = tf.keras.layers.GlobalAveragePooling2D()(x)
        x = self.dense1(x)
        x = self.dense2(x)
        return self.dense3(x)

In [51]:
conv = CONV_Model(10)
conv.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
conv.build(input_shape=(None, X_train.shape[1], X_train.shape[2], 1))

UnboundLocalError: local variable 'x' referenced before assignment

In [36]:
conv.fit(X_train, Y_train, epochs=40, batch_size=16, validation_data=(X_valid, Y_valid), verbose=1, callbacks=es)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40

KeyboardInterrupt: 