# ATML report
## Music genre recognition
Authors:
- Dorian Guyot
- Jonathan Péclat
- Thomas Schaller

Goal : Classify music samples by genre
## Dataset

We used the GTZAN dataset (http://marsyas.info/downloads/datasets.html) in our work. It consists of ten genres, each one containing hundred samples of thirty seconds. The genres are: Blues, Classical, Country, Disco, Hiphop, Jazz, Metal, Pop, Reggae and Rock.

## Approach description

### Data preparation: spectrograms

A music sample can be represented in multiple ways. Rather than direclty sending the raw bytes of the music file into a neural network, we chose to use a frequential domain representation produced through a FFT in the form of mel-spectrograms. We now have a 216x216 pixels monochromatic image describing our sample.

On the generated image, the abscissa represents the time, and the ordinate the frequency. The brighter the pixel is, the stronger a frequency is played at a given time. The y axis is also a log-scale because humans hear music in a logarithmic fashion (going one octave higher doubles the frequency) and since music has been designed for human ears, it made sense to go down that road.

**Remark**: It is worth noting that good results have also been reached with the raw data as inputs, but we will not use this approach.

### Data augmentation

The GTZAN dataset is pretty small and we are trying to train a (deep) neural network. We will therefore need to do some form of data augmentation to be able to reach interesting results and avoid overfitting.

We tried and used different methods to increase the number of images from a given sample.
  - Adding noise to the spectrograms to prevent the network to train on insignificant details and augment tolerance to low quality samples
  - Splitting a sample into smaller samples without overlap to get more spectrograms
  - Splitting a sample in a "rolling window" manner (into samples that do overlap)

#### Further work
Other ideas (mainly working on the raw audio) for more data augmentation have been considered but not implemented:
  - Using additional representations for the samples (not only spectrograms)
  - Pitch and tempo shifting
  
### Network architecture

Having a representation of a song in the form of a picture opens the problem up to all the tools used in image processing and, amongst otherm makes it easy to use with a classic CNN. We tried out different types of network and got different results.

#### The homebrew
A first approach was obviously to try it ourselves and create a network architecture from scratch.

The custom network has the following structure : \\
Input -> 1x216x216 \\
Conv2D -> 128x212x216 -> BatchNorm2D -> LeakyRelu 0.2 -> MaxPool2D -> 128x106x108 -> Dropout 0.5 \\
Conv2D -> 64x102x108  -> BatchNorm2D -> LeakyRelu 0.2 -> MaxPool2D -> 64x51x54 -> Dropout 0.5 \\
Conv2D -> 64x48x54 -> BatchNorm2D -> LeakyRelu 0.2 -> MaxPool2D -> 64x24x27 -> Dropout 0.5 \\
Conv2D -> 64x24x27 -> BatchNorm2D -> LeakyRelu 0.2 -> MaxPool2D -> 64x10x27 Dropout 0.5 \\
Linear -> 364 -> LeakyRelu -> Dropout 0.5 \\
Linear -> 182 -> LeakyRelu -> Dropout 0.5 \\
Linear -> 10 \\

The results obtained with this architecture were quiet good (around 68%), given the difficult conditions (mainly the dataset size).

#### Finetuning with ResNet
Since the GTZAN dataset is really small, it was unthinkable to train a (very) deep network on it. The clear answer to this problem was to use some pre-trained network and transfer learning. We fine-tuned ResNet18 (pretrained on ImageNet) in this case but many other options such as VGG were considered. A linear classifier was appended to ResNet to fit the ten categories. Unsurprisingly, this yielded very good results (with an accuracy of over 95%)


## Folder structure
```
ATML19
│   README.md : Simple readme for github
│   report.ipynb : Project report (this file)
│   report.html : Same as report.ipynb but in HTML format
│   pres.pdf : The pdf of the presentation
│
└─── notebooks : Notebook's files where tests were made
│   │   create_data_folder.ipynb : Create data folder when spectrogram are created
│   │   model_dorian.ipynb : Testing differents models on data
│   │   model_thomas.ipynb : Testing differents models on data
│   │   spectrogram.ipynb : Creating spectrograms from wav files
│   │   generate_experiments.ipynb : Generate a barplot from genre classification on external wav's musics.
│   │   src_python : Contains scripts use in generate_experiments (same as test_src)
│   
└─── test_src : Small app to test the final model
│   │   user_app.py : Main file of the app
│   └─── generate_data : Folder containing the file to create the spectogram of the music
│   │    │ spectrogram.py : Class to create the spectrogram of the music
│   │
│   └─── models : Folder containing files for the model
│   │    │ best_model_resnet : State dict of the best model created with resnet
│   │    │ model.py : Model of the project. Allow to predict genre of music
│   
└─── train_src : Small app to test the final model
│   │   main.py : Main file of the app to train the model
│   └─── generate_data : Folder containing files to process the data and create the data directory.
│   │    │ data_folder.py : Create the data folder to be able to use ImageFolder from pytorch then.
│   │    │ spectrogram.py : Create the spectrograms of all musics contain in a folder.
│   │
│   └─── model : Folder containing files for training the model
│   │    │ dataloader.py : Create the dataloader with the data of the data directory
│   │    │ model.py : train the model with the dataloader and save the best one
```

# Results

There are 10 different genre of music to classify. Therefore an untrained network has a 1 in 10 chance of guessing right. 10% is thus the baseline.

Both of the network have been trained for 50 epochs with ADAM, cross-entropy loss and an initial learning rate of 0.001 that decrease as the time goes by.

## The homebrew
With our custom architecture we have been able to reach an accuracy of about 68% (loss of 0.92) on the test set. Any more complex model tended to overfit and smaller models did never yield better results.

## Fine-tuned ResNet
As stated above the result with ResNet is on another level: we reached 95% accuracy (loss of 0.16) on the test set

### Going further
The custom network was only ever trained on the GTZAN dataset, which pretty much restricted the size of the network we could use. Pre-training the convolutionnal part of it on an unsupervised task involving spectrograms as inputs and then fine-tuning it on the exact task would be a great way to get the best out of both worlds (custom convolution kernels and transferred learning).


# Example of use

## Training the network yourself
If you want to train the network yourself you will need some additional data. You can either download the raw wav files and do the processing by yourself with our app, or directly download the already processed data.

The `data` folder contains the spectrograms of the wav files. Here is a link to download the two folders: https://www.dropbox.com/sh/dg1crj9yimefgpb/AADcOLk9fkLxFbaO7dn-rACDa?dl=0

## Classifying music with the app
If you want to use the trained network with no hassle to predict the genre of a song, you can simply use the small app we coded for that pusrpose. Just run the python file in the folder "test_src" named "user_app.py". You then have a very basic graphical user interface that enable you to select a file with a button on the top of the window, and them hit the button "classify". The output is the probability the network gives to each genre for the song you gave as input.

### How does the app classify an entire song ?
The app classifies a whole song into a category. Since the network only ever works on small slices of a few seconds, the app has to do some additional work. What it actually does is similar to the data preparation / data augmentation, namely it runs a sliding window over the wole song, classifies all the slices and then finally sums up all the results

## Classifying the music programmatically
If you don't want to use the app or want to classify the music in your own way, you can do so by using our code as follows:

In [0]:
import os
import matplotlib.pyplot as plt
import numpy as np
from test_src.generate_data.spectrogram import Spectrogram
from test_src.models.model import Model

data_folder = 'test_data'
styles = ['blues','classical','country','disco','hiphop','jazz','metal','pop','reggae','rock']

model_dict_path = "test_src/models/best_model_resnet"
model = Model()
model.load(model_dict_path)
spectrogram = Spectrogram()
length_data = len(test_data)

imgs = spectrogram.sample(os.path.join(experiments_folder,data[0],data[1]))
results_sum = np.array([0.0] * len(styles))
for img in imgs:
    results_sum += model.predict_image(img)
results = results_sum / len(imgs) * 100.0

## Generating experiments

We have also run our network on full-length and well-known songs that were absolutely not in the dataset. Here are the code snippets to make this and help to understand how everything works together.

**Beware**

The code below can also be found in the notebook `generate_experiments`.

The folder that contains the music to be processed below is located at the root of the project under the name `test_data`.

To begin, we have to make the necessary imports.

In [0]:
import os
import matplotlib.pyplot as plt
import numpy as np
from test_src.generate_data.spectrogram import Spectrogram
from test_src.models.model import Model

Setup matplotlib to work in the notebook

In [0]:
%matplotlib inline

Then, list the music in the designated folder.

In [0]:
experiments_folder = 'test_data'
styles = ['blues','classical','country','disco','hiphop','jazz','metal','pop','reggae','rock']

In [0]:
test_data = []
for file in os.listdir(experiments_folder):
    if os.path.isdir(os.path.join(experiments_folder, file)):
        for file2 in os.listdir(os.path.join(experiments_folder, file)):
            if os.path.isfile(os.path.join(experiments_folder,file,file2)):
                test_data.append([file, file2])

Now, we need to instantiate a class that will help us with the generation of the spectrograms aswell as the classifier (whose weights are stored at `test_src/models/best_model_resnet`)

In [0]:
model_dict_path = "test_src/models/best_model_resnet"
model = Model()
model.load(model_dict_path)
spectrogram = Spectrogram()
length_data = len(test_data)

Finally, for each song we generate its corresponding spectrograms, and predict the genre using out model. Then, to better visualize the result, a BarPlot is created with the percentage obtained for each of the genres. (Remember that we classify each slice and the sum them up)

In [0]:
i = 1
for data in test_data:
    print("Processing music ",i,"/",length_data)
    imgs = spectrogram.sample(os.path.join(experiments_folder,data[0],data[1]))
    results_sum = np.array([0.0] * len(styles))
    for img in imgs:
        results_sum += model.predict_image(img)
    results = results_sum / len(imgs) * 100.0
    y_pos = np.arange(len(styles))
    plt.bar(y_pos, results, align='center', alpha=0.5)
    plt.xticks(y_pos, styles, rotation=45)
    plt.ylim([0,100])
    plt.ylabel('Percent')
    plt.title('Title: '+data[1]+', True genre: '+data[0])
    plt.show()
    i += 1