# ATML report
## Music genre recognition
Authors:
- Dorian Guyot
- Jonathan Péclat
- Thomas Schaller

Goal : Classify music samples by genre
## Dataset

We used the GTZAN dataset (http://marsyas.info/downloads/datasets.html) in our work. It consists of ten genres, each one containing hundred samples of thirty seconds. The genres are: Blues, Classical, Country, Disco, Hiphop, Jazz, Metal, Pop, Reggae and Rock.

## Approach description

### Data preparation: spectrograms

A music sample can be represented in multiple ways. Rather than direclty sending the raw bytes of the music file into a neural network, we chose to use a frequential domain representation produced through a FFT in the form of spectrograms. We now have a 216x216 pixels monochromatic image describing our sample.

On the generated image, the abscissa represents the time, and the ordinate the frequency. The brighter the pixel is, the stronger a frequency is played at a given time. The y axis is also a log-scale because humans hear music in a logarithmic fashion (going one octave higher doubles the frequency) and since music has been designed for human ears, it made sense to go down that road.

**Remark**: It is worth noting that good results have also been reached with the raw data as inputs, but we will not use this approach.

### Data augmentation

The GTZAN dataset is pretty small and we are trying to train a (deep) neural network. We will therefore need to do some form of data augmentation to be able to reach interesting results and avoid overfitting.

We tried and used different methods to increase the number of images from a given sample.
  - Adding noise to the spectrograms to prevent the network to train on insignificant details and augment tolerance to low quality samples. This was a very significant element and increased the accuracy of about 5%.
  - Splitting a sample into smaller samples without overlap to get more spectrograms
  - Splitting a sample in a "rolling window" manner (into samples that do overlap)

#### Further work
Other ideas (mainly working on the raw audio) for more data augmentation have been considered but not implemented:
  - Using additional representations for the samples (not only spectrograms)
  - Pitch and tempo shifting
  
### Network architecture

Having a representation of a song in the form of a picture opens the problem up to all the tools used in image processing and, amongst other makes it easy to use with a classic CNN. We tried out different types of network and got different results.

#### The homebrew
A first approach was obviously to try it ourselves and create a network architecture from scratch.

The custom network has the following structure : \\

Input -> 1x216x216 \\

Conv2D -> 128x212x216 -> BatchNorm2D -> LeakyRelu 0.2 -> Max

Pool2D -> 128x106x108 -> Dropout 0.5 \\

Conv2D -> 64x102x108  -> BatchNorm2D -> LeakyRelu 0.2 -> MaxPool2D -> 64x51x54 -> Dropout 0.5 \\

Conv2D -> 64x48x54 -> BatchNorm2D -> LeakyRelu 0.2 -> MaxPool2D -> 64x24x27 -> Dropout 0.5 \\

Conv2D -> 64x24x27 -> BatchNorm2D -> LeakyRelu 0.2 -> MaxPool2D -> 64x10x27 Dropout 0.5 \\

Linear -> 364 -> LeakyRelu -> Dropout 0.5 \\

Linear -> 182 -> LeakyRelu -> Dropout 0.5 \\

Linear -> 10 \\

The results obtained with this architecture were quiet good (around 75%), given the difficult conditions (mainly the dataset size).

#### Finetuning with ResNet
Since the GTZAN dataset is really small, it was unthinkable to train a (very) deep network on it. The clear answer to this problem was to use some pre-trained network and transfer learning. We fine-tuned ResNet18 (pretrained on ImageNet) in this case but many other options such as VGG were considered. A linear classifier was appended to ResNet to fit the ten categories. Unsurprisingly, this yielded very good results (with an accuracy of over 95%)


## Folder structure
```
ATML19
│   README.md : Simple readme for github
│   report.ipynb : Project report (this file)
│   report.html : Same as report.ipynb but in HTML format
│   pres.pdf : The pdf of the presentation
│
└─── notebooks : Notebook's files where tests were made
│   │   create_data_folder.ipynb : Create data folder when spectrogram are created
│   │   model_dorian.ipynb : Testing differents models on data
│   │   model_thomas.ipynb : Testing differents models on data
│   │   spectrogram.ipynb : Creating spectrograms from wav files
│   │   generate_experiments.ipynb : Generate a barplot from genre classification on external wav's musics.
│   │   src_python : Contains scripts use in generate_experiments (same as test_src)
│   
└─── test_src : Small app to test the final model
│   │   user_app.py : Main file of the app
│   └─── generate_data : Folder containing the file to create the spectogram of the music
│   │    │ spectrogram.py : Class to create the spectrogram of the music
│   │
│   └─── models : Folder containing files for the model
│   │    │ best_model_resnet : State dict of the best model created with resnet
│   │    │ model.py : Model of the project. Allow to predict genre of music
│   
└─── train_src : Small app to test the final model
│   │   main.py : Main file of the app to train the model
│   └─── generate_data : Folder containing files to process the data and create the data directory.
│   │    │ data_folder.py : Create the data folder to be able to use ImageFolder from pytorch then.
│   │    │ spectrogram.py : Create the spectrograms of all musics contain in a folder.
│   │
│   └─── model : Folder containing files for training the model
│   │    │ dataloader.py : Create the dataloader with the data of the data directory
│   │    │ model.py : train the model with the dataloader and save the best one
```

# Results

There are 10 different genre of music to classify. Therefore an untrained network has a 1 in 10 chance of guessing right. 10% is thus the baseline.

Both of the network have been trained for 50 epochs with ADAM, cross-entropy loss and an initial learning rate of 0.001 that decrease as the time goes by.

## The homebrew
We started with a very simple model and increased the complexity gradually until we reached a sweetspot just bedore overfitting. The best results we have been able to reach are an accuracy of about 75% (loss of 1.13) on the test set. 

<img src="http://dragoo.ch/images/ATML/best_homebrew.png" width = 80%>

Any more complex model tended to overfit. The graph below is from the same network as the above, but with one more convolutional layer. (The training has been interupted because the overfitting was clear)

<img src="http://dragoo.ch/images/ATML/homebrew_overfit.png" width = 80%>


## Transfer learning with ResNet

Since the dataset is small and we are working on images, it made sense to consider transfer learning with one of the well-known image classification networks such as VGG, ResNet or AlexNet.

We used a ResNet pretrained on ImageNet. While it is true that spectrograms are not at all like objects from the ImageNet dataset, the low-level features may come in handy for the "understanding" that the networks builds from the former, especially when fine-tuning.


### Fixed features
With fixed features, we were able to reach an accuracy of about 68% (loss of 0.92), which is not bad considering the fact that spectrograms are not at all present in ImageNet, on which ResNet had been trained.
<img src="http://dragoo.ch/images/ATML/resnet_fixed_feature.png" width = 80%>

### Finetuning

Using a pretrained ResNet and fine-tuning it to adapt to spectrograms has led to the best results we were able to achieve with 95% accuracy (loss of 0.16) on the test set.

<img src="http://dragoo.ch/images/ATML/rewsnet_fine_tuning.png" width = 80%>

### Conclusion
As could be expected in our setting (small dataset) the best option was to use transfer learning and fine-tuning. We can see in both graphs using ResNet that the network learns really fast and then stagnates. The fast convergence is probably due to the depth of the network and the plateau could be explained by the small size of the dataset: the network is able to learn its task really fast due to the complexity it has at hand, exhausts all the information present in the dataset and is then not able to learn further anymore.

### Going further
The custom network was only ever trained on the GTZAN dataset, which pretty much restricted the size of the network we could making use of transfer learning. Pre-training the convolutionnal part of it on an unsupervised task involving spectrograms as inputs and then fine-tuning it on the exact task would be a great way to get the best out of both worlds (custom convolution kernels and transferred learning).

It would also be really beneficial to work on a bigger dataset. It would open the door of complexity to our own model and provide more information for the ResNet models to learn further.


# Experiments

We have tested the classification on songs outside the dataset and have gotten different results:

**Very good results**

A significant proportion of the tested songs were classified correctly and some of them with great confidence. This was mostly the case for music genres that were also really distinguishable from the others, such as classical or reggae.

<img src="http://dragoo.ch/images/ATML/good_classification.PNG" img>

**Understandable bad results**

Since even for humans the classification is not always clear, we cannot expect the network to do a lot better. We had several cases of wrong classification that were "understandable" in the sense that the song was close to another genre, or a mix of both. In this case, one can argue that the rock, jazz and blues genres sometimes overlap.

<img src="http://dragoo.ch/images/ATML/understandable_classification.PNG" img>

**Bad results**

Of course we also had results that were just completely wrongly classified with no appearing similarities. The network is not very confident that it is right and the right genre has low probability. An example can be seen below (the true genre is metal, but it gets classified as pop and the metal probability is nearly zero):

<img src="http://dragoo.ch/images/ATML/bad_classification.PNG" img>


# Conclusion

Music genre recognition is clearly feasible but should be interpreted in a fuzzy manner: a song can belong to multiples genres depending on the instruments present in the piece, the rythm or even the lyrics. A single music genre covers a lot of sub-genres and it is virtually impossible to classify everthing correctly since even humans argue about this. It would be great to have a bigger dataset with multiple genres assigned to each song to enable a better training and a more meaningful classification.

# Example of use

## Training the network yourself
If you want to train the network yourself you will need some additional data. You can either download the raw wav files and do the processing by yourself with our app, or directly download the already processed data.

The `data` folder contains the spectrograms of the wav files. Here is a link to download the two folders: https://www.dropbox.com/sh/dg1crj9yimefgpb/AADcOLk9fkLxFbaO7dn-rACDa?dl=0

## Classifying music with the app
If you want to use the trained network with no hassle to predict the genre of a song, you can simply use the small app we coded for that pusrpose. Just run the python file in the folder "test_src" named "user_app.py". You then have a very basic graphical user interface that enable you to select a file with a button on the top of the window, and them hit the button "classify". The output is the probability the network gives to each genre for the song you gave as input.

### How does the app classify an entire song ?
The app classifies a whole song into a category. Since the network only ever works on small slices of a few seconds, the app has to do some additional work. What it actually does is similar to the data preparation / data augmentation, namely it runs a sliding window over the wole song, classifies all the slices and then finally sums up all the results

## Classifying the music programmatically
If you don't want to use the app or want to classify the music in your own way, you can do so by using our code as follows:

In [None]:
import os
import matplotlib.pyplot as plt
import numpy as np
from test_src.generate_data.spectrogram import Spectrogram
from test_src.models.model import Model

data_folder = 'test_data'
styles = ['blues','classical','country','disco','hiphop','jazz','metal','pop','reggae','rock']

# Create the model
model_dict_path = "test_src/models/best_model_resnet"
model = Model()
model.load(model_dict_path)

# Generate all spectrograms
spectrogram = Spectrogram()
imgs = spectrogram.sample( path/to/the/song )

# Predict the style
results_sum = np.array([0.0] * len(styles))
for img in imgs:
    results_sum += model.predict_image(img)
results = results_sum / len(imgs) * 100.0

## Generating experiments

We have also run our network on full-length and well-known songs that were absolutely not in the dataset. Here are the code snippets to make this and help to understand how everything works together.

**Remark**

The code below can also be found in the notebook `generate_experiments`.

The folder that contains the music to be processed below is located at the root of the project under the name `test_data`.

To begin, we have to make the necessary imports.

In [None]:
import os
import matplotlib.pyplot as plt
import numpy as np
from test_src.generate_data.spectrogram import Spectrogram
from test_src.models.model import Model

Setup matplotlib to work in the notebook

In [None]:
%matplotlib inline

Then, list the music in the designated folder.

In [None]:
experiments_folder = 'test_data'
styles = ['blues','classical','country','disco','hiphop','jazz','metal','pop','reggae','rock']

In [None]:
test_data = []
for file in os.listdir(experiments_folder):
    if os.path.isdir(os.path.join(experiments_folder, file)):
        for file2 in os.listdir(os.path.join(experiments_folder, file)):
            if os.path.isfile(os.path.join(experiments_folder,file,file2)):
                test_data.append([file, file2])

Now, we need to instantiate a class that will help us with the generation of the spectrograms aswell as the classifier (whose weights are stored at `test_src/models/best_model_resnet`)

In [None]:
model_dict_path = "test_src/models/best_model_resnet"
model = Model()
model.load(model_dict_path)
spectrogram = Spectrogram()
length_data = len(test_data)

Finally, for each song we generate its corresponding spectrograms, and predict the genre using out model. Then, to better visualize the result, a BarPlot is created with the percentage obtained for each of the genres. (Remember that we classify each slice and the sum them up)

In [None]:
i = 1
for data in test_data:
    print("Processing music ",i,"/",length_data)
    imgs = spectrogram.sample(os.path.join(experiments_folder,data[0],data[1]))
    results_sum = np.array([0.0] * len(styles))
    for img in imgs:
        results_sum += model.predict_image(img)
    results = results_sum / len(imgs) * 100.0
    y_pos = np.arange(len(styles))
    plt.bar(y_pos, results, align='center', alpha=0.5)
    plt.xticks(y_pos, styles, rotation=45)
    plt.ylim([0,100])
    plt.ylabel('Percent')
    plt.title('Title: '+data[1]+', True genre: '+data[0])
    plt.show()
    i += 1