# Extracting the mel-spectrograms of each audio sample.

---

In this notebook, the audio files and their related metadata are processed into mel-spectrograms to be fed into the convolutional model. The genres of each audio file is also created separately as the class labels.

The audio files and related metadata are sourced from the Free Music Archive (FMA) Database, which contains a 900+ GiB library of 106,574 tracks of creative Commons-licensed audio from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. For this project, I decided to only utilize a fraction of the dataset. This fraction - found as 'fma_small' in the repository - consists of 8000 tracks of 30s each, with 8 balanced genre classes. 


### Contents
---

- [Requirements](#reminder)
- [Mel-Spectrograms](#melspec)
- [Pre-processing work](#preproc)
- [Spectrogram generation](#spec)
- [Generating class labels](#classlabels)
---

<a id='reminder'></a>
### Requirements
---

The audio data and metadata should be downloaded beforehand in the same directory as this notebook before running these cells. [The dataset and its related metadata can be found here.](https://github.com/mdeff/fma) Further information on the metadata (which is actually quite substantial) is found there too. 

Required Files:
``` python
* fma_metadata (tracks.csv)
* fma_small (audio tracks)
```

Required Packages:
```python
* sklearn
* librosa
* pandas
* numpy
* ast

```

[Librosa](https://librosa.github.io/librosa/) is a powerful package that was used to extract information on each audio track. There are many features of this package that can be used to extract various types of features, but the main feature extracted in this notebook is the mel-spectrogram.



<a id='melspec'></a>
### Mel-Spectrograms

---

#### Spectrograms
A spectrogram is a visual representation of the spectrum of frequencies of sound (on the y-axis) as they vary with time (on the x-axis).
![](../Images/melspec.png)
The spectrogram is computed by performing a [Fast Fourier Transform](https://en.wikipedia.org/wiki/Fast_Fourier_transform) on the floating point time-series of an audio file. This performs a dimensionality reduction on the time series and processes the audio signal in the frequency domain, which allows for much better analysis on the audio file.

#### Why use the Mel-scale?

Typically, a spectrogram would use linear scaling on the frequency domain. This does not really model a human's perception of sound, which does not perceive frequencies on the linear scale. Our auditory system is more discriminative at lower frequencies and less discriminative at higher frequencies.
For example - we would not perceive the 'distance' between a _100hz_ (__G2__ in [scientific pitch notation]( https://en.wikipedia.org/wiki/Scientific_pitch_notation)) and _200hz_ (__G3__) sound wave to be the same as the distance between a _900hz_ (__A5__) and _1000hz_ (__B5__) sine wave. 

You can try it out for yourself [here](http://www.szynalski.com/tone-generator/).

So, how can we scale the frequency domain to match it better to our own perception of sound? A [Mel scale](https://en.wikipedia.org/wiki/Mel_scale) can be used for mapping on the frequency domain instead. It is a scale which utilises a quasi-logarithmic spacing to roughly resemble the way our auditory system resolves and perceives sound. This is important in the context of speech and music audio analysis, as these sounds are made for our own auditory system to perceive, and we want the analysis to be conducted on our own (_biased_) perception of sound.

Therefore, for the spectrograms that were created in this notebook, a mel scale is used to scale the frequency domains.

---

<a id='preproc'></a>

### Pre-processing on the metadata
---
Firstly, some pre-processing work has to be done on the metadata to clean up _tracks.csv_, which contains metadata on the audio files.


In [8]:
import pandas as pd
from pandas.api.types import CategoricalDtype
from sklearn.preprocessing import LabelBinarizer
import librosa as lib
import os
import numpy as np
import pickle
from sklearn.preprocessing import MinMaxScaler


pd.set_option('display.max_columns', 500)

In [9]:
# Preprocessing to be able to read the tracks csv file

tracks = pd.read_csv('fma_metadata/tracks.csv',index_col=0, header=[0, 1])

COLUMNS = [('track', 'tags'), ('album', 'tags'), ('artist', 'tags'),
            ('track', 'genres'), ('track', 'genres_all'),
            ('track', 'genre_top')]
for column in COLUMNS:
    tracks[column] = tracks[column]
COLUMNS = [('track', 'date_created'), ('track', 'date_recorded'),
            ('album', 'date_created'), ('album', 'date_released'),
            ('artist', 'date_created'), ('artist', 'active_year_begin'),
            ('artist', 'active_year_end')]
for column in COLUMNS:
    tracks[column] = pd.to_datetime(tracks[column])

SUBSETS = ('small', 'medium', 'large')
categorical_type = CategoricalDtype(categories=SUBSETS, ordered=True)
tracks['set', 'subset'] = tracks['set', 'subset'].astype(categorical_type)

COLUMNS = [('track', 'license'), ('artist', 'bio'),
            ('album', 'type'), ('album', 'information')]
for column in COLUMNS:
    tracks[column] = tracks[column].astype('category')

This is what the track metadata file looks like when converted into a pandas dataframe: 

In [10]:
tracks.head()

Unnamed: 0_level_0,album,album,album,album,album,album,album,album,album,album,album,album,album,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,set,set,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track
Unnamed: 0_level_1,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments,date_created,favorites,id,latitude,location,longitude,members,name,related_projects,tags,website,wikipedia_page,split,subset,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
track_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2,Unnamed: 42_level_2,Unnamed: 43_level_2,Unnamed: 44_level_2,Unnamed: 45_level_2,Unnamed: 46_level_2,Unnamed: 47_level_2,Unnamed: 48_level_2,Unnamed: 49_level_2,Unnamed: 50_level_2,Unnamed: 51_level_2,Unnamed: 52_level_2
2,0,2008-11-26 01:44:45,2009-01-05,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01,NaT,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
3,0,2008-11-26 01:44:45,2009-01-05,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01,NaT,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:48:14,2008-11-26,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
5,0,2008-11-26 01:44:45,2009-01-05,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01,NaT,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:20,2008-11-26,206,6,Hip-Hop,[21],[21],,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World
10,0,2008-11-26 01:45:08,2008-02-06,,4,6,,47632,,[],Constant Hitmaker,2,Album,NaT,NaT,"Mexican Summer, Richie Records, Woodsist, Skul...","<p><span style=""font-family:Verdana, Geneva, A...",3,2008-11-26 01:42:55,74,6,,,,"Kurt Vile, the Violators",Kurt Vile,,"['philly', 'kurt vile']",http://kurtvile.com,,training,small,192000,0,Kurt Vile,2008-11-25 17:49:06,2008-11-26,161,178,Pop,[10],[10],,54881,en,Attribution-NonCommercial-NoDerivatives (aka M...,50135,,1,,[],Freeway
20,0,2008-11-26 01:45:05,2009-01-06,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01,2011-01-01,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:48:56,2008-01-01,311,0,,"[76, 103]","[17, 10, 76, 103]",,978,en,Attribution-NonCommercial-NoDerivatives (aka M...,361,,3,,[],Spiritual Level


Next, I extracted the subset of audiofiles that are going to be analysed (8000 tracks, 30s samples, 8 balanced genres).

In [11]:
small = tracks[tracks['set']['subset'] == 'small']  # getting the tracks from the small database (fma_small)

In [12]:
small_genre = small['track']['genre_top'].reset_index()   # resetting index to get the trackid as a pd column

In [13]:
small_genre['track_id'] = small_genre['track_id'].apply(lambda x:'{0:0>6}'.format(x)) 
# formatting the track_id to fit the filename of the audiofile (eg: 2 -> 000002)

A new dataframe (small_genre) is then created with just the track_id and the top genre per track.

In [14]:
small_genre.head()

Unnamed: 0,track_id,genre_top
0,2,Hip-Hop
1,5,Hip-Hop
2,10,Pop
3,140,Folk
4,141,Folk


In [15]:
small_genre.loc[small_genre['track_id'] == '073770','genre_top'].values[0]  # testing if the genre works

'International'

<a id='spec'></a>
### Generating the Mel-Spectrograms

---

For every track in the folder "fma_small", a Mel-Spectrogram is generated and dumped into pickle files for analysis later.

>For the mel spectrogram parameters, I used:
>- 128 mel filterbanks
     - This was used to maximise the dimensions in the frequency domain for analysis in the convolutional layers.
     

>- FFT window size of 512
    - This reduced the time dimension by 512 times, which improves computation vastly. This is also the smallest timeframe in which a human can resolve sound (roughly 23ms per frame based on a sampling rate of 22050hz), and so it is a natural choice to choose this.
        
>The time domain is also limited to 1024 frames (roughly 23.77s in length) to accomodate the length of all the songs, and to enforce an aspect ratio of 1:8.

>The spectrogram is also converted from power units (amplitude squared) to decibel (dB) units for better representation of depth. 


In [29]:
scaler = MinMaxScaler()   # to be used to scale between values between 0 and 1

if(not os.path.exists("pickles")):
    os.mkdir("pickles")

for folder in os.listdir('fma_small'):
    if folder.startswith(('0','1')):
        for filename in os.listdir('fma_small/'+ folder):
            if filename.endswith('.mp3'):
                if(os.path.exists("pickles/{}.pickle".format(filename[:-4]))):
                    continue

                path = 'fma_small/' + folder + '/' + filename
                try:
                    file, sr = lib.core.load(path)  # This loads the audio file (mp3 in this case) into a 
                                                # floating point time-series.
                except:
                    print(path, "failed")
                    continue

                mel = lib.feature.melspectrogram(y=file, n_mels = 128)[:,:1024] # individual mel-spectrogram.
                mel = lib.power_to_db(mel, ref=np.max) # converting to db units
            
                melscaled = scaler.fit_transform(mel) # scaling

                x = np.array(melscaled)

                if x.shape == (128, 1024):            # enforcing dimensions
                    x.dump("pickles/{}.pickle".format(filename[:-4]))    # saving melspecs into a pickle.

                
                
                
                
                

  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)


fma_small/099/099134.mp3 failed


  return f(*args, **kwargs)


fma_small/108/108925.mp3 failed


  return f(*args, **kwargs)


fma_small/133/133297.mp3 failed


<a id='classlabels'></a>
### Generating class labels

---

The genres of the audio tracks are classified and transformed into a one-hot encoded array of classes. For 8000 tracks of 8 genres, the class labels will be an array of size (8000, 8). 

[SKLearn's LabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html) came in handy here to do a quick fit-transform into an array.

In [30]:
ydf = small_genre.set_index('track_id')

In [31]:
lb = LabelBinarizer()
lb.fit(small_genre['genre_top'].values)      
labels = lb.transform(ydf['genre_top'].values)

A dictionary is created to save as a pickle file for easy retrieval later.

In [32]:
y = dict()
for index, value in enumerate(labels):
    y[ydf.index[index]] = value
    

In [33]:
classes = lb.classes_    # class labels

In [39]:
print(classes)

['Electronic' 'Experimental' 'Folk' 'Hip-Hop' 'Instrumental'
 'International' 'Pop' 'Rock']


In [40]:
with open('y.pickle', 'wb') as handle:
    pickle.dump(y, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [37]:
pickle.dump(classes, open( "classlabels.pickle", "wb"))