In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import scipy, matplotlib.pyplot as plt, IPython.display as ipd, sklearn
import os
import librosa
import librosa.display
from ggplot import *
import seaborn as sns

# Purpose

Our research question was the following: __Can we create a machine learning model that can classify songs in their respective categories to a reasonable accuracy?__

The end goal of this project will be to use a pre-trained machine learning model to classify the audio files directly, and either compare it's performance to this model or use them together to create a hybrid model.

However, for the purpose of feature exploration, it will be more useful to explore the features in a tabular state.

# Examining data sources

## Data Source 1

For this project, we were given an initial data source from Kaggle organized by Andrada Olteanu from Romania.
The dataset, known as the [GTZAN Dataset](https://www.kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification), is a collection of 30 second snippets from songs, with a 100 songs for 10 genres equalling a total of 1000 songs.

The dataset also contains features of audio analysis for each of the songs in .csv form that we will be using to explore the data.

While Andrada organized, uploaded, and generated features from the song snippets, the original data was recorded by G. Tzanetakis and P. Cook in _IEEE Transactions on Audio and Speech Processing_ in 2002, a journal for the "science, technologies and applications that relate to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language".
The songs were all pulled from personal CDs, radio, microphone recordings, in order to represent a variety of recording conditions, and are all 22050Hz Mono 16-bit audio files in .wav format.

## Data Source 2

We also added in our own dataset of songs downloaded from Youtube Playlists by genre using a command line program. These are upwards of 60+ songs for each genre and are also all 22050Hz Mono 16-bit audio files in .wav format. These are all also cut down to only 30 second snippets using the Digital Audio Workstation _Ableton_.

A lot of these songs are similar to each other in time released, as for example many of the rap songs from the same playlist came out around similar times. To get a more well rounded dataset, I will soon download more playlists from older or newer music from the genre. Our existing GTZAN dataset has mostly old music, as it is from 2002.

The features for this data will be generated by us later on in the exploration and encorporated into the GTZAN Dataset.

## Data Source 1

Let's begin by printing the head of our entire dataframe. It is wide so it'll have to be printed out 20 columns at a time.

In [None]:
df=pd.read_csv("../input/gtzan-dataset-music-genre-classification/Data/features_30_sec.csv")
df.iloc[:,:20].head()

In [None]:
df.iloc[:,20:40].head()

In [None]:
df.iloc[:,40:60].head()

That is a lot of columns!

Just doing a simple inspection of our data, we see it is 1000 rows by 60 columns. There are quite a lot of features that were generated for us by Andrada (58 in total and all numerical), so lets quickly move through them to get a better sense of our data.

If we want encorporate files from our second data source, we should know how each of these variables were calculated as well so we can generate them for our own songs.

In [None]:
df["chroma_stft_mean"].describe()

In music, the term [chroma feature][1] relates to the twelve different pitch classes. Chroma-based featurescan be used to categorize music with meaningful pitches (usually into 12 scales) and whose tuning can equated to the equal-tempered scale.

[1]: https://en.wikipedia.org/wiki/Chroma_feature


In our dataset, Andrada used the [librosa][2] package, a python package for music and audio analysis, to extract features from each of our song snippets.

[2]: https://librosa.org/doc/latest/index.html

Let's generate a Chromagram from one of our blues songs using [librosa][2].

In [None]:
y, sr = librosa.load("../input/gtzan-dataset-music-genre-classification/Data/genres_original/blues/blues.00000.wav")
ipd.Audio(y, rate=sr)

In [None]:
S = np.abs(librosa.stft(y))
chroma = librosa.feature.chroma_stft(S=S, sr=sr)
S = np.abs(librosa.stft(y, n_fft=4096))**2
chroma = librosa.feature.chroma_stft(S=S, sr=sr)
fig, ax = plt.subplots()
img = librosa.display.specshow(chroma, y_axis='chroma', x_axis='time', ax=ax)
fig.colorbar(img, ax=ax)
ax.set(title='Chromagram')
plt.show()

For our generated features, the mean and variance was taken across the entire generated array.

In [None]:
np.mean(chroma)

In [None]:
df["rms_mean"].describe()

[RMS level][3] (root mean squared) is fairly simple -- it is just __proportional to the amount of energy over a period of time in the signal.__ This can be used to distinguish songs that are louder from each other.

[3]: https://en.wikipedia.org/wiki/Audio_power#Continuous_power_and_%22RMS_power%22

This RMS value is generated using numpy and is a fairly simple calculation (again, root mean squared).


In [None]:
df["spectral_centroid_mean"].describe()

The [spectral centroid](https://en.wikipedia.org/wiki/Spectral_centroid) is a measure used to characterise an audio spectrum by finding its center of mass. It is also connected to the brightness of a sound, which refers to the higher mid and treble parts of the frequency.

We can use the librosa package again to compute this frequency that the energy of the spectrum centers upon.

Let's start by loading in one of our jazz songs.

In [None]:
x, sr = librosa.load('../input/gtzan-dataset-music-genre-classification/Data/genres_original/jazz/jazz.00016.wav')
ipd.Audio(x, rate=sr)

Next we will find the spectral centroids per each frame in an audio signal, compute time for visualization, and then plot.

In [None]:
spectral_centroids = librosa.feature.spectral_centroid(y, sr=sr)[0]
frames = range(len(spectral_centroids))
t = librosa.frames_to_time(frames)
def normalize(y_, axis=0):
    return sklearn.preprocessing.minmax_scale(y_, axis=axis)
librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_centroids), color='r') # normalize for visualization purposes
plt.show()


This gives us additional information about each of the songs in our data. The features are taking the mean and variance across all of this data.

[Spectral Bandwidth](https://musicinformationretrieval.com/spectral_features.html) is similar to above, just with a weighted p to provide different measures of spectral analysis.

In [None]:
spectral_bandwidth_2 = librosa.feature.spectral_bandwidth(x+0.01, sr=sr)[0]
spectral_bandwidth_3 = librosa.feature.spectral_bandwidth(x+0.01, sr=sr, p=3)[0]
spectral_bandwidth_4 = librosa.feature.spectral_bandwidth(x+0.01, sr=sr, p=4)[0]
librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_bandwidth_2), color='r')
plt.plot(t, normalize(spectral_bandwidth_3), color='g')
plt.plot(t, normalize(spectral_bandwidth_4), color='y')
plt.legend(('p = 2', 'p = 3', 'p = 4'))
plt.show()

The other features we will not use as there are no standardized ways of computing them, so we cannot recreate them for our new songs.

Let's check for missingness.

In [None]:
df=df[['filename', 'chroma_stft_mean', 'chroma_stft_var', 'rms_mean', 'rms_var', 'spectral_centroid_mean',
      'spectral_centroid_var', 'spectral_bandwidth_mean', 'spectral_bandwidth_var', 'label']]
df.isnull().any()

This looks good! No missing values for any of our columns of interest.

Next let's look at some tables and graphs.

In [None]:
df.describe()

From this chart we can note a couple of things --
* Our variables are definitely all on different scales. Our max values range from 0.027 to 3,036,843, with spectral variables being the ones with high values.
* There is a lot of variation for spectral centroid and not much for chroma stft or rms.
* There doesn't seem to much skew to our data as our mean and max don't seem too far apart for any of our variables.

Let's do some visualizations.

In [None]:
p = ggplot(df, aes(x='label', y='chroma_stft_mean', color='label')) +\
    geom_point() +\
    scale_color_brewer(type='diverging', palette=4) +\
    xlab("Genre") + ylab("Chroma Stft Mean") + ggtitle("Song Chroma Mean by Genre")
p

So we can see there's definitely a difference across our Genres when comparing the Chroma mean. Metal has the highest while classical has the lowest.
Reggae has a few interesting outliers that go pretty far up as well. Let's look at those.

In [None]:
reggae_out=df[df['label']=='reggae']
reggae_out=reggae_out[reggae_out['chroma_stft_mean']>0.55]
reggae_out

Our offending files are 51 and 86. Let's have a listen to them.

In [None]:
x, sr = librosa.load('../input/gtzan-dataset-music-genre-classification/Data/genres_original/reggae/reggae.00051.wav')
ipd.Audio(x, rate=sr)

So.. this doesn't sound like reggae. This is a lot more electronic music. I am going to drop this out of the dataset as it seems to be mislabeled.

In [None]:
x, sr = librosa.load('../input/gtzan-dataset-music-genre-classification/Data/genres_original/reggae/reggae.00086.wav')
ipd.Audio(x, rate=sr)

I won't play this file as about 5 seconds the file glitches very loud for the rest of the recording -- no wonder it showed up so far away from our other values. I am going to remove this one as well.

In [None]:
df=df.drop(851)
df=df.drop(886)

Exploring these outliers was useful for reggae. Let's check out some others.

In [None]:
outlier=df[df['label']=='classical']
outlier=outlier[outlier['chroma_stft_mean']>0.4]
x, sr = librosa.load('../input/gtzan-dataset-music-genre-classification/Data/genres_original/classical/classical.00080.wav')
ipd.Audio(x, rate=sr)

This one just contains a lot of silence, but that is just how the song sounds. It shouldn't affect our label too much.

In [None]:
outlier=df[df['label']=='disco']
outlier=outlier[outlier['chroma_stft_mean']<0.3]
x, sr = librosa.load('../input/gtzan-dataset-music-genre-classification/Data/genres_original/disco/disco.00047.wav')
ipd.Audio(x, rate=sr)

I am not sure if that qualifies as disco -- I would call that more like a Disney song or something? I am going to remove it.

In [None]:
df=df.drop(347)

That concludes our outliers. Next let's view some graphs for the spectral centroid statistics.

In [None]:
ggplot(df, aes(x='spectral_centroid_mean', color='label')) +\
    geom_density(alpha=0.95) +\
    facet_wrap("label") +\
    xlab("Spectral Centroid Mean") + ggtitle("Song Spectral Centroid Mean by Genre")

This looks very similar to our above plot with the Chroma Mean variable as they all follow very similar distributions. Let's see if we graph a scatterplot with both as the x and y values.

In [None]:
ggplot(df, aes(x='spectral_centroid_mean', y='chroma_stft_mean', color='label')) +\
    geom_point(alpha=0.80) +\
    xlab("Spectral Centroid Mean") + ylab("Chroma Stft Mean") + ggtitle("Scatterplot for Spectral Centroid vs. Chroma Stft Mean")

So -- our values aren't as linear as I thought. There is a loosely linear correlation between the two but it is not super strong.

Speaking of correlation, let's do a heatmap to see how correlated our variables are to each other.

In [None]:
df_no_var=df[['filename', 'chroma_stft_mean', 'rms_mean', 'spectral_centroid_mean', 'spectral_bandwidth_mean', 'label']]
corr = df_no_var.corr()
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);

To make this heatmap simpler, I only visualized the means for each of our variables.
We see all of our variables are pretty strongly correlated together, especially our two spectral centroids.
This could potentially cause problems for us in the future and it might be better to only use one of them.

For recoding variables, we could divide each category by the max of that category to get a final percent value.

In [None]:
df_no_var['chroma_stft_mean']=df_no_var['chroma_stft_mean']/df_no_var['chroma_stft_mean'].max()
df_no_var['rms_mean']=df_no_var['rms_mean']/df_no_var['rms_mean'].max()
df_no_var['spectral_centroid_mean']=df_no_var['spectral_centroid_mean']/df_no_var['spectral_centroid_mean'].max()
df_no_var['spectral_bandwidth_mean']=df_no_var['spectral_bandwidth_mean']/df_no_var['spectral_bandwidth_mean'].max()
df_no_var.describe()

## Dataset Source 2

Now that we have established which features are of interest to us and possible to replicate from our first dataset, we can generate them from the songs we downloaded in our second dataset. I did this offline as uploading all of the songs to Kaggle would have taken too long.

The code for that is at the bottom of the page and commented out if you are interested.

For this case, we will quickly examine the second dataset source similarly to how we did the first, looking at descriptive statistics generated in the same way as the above features and looking for outliers to see if I made any mistakes or the playlists had any mislabeled songs.

In [None]:
df2=pd.read_csv("../input/additionaldata1/additional_df.csv")
print(df2.shape)
print(df2.isnull().any())
df2.head()

This dataset has 788 rows while our previous had 1000.

In [None]:
df2.groupby(['label']).size()

The main difference in the two datasets is the difference in size for labels, with country only having 66 songs while blues has 124. I am aiming to continue adding songs to this dataset as the semester goes on so I will have this fixed soon.

In [None]:
p = ggplot(df2, aes(x='label', y='chroma_stft_mean', color='label')) +\
    geom_point() +\
    scale_color_brewer(type='diverging', palette=4) +\
    xlab("Genre") + ylab("Chroma Stft Mean") + ggtitle("Song Chroma Mean by Genre")
p

We see a surprisingly similar graph to what we saw before, with classical being the lowest and rock/metal being the highest.

In [None]:
rock=df2[df2['label']=='rock']
rock=rock[rock['chroma_stft_mean']<0.55]
rock

While that Metallica one is correct, the Helen O'Connel is one from the Jazz category and got incorrectly sorted.

In [None]:
df2=df2.drop(784)

The other outliers investigated proved to be correctly sorted.

In [None]:
df_no_var['dataset'] = "Source 1"
df2['dataset'] = "Source 2"

df2['chroma_stft_mean']=df2['chroma_stft_mean']/df2['chroma_stft_mean'].max()
df2['rms_mean']=df2['rms_mean']/df2['rms_mean'].max()
df2['spectral_centroid_mean']=df2['spectral_centroid_mean']/df2['spectral_centroid_mean'].max()
df2['spectral_bandwidth_mean']=df2['spectral_bandwidth_mean']/df2['spectral_bandwidth_mean'].max()

master_df = pd.concat([df_no_var, df2])
master_df

One last thing I wanted to examine is how these two Data sources compare side by side using a density plot.

In [None]:
ggplot(master_df, aes(x='spectral_centroid_mean', color='dataset')) +\
    geom_density(alpha=0.95) +\
    facet_wrap("label") +\
    xlab("Spectral Centroid Mean") + ggtitle("Comparing spectral centroid distributions for different data sources")

We see here that majority of our genres follow a very similar distribution, with the main differences being in pop and rock -- most likely because the Youtube playlists I downloaded for these genres were majority songs from 2020.
In order to fix this, I'll add in other playlists that are from older songs to more fit the curve from the older dataset. 

I also was unable to find any good Metal playlists as most of them were just the same songs in the rock playlist, so this is something I need to go back to and really search for.

In [None]:
ggplot(master_df, aes(x='chroma_stft_mean', color='dataset')) +\
    geom_density(alpha=0.95) +\
    facet_wrap("label") +\
    xlab("Chroma Mean") + ggtitle("Comparing chroma means for different data sources")

Our differences are just exacerbated here, as for almost every category they are higher up on the chroma mean scale. Possibly the quality from Youtube is better than the quality from the GTZAN dataset, as those songs were all pulled from old personal CDs, radio, microphone recordings.

# Concluding Thoughts

* We have successfully identified our variables of interest for our data through research and exploration of our features.
* We cleaned our data by picking out outliers and mislabeled values through using visualizations as well normalizing our features.
* We encorporated our own data and merged the two dataframes together and compared the differences between them.


Moving forward, this should set the base for a good baseline model to compare against a pre-trained neural network and to potentially merge together.

# Offline Audio Feature Extraction

In [None]:
# import pandas as pd
# import numpy as np
# import librosa
# import os
# import math

# list_=[]

# for dirname, _, filenames in os.walk('E:/code/470w30'):
#     for filename in filenames:
        
#         path_=(os.path.join(dirname, filename))
#         label=(dirname.split('\\')[1])
#         y, sr = librosa.load(path_)
#         S = np.abs(librosa.stft(y))
#         chroma = librosa.feature.chroma_stft(S=S, sr=sr)
#         chroma_mean = (np.mean(chroma))
        
#         rms=math.sqrt(np.mean(y*y))
        
#         spectral_centroids = librosa.feature.spectral_centroid(y, sr=sr)[0]
#         spectral_mean = (np.mean(spectral_centroids))
        
#         spectral_bandwidth_2 = librosa.feature.spectral_bandwidth(y+0.01, sr=sr)[0]
#         spectral_bandwidth_2 = np.mean(spectral_bandwidth_2)
#         spectral_bandwidth_3 = librosa.feature.spectral_bandwidth(y+0.01, sr=sr, p=3)[0]
#         spectral_bandwidth_3 = np.mean(spectral_bandwidth_3)
#         spectral_bandwidth_4 = librosa.feature.spectral_bandwidth(y+0.01, sr=sr, p=4)[0]
#         spectral_bandwidth_4 = np.mean(spectral_bandwidth_4)
#         band_mean = np.mean(np.array([spectral_bandwidth_2, spectral_bandwidth_3, spectral_bandwidth_4]))
        
#         list_.append([filename, chroma_mean, rms, spectral_mean, band_mean, label])
        
# df_=pd.DataFrame(list_, columns=['filename', 'chroma_stft_mean', 'rms_mean', 'spectral_centroid_mean', 'spectral_bandwidth_mean', 'label'])

# df_.to_csv("additional_df.csv", index=False)