# Working with audio data and building a recommender.

### Why?
Because, I am a music producer and a mastering engineer who is learning how machines learn and it will be a shame if I do not work on atleast one music related to ML problem and plus, it's fun so why not? ヽ(‘ ∇‘ )ノ

Want to hear my music?
Click on the link here:- https://www.youtube.com/c/FusionAssam

### First Step: Importing the basic libraries.

First, we will import the basic libraries that we mostly use. They are like siblings to you. You may like them or hate them, love them or fight with them but, at the end of the day you end up needing them. Yes, I am also a self made philosopher.

We will import machine learning libraries later on.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Second Step: Importing librosa.

Librosa is the fuel to the atom bomb we are going to work on today.

Know more about it here: https://librosa.org/doc/latest/index.html

In [None]:
import librosa
import librosa.display

# Importing other libraries just in case

import IPython.display as ipd
import warnings
warnings.filterwarnings('ignore')

In [None]:
import os
file_path = "../input/gtzan-dataset-music-genre-classification/Data"
print(os.listdir(f'{file_path}/genres_original/'))  # This will show us the 10 genre in the file.

# I don't make music on any of the genres displayed below. That's a little heartbreaking.

### Third Step: Exploring audio files

Here we will check the sequence of the vibrations which will be an array because, it will be in numbers. A sound is analog but, inside a computer it is just a sequence of numbers. We will also check the sample rate of the sound.

**What is a sample rate?**

We will talk about it a little later.

In [None]:
sound, sample_rate = librosa.load(f'{file_path}/genres_original/classical/classical.00003.wav')

In [None]:
print('Vibration sequence:', sound)  # Audio time series
array_len = sound.shape
print('\nSound shape:', array_len)
print('Sample Rate (Hz):', sample_rate)

# Length of the sound
print('Check Len of Audio:', array_len[0] / sample_rate)

As you can see above, the sample rate of the file is 22050Khz and length of the array is 661794.

**A sample rate of an audio file is defined as the number of times a sample is obtained or recorded or exported by a device in one second.** 

In simple terms, when we listen to a music or sound around us that does not come from any electronic device like the natural sound of human speaking, your friend playing a guitar in front of you without a mic or the sound of wind and water, we listen to the full range of sounds available out there out of which the sound in frequency of 20hz to 20,000hz is audible to human ears. In truth most humans cannot listen to sound of frequency below 50hz and above 17000hz but, people can feel the absence of the remaining frequencies. But, in a whole it is fixed that the audio range from 20hz to 20,000hz is audible to human ears.

But, now comes the electrical instruments which record the sounds. These instruments record the analog sound of the source i.e the musical instruments or human and convert it into digital signal/sound. These electronic devices records the sounds as small samples per second and add them together to create the music. Like in integral calculus, where we find area under a curve or an analog signal by stacking up small slices to cover the area below the curve and add them to get a rough picture close to the actual area under the curve. *The finer the slices, the more is the number of slices hence, we get more precise area under the curve.*

Similarly, in sound and music those slices are the samples. More the number of samples per second, the better will be the quality of the sound we listen from an electronic device since, more precise information of the real sound could be reproduced digitally.

And this logic of sampling rate is perfectly stated under the Nyquist–Shannon sampling Theorem.

This theorem states that:
> If a system uniformly samples an analog signal at a rate that exceeds the signal’s highest frequency by at least a factor of two, the original analog signal can be perfectly recovered from the discrete values produced by sampling.

That means, if you want to record a song or sound that needs to be close to sound that comes from the source, you will have to sample at least twice as fast as the bandwidth of the signal from the source.
Otherwise, the high-frequency content creates an alias/distortion in the waveform i.e from 20hz to 20,000hz. What is an Alias? [Click here](http://zone.ni.com/reference/en-XX/help/370524V-01/siggenhelp/fund_aliased_images/)

In short, humans listen to sound upto 20,000hz so, sampling rate should be 40,000hz or in more digital terms atleast 44,100hz. Why 44,100hz? [Click here](http://en.wikipedia.org/wiki/44,100_Hz)

The sound samples in this dataset is 22050hz that means means we are loosing a lot of informations in these audio files which is not so cool in todays world but, very cool for machine learning. Because, lower the sample rate, lower the resolution of the sound hence, less data for processing but enough to make a machine think efficiently. Or maybe because Kaggle is not a music streaming service there it downsamples the sound. I don't know, my own song got downsampled while uploading here. For that you will have to scroll down a little. 

The length of the array of the sounds in this dataset is 661794 that means, the audio sample consists of a collection of 661794 samples or vibration value of sound which were collected at a speed of 22050 samples per second which results in 661794/22050 = 30 seconds(approx).

Time to code again.

All the sounds in this dataset are hard cut instead of fade in and fade out hence, there are clicks in the beginning and end of the samples instead of silence or in more computer terms, there is information in the both ends. So, we do not need to trim them using librosa.effects.trim(). I don't know I may be wrong here but, this is how I feel.

In [None]:
# Let's look at the waveform of the sound.

plt.figure(figsize=(16, 6))
librosa.display.waveplot(y=sound, sr=sample_rate, color="#2f7d92ff")
plt.title("Waveform of classical.00003.wav", fontsize=12)  # Classical music are highly dynamic.
plt.show()

In [None]:
# Just for fun let's see one of my own sound file. 

my_sample, my_sample_rate = librosa.load('../input/fusiona-mandelbrot/FusionA - Mandelbrot Mixed File (22-7-2020) 1.mp3')
print('Vibration sequence:', my_sample)  # Audio time series
array_len = my_sample.shape
print('\nSound shape:', array_len)
print('Sample Rate (Hz):', my_sample_rate)

# Length of the sound
print('Check Len of Audio:', array_len[0] / my_sample_rate)

In [None]:
# Let's look at the waveform of my sound. (ﾟ▽^*)

plt.figure(figsize=(60, 15))
librosa.display.waveplot(y=my_sample, sr=my_sample_rate, color="#d3a5a7ff")
plt.title("Waveform of FusionA - Mandelbrot", fontsize=60)
plt.xticks(fontsize=60)
plt.xlabel('Time', fontsize=60)
plt.yticks(fontsize=60)
plt.show()

# Wow ヽ(゜∇゜)ノ

Now, thats a wonderful waveform. Listen to my song "Mandelbrot" guys. (¬‿¬) [Click here](http://www.youtube.com/watch?v=pSDtn8PT-Mg) and enjoy, my fellow comrades.

### Time for some Fourier Transform

In mathematics, a Fourier transform (FT) is a mathematical transform that decomposes a function (often a function of time, or a signal) into its constituent frequencies, such as the expression of a musical chord in terms of the volumes and frequencies of its constituent notes. The term Fourier transform refers to both the frequency domain representation and the mathematical operation that associates the frequency domain representation to a function of time.

Directly copied from [wikipedia](http://**en.wikipedia.org/wiki/Fourier_transform).

In [None]:
# For this we will go back to using the original sound samples that were on the dataset.
# In out case it was stored in a variable name 'sound'.

# Default FFT window size
n_fft = 2048 # FFT window size
hop_length = 512 # number audio of frames between STFT columns (looks like a good default)

# Short-time Fourier transform (STFT)
D = np.abs(librosa.stft(sound, n_fft=n_fft, hop_length=hop_length))

print('Shape of D object:', np.shape(D))
print('\nD:-\n', D)

In [None]:
plt.figure(figsize = (16, 6))
plt.plot(D)
plt.show()

### Time to look at the spectrogram

**What is a spectrogram?**

A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. When applied to an audio signal, spectrograms are sometimes called sonographs, voiceprints, or voicegrams.

Directly copied from [you guessed it](http://en.wikipedia.org/wiki/Spectrogram).

    In musical term, a spectrogram is a detailed view of audio, able to represent time, frequency, and amplitude all on one graph. A spectrogram can visually reveal broadband, electrical, or intermittent noise in audio.

In [None]:
# Convert an amplitude spectrogram to Decibels-scaled spectrogram.
db = librosa.amplitude_to_db(D, ref=np.max)

# Creating the Spectogram
plt.figure(figsize = (16, 6))
librosa.display.specshow(db, sr=sample_rate, hop_length=hop_length, x_axis='time', y_axis='mel',
                        cmap='gist_heat')
plt.colorbar();
# We are using 'gist_heat' colour map because this colour is similar to Izotope Rx Spectogram
# And it is mostly common in the world of music.

To know about the spectogram in Izotope Rx [click here](http://help.izotope.com/docs/rx/pages/userguide_spectrogramwaveformdisplay.htm).

### Time to look at the Zero-Crossing Rate:

The zero-crossing rate is the rate of sign-changes along a signal, i.e., the rate at which the signal changes from positive to zero to negative or from negative to zero to positive. This feature has been used heavily in both speech recognition and music information retrieval, being a key feature to classify percussive sounds.

A voice signal oscillates slowly - for example, a 100 Hz signal will cross zero 100 per second - whereas an unvoiced fricative can have 3000 zero crossing per second.

In [None]:
zc = librosa.zero_crossings(sound, pad=False)  # The zero crossing rate of the sound sample
sum(zc)

In [None]:
zc2 = librosa.zero_crossings(my_sample, pad=False)  # The zero crossing rate of my song
sum(zc2)

### Analyze Harmonics, Perceptual, Tempo and Pitch:

Harmonics: When a musical instrument is playing a note, what we are actually hearing is the fundamental pitch, which is the pitch being played by the instrument, accompanied by a series of frequencies that are usually heard as a single composite tone. Those frequencies that are integer multiples of the fundamental pitch's frequency are called harmonics. To know more [click here](http://study.com/academy/lesson/what-are-harmonics-definition-types-quiz.html).

Perceptual: Music involves the manipulation of sound. Our perception of music is thus influenced by how the auditory system encodes and retains acoustic information. To know more [click here](http://www.researchgate.net/publication/220723259_Perceptual_and_Cognitive_Applications_in_Music_Information_Retrieval) or [here](http://serious-science.org/perception-of-music-9396).

Tempo (Beats per minute): “Beats per minute” (or BPM) is self-explanatory: it indicates the number of beats in one minute. For instance, a tempo notated as 60 BPM would mean that a beat sounds exactly once per second. To know more [click here](http://www.masterclass.com/articles/music-101-what-is-tempo-how-is-tempo-used-in-music#what-is-beats-per-minute-bpm).

Pitch: You know what is pitch.

In [None]:
# Decompose an audio time series into harmonic and percussive components.

y_harm, y_perc = librosa.effects.hpss(sound)
plt.figure(figsize = (16, 6))
librosa.display.waveplot(y_harm, sr=sample_rate, color="#6885a7ff", alpha=0.25);
librosa.display.waveplot(y_perc, sr=sample_rate, color='#cf27a7ff', alpha=0.5);
ax = plt.axes()
ax.set(title='Harmonic + Percussive');

In [None]:
# Detecting the tempo of the track

tempo = librosa.beat.tempo(y=sound, sr=sample_rate)
print(tempo)

In [None]:
tempo = librosa.beat.tempo(y=my_sample, sr=my_sample_rate)
print(tempo)  # This value is wrong. This is not the tempo of my song.

In [None]:
# Chromogram
chromagram = librosa.feature.chroma_stft(sound, sr=sample_rate, hop_length=10000)
plt.figure(figsize=(16, 6))
librosa.display.specshow(chromagram, x_axis='time', y_axis='chroma', hop_length=10000, cmap='coolwarm');

# Low hop_legth = finer cell blocks.

## Time to speak some big big words like Explanatory Data Analysis.

╰(◡‿◡✿╰)

### Forth Step: Running Pandas for EDA and other stuffs.

Finally, this library became useful for this problem.

In [None]:
# Importing 30 secs csv file.
pd.set_option('max_columns', None)
data = pd.read_csv(f'{file_path}/features_30_sec.csv')
data.head()

Here columns mfcc are nothing but [Mel-Frequency Cepstral Coefficients](http://medium.com/prathena/the-dummys-guide-to-mfcc-aceab2450fd).

In [None]:
# Make a box-plot to check the distribution of the genres

x = data[["label", "tempo"]]

f, ax = plt.subplots(figsize=(16, 9));
sns.boxplot(x = "label", y = "tempo", data = x, palette = 'PuBuGn');

plt.title('BPM Boxplot for Genres', fontsize = 15)
plt.xticks(fontsize = 10)
plt.yticks(fontsize = 10);
plt.xlabel("Genre", fontsize = 10)
plt.ylabel("BPM", fontsize = 10)
plt.savefig("BPM Boxplot.jpg")

In [None]:
point = data.iloc[:, 2:]
point

In [None]:
from sklearn import preprocessing

data = data.iloc[0:, 1:]
y = data['label']
X = data.drop('label', axis=1)

cols = X.columns
min_max_scaler = preprocessing.MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(X)
X = pd.DataFrame(np_scaled, columns = cols)

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
principalComponents = pca.fit_transform(X)
principalDf = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2'])

finalDf = pd.concat([principalDf, y], axis = 1)

pca.explained_variance_ratio_

In [None]:
plt.figure(figsize = (16, 9))
sns.scatterplot(x = "principal component 1", y = "principal component 2", data = finalDf, hue = "label", alpha = 0.7,
               s = 100);

plt.title('PCA on Genres', fontsize = 15)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12);
plt.xlabel("Principal Component 1", fontsize = 15)
plt.ylabel("Principal Component 2", fontsize = 15)
plt.savefig("PCA Scattert.jpg")

## Fifth Step: Machine Learning

Finally, time to see what we can do with the data.

In [None]:
'''
Here we will do the following task:-
    * We will first import the libraries we want for machine learning.
    * We will use the feature_3_sec.csv for building and testing.
    * We will use Random Forest, KNN, XGBoost and support vector machine.
'''

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, roc_curve, precision_score, recall_score, f1_score
# Feature ranking with recursive feature elimination
from sklearn.feature_selection import RFE

In [None]:
data = pd.read_csv('../input/gtzan-dataset-music-genre-classification/Data/features_3_sec.csv')
data.head()

In [None]:
data['length'].nunique()

In [None]:
# Splitting data
# We will remove column 'filename' since all the data in this column is unique.
# We will remove 'length' column since all the data in this column is same.

df = data.iloc[0:, 2:]

y = df['label'].values
X = df.drop('label', axis=1)

scale = MinMaxScaler()
scaled_data = scale.fit_transform(X)
X = pd.DataFrame(scaled_data, columns = X.columns).values

Since, human beings are lazy it is better to build a function to do net repeated tasks.

Firstly, we will split the dataset using KFold to get a better understanding of our models while evaluating.

Secondly, we will wait for the kernel to perform it's magic.

In [None]:
def model_build(model, kf, title = "Default"):
    accuracy_scores = []
    precision_scores = []
    recall_scores = []
    f1_scores = []
    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        accuracy_scores.append(accuracy_score(y_test, y_pred))
    print("Accuracy score of", title, "is:", round(np.mean(accuracy_scores), 2))
    # Let's see the confusion matrix of the last split for a little insight
    con_mat = confusion_matrix(y_test, y_pred)
    plt.figure(figsize = (16, 9))
    sns.heatmap(con_mat, cmap="Blues", annot=True, 
                xticklabels = ["blues", "classical", "country", "disco", 
                               "hiphop", "jazz", "metal", "pop", "reggae", "rock"], 
                yticklabels=["blues", "classical", "country", "disco", "hiphop", 
                             "jazz", "metal", "pop", "reggae", "rock"])
    plt.show()
    

# Leave 2 blank spaces after a function definition

In [None]:
split = KFold(n_splits=5, shuffle=True)

### First let us have a look at the Shenanigans performed by the Random Forest Classifier followed by KNN, XG Boost and later on with SVM

In [None]:
# Random Forest Classifer

rfc = RandomForestClassifier(n_estimators=1000, max_depth=10, random_state=10)
model_build(rfc, split, 'Random Forest Classifier')

In [None]:
# K Nearest Neighbor

knn = KNeighborsClassifier(n_neighbors=10)
model_build(knn, split, 'K Nearest Neighbor')

In [None]:
# XG Boost

xgb = XGBClassifier()
model_build(xgb, split, 'XG Boost')

In [None]:
# Support Vector Machine

svm = SVC(decision_function_shape="ovo")
model_build(svm, split, 'Support Vector Machine')

From the above data we can see that XG Boost has greater chance of success in this problem. Let's fine tune it a little more. We will use Grid Search CV for this operation.

### Sixth Step: Tuning the model.

In [None]:
X_train, X_test, y_train, y_test = tts(X, y, test_size=0.25, random_state=1)
model = XGBClassifier(n_estimators=1000, learning_rate=0.3)
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
print('Accuracy', ':', round(accuracy_score(y_test, y_pred), 5))

### Seventh Step: Feature importance

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(estimator=model, random_state=1)
perm.fit(X_test, y_test)

In [None]:
columns = df.drop('label', axis=1).columns.tolist()
eli5.show_weights(estimator=perm, feature_names = columns)

### Eighth Step: Building recommender system

In [None]:
# First we will scale the data

import IPython.display as ipd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn import preprocessing

# Read data
data = pd.read_csv(f'{file_path}/features_30_sec.csv', index_col='filename')

# Extract labels
labels = data[['label']]

# Drop labels from original dataframe
data = data.drop(columns=['length','label'])
data.head()

# Scale the data
data_scaled=preprocessing.scale(data)
print('Scaled data type:', type(data_scaled))

And then we will find the [Cosine Similarity](http://www.sciencedirect.com/topics/computer-science/cosine-similarity#:~:text=Cosine%20similarity%20measures%20the%20similarity,document%20similarity%20in%20text%20analysis.). This will be the tool for our recommender system.

In [None]:
# Cosine similarity
similarity = cosine_similarity(data_scaled)
print("Similarity shape:", similarity.shape)

# Convert into a dataframe and then set the row index and column names as labels
sim_df_labels = pd.DataFrame(similarity)
sim_df_names = sim_df_labels.set_index(labels.index)
sim_df_names.columns = labels.index

sim_df_names.head()

### Now, we will define the recommender function.

In [None]:
def recommender(name):
    series = sim_df_names[name].sort_values(ascending = False)
    
    # Remove cosine similarity == 1 (songs will always have the best match with themselves)
    series = series.drop(name)
    topfive = series.head(5)
    songnames = topfive.index.tolist()
    address_list = ['../input/gtzan-dataset-music-genre-classification/Data/genres_original/blues',
                   '../input/gtzan-dataset-music-genre-classification/Data/genres_original/classical',
                   '../input/gtzan-dataset-music-genre-classification/Data/genres_original/country',
                   '../input/gtzan-dataset-music-genre-classification/Data/genres_original/disco',
                   '../input/gtzan-dataset-music-genre-classification/Data/genres_original/hiphop',
                   '../input/gtzan-dataset-music-genre-classification/Data/genres_original/jazz',
                   '../input/gtzan-dataset-music-genre-classification/Data/genres_original/metal',
                   '../input/gtzan-dataset-music-genre-classification/Data/genres_original/pop',
                   '../input/gtzan-dataset-music-genre-classification/Data/genres_original/reggae',
                   '../input/gtzan-dataset-music-genre-classification/Data/genres_original/rock']
    genre_list = ["blues", "classical", "country", "disco", 
                  "hiphop", "jazz", "metal", "pop", "reggae", "rock"]
    songlist = []
    songnames = []
    for songname in topfive.index:
        songgenre = songname.split('.')[0]
        address = genre_list.index(songgenre)
        fileaddress = address_list[address] + ('/') + songname
        songlist.append(fileaddress)
        songnames.append(songname)
    return songlist, songnames

## Time for some experiments:

### Experiment no. 1:

First, we will check on a hip hop song. I liked the beat of the song hiophop.00010.wav so, I am using it.

In [None]:
now_playing = 'hiphop.00010.wav'
playlist, songname = recommender(now_playing)
print('Now playing:', now_playing)
ipd.Audio('../input/gtzan-dataset-music-genre-classification/Data/genres_original/hiphop/hiphop.00010.wav')

Below, is the recommended song based on the song playing now.

In [None]:
print('Recommended songs:\n\n', pd.Series(songname))
ipd.Audio(playlist[0])

Both, the songs sound quite similar. So, we can see that our recommender system is working good.

### Experiment no.: 2

Now, let's check on a classical music.

In [None]:
now_playing = 'classical.00003.wav'
playlist, songname = recommender(now_playing)
print('Now playing:', now_playing)
ipd.Audio('../input/gtzan-dataset-music-genre-classification/Data/genres_original/classical/classical.00003.wav')

In [None]:
print('Recommended songs:\n\n', pd.Series(songname))
ipd.Audio(playlist[0])

From the above information we can see that our recommender system is working quite good.

Hence, we can conclude in a way that the recommender system we built is working great. I am so happy that I learned it.

I want to thank [Miss Andrada Olteanu](http://www.kaggle.com/andradaolteanu) for sharing her notebook on her Kaggle account from where I learned this concept. And, yes I am very happy with what I learned and want to share it with you all. Have fun, and I hope I may have shown you a few styles of my coding from which you can learn.

Thank you for your patience. 

### And, here is my song. I hope you will like it ヽ(^◇^*)/

For better quality [click here](http://www.youtube.com/watch?v=pSDtn8PT-Mg).

Thank you all. Have fun learning.

In [None]:
print("FusionA - Mandelbrot")
ipd.Audio('../input/fusiona-mandelbrot/FusionA - Mandelbrot Mixed File (22-7-2020) 1.mp3')