Required packages and modules

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings; warnings.filterwarnings(action='once')
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.gaussian_process import GaussianProcessClassifier

Importing and cleaning data

In [None]:
df = pd.read_csv('../input/dataset-of-songs-in-spotify/genres_v2.csv')

In [None]:
df.columns

In [None]:
df.drop(columns = ['type', 'id', 'uri', 'track_href', 'analysis_url', 'time_signature', 'song_name', 
                   'Unnamed: 0', 'title'], inplace = True)

In [None]:
df.head()

In [None]:
df.genre.nunique()

In [None]:
df.genre.unique()

In [None]:
df_slim = df.loc[df['genre'].isin(['Underground Rap', 'Rap', 'RnB', 'Pop', 'Hiphop', 'dnb', 'Emo'])]

EDA and Preprocessing

In [None]:
# Draw Plot
plt.figure(figsize=(13,10), dpi= 80)
sns.violinplot(x='genre', y='duration_ms', data=df_slim, scale='width', inner='quartile')

# Decoration
plt.title('Violin Plots of Song Duration by Genre', fontsize=16)
plt.show()

In [None]:
df_slim.columns

In [None]:
X = df_slim[['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'duration_ms']]

X is a dataframe of predictors

In [None]:
y = df_slim[['genre']]

y is a vector of the dependent/target variable

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

We split the data into train and test on the ratio 80/20

Supervised ML algorithms - classifiers

1. Decision Tree Classification

First we train the model by feeding it our X and y training sets

In [None]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)

Now we use our remaining X values in the test set to predict some y's

In [None]:
y_pred = clf.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

If the model had 100% classification accuracy then all non-diagonals in the confusion matrix would be zero. 
Some good and expected results here: 
Classified dnb with 96% accuracy and Emo with 62%.
Pop had the worst accuracy with 14% - no clear pattern? 
Remaining genres are all very similar so high accuracy was unexpected.

2. K-Nearest Neighbours

First we need to find the optimal k based on our training data. 
This for loop goes through a range for k and runs KNN on the training sets, makes y predictions using the X test set and records accuracy.
We plot below to find the optimal k, i.e. with the highest accuracy.

In [None]:
range_k = range(1,40)
scores = {}
scores_list = []

for k in range_k:
   classifier = KNeighborsClassifier(n_neighbors = k)
   classifier.fit(X_train, y_train.values.ravel())
   y_pred = classifier.predict(X_test)
   scores[k] = metrics.accuracy_score(y_test,y_pred)
   scores_list.append(metrics.accuracy_score(y_test,y_pred))
result = metrics.confusion_matrix(y_test, y_pred)

In [None]:
plt.figure(figsize=(12, 5), dpi=80)

plt.plot(range_k,scores_list)
plt.xlabel("Value of K")
plt.ylabel("Accuracy")

The accuracy hits 40% with k = 7 and seems to increase and flatten out at 42% at the highest.

This is a different for loop to measure mean error 

In [None]:
error = []

# Calculating error for K values between 1 and 40

for i in range(1, 40):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train.values.ravel())
    pred_i = knn.predict(X_test).reshape(3592,1)
    error.append(np.mean(pred_i != y_test))

In [None]:
plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), error, color='cornflowerblue', linestyle='dashed', marker='o',
         markerfacecolor='cornflowerblue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')

Shows the same pattern as above. 
So we go ahead and apply KNN with k= 7

In [None]:
classifier = KNeighborsClassifier(n_neighbors = 7)
classifier.fit(X_train, y_train.values.ravel())

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Poor accuracy all round, with dnb and Underground Rap the only two more likely to be correctly classified than not