# Predicting song's genre using ML
* Problem definition - check several classification models and try to achieve best accuracy in prediction genre of a song from data provided
* Data - Data has been obtained from Kaggle: https://www.kaggle.com/mrmorj/dataset-of-songs-in-spotify
* Evaluation - try several models and reach the best possible accuracy
* Features - data consists of several features such as danceability, energy, key, loudness, mode in numerical format and other categorical features.
* Modelling - we will try K-Nearest Neigbors, Random Forest and Logistic Regression model for evaluation
* Experimentation - this section will involve fine tuning some of the hyperparameters to see if we can improve the accuracy

Preparing the tools
Pandas,Matplotlib, and Numpy for data analysis and manipulation

In [None]:
#Import EDA (exploratory data analysis) and plotting library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
#Import models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
#Model evaluation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score,f1_score
from sklearn.metrics import plot_roc_curve

## Load data

In [None]:
data=pd.read_csv("../input/dataset-of-songs-in-spotify/genres_v2.csv")
data.head()

In [None]:
#Check the number of rows and columns
data.shape

In [None]:
#Check and view data in transposed form to view all columns
data.head().T

In [None]:
#Since we are going to predict the genre , let's check the "genre" column for data types
data["genre"].value_counts()

In [None]:
#Let's plot the above date from "genre" columns in bar plot for better vizualization
data["genre"].value_counts().plot(kind="barh",color=["lightblue"],title="Genres");

In [None]:
#Let's check and see if we have missing data
data.isna().sum()

As we can see above, missing values are only in last 3 columns and this will not affect our analysis of predicting genre

In [None]:
#Finally we can check the data types in our data frame
data.dtypes
#Based on below we will base the prediction analysis only on numerical data

In [None]:
#Correlation analysis using Seaborn heatmap for data analysis
corr_matrix=data.corr()
fig,ax=plt.subplots(figsize=(15,10))
ax=sns.heatmap(corr_matrix,
              annot=True,
              linewidths=0.5,
              fmt=".2f",
              cmap="YlGnBu");

## Modelling

In [None]:
#Split the data into X and y
num_data=data.drop(["title","Unnamed: 0","song_name","analysis_url","track_href","uri","id","type"],axis=1) #drop all non-numeric columns
X=num_data.drop("genre",axis=1)
y=num_data["genre"]

In [None]:
#Split into training and test sets
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)

#### We will use 3 different models for this problem:
    1. Logistic regression
    2. K-Nearest
    3. Random Forest

In [None]:
#For that we will create a fuction in order to evaluate and compare models easily
models={"LogReg":LogisticRegression(),
       "KNN":KNeighborsClassifier(),
       "Random Forest":RandomForestClassifier()}
def fit_and_score (models,X_train,X_test,y_train,y_test):
    """
    Fits and evaluates given machine learning models
    """
    np.random.seed(1)
    model_scores={}
    for name , model in models.items():
        model.fit(X_train,y_train)
        model_scores[name]=model.score(X_test,y_test)
    return model_scores

In [None]:
model_scores=fit_and_score(models=models,X_train=X_train,X_test=X_test,y_train=y_train,y_test=y_test)
model_scores

## Model comparison

In [None]:
model_compare=pd.DataFrame(model_scores,index=["Accuracy"])
model_compare.T.plot.barh(color=["lightblue"]);

#### Now as we have initial Accuracy scoring of 3 models, let's look at following:
* Hyperparameter tuning
* Feature importance
* Confusion matrix
* Cross validation
* Precision, Recall, F1 score
* Classification report

## Hyperparameter tuning
1. HP tuning by hand
2. HP tuning with RandomizedSearchCV

#### Hyperparameter tuning for KNN manually

In [None]:
#Tuning for KNN model
train_scores=[]
test_scores=[]
#Create list of different n-neighbors
neighbors = range(1,15)
#Setup KNN instance
knn=KNeighborsClassifier()
#Loop through different neigbors
for i in neighbors:
    knn.set_params(n_neighbors=i)
    knn.fit(X_train,y_train) #Fit the model
    train_scores.append(knn.score(X_train,y_train)) #Update the train score list
    test_scores.append(knn.score(X_test,y_test)) #Update test scores list

In [None]:
plt.plot(neighbors,train_scores,label="Train scores")
plt.plot(neighbors,test_scores,label="Test scores")
plt.xticks(np.arange(1,15,1))
plt.yticks(np.arange(0.1,1,0.1))
plt.xlabel("Number of neighbors")
plt.ylabel("Model scores")
plt.legend;

The above chart shows that the best K-value giving higher prediction is *1* where we can see 36% accuracy level which is still low.
Let's look for other models and see whether we can improve accuracy

#### Hyperparameter tuning with RandomizedSearchCV for LogReg and Random Forest models
Let's tune LogReg and Random Forest models using RandomizedSearchCV

In [None]:
#Hyperparameter grid for Logistic Regression
log_reg_grid={"C": np.logspace(-4,4,20),
             "solver":["liblinear"]}
#Hyperparameter grid for Random Forest
rf_grid={"n_estimators": np.arange(10,1000,50),
         "max_depth":[None,3,5,10],
         "min_samples_split":np.arange(2,20,2),
         "min_samples_leaf":np.arange(1,20,2)}

In [None]:
#Tune LogReg model
np.random.seed(1)
rs_log_reg=RandomizedSearchCV(LogisticRegression(),
                             param_distributions=log_reg_grid,
                             cv=5,
                             n_iter=20,
                             verbose=True)
#Fit random hyperparameter search model to LogReg
rs_log_reg.fit(X_train,y_train)

In [None]:
#Let's check the best parameters
rs_log_reg.best_params_

In [None]:
rs_log_reg.score(X_test,y_test)

The above score is very very small increase from initial scoring, now let's try same with Random Forest and see how it can improve the scoring

In [None]:
#Tune Random Forest model
np.random.seed(42)
rs_rf=RandomizedSearchCV(RandomForestClassifier(),
                        param_distributions=rf_grid,
                        n_iter=20,
                        verbose=True)
rs_rf.fit(X_train,y_train)

In [None]:
rs_rf.best_params_

In [None]:
rs_rf.score(X_test,y_test)

### Evaluating tuned machine learning classifier , beyond accuracy

* Comparison of real and predicted results
* Classification report
* Precision
* Recall
* F1 score

Since Random Forest tuned model give the best results, going further we will evaluate only that model

In [None]:
#Make predictions with tuned model
y_preds=rs_rf.predict(X_test)
preds_df=pd.DataFrame(y_preds)
preds_df.head()

In [None]:
#Check actual vs predictions
comparison=pd.DataFrame(data={"actual":y_preds,"prediction":y_test})
comparison.head()

In [None]:
#Calculate number of true and false predictions
comparison["result"]=comparison["actual"]==comparison["prediction"]
comparison["result"].value_counts().plot(kind="barh",color=["lightblue"],title="Comparison");

#### Classification report for Precision, F1 Score and Recall

In [None]:
#Classification report for precision, recall, f1score and accuracy
print(classification_report(y_test,y_preds))

#### Feature importance
Checking with features contributed most to the outcome

In [None]:
feat=RandomForestClassifier(n_estimators= 460,       #Using best params obtained earlier
                            min_samples_split= 6,
                            min_samples_leaf= 9,
                            max_depth= None)
feat.fit(X_train,y_train)

In [None]:
feat.feature_importances_

In [None]:
#Match coefficients of features to columns
feature_dict=dict(zip(num_data.columns,list(feat.feature_importances_)))
#Vizualize the feature importance
feature_data=pd.DataFrame(feature_dict,index=[0])
feature_data.T.plot.barh(title="Feature importance",legend=False,color="lightblue",grid=False);