# MUSIC GENRE CLASSIFICATION USING MACHINE LEARNING 🎧

This notebook looks into using various python-based machine learning and data science libraries in an attempt to build a machine learning model capable of classify the music based on their genre.

We are going to take following approach.
1. Problem Definition
2. Data
3. Evaluation
4. Features
5. Modeling

## 1. Problem Definiton
Music has been an important part of our lives since time immemorial. Every artist has a signature, making music a subjective art. We have scales/metrics to measure the quality of music. But, is it possible to train a machine learning model to predict the genre and quality of the music?

## 2. Data
Data is taken form MACHINE HACK COMPETITION:
https://machinehack.com/hackathons/music_genre_classification_weekend_hackathon_edition_2_the_last_hacker_standing/data

**Training dataset**: 17,996 rows with 17 columns

**Test dataset**: 7,713 rows with 16 columns

## 3. Evaluation
The submission will be evaluated using the Log Loss metric. One can use sklearn.metric.log_loss to calculate the same

## 4. Features
Column details: artist name; track name; popularity; ‘danceability’; energy; key; loudness; mode; ‘speechiness’; ‘acousticness’; ‘instrumentalness’; liveness; valence; tempo; duration in milliseconds and time_signature. 

Target Variable: 'Class’ such as Rock, Indie, Alt, Pop, Metal, HipHop, Alt_Music, Blues, Acoustic/Folk, Instrumental, Country, Bollywood, 

## Getting Started

In [1]:
#importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
data = pd.read_csv('train.csv')

In [3]:
data.head()

Unnamed: 0,Artist Name,Track Name,Popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_in min/ms,time_signature,Class
0,Bruno Mars,That's What I Like (feat. Gucci Mane),60.0,0.854,0.564,1.0,-4.964,1,0.0485,0.0171,,0.0849,0.899,134.071,234596.0,4,5
1,Boston,Hitch a Ride,54.0,0.382,0.814,3.0,-7.23,1,0.0406,0.0011,0.00401,0.101,0.569,116.454,251733.0,4,10
2,The Raincoats,No Side to Fall In,35.0,0.434,0.614,6.0,-8.334,1,0.0525,0.486,0.000196,0.394,0.787,147.681,109667.0,4,6
3,Deno,Lingo (feat. J.I & Chunkz),66.0,0.853,0.597,10.0,-6.528,0,0.0555,0.0212,,0.122,0.569,107.033,173968.0,4,5
4,Red Hot Chili Peppers,Nobody Weird Like Me - Remastered,53.0,0.167,0.975,2.0,-4.279,1,0.216,0.000169,0.0161,0.172,0.0918,199.06,229960.0,4,10


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17996 entries, 0 to 17995
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Artist Name         17996 non-null  object 
 1   Track Name          17996 non-null  object 
 2   Popularity          17568 non-null  float64
 3   danceability        17996 non-null  float64
 4   energy              17996 non-null  float64
 5   key                 15982 non-null  float64
 6   loudness            17996 non-null  float64
 7   mode                17996 non-null  int64  
 8   speechiness         17996 non-null  float64
 9   acousticness        17996 non-null  float64
 10  instrumentalness    13619 non-null  float64
 11  liveness            17996 non-null  float64
 12  valence             17996 non-null  float64
 13  tempo               17996 non-null  float64
 14  duration_in min/ms  17996 non-null  float64
 15  time_signature      17996 non-null  int64  
 16  Clas

In [5]:
data.isnull().sum()

Artist Name              0
Track Name               0
Popularity             428
danceability             0
energy                   0
key                   2014
loudness                 0
mode                     0
speechiness              0
acousticness             0
instrumentalness      4377
liveness                 0
valence                  0
tempo                    0
duration_in min/ms       0
time_signature           0
Class                    0
dtype: int64

In [6]:
data['Class'].unique()

array([ 5, 10,  6,  2,  4,  8,  9,  3,  7,  1,  0], dtype=int64)

For modelling we need to convert object dtype into numeric dtype and fill all missing values

So for now let's fill the missing values.

### Filling Missing Values

In [7]:
data.head()

Unnamed: 0,Artist Name,Track Name,Popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_in min/ms,time_signature,Class
0,Bruno Mars,That's What I Like (feat. Gucci Mane),60.0,0.854,0.564,1.0,-4.964,1,0.0485,0.0171,,0.0849,0.899,134.071,234596.0,4,5
1,Boston,Hitch a Ride,54.0,0.382,0.814,3.0,-7.23,1,0.0406,0.0011,0.00401,0.101,0.569,116.454,251733.0,4,10
2,The Raincoats,No Side to Fall In,35.0,0.434,0.614,6.0,-8.334,1,0.0525,0.486,0.000196,0.394,0.787,147.681,109667.0,4,6
3,Deno,Lingo (feat. J.I & Chunkz),66.0,0.853,0.597,10.0,-6.528,0,0.0555,0.0212,,0.122,0.569,107.033,173968.0,4,5
4,Red Hot Chili Peppers,Nobody Weird Like Me - Remastered,53.0,0.167,0.975,2.0,-4.279,1,0.216,0.000169,0.0161,0.172,0.0918,199.06,229960.0,4,10


In [8]:
data.isnull().sum()

Artist Name              0
Track Name               0
Popularity             428
danceability             0
energy                   0
key                   2014
loudness                 0
mode                     0
speechiness              0
acousticness             0
instrumentalness      4377
liveness                 0
valence                  0
tempo                    0
duration_in min/ms       0
time_signature           0
Class                    0
dtype: int64

We have missing values in columns Popularity, key and instrumentalness

Let's fill missing values with median as it is more robust than mean.

In [9]:
#filling numerical missing values
for label,content in data.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            data[label] = content.fillna(content.median())

In [10]:
data.isnull().sum()

Artist Name           0
Track Name            0
Popularity            0
danceability          0
energy                0
key                   0
loudness              0
mode                  0
speechiness           0
acousticness          0
instrumentalness      0
liveness              0
valence               0
tempo                 0
duration_in min/ms    0
time_signature        0
Class                 0
dtype: int64

In [11]:
df_tmp = data.copy()

Now the missing values has successfully filled, let's convert object into categorical features

### Dropping Artist Name and Track Name

In [12]:
df_tmp.head()

Unnamed: 0,Artist Name,Track Name,Popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_in min/ms,time_signature,Class
0,Bruno Mars,That's What I Like (feat. Gucci Mane),60.0,0.854,0.564,1.0,-4.964,1,0.0485,0.0171,0.00391,0.0849,0.899,134.071,234596.0,4,5
1,Boston,Hitch a Ride,54.0,0.382,0.814,3.0,-7.23,1,0.0406,0.0011,0.00401,0.101,0.569,116.454,251733.0,4,10
2,The Raincoats,No Side to Fall In,35.0,0.434,0.614,6.0,-8.334,1,0.0525,0.486,0.000196,0.394,0.787,147.681,109667.0,4,6
3,Deno,Lingo (feat. J.I & Chunkz),66.0,0.853,0.597,10.0,-6.528,0,0.0555,0.0212,0.00391,0.122,0.569,107.033,173968.0,4,5
4,Red Hot Chili Peppers,Nobody Weird Like Me - Remastered,53.0,0.167,0.975,2.0,-4.279,1,0.216,0.000169,0.0161,0.172,0.0918,199.06,229960.0,4,10


In [13]:
df_tmp.drop('Artist Name',inplace=True,axis=1)

In [14]:
df_tmp.drop('Track Name',inplace=True,axis=1)

In [15]:
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17996 entries, 0 to 17995
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Popularity          17996 non-null  float64
 1   danceability        17996 non-null  float64
 2   energy              17996 non-null  float64
 3   key                 17996 non-null  float64
 4   loudness            17996 non-null  float64
 5   mode                17996 non-null  int64  
 6   speechiness         17996 non-null  float64
 7   acousticness        17996 non-null  float64
 8   instrumentalness    17996 non-null  float64
 9   liveness            17996 non-null  float64
 10  valence             17996 non-null  float64
 11  tempo               17996 non-null  float64
 12  duration_in min/ms  17996 non-null  float64
 13  time_signature      17996 non-null  int64  
 14  Class               17996 non-null  int64  
dtypes: float64(12), int64(3)
memory usage: 2.1 MB


Now data is finally cleaned.

Now let's split our training data and get to modelling

### Modelling

In [16]:
from sklearn.model_selection import train_test_split

np.random.seed(42)

#getting our X and y variable
X = df_tmp.drop('Class',axis=1)
y = df_tmp['Class']

#splitting our training dataset into training and validation
X_train,X_val,y_train,y_val = train_test_split(X,y,test_size = 0.2)


In [17]:
len(X_train), len(X_val), len(y_train), len(y_val)

(14396, 3600, 14396, 3600)

Now our data is fully ready for modeling.

As we are having large data, so at start we going to train on 1000 data

In [18]:
#importing models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier


#creating dictionary 
models = {
        'Logistic regression': LogisticRegression(),
        'Decision tree classifier': DecisionTreeClassifier(),
        'Random Forest': RandomForestClassifier(),
        'AdaBoost': AdaBoostClassifier(),
        'GradientBossting': GradientBoostingClassifier(),
        'SVC': SVC(),
        'Kneighbors': KNeighborsClassifier()}

In [19]:
#let's make fucntion for fit and train our models

def fit_model(models,X_train,X_test,y_train,y_test):
    np.random.seed(42)
    for name,model in models.items():
        model.fit(X_train,y_train)
        score = model.score(X_val,y_val)
        print(f'Accuracy of {model} : {score}')

In [20]:
fit_model(
    models=models,
    X_train = X_train,
    X_test = X_val,
    y_train = y_train,
    y_test = y_val
)

Accuracy of LogisticRegression() : 0.29444444444444445
Accuracy of DecisionTreeClassifier() : 0.36277777777777775
Accuracy of RandomForestClassifier() : 0.5036111111111111
Accuracy of AdaBoostClassifier() : 0.4263888888888889
Accuracy of GradientBoostingClassifier() : 0.5425
Accuracy of SVC() : 0.2972222222222222
Accuracy of KNeighborsClassifier() : 0.23666666666666666


We can see RandomForestClassifier and GradientBoostingClassifier has got good accuracy score.

So we further try to improve this model with RandomSearchCV and GridSearchCV

### Hypertuning model

In [21]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

In [23]:
#making grids for RandomForestClassifier
rf_grids = {
    'bootstrap': [True,False],
    'max_depth': [10,20,40,60,80,90,100,None],
    'max_features' : ['auto','sqrt'],
    'min_samples_leaf' : [1,2,4],
    'min_samples_split' : [2,5,10],
    'n_estimators' : [50,100,200,400,600,800,1000,1400,2000]
}


#making grids for GradientBoostingClassifier
gb_grids = {
     "n_estimators":[5,50,250,500],
     "max_depth":[1,3,5,7,9],
     "learning_rate":[0.01,0.1,1,10,100]
}

In [26]:
#RandomizedSearchCV for RandomForestClassifier
%time
rf_cv = RandomizedSearchCV(RandomForestClassifier(),param_distributions=rf_grids,cv=5,n_iter=10,random_state=42,n_jobs=-1,verbose=True)
rf_cv.fit(X_train,y_train)

Wall time: 0 ns
Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done  38 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  3.2min finished


RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 40, 60, 80, 90,
                                                      100, None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [50, 100, 200, 400, 600,
                                                         800, 1000, 1400,
                                                         2000]},
                   random_state=42, verbose=True)

In [28]:
rf_cv.best_params_

{'n_estimators': 1400,
 'min_samples_split': 10,
 'min_samples_leaf': 4,
 'max_features': 'sqrt',
 'max_depth': None,
 'bootstrap': True}

In [29]:
rf_cv.score(X_val,y_val)

0.5294444444444445

In [35]:
gb_gs = RandomizedSearchCV(
    GradientBoostingClassifier(),param_distributions=gb_grids,verbose=True,cv=5,n_iter=20,random_state=42,n_jobs=-1
)
gb_gs.fit(X_train,y_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done  38 tasks      | elapsed: 12.8min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 29.8min finished


RandomizedSearchCV(cv=5, estimator=GradientBoostingClassifier(), n_iter=20,
                   n_jobs=-1,
                   param_distributions={'learning_rate': [0.01, 0.1, 1, 10,
                                                          100],
                                        'max_depth': [1, 3, 5, 7, 9],
                                        'n_estimators': [5, 50, 250, 500]},
                   random_state=42, verbose=True)

In [36]:
gb_gs.best_params_

{'n_estimators': 250, 'max_depth': 5, 'learning_rate': 0.01}

In [37]:
gb_gs.score(X_val,y_val)

0.5338888888888889

In [21]:
#evaluating our model with best params

model = GradientBoostingClassifier(
                    n_estimators=250,max_depth=5,learning_rate=0.01)

model.fit(X_train,y_train)
model.score(X_val,y_val)

0.5355555555555556

In [22]:
preds = model.predict(X_val)

preds

array([ 5,  8,  0, ...,  5,  6, 10], dtype=int64)

In [23]:
from sklearn.metrics import log_loss,classification_report, confusion_matrix, roc_auc_score
report = classification_report(y_val,preds)
cf_matrix = confusion_matrix(y_val,preds)


In [24]:
print(report)

              precision    recall  f1-score   support

           0       0.73      0.73      0.73       136
           1       0.33      0.01      0.03       286
           2       0.57      0.37      0.45       281
           3       0.70      0.76      0.73        78
           4       0.62      0.82      0.71        71
           5       0.67      0.69      0.68       262
           6       0.41      0.30      0.35       500
           7       0.90      0.89      0.90       103
           8       0.65      0.60      0.62       382
           9       0.52      0.53      0.53       531
          10       0.46      0.69      0.55       970

    accuracy                           0.54      3600
   macro avg       0.60      0.58      0.57      3600
weighted avg       0.53      0.54      0.51      3600



In [25]:
pred_proba = model.predict_proba(X_val)

In [26]:
log_loss = log_loss(y_val,pred_proba)

In [27]:
log_loss

1.272040405154302

### Getting test_data


In [28]:
test_data = pd.read_csv('test.csv')


In [29]:
test_data.head()

Unnamed: 0,Artist Name,Track Name,Popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_in min/ms,time_signature
0,David Bowie,Space Oddity - 2015 Remaster,73.0,0.31,0.403,,-13.664,1,0.0326,0.0726,9.3e-05,0.139,0.466,134.48,318027.0,4
1,Crimson Sun,Essence of Creation,34.0,0.511,0.955,1.0,-5.059,1,0.129,0.0004,9e-06,0.263,0.291,151.937,220413.0,4
2,P!nk,Raise Your Glass,78.0,0.7,0.709,7.0,-5.006,1,0.0839,0.0048,,0.0289,0.625,122.019,202960.0,4
3,Shawn Mendes,Wonder,80.0,0.333,0.637,1.0,-4.904,0,0.0581,0.131,1.8e-05,0.149,0.132,139.898,172693.0,4
4,Backstreet Boys,Helpless When She Smiles - Radio Version,48.0,0.393,0.849,11.0,-4.114,1,0.0459,0.00421,,0.162,0.222,74.028,4.093117,4


In [30]:
test_data.isnull().sum()

Artist Name              0
Track Name               0
Popularity             227
danceability             0
energy                   0
key                    808
loudness                 0
mode                     0
speechiness              0
acousticness             0
instrumentalness      1909
liveness                 0
valence                  0
tempo                    0
duration_in min/ms       0
time_signature           0
dtype: int64

In [31]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7713 entries, 0 to 7712
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Artist Name         7713 non-null   object 
 1   Track Name          7713 non-null   object 
 2   Popularity          7486 non-null   float64
 3   danceability        7713 non-null   float64
 4   energy              7713 non-null   float64
 5   key                 6905 non-null   float64
 6   loudness            7713 non-null   float64
 7   mode                7713 non-null   int64  
 8   speechiness         7713 non-null   float64
 9   acousticness        7713 non-null   float64
 10  instrumentalness    5804 non-null   float64
 11  liveness            7713 non-null   float64
 12  valence             7713 non-null   float64
 13  tempo               7713 non-null   float64
 14  duration_in min/ms  7713 non-null   float64
 15  time_signature      7713 non-null   int64  
dtypes: flo

Test data also having some missing values and objcet dtype, so we will preprocess the setting of train data with test data

In [59]:
def preprocess(data):            
    for label,content in data.items():
        if pd.api.types.is_numeric_dtype(content):
            if pd.isnull(content).sum():
                data[label] = content.fillna(content.median())
    

In [60]:
preprocess(test_data)

In [61]:
test_data.isnull().sum()

Artist Name           0
Track Name            0
Popularity            0
danceability          0
energy                0
key                   0
loudness              0
mode                  0
speechiness           0
acousticness          0
instrumentalness      0
liveness              0
valence               0
tempo                 0
duration_in min/ms    0
time_signature        0
dtype: int64

In [62]:
test_data.drop('Artist Name',axis=1,inplace=True)
test_data.drop('Track Name',axis=1,inplace=True)

In [63]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7713 entries, 0 to 7712
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Popularity          7713 non-null   float64
 1   danceability        7713 non-null   float64
 2   energy              7713 non-null   float64
 3   key                 7713 non-null   float64
 4   loudness            7713 non-null   float64
 5   mode                7713 non-null   int64  
 6   speechiness         7713 non-null   float64
 7   acousticness        7713 non-null   float64
 8   instrumentalness    7713 non-null   float64
 9   liveness            7713 non-null   float64
 10  valence             7713 non-null   float64
 11  tempo               7713 non-null   float64
 12  duration_in min/ms  7713 non-null   float64
 13  time_signature      7713 non-null   int64  
dtypes: float64(12), int64(2)
memory usage: 843.7 KB


### Let's predict on test data with our model

In [65]:
test_predictions = model.predict(test_data)

In [66]:
test_predictions

array([10,  8,  9, ...,  9,  6,  5], dtype=int64)

In [67]:
test_predictions_proba = model.predict_proba(test_data)

In [69]:
test_predictions_proba

array([[0.00241662, 0.03937233, 0.01986839, ..., 0.04650399, 0.05366218,
        0.75516241],
       [0.00173554, 0.04965883, 0.01335987, ..., 0.61139707, 0.02628616,
        0.22032691],
       [0.00296542, 0.06403276, 0.01414941, ..., 0.02142519, 0.52588717,
        0.22863867],
       ...,
       [0.12209525, 0.01784506, 0.01877069, ..., 0.02495268, 0.30800518,
        0.15383303],
       [0.00287235, 0.15396987, 0.03126341, ..., 0.03236635, 0.0672747 ,
        0.17895558],
       [0.00144126, 0.02127852, 0.01232037, ..., 0.00922863, 0.07812685,
        0.09802106]])

In [72]:
sub_temp = pd.read_csv("submission.csv")
test_predictions_proba = pd.DataFrame(test_predictions_proba)
test_predictions_proba.columns = sub_temp.columns
test_predictions_proba.to_csv("submit.csv",index=False)

### Saving our model

In [32]:
from joblib import dump,load
dump(model,filename='model_1.joblib')

['model_1.joblib']

### Further hypertuning our model

In [37]:
gb_grids_2= {
     "n_estimators":[220,250,280],
     "max_depth":[3,5,7],
     "learning_rate":[0.01]
}

In [39]:
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
gb_rscv = GridSearchCV(GradientBoostingClassifier(),param_grid=gb_grids_2,cv=5,n_iter=10,verbose=True,n_jobs=-1)
gb_rscv.fit(X_train,y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done  38 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  2.4min finished


RandomizedSearchCV(cv=5, estimator=GradientBoostingClassifier(), n_jobs=-1,
                   param_distributions={'learning_rate': [0.01, 0.05, 0.1, 0.2,
                                                          1, 100],
                                        'max_depth': array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13.,
       14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26.,
       27., 28., 29., 30., 31., 32.]),
                                        'min_samples_leaf': array([0.1, 0.2, 0.3, 0.4, 0.5]),
                                        'min_samples_split': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
                                        'n_estimators': [5, 50, 100, 200, 250,
                                                         300]},
                   verbose=True)

In [40]:
gb_rscv.best_params_

{'n_estimators': 200,
 'min_samples_split': 0.7000000000000001,
 'min_samples_leaf': 0.30000000000000004,
 'max_depth': 14.0,
 'learning_rate': 0.1}

In [41]:
gb_rscv.score(X_val,y_val)

0.4577777777777778