# TIME SIGNATURE PREDICTION OF GIVEN SPOTIFY SONGS

In this repository, I'll try to predict time signature of the spotify songs in our dataset. <br>
**Time Signature** (also known as meter signature, metre signature, or measure signature) is a notational convention used in Western musical notation to specify how many beats (pulses) are to be contained in each bar and which note value is to be given one beat. <br>
In this dataset, we have 4 kinds of time signatures : 1,3,4,5. As we will classify our dataset according to these 4 time signatures, we'll have multiclass classification problem. <br>
We'll do this by Scikit-Learn <br>
To do that, first, we need to import the necessary libraries & our dataset ,and then fit our dataset for our ML models.

#### IMPORTING NECESSARY LIBRARIES

In [None]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
import tensorflow as tf

#### IMPORTING OUR DATASET

In [2]:
url1 = "C:\\Users\\talfi\\python\\TensorFlow\\1.Regression\\self\\spoti\\genres_v2.csv"
url2= "C:\\Users\\talfi\\python\\TensorFlow\\1.Regression\\self\\spoti\\playlists.csv"

In [3]:
spoti = pd.read_csv(url1, encoding='utf-8', quotechar='"')
spoti.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0.1,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,...,id,uri,track_href,analysis_url,duration_ms,time_signature,genre,song_name,Unnamed: 0,title
0,0.831,0.814,2,-7.364,1,0.42,0.0598,0.0134,0.0556,0.389,...,2Vc6NJ9PW9gD9q343XFRKx,spotify:track:2Vc6NJ9PW9gD9q343XFRKx,https://api.spotify.com/v1/tracks/2Vc6NJ9PW9gD...,https://api.spotify.com/v1/audio-analysis/2Vc6...,124539,4,Dark Trap,Mercury: Retrograde,,
1,0.719,0.493,8,-7.23,1,0.0794,0.401,0.0,0.118,0.124,...,7pgJBLVz5VmnL7uGHmRj6p,spotify:track:7pgJBLVz5VmnL7uGHmRj6p,https://api.spotify.com/v1/tracks/7pgJBLVz5Vmn...,https://api.spotify.com/v1/audio-analysis/7pgJ...,224427,4,Dark Trap,Pathology,,
2,0.85,0.893,5,-4.783,1,0.0623,0.0138,4e-06,0.372,0.0391,...,0vSWgAlfpye0WCGeNmuNhy,spotify:track:0vSWgAlfpye0WCGeNmuNhy,https://api.spotify.com/v1/tracks/0vSWgAlfpye0...,https://api.spotify.com/v1/audio-analysis/0vSW...,98821,4,Dark Trap,Symbiote,,
3,0.476,0.781,0,-4.71,1,0.103,0.0237,0.0,0.114,0.175,...,0VSXnJqQkwuH2ei1nOQ1nu,spotify:track:0VSXnJqQkwuH2ei1nOQ1nu,https://api.spotify.com/v1/tracks/0VSXnJqQkwuH...,https://api.spotify.com/v1/audio-analysis/0VSX...,123661,3,Dark Trap,ProductOfDrugs (Prod. The Virus and Antidote),,
4,0.798,0.624,2,-7.668,1,0.293,0.217,0.0,0.166,0.591,...,4jCeguq9rMTlbMmPHuO7S3,spotify:track:4jCeguq9rMTlbMmPHuO7S3,https://api.spotify.com/v1/tracks/4jCeguq9rMTl...,https://api.spotify.com/v1/audio-analysis/4jCe...,123298,4,Dark Trap,Venom,,


#### FITTING OUR DATASET FOR ML MODEL

We can do this with two ways. **First** one is `pd.get_dummies()` function. Actually, I've tried this but my computer's ram couldn't handle it. Thus, I'll go with the **second** way; dropping object columns. <br>
Normally, this option is not much prefareble because of the data loss, but as our df has excessive amount of rows and cols, I think we'll be fine.

In [4]:
spoti.dtypes

danceability        float64
energy              float64
key                   int64
loudness            float64
mode                  int64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
type                 object
id                   object
uri                  object
track_href           object
analysis_url         object
duration_ms           int64
time_signature        int64
genre                object
song_name            object
Unnamed: 0          float64
title                object
dtype: object

In [5]:
spoti = spoti.drop(columns= ["type", "id", "uri", "track_href", "analysis_url", "genre", "song_name", "title"])

In [6]:
spoti.dtypes

danceability        float64
energy              float64
key                   int64
loudness            float64
mode                  int64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
duration_ms           int64
time_signature        int64
Unnamed: 0          float64
dtype: object

In [7]:
spoti.shape

(42305, 14)

See, 42305 rows and 15 columns are enough to train and test our model with both Scikit-Learn.

In [13]:
spoti["time_signature"].value_counts()

4    40427
3     1219
5      509
1      150
Name: time_signature, dtype: int64

Hmm, mostly we have 4 beats (pulses) are to be contained in each bar.

Now, let's start with the Scikit-Learn.

# SCIKIT LEARN

In Scikit-Learn, first, we'll look our classification_report. We'll classify our dataset by Support Vector Classification. We'll also use the sklearn's pipeline method. <br>
Note, I am going to classify our dataset with different methods as much as possible to demonstrate my knowledge.

#### SVM
In Scikit-Learn, first, we'll look our our classification_report by classifying our data with SVM. <br>
* The classification report is about key metrics in a classification problem. We'll have precision, recall, f1-score and support for each class We'll trying to find. The recall means "how many of this class we find over the whole number of element of this class" <br>
Then, we'll classify our dataset by Support Vector Classification from `sklearn.svm` <br>
* `SVC` (Support Vector Classifier) is to fit to the data we provide, returning a "best fit" hyperplane that divides, or categorizes, our data.The idea of SVM is simple: The algorithm creates a line or a hyperplane which separates the data into classes. <br>
We'll also use the `pipeline` method. <br>
* Pipeline method equentially applies a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods.The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.

In [14]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

In [15]:
from sklearn.metrics import classification_report

In [16]:
from sklearn.model_selection import train_test_split
steps = [("imputation", SimpleImputer(missing_values=np.nan, strategy='most_frequent')),("SVM", SVC())]
pipeline = Pipeline(steps)
X_train, X_test, y_train ,y_test = train_test_split(
    spoti.drop(columns='time_signature'),
    spoti["time_signature"],
    test_size=0.25,
    random_state=42,
    stratify=spoti["time_signature"]
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.00      0.00      0.00        38
           3       0.00      0.00      0.00       305
           4       0.96      1.00      0.98     10107
           5       0.00      0.00      0.00       127

    accuracy                           0.96     10577
   macro avg       0.24      0.25      0.24     10577
weighted avg       0.91      0.96      0.93     10577



  _warn_prf(average, modifier, msg_start, len(result))


Here is our key metrics for classification problem. <br>
As you spotted, we used the `SimpleImputer` <br>
SimpleImputer is a scikit-learn class which is helpful in handling the missing data in the predictive model dataset. It replaces the NaN values with a specified placeholder. In this case, we'll replace our NaN  values with using the mean along
      each column. <br>
We'll use the simple imputer one more time. This time we'll use it without pipeline method. <br>
As SimpleImputer method transfers our colnames into their index number, we'll rename our colnames as they used to be.

In [17]:
imr = SimpleImputer(missing_values=np.nan, strategy='mean')

imr = imr.fit(spoti)

imputed_data = imr.transform(spoti)
spoti = pd.DataFrame(imputed_data)
spoti = spoti.rename(columns={0:"danceability", 1:"energy", 2:"key",3:"loudness",4:"mode",5:"speechiness",6:"acousticness",7:"Oth-N",8:"instrumentalness",9:"liveness",10:"valence",11:"duration_ms",12:"time_signature",13:"Unnamed-0"})
spoti.head(3)

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,Oth-N,instrumentalness,liveness,valence,duration_ms,time_signature,Unnamed-0
0,0.831,0.814,2.0,-7.364,1.0,0.42,0.0598,0.0134,0.0556,0.389,156.985,124539.0,4.0,10483.970645
1,0.719,0.493,8.0,-7.23,1.0,0.0794,0.401,0.0,0.118,0.124,115.08,224427.0,4.0,10483.970645
2,0.85,0.893,5.0,-4.783,1.0,0.0623,0.0138,4e-06,0.372,0.0391,218.05,98821.0,4.0,10483.970645


In [10]:
spoti.dtypes

danceability        float64
energy              float64
key                 float64
loudness            float64
mode                float64
speechiness         float64
acousticness        float64
Oth-N               float64
instrumentalness    float64
liveness            float64
valence             float64
duration_ms         float64
time_signature      float64
Unnamed-0           float64
dtype: object

SimpleImputer did one more thing; it transferred some our dataset's dtype from int64 to float64. Now, all dtypes are float64, and that is good for our models.

#### MULTICLASS ROC AUC SCORE WITH LOGISTIC REGRESSION

Now, it is time to implement **roc auc score** <br>
AUC - ROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. <br>
As we are dealing with multiclass classification, we need to define new roc_auc_score_multiclass function.

In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
logreg = LogisticRegression()
X_train, X_test, y_train ,y_test = train_test_split(
    spoti.drop(columns='time_signature'),
    spoti["time_signature"],
    test_size=0.25,
    random_state=42,
    stratify=spoti["time_signature"]
)
logreg.fit(X_train, y_train)
def roc_auc_score_multiclass(actual_class, pred_class, average = "macro"):

  #creating a set of all the unique classes using the actual class list
  unique_class = set(actual_class)
  roc_auc_dict = {}
  for per_class in unique_class:
    #creating a list of all the classes except the current class 
    other_class = [x for x in unique_class if x != per_class]

    #marking the current class as 1 and all other classes as 0
    new_actual_class = [0 if x in other_class else 1 for x in actual_class]
    new_pred_class = [0 if x in other_class else 1 for x in pred_class]

    #using the sklearn metrics method to calculate the roc_auc_score
    roc_auc = roc_auc_score(new_actual_class, new_pred_class, average = average)
    roc_auc_dict[per_class] = roc_auc

  return roc_auc_dict

In [19]:
y_pred_prob = logreg.predict_proba(X_test)[:,1]
roc_auc_score_multiclass(y_test, y_pred_prob)

{1.0: 0.5, 3.0: 0.5, 4.0: 0.5, 5.0: 0.5}

#### AUC SCORE AND BEST PIPELINE STEPS WITH TPOT CLASSIFIER

Now, let's try to find auc score one more time, but this time with TPOT classifier. <br>
TPOT is an open-source library for performing AutoML in Python. It makes use of the popular Scikit-Learn machine learning library for data transforms and machine learning algorithms and uses a Genetic Programming stochastic global search procedure to efficiently discover a top-performing model pipeline for a given dataset. <br>
* Thanks to TPOT, we don't need to define roc_auc_score_multiclass because TPOT automatically handles it. <br>
* We'll also discover best pipeline steps with TPOT. That is, TPOT will tell us what pipeline steps to use to reach maximum success.

In [20]:
from tpot import TPOTClassifier
tpot = TPOTClassifier(
    generations=5,
    population_size=20,
    verbosity=3,
    scoring='accuracy',
    random_state=42,
    disable_update_check=True,
    config_dict='TPOT light'
)
tpot.fit(X_train, y_train)

# AUC score for tpot model
tpot_auc_score = roc_auc_score_multiclass(y_test, tpot.predict_proba(X_test)[:, 1])
print(f'\nAUC score:',tpot_auc_score)

# Print best pipeline steps
print('\nBest pipeline steps:', end='\n')
for idx, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):
    # Print idx and transform
    print(f'{idx}. {transform}')



19 operators have been imported by TPOT.


HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=120.0, style=ProgressStyle(de…

_pre_test decorator: _random_mutation_operator: num_test=0 Solver lbfgs supports only dual=False, got dual=True.
_pre_test decorator: _random_mutation_operator: num_test=0 Negative values in data passed to MultinomialNB (input X).
_pre_test decorator: _random_mutation_operator: num_test=1 Negative values in data passed to MultinomialNB (input X).
_pre_test decorator: _random_mutation_operator: num_test=0 Negative values in data passed to MultinomialNB (input X).

Generation 1 - Current Pareto front scores:

-1	0.9615166332756907	KNeighborsClassifier(input_matrix, KNeighborsClassifier__n_neighbors=73, KNeighborsClassifier__p=1, KNeighborsClassifier__weights=distance)
_pre_test decorator: _random_mutation_operator: num_test=0 Negative values in data passed to MultinomialNB (input X).

Generation 2 - Current Pareto front scores:

-1	0.9615166332756907	KNeighborsClassifier(input_matrix, KNeighborsClassifier__n_neighbors=73, KNeighborsClassifier__p=1, KNeighborsClassifier__weights=distance)

Its seems that best classifier for this case is `KNeighborsClassifier(n_neighbors=92, p=1, weights='distance')` <br>
The best percentile is 41 %. <br>
Well, let's go with these two then

In [22]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 92, p=1, weights='distance')
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
knn.score(X_test,y_test)

0.9634111751914531

96 %  of test set accuracy. Hmm.. Not bad.