## In this Notebook, we are going to try to predict what makes a song popular using:
### Vanilla Tree Based Models
### Bagged Tree Based Models
### Plus, Ensembling


In this analysis we will use a dataset of roughly ~470 songs to determine what makes a song popular.

Dataset: The dataset found on kaggle (https://www.kaggle.com/leonardopena/top-spotify-songs-from-20102019-by-year)

Before we get started, this is a great resource on what a Decision Tree is and how to utilize sklearn to build one in Python

https://towardsdatascience.com/decision-tree-in-python-b433ae57fb93

In [180]:
# Libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO 
from IPython.display import Image 
import pydotplus
import numpy as np
from sklearn.model_selection import cross_val_predict

In [54]:
# Import Data
df = pd.read_csv("/Users/samlafell/Desktop/Desktop-Sam’s_MacBook_Pro/MSA/Kaggle/Top_50_Spotify/top10s.csv",encoding='iso-8859-1').drop(['Unnamed: 0'],axis=1)

In [55]:
df.head()

Unnamed: 0,title,artist,top genre,year,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop
0,"Hey, Soul Sister",Train,neo mellow,2010,97,89,67,-4,8,80,217,19,4,83
1,Love The Way You Lie,Eminem,detroit hip hop,2010,87,93,75,-5,52,64,263,24,23,82
2,TiK ToK,Kesha,dance pop,2010,120,84,76,-3,29,71,200,10,14,80
3,Bad Romance,Lady Gaga,dance pop,2010,119,92,70,-4,8,71,295,0,4,79
4,Just the Way You Are,Bruno Mars,pop,2010,109,84,64,-5,9,43,221,2,4,78


In [56]:
df.describe()

Unnamed: 0,year,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop
count,603.0,603.0,603.0,603.0,603.0,603.0,603.0,603.0,603.0,603.0,603.0
mean,2014.59204,118.545605,70.504146,64.379768,-5.578773,17.774461,52.225539,224.674959,14.3267,8.358209,66.52073
std,2.607057,24.795358,16.310664,13.378718,2.79802,13.102543,22.51302,34.130059,20.766165,7.483162,14.517746
min,2010.0,0.0,0.0,0.0,-60.0,0.0,0.0,134.0,0.0,0.0,0.0
25%,2013.0,100.0,61.0,57.0,-6.0,9.0,35.0,202.0,2.0,4.0,60.0
50%,2015.0,120.0,74.0,66.0,-5.0,12.0,52.0,221.0,6.0,5.0,69.0
75%,2017.0,129.0,82.0,73.0,-4.0,24.0,69.0,239.5,17.0,9.0,76.0
max,2019.0,206.0,98.0,97.0,-2.0,74.0,98.0,424.0,99.0,48.0,99.0


## Using the article as a guide, we are able to look at multiple levels of song popularity.

## Applying that to our specific problem, we are going to split the 'pop' variable into 0-3. This will split our popularity variable into 25% quantiles. We will assume 0 is not popular and 3 is the most popular.

In [57]:
# Use pd.qcut to cut based on the quantile (4 for using the quantiles)
df['pop_model'] = pd.qcut(df['pop'],
                              q=4,
                              labels=False)

In [58]:
# Create the Feature and Target dataframes
X = df.select_dtypes('number').iloc[:, 1:-2]
y = df.select_dtypes('number').iloc[:,-1]

In [59]:
# Get dummies for the level of features
y = pd.get_dummies(y)

In [60]:
# Split your data!
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [61]:
# Initialize the Classifier
dt = DecisionTreeClassifier()

# Fit the model
dt.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [67]:
# Visualize
dot_data = StringIO()
export_graphviz(dt, out_file=dot_data, feature_names=X_train.columns)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

# That is a deep tree!!

In [71]:
y_pred = dt.predict(X_test)

In [74]:
# Create the confusion matrix of this single decision tree
popularity = np.array(y_test).argmax(axis=1)
predictions = np.array(y_pred).argmax(axis=1)
confusion_matrix(popularity, predictions)

array([[11,  8, 10, 13],
       [12, 10,  4, 10],
       [11, 11,  9, 10],
       [ 8,  8,  6, 10]])

In [86]:
sum_correct = 0
for i in range(0,4):
    sum_correct += confusion_matrix(popularity, predictions)[i][i]

accuracy = sum_correct / len(X_test)

In [87]:
accuracy

0.26490066225165565

## Looking at this, our Decision Tree was just slightly better than completely random. It correctly classified 26.5% of observations with 4 possible buckets. Completely random you would expect roughly ~25%.

## How can we improve this? Bagging!

https://medium.com/@rrfd/boosting-bagging-and-stacking-ensemble-methods-with-sklearn-and-mlens-a455c0c982de

In [93]:
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import cross_val_score

In [115]:
## To use the cross-validation, we need our y-Train to be 1 column
X = df.select_dtypes('number').iloc[:, 1:-2]
y = df.select_dtypes('number').iloc[:,-1]

In [146]:
# Initialize the Classifier and get one of the methods to work!
# For Cross-Validation, use all the data instead of splitting
np.random.seed(1234)
dt = DecisionTreeClassifier()
vanilla_scores = cross_val_score(dt, X, y, cv=10, n_jobs=-1)
bagging_clf = BaggingClassifier(dt, max_samples=0.4, max_features=9, random_state=1234)
bagging_scores = cross_val_score(bagging_clf, X, y, cv=10, n_jobs=-1)

print('The Mean Accuracy of the 10-fold Cross Validation on Decision Trees is', round(np.mean(vanilla_scores),4))
print('The Mean Accuracy of the 10-fold Cross Validation on Bagged Trees is', round(np.mean(bagging_scores),4))
print('\n')
print('Did Bagged Trees Perform Better?')
if np.mean(bagging_scores) > np.mean(vanilla_scores):
    print('Yes')
elif np.mean(bagging_scores) < np.mean(vanilla_scores):
    print('No')

The Mean Accuracy of the 10-fold Cross Validation on Decision Trees is 0.2968
The Mean Accuracy of the 10-fold Cross Validation on Bagged Trees is 0.3103


Did Bagged Trees Perform Better?
Yes


In [157]:
# Get some classifiers to evaluate
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.svm import SVC


# Create classifiers
rf = RandomForestClassifier(n_estimators=50)
et = ExtraTreesClassifier(n_estimators=50)
knn = KNeighborsClassifier()
svc = SVC(gamma='auto')
rg = RidgeClassifier()

clf_array = [rf, et, knn, svc, rg]

for clf in clf_array:
    vanilla_scores = cross_val_score(clf, X, y, cv=10, n_jobs=-1)
    bagging_clf = BaggingClassifier(clf, max_samples=0.4, max_features=X.shape[1], random_state=1234)
    bagging_scores = cross_val_score(bagging_clf, X, y, cv=10, n_jobs=-1)
    
    print("Mean of: {1:.3f}, std: (+/-) {2:.3f} [{0}]"
    .format(clf.__class__.__name__,
            vanilla_scores.mean(),
            vanilla_scores.std()))
    print("Mean of: {1:.3f}, std: (+/-) {2:.3f} [Bagging {0}]\n"
          .format(clf.__class__.__name__,
                  bagging_scores.mean(),
                  bagging_scores.std()))

Mean of: 0.337, std: (+/-) 0.042 [RandomForestClassifier]
Mean of: 0.320, std: (+/-) 0.047 [Bagging RandomForestClassifier]

Mean of: 0.327, std: (+/-) 0.038 [ExtraTreesClassifier]
Mean of: 0.346, std: (+/-) 0.049 [Bagging ExtraTreesClassifier]

Mean of: 0.288, std: (+/-) 0.059 [KNeighborsClassifier]
Mean of: 0.285, std: (+/-) 0.023 [Bagging KNeighborsClassifier]

Mean of: 0.298, std: (+/-) 0.013 [SVC]
Mean of: 0.277, std: (+/-) 0.017 [Bagging SVC]

Mean of: 0.295, std: (+/-) 0.043 [RidgeClassifier]
Mean of: 0.292, std: (+/-) 0.047 [Bagging RidgeClassifier]



## So, bagging generally worked better on Decision Trees, but it looks like bagging actually did not work better for us on any of the other methods we try here.

## Random Forest Performs the best. Up to 34.6% accuracy from our initial 26.49% accuracy.

In [158]:
# Example of hard voting 
from sklearn.ensemble import VotingClassifier
clf = [rf, et, knn, svc, rg]
eclf = VotingClassifier(estimators=[('Random Forests', rf), ('Extra Trees', et), ('KNeighbors', knn), ('SVC', svc), ('Ridge Classifier', rg)], voting='hard')
for clf, label in zip([rf, et, knn, svc, rg, eclf], ['Random Forest', 'Extra Trees', 'KNeighbors', 'SVC', 'Ridge Classifier', 'Ensemble']):
    scores = cross_val_score(clf, X, y, cv=10, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

Accuracy: 0.37 (+/- 0.03) [Random Forest]
Accuracy: 0.37 (+/- 0.06) [Extra Trees]
Accuracy: 0.29 (+/- 0.06) [KNeighbors]
Accuracy: 0.30 (+/- 0.01) [SVC]
Accuracy: 0.30 (+/- 0.04) [Ridge Classifier]
Accuracy: 0.33 (+/- 0.06) [Ensemble]


## In comparing, we are doing great with Random Forest. Let's keep it!

In [160]:
# Initialize the Classifier and get one of the methods to work!
# For Cross-Validation, use all the data instead of splitting
np.random.seed(1234)
rf = RandomForestClassifier(n_estimators=50)
vanilla_scores = cross_val_score(rf, X, y, cv=10, n_jobs=-1)
bagging_clf = BaggingClassifier(rf, max_samples=0.4, max_features=9, random_state=1234)
bagging_scores = cross_val_score(bagging_clf, X, y, cv=10, n_jobs=-1)

print('The Mean Accuracy of the 10-fold Cross Validation on Random Forest is', round(np.mean(vanilla_scores),4))
print('The Mean Accuracy of the 10-fold Cross Validation on Bagged Random Forest is', round(np.mean(bagging_scores),4))
print('\n')
print('Did Bagged Forest Perform Better?')
if np.mean(bagging_scores) > np.mean(vanilla_scores):
    print('Yes')
elif np.mean(bagging_scores) < np.mean(vanilla_scores):
    print('No')

The Mean Accuracy of the 10-fold Cross Validation on Random Forest is 0.3616
The Mean Accuracy of the 10-fold Cross Validation on Bagged Random Forest is 0.3204


Did Bagged Forest Perform Better?
No


In [189]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
y_pred = cross_val_predict(rf, X, y, cv=10)

# Create the confusion matrix of this single decision tree
popularity = np.array(y)
predictions = np.array(y_pred)

sum_correct = 0
for i in range(0,4):
    sum_correct += confusion_matrix(popularity, predictions)[i][i]

accuracy = sum_correct / len(X)

accuracy

0.3714759535655058

## We were able to find that Random Forest performed better

In [195]:
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [235]:
feature_imp_df = pd.DataFrame(clf.feature_importances_).T
feature_imp_df.columns = X.columns
feature_imp_df.T.sort_values(by=0, ascending=False)

Unnamed: 0,0
val,0.131089
nrgy,0.130343
dur,0.124311
dnce,0.124153
bpm,0.121921
acous,0.110939
live,0.105752
spch,0.087952
dB,0.063541


## In this example, we looked at how well a Decision Tree did, and then moved into focusing more on the impacts of Vanilla vs. Bagging Models.

### There is a lot of literature out there about Bagging/Boosting/Strapped models. Each one has their advantages and disadvantages.

### This dataset seems to inherently have a lot of noise, as I've discovered through previous analysis. Thus, it's hard to make conclusions

### The Random Forest determined Valence to be the most important factor in determining the popularity of song.

### Although, nothing in this analysis was particularly strong. Since this was our most accurate model out of the candidate models, we begin to get a sense that if we know how 'happy' a song is perceived to be, we might know better where to place it in terms of popularity.

### Again, this is a weak correlation, and would not by any means say that happier songs are more popular.