<p style="font-family: Georgia, serif; font-size:20pt; font-style: bold">
    What features make a hit song?
</p>
<p style="font-family: Georgia, serif; font-size:15pt">
    An analysis of three decades of music using data from Spotify
</p>

<p style="font-family: Georgia, serif; font-size:12pt">
Spotify’s web API can be used to create a dataset of songs and their features, including the songs popularity.
Using data analytics and a dataset from Kaggle, I determined which song features can be used to predict “hits”.
Songs were analyzed from three decades: 1990’s, 2000’s, and 2010’s, to determine which song features create hits and whether these features change over time.
For more information about what Spotify's song features are go to this website: https://developer.spotify.com/documentation/web-api/reference/#endpoint-get-several-tracks.
</p>

<p style="font-family: Georgia, serif; font-size:11pt">
    Start by importing libraries and CSV files.
</p>

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import sklearn

import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.figsize']=(20,5)
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')


nineties = pd.DataFrame(pd.read_csv('../input/the-spotify-hit-predictor-dataset/dataset-of-90s.csv')) #Spotify dataset of all songs from the nineties
aughts = pd.DataFrame(pd.read_csv('../input/the-spotify-hit-predictor-dataset/dataset-of-00s.csv'))  #Spotify dataset of all songs from the 2000's
tens = pd.DataFrame(pd.read_csv('../input/the-spotify-hit-predictor-dataset/dataset-of-10s.csv'))  #Spotify dataset of all songs from the 2010's

<p style="font-family: Georgia, serif; font-size:15pt; font-style = bold">
   Data Organizing and Cleaning
</p>

<p style="font-family: Georgia, serif; font-size:11pt">
    Add a column for year into all dfs before combining all the CSV files.
</p>

In [None]:
nineties['decade'] = 1990
aughts['decade'] = 2000
tens['decade'] = 2010

<p style="font-family: Georgia, serif; font-size:11pt">
    Create a combined CSV file of the past three decades.
</p> 

In [None]:
all_dfs = [nineties, aughts, tens]
all_songs = pd.concat(all_dfs) #combines all the decade dataframes
print(all_songs['decade'].unique()) #check that the new dataframe has all the decades

In [None]:
all_songs.to_csv('all_songs.csv', index = False) #create a CSV file of the new dataframe
all_songs.head(3)

<p style="font-family: Georgia, serif; font-size:11pt">
    Check for nulls to determine if you need to clean the data.
</p> 

In [None]:
pd.isnull(all_songs).sum()

<p style="font-family: Georgia, serif; font-size:11pt">
    No nulls, so data looks good.
</p>    
<p style="font-family: Georgia, serif; font-size:11pt">    
    Check the columns of the dataframe to determine the names of the variables to be analyzed and create a list of variables.
</p>

In [None]:
all_songs.columns

In [None]:
#create two var lists, one with Spotify's features (spfeatures_var_list) and one with the song traits (song_traits_var_list)
spfeatures_var_list = ['danceability', 'energy', 'key', 'loudness','mode', 'speechiness', 'acousticness', 
                       'instrumentalness', 'liveness','valence']
song_traits_var_list = ['key', 'loudness','tempo', 'time_signature', 'chorus_hit','sections'] 
#duration_ms has been removed since it has such larger numbers than the other variables

<p style="font-family: Georgia, serif; font-size:15pt; font-style: bold">    
    Descriptive Statistics
</p>

In [None]:
all_songs.describe() #show the descriptive statistics of the variables of all the songs in all decades

<p style="font-family: Georgia, serif; font-size:11pt">    
    Compare the means for Spotify's song features for hit songs and flop songs.
</p>

In [None]:
#going to focus on the spotify features when comparing the hits and flops
all_songs_hits = all_songs[spfeatures_var_list].loc[all_songs['target'] == 1]
all_songs_flops = all_songs[spfeatures_var_list].loc[all_songs['target'] == 0]

In [None]:
#create a dataframe that includes the means for hits and flops
hits_means = pd.DataFrame(all_songs_hits.describe().loc['mean'])
flops_means = pd.DataFrame(all_songs_flops.describe().loc['mean'])
means_joined = pd.concat([hits_means,flops_means], axis = 1)
means_joined.columns = ['hit_mean', 'flop_mean']

means_joined

In [None]:
#going to scale the dataframe to make the graph more readable
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
means_joined_scaled = pd.DataFrame(ss.fit_transform(means_joined),index= means_joined.index, columns = means_joined.columns)
means_joined_scaled


means_joined_scaled.plot(kind = 'bar', figsize=(10, 5), color = ('purple', 'grey'), title = 'Means of Hit Songs and Flop Songs for Song Features')
plt.legend(labels=['Hits', 'Flops'], loc='upper right')
plt.show()
plt.show()

<p style="font-family: Georgia, serif; font-size:11pt">    
Judging by the differences in the means between hits and flops, there does appear to be a difference between the two types of songs. This means we could possibly create a model to predict hits and flops based on song features.   
</p>

<p style="font-family: Georgia, serif; font-size:15pt; font-style: bold">    
Exploratory Data Analysis
</p>
<p style="font-family: Georgia, serif; font-size:11pt">    
More information could be gleaned from histograms and boxplots than just means.
</p>

<p style="font-family: Georgia, serif; font-size:12pt">    
Histograms
</p>

In [None]:
#create histograpms of all the variables to see distributions
fig, ax = plt.subplots(5,3, figsize=(20,20))

def hist_plot(row, column, variable, binsnum, color):
    ax[row, column].hist(all_songs[variable], bins = binsnum, color = color)
    ax[row, column].set_title(variable + ' histogram')
    
hist_plot(0, 0, 'danceability', 10, 'purple')
hist_plot(0, 1, 'energy', 10, 'orchid')
hist_plot(0, 2, 'key', 10, 'plum')
hist_plot(1,0, 'loudness', 10, 'purple')
hist_plot(1,1, 'mode', 10, 'orchid')
hist_plot(1,2, 'speechiness', 10, 'plum')
hist_plot(2,0, 'acousticness', 10, 'purple')
hist_plot(2,1, 'instrumentalness', 10, 'orchid')
hist_plot(2,2, 'liveness', 10, 'plum')
hist_plot(3,0, 'valence', 10, 'purple')
hist_plot(3,1, 'tempo', 10, 'orchid')
hist_plot(3,2, 'duration_ms', 50, 'plum')
hist_plot(4,0, 'time_signature', 10, 'purple')
hist_plot(4,1, 'chorus_hit', 10, 'orchid')
hist_plot(4,2, 'sections', 50, 'plum')

plt.show()

<p style="font-family: Georgia, serif; font-size:11pt">    
Some interesting patterns here - songs tend to be more danceable than less danceable, songs tend to have more energy than less energy, the key of C is the most popular key, songs tend to be under 10 decibels, most songs are in major scales, most songs contain more music than speech, most songs are not live, most songs are not acoustic, most songs contain music, there's a good mix of happy and sad songs, most songs are about 80-90 beats per minute, and most songs are in 4/4 time.
</p>

<p style="font-family: Georgia, serif; font-size:12pt">    
   Boxplots
</p>
<p style="font-family: Georgia, serif; font-size:11pt">    
   Let's create some boxplots to see the spread of the song features and any differences between hits and flops.
</p>


In [None]:
#to create more readable graphs, I created two boxplots, one with the first 10 song features
mpl.rcParams['figure.figsize']=(20,5)
all_songs[all_songs['target']==1].iloc[:, 0:13].plot(kind='box', title = 'Hits')
plt.show()
all_songs[all_songs['target']==0].iloc[:, 0:13].plot(kind='box', title = 'Flops')
plt.show()

In [None]:
#...and one with the last 5 song features
all_songs[all_songs['target']==1].iloc[:, 13:18].plot(kind='box', title = 'Hits')
plt.show()
all_songs[all_songs['target']==0].iloc[:, 13:18].plot(kind='box', title = 'Flops')
plt.show()

<p style="font-family: Georgia, serif; font-size:11pt">    
Since I didn't scale the dataset, it is a bit hard to see the boxplots of some of the song features. Overall, however, there does appear to be a difference between hit and flop songs. This leads me to the next part of the analysis.   
</p>

<p style="font-family: Georgia, serif; font-size:15pt; font-style: bold">    
   Inferential Statistics
</p>

<p style="font-family: Georgia, serif; font-size:15pt; font-style = bold">    
Random Forest Classifier  
</p>
<p style="font-family: Georgia, serif; font-size:11pt">    
Create model to determine if hit songs can be determined for all three decades combined using a Random Forest Classifier.  
</p>

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import scale

indep_columns = ['danceability', 'energy', 'key', 'loudness',
       'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'duration_ms', 'time_signature', 'chorus_hit',
       'sections']

X = all_songs[indep_columns]
y = all_songs['target']

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0) #use 75% of the data for training the model and 25% of the model for testing
RF = RandomForestClassifier()
RF.fit(X_train, y_train)
y_pred = RF.predict(X_test)

In [None]:
#create a confusion matrix to see the efficacy of the model
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

In [None]:
#create a figure/heatmap of the confusion matrix for a better visual
mpl.rcParams['figure.figsize']=(10,5)
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="RdPu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()

<p style="font-family: Georgia, serif; font-size:11pt">
The confusion matrix demonstrates that the model correctly identified hits and flops most of the time. 
</p>

In [None]:
#create a dataframe of the feature importances to determine which variables are the most important in determining a hit
all_songs_feat = RF.feature_importances_
df_indep_columns = pd.DataFrame(indep_columns)
df_all_songs_feat = pd.DataFrame(all_songs_feat)
all_songs_feat_vars = pd.concat([df_indep_columns, df_all_songs_feat], axis = 1)
all_songs_feat_vars.columns = ['Variable', 'Feature importance all decades']
all_songs_feat_vars = all_songs_feat_vars.set_index('Variable')
all_songs_feat_vars = all_songs_feat_vars.sort_values(by=['Feature importance all decades'], ascending = False)
all_songs_feat_vars
all_songs_feat_vars.to_csv('all_songs_feat.csv', index = False) #create a CSV file of the new dataframe

In [None]:
all_songs_feat_vars

In [None]:
all_songs_feat_vars.plot(kind='bar', color = "purple", title = "Most important features for predicting hit and flop songs for all decades", legend = None)
plt.ylabel('Feature importance')
plt.show()

<p style="font-family: Georgia, serif; font-size:11pt">
For all decades, instrumentalness, danceability, acousticness, duration_ms, and loudness were the greatest predictors of if a song was a hit.
</p>

In [None]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))

<p style="font-family: Georgia, serif; font-size:11pt">
    The model is highly accurate, precise, and has good recall.
</p>

<p style="font-family: Georgia, serif; font-size:12pt; font-style = bold">
    Decade: 1990's
</p>
<p style="font-family: Georgia, serif; font-size:11pt">
    Let's repeat the model but specifically for songs in the 1990's.
</p>

In [None]:
X = nineties[indep_columns]
y = nineties['target']

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
RF = RandomForestClassifier()
RF.fit(X_train, y_train)
y_pred = RF.predict(X_test)

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

In [None]:
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="RdPu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()

<p style="font-family: Georgia, serif; font-size:11pt">
Once again, the confusion matrix demonstrates that the model is very good at predicting hits and flops. 
</p>

In [None]:
nineties_feat = RF.feature_importances_
df_indep_columns = pd.DataFrame(indep_columns)
df_nineties_feat = pd.DataFrame(nineties_feat)
nineties_feat_vars = pd.concat([df_indep_columns, df_nineties_feat], axis = 1)
nineties_feat_vars.columns = ['Variable', 'Feature importance 1990s']
nineties_feat_vars = nineties_feat_vars.set_index('Variable')
nineties_feat_vars = nineties_feat_vars.sort_values(by=['Feature importance 1990s'], ascending = False)
nineties_feat_vars
nineties_feat_vars.to_csv('nineties_feat_vars.csv', index = False) #create a CSV file of the new dataframe

In [None]:
nineties_feat_vars

In [None]:
nineties_feat_vars.plot(kind='bar', color = "purple", title = "Most important features for predicting hit and flop songs for the nineties", legend = None)
plt.ylabel('Feature importance')
plt.show()

<p style="font-family: Georgia, serif; font-size:11pt">
Hit and flop songs in the 90's were predicted mostly by instrumentalness, duration, danceability, acousticness and energy.
</p>

In [None]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))

<p style="font-family: Georgia, serif; font-size:11pt">
The model also had high accuracy, precision, and recall for the 90's.
</p>

<p style="font-family: Georgia, serif; font-size:12pt; font-style = bold">
Decade: The 2000's
</p>
<p style="font-family: Georgia, serif; font-size:11pt">
Repeat the model for songs from the 2000's.
</p>

In [None]:
X = aughts[indep_columns]
y = aughts['target']

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
RF = RandomForestClassifier()
RF.fit(X_train, y_train)
y_pred = RF.predict(X_test)

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

In [None]:
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="RdPu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()

<p style="font-family: Georgia, serif; font-size:11pt">
Once again, the confusion matrix demonstrates that the model is very good at predicting hits and flops. 
</p>

In [None]:
aughts_feat = RF.feature_importances_
df_indep_columns = pd.DataFrame(indep_columns)
df_aughts_feat = pd.DataFrame(aughts_feat)
aughts_feat_vars = pd.concat([df_indep_columns, df_aughts_feat], axis = 1)
aughts_feat_vars.columns = ['Variable', 'Feature importance 2000s']
aughts_feat_vars = aughts_feat_vars.set_index('Variable')
aughts_feat_vars = aughts_feat_vars.sort_values(by=['Feature importance 2000s'], ascending = False)
aughts_feat_vars
aughts_feat_vars.to_csv('aughts_feat_vars.csv', index = False) #create a CSV file of the new dataframe

In [None]:
aughts_feat_vars

In [None]:
aughts_feat_vars.plot(kind='bar', color = "purple", title = "Most important features for predicting hit and flop songs for the 2000's", legend = None)
plt.savefig('aughts_feature_importance_bar.jpg')
plt.ylabel('Feature importance')
plt.show()

<p style="font-family: Georgia, serif; font-size:11pt">
In the 2000's hit and flop songs were mostly predicted by instrumentalness, danceability, loudness, duration, and acousticness.
</p>

In [None]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))

<p style="font-family: Georgia, serif; font-size:11pt">
The model has high accuracy, precision and recall, indicating it is good at classifying hit and flop songs.
</p>

<p style="font-family: Georgia, serif; font-size:12pt; font-style = bold">
Decade: The 2010's
</p>
<p style="font-family: Georgia, serif; font-size:11pt">
Repeat the model for songs from the 2010's.
</p>

In [None]:
X = tens[indep_columns]
y = tens['target']

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
RF = RandomForestClassifier()
RF.fit(X_train, y_train)
y_pred = RF.predict(X_test)

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

In [None]:
RF.feature_importances_ #corresponds to the order of the variables

In [None]:
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="RdPu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()

<p style="font-family: Georgia, serif; font-size:11pt">
Once again, the confusion matrix demonstrates that the model is very good at predicting hits and flops. 
</p>

In [None]:
tens_feat = RF.feature_importances_
df_indep_columns = pd.DataFrame(indep_columns)
df_tens_feat = pd.DataFrame(tens_feat)
tens_feat_vars = pd.concat([df_indep_columns, df_tens_feat], axis = 1)
tens_feat_vars.columns = ['Variable', 'Feature importance 2010s']
tens_feat_vars = tens_feat_vars.set_index('Variable')
tens_feat_vars = tens_feat_vars.sort_values(by=['Feature importance 2010s'], ascending = False)
tens_feat_vars
tens_feat_vars.to_csv('tens_feat_vars.csv', index = False) #create a CSV file of the new dataframe

In [None]:
tens_feat_vars

In [None]:
tens_feat_vars.plot(kind='bar', color = "purple", title = "Most important features for predicting hit and flop songs for the 2010's", legend = None)
plt.ylabel('Feature importance')
plt.show()


<p style="font-family: Georgia, serif; font-size:11pt">
Hit and flop songs of the 2010's were most influenced by instrumentalness, loudness, acousticness, danceability and energy.
</p>

In [None]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))

<p style="font-family: Georgia, serif; font-size:11pt">
The model has high accuracy, precision and recall, indicating it is good at classifying hit and flop songs.
</p>

<p style="font-family: Georgia, serif; font-size: 15pt; font-style = bold">
Has feature importance changed through time?
</p>
<p style="font-family: Georgia, serif; font-size:11pt">
From our previous models, it's apparent that feature importance has changed through time. Let's graph the results to show that visually.
</p>

In [None]:
compare_feats = all_songs_feat_vars + nineties_feat_vars + aughts_feat_vars + tens_feat_vars
compare_feats_df = pd.concat([all_songs_feat_vars, nineties_feat_vars, aughts_feat_vars, tens_feat_vars], axis = 1)
compare_feats_df
compare_feats_df.to_csv('compare_feats_df.csv') #create a CSV file of the new dataframe

In [None]:
compare_feats_df

In [None]:
compare_feats_df.plot(kind='bar', color = ('purple','orchid','plum','thistle' ), figsize = (20,8))
plt.ylabel("Feature importance")
plt.show()

<p style="font-family: Georgia, serif; font-size:11pt">
Overall the most important predictors of hit and flop songs are their instrumentalness, danceability, acousticness, duration, energy and loudness. However, there has been some fluctuation of these through the decades with some song features being more important in particular decades.  
</p>