<a href="https://colab.research.google.com/github/ryanczhang7/spotifyproject/blob/master/Spotify_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
import numpy as np
import scipy
import statsmodels
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction import text
import category_encoders as ce

In [0]:
df_personality_sentiment = pd.read_csv(
    "/content/drive/My Drive/df_personality_sentiment.csv")
df_songs = pd.read_csv("/content/drive/My Drive/df_songs.csv")

**I am trying to predict the popularity of a song. Besides all the feature selection and engineering, model selection, and hyperparameter tuning, I have audio features and textual features of a song. Between audio features and textual features, which one will be better for predicting popularity of a song? Because I am comparing audio features and textual features, I will not use artist or album as categorical features. Also, I know that popularity is highly depedent on the release date of a song, and my data isn't time-labelled, so my models will be limited.**

# Audio Features

In [3]:
df_songs.head()

Unnamed: 0,artist,album,name,explicit,key,mode,time_signature,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,genre,duration_s
0,Kendrick Lamar,Black Panther The Album Music From And Inspire...,Black Panther,1,1,1,4,0.625,0.618,0.582,4e-06,0.265,-9.454,0.297,90.035,0.48,57,hiphop,130.613
1,Kendrick Lamar,Black Panther The Album Music From And Inspire...,All The Stars,1,8,1,4,0.0605,0.698,0.633,0.000194,0.0926,-4.946,0.0597,96.924,0.552,78,hiphop,232.186
2,Kendrick Lamar,Black Panther The Album Music From And Inspire...,X,1,2,1,4,0.0201,0.768,0.471,0.0,0.268,-8.406,0.259,131.023,0.405,69,hiphop,267.426
3,Kendrick Lamar,Black Panther The Album Music From And Inspire...,The Ways,1,11,0,4,0.0626,0.727,0.72,1e-06,0.176,-5.856,0.0488,140.08,0.589,65,hiphop,238.893
4,Kendrick Lamar,Black Panther The Album Music From And Inspire...,Opps,1,1,1,4,0.152,0.706,0.775,3.3e-05,0.416,-6.819,0.335,127.929,0.847,59,hiphop,180.893


**I will mainly be choosing between Linear Regression, KNeighbors Regressor, and Random Forest Regressor. I start out with just Linear and KNeighbors because I haven't chosen features yet, and Random Forest won't work well on just one feature.**

In [12]:
pipeline = make_pipeline(
    StandardScaler(),
    LinearRegression()
)

X_train = df_songs[["danceability"]]
y_train = df_songs["popularity"]

pipeline.fit(X_train, y_train)
y_train_ = pipeline.predict(X_train)

np.sqrt(mean_squared_error(y_train, y_train_))

20.15372534935818

In [19]:
pipeline = make_pipeline(
    StandardScaler(),
    Lasso()
)

X_train = df_songs[["danceability"]]
y_train = df_songs["popularity"]

pipeline.fit(X_train, y_train)
y_train_ = pipeline.predict(X_train)

np.sqrt(mean_squared_error(y_train, y_train_))

20.178519406967464

In [13]:
pipeline = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor()
)

X_train = df_songs[["danceability"]]
y_train = df_songs["popularity"]

pipeline.fit(X_train, y_train)
y_train_ = pipeline.predict(X_train)

np.sqrt(mean_squared_error(y_train, y_train_))

17.85280998687548

**KNeighbors works better so I stick with KNeighbors when choosing features. I am not using explicit as an audio feature because it is more accurate to describe it as a textual feature.**

In [27]:
rmse = []
feature = []
for features in [["danceability"],
                 ["key"],
                 ["mode"],
                 ["energy"],
                 ["instrumentalness"],
                 ["liveness"],
                 ["loudness"],
                 ["acousticness"],
                 ["speechiness"],
                 ["tempo"],
                 ["valence"],
                 ["time_signature"],
                 ["duration_s"]]:

  ct = make_column_transformer(
      (StandardScaler(), features),
      remainder = "drop"
  )

  pipeline = make_pipeline(
      ct,
      KNeighborsRegressor()
  )

  X_train = df_songs[features]
  y_train = df_songs["popularity"]

  pipeline.fit(X_train, y_train)
  y_train_ = pipeline.predict(X_train)
  feature.append(features)
  rmse.append(np.sqrt(mean_squared_error(y_train, y_train_)))

df = {}
df["feature"] = feature
df["rmse"] = rmse
df1 = pd.DataFrame(df)
df1

Unnamed: 0,feature,rmse
0,[danceability],17.85281
1,[key],22.126037
2,[mode],25.602111
3,[energy],17.48504
4,[instrumentalness],20.307815
5,[liveness],18.295904
6,[loudness],18.137016
7,[acousticness],18.16684
8,[speechiness],18.045152
9,[tempo],18.28568


**Since they had some values in the 20's I want to check the same thing for Ridge Regression to make sure they didn't just differ in predictor scores.**

In [0]:
rmse = []
feature = []
for features in [["danceability"],
                 ["key"],
                 ["mode"],
                 ["energy"],
                 ["instrumentalness"],
                 ["liveness"],
                 ["loudness"],
                 ["acousticness"],
                 ["speechiness"],
                 ["tempo"],
                 ["valence"],
                 ["time_signature"],
                 ["duration_s"]]:

  pipeline = make_pipeline(
      StandardScaler(),
      LinearRegression()
  )

  X_train = df_songs[features]
  y_train = df_songs["popularity"]

  pipeline.fit(X_train, y_train)
  y_train_ = pipeline.predict(X_train)
  feature.append(features)
  rmse.append(np.sqrt(mean_squared_error(y_train, y_train_)))

df = {}
df["feature"] = feature
df["rmse"] = rmse
df1 = pd.DataFrame(df)
df1

**Ridge doesn't have anything under 20, so I'm gonna stick with KNeighbors.**

**Danceability and energy are the best with error in the 17's. 7 other features are in the 18's. Since they are all qunatitative, I will just iteratively add the next best predictor until the score decreases.**

In [25]:
rmse = []
feature = []
for features in [["danceability", "energy"],
                 ["danceability", "energy", "speechiness"],
                 ["danceability", "energy", "speechiness", "loudness"],
                 ["danceability", "energy", "speechiness", "loudness", 
                  "acousticness"],
                 ["danceability", "energy", "speechiness", "loudness", 
                  "acousticness", "liveness"]]:

  pipeline = make_pipeline(
      StandardScaler(),
      KNeighborsRegressor()
  )

  X_train = df_songs[features]
  y_train = df_songs["popularity"]

  pipeline.fit(X_train, y_train)
  y_train_ = pipeline.predict(X_train)
  feature.append(features)
  rmse.append(np.sqrt(mean_squared_error(y_train, y_train_)))

df = {}
df["feature"] = feature
df["rmse"] = rmse
df1 = pd.DataFrame(df)
df1

Unnamed: 0,feature,rmse
0,"[danceability, energy]",17.915887
1,"[danceability, energy, speechiness]",17.27472
2,"[danceability, energy, speechiness, loudness]",16.985452
3,"[danceability, energy, speechiness, loudness, ...",16.879977
4,"[danceability, energy, speechiness, loudness, ...",17.258549


**So I choose danceability, energy, speechiness, loudness, acousticness. Now that I have features, I want to try RandomForest.**

In [36]:
pipeline = make_pipeline(
    StandardScaler(),
    RandomForestRegressor(max_features="sqrt")
)

X_train = df_songs[["danceability", "energy", "speechiness", "loudness",
                    "acousticness"]]
y_train = df_songs["popularity"]

pipeline.fit(X_train, y_train)
y_train_ = pipeline.predict(X_train)
np.sqrt(mean_squared_error(y_train, y_train_))

7.394213328609897

**RandomForest does better, so I stick with it. I want to estimate test error now.**

In [45]:
pipeline = make_pipeline(
    StandardScaler(),
    RandomForestRegressor(max_features="sqrt")
)

X_train = df_songs[["danceability", "energy", "speechiness", "loudness",
                    "acousticness"]]
y_train = df_songs["popularity"]

cv_errs = -cross_val_score(pipeline, X_train, y_train,
             scoring="neg_root_mean_squared_error", cv=10)

cv_errs.mean()

19.915880785471284

**It seems the model is overfitting to the data. I want to hyperparameter tune now using random search cv.**

In [105]:
pipeline = make_pipeline(
    StandardScaler(),
    RandomForestRegressor(max_features = 'sqrt')
)

n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1000, num = 10)]
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)] + [None]
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'randomforestregressor__n_estimators': n_estimators,
               'randomforestregressor__max_depth': max_depth,
               'randomforestregressor__min_samples_split': min_samples_split,
               'randomforestregressor__min_samples_leaf': min_samples_leaf,
               'randomforestregressor__bootstrap': bootstrap}

rs = RandomizedSearchCV(
    pipeline, param_distributions=random_grid, n_iter=100, 
    scoring="neg_root_mean_squared_error", cv=10)

model = rs.fit(
    df_songs[["danceability", "energy", "speechiness", "loudness", 
              "acousticness"]],
    df_songs["popularity"])

model.best_params_

{'randomforestregressor__bootstrap': True,
 'randomforestregressor__max_depth': 10,
 'randomforestregressor__min_samples_leaf': 2,
 'randomforestregressor__min_samples_split': 10,
 'randomforestregressor__n_estimators': 200}

In [106]:
-model.best_score_

-19.54008627239365

# Textual Features

In [39]:
df_personality_sentiment.head()

Unnamed: 0,artist,album,name,explicit,genre,chain_lyrics,lyrics,polarity,magnitude,artistic,emotion,imagination,assertive,cheeful,outgoing,modesty,sympathy,fiery,melancholy
0,Kendrick Lamar,Black Panther The Album Music From And Inspire...,Black Panther,1,hiphop,why i go easy \n know why i go easy \n wait \n...,why i go easy know why i go easy wait king of ...,-1.0,1.2,0.999983,0.931594,0.998384,0.095272,0.387043,0.101374,0.421143,0.985013,0.386352,0.921542
1,Kendrick Lamar,Black Panther The Album Music From And Inspire...,All The Stars,1,hiphop,love lets talk about love \n is it anything an...,love lets talk about love is it anything and e...,-1.0,1.3,0.996511,0.986234,0.991554,0.747748,0.98604,0.705842,0.441873,0.842244,0.157679,0.626636
2,Kendrick Lamar,Black Panther The Album Music From And Inspire...,I Am,1,hiphop,everybody put three fingers in the air \n the ...,everybody put three fingers in the air the sky...,-1.0,2.2,0.906971,0.318395,0.99498,0.574279,0.469771,0.320161,0.02456,0.785487,0.582447,0.669157
3,Kendrick Lamar,Black Panther The Album Music From And Inspire...,Big Shot,1,hiphop,wakanda welcome \n big shot hol up wait peanut...,wakanda welcome big shot hol up wait peanut bu...,-1.0,1.3,0.967339,0.229457,0.975867,0.382183,0.880943,0.477181,0.000747,0.270318,0.787414,0.197281
4,Kendrick Lamar,Black Panther The Album Music From And Inspire...,Pray For Me,1,hiphop,im always ready for a war again \n go down tha...,im always ready for a war again go down that r...,-1.0,1.1,0.98011,0.889895,0.977369,0.307116,0.962168,0.668731,0.864149,0.875111,0.464351,0.871338


**I will create a new column that is polarity multiplied by magnitude. Magnitude won't be that meaningful by itself, the two together might be better than just polarity.**

In [66]:
df_personality_sentiment["sentiment"] = df_personality_sentiment["polarity"] * \
                                        df_personality_sentiment["magnitude"]
df_personality_sentiment

Unnamed: 0,artist,album,name,explicit,genre,chain_lyrics,lyrics,polarity,magnitude,artistic,emotion,imagination,assertive,cheeful,outgoing,modesty,sympathy,fiery,melancholy,popularity,sentiment
0,Kendrick Lamar,Black Panther The Album Music From And Inspire...,Black Panther,1,hiphop,why i go easy \n know why i go easy \n wait \n...,why i go easy know why i go easy wait king of ...,-1.0,1.2,0.999983,0.931594,0.998384,0.095272,0.387043,0.101374,0.421143,0.985013,0.386352,0.921542,57,-1.20
1,Kendrick Lamar,Black Panther The Album Music From And Inspire...,All The Stars,1,hiphop,love lets talk about love \n is it anything an...,love lets talk about love is it anything and e...,-1.0,1.3,0.996511,0.986234,0.991554,0.747748,0.986040,0.705842,0.441873,0.842244,0.157679,0.626636,78,-1.30
2,Kendrick Lamar,Black Panther The Album Music From And Inspire...,I Am,1,hiphop,everybody put three fingers in the air \n the ...,everybody put three fingers in the air the sky...,-1.0,2.2,0.906971,0.318395,0.994980,0.574279,0.469771,0.320161,0.024560,0.785487,0.582447,0.669157,62,-2.20
3,Kendrick Lamar,Black Panther The Album Music From And Inspire...,Big Shot,1,hiphop,wakanda welcome \n big shot hol up wait peanut...,wakanda welcome big shot hol up wait peanut bu...,-1.0,1.3,0.967339,0.229457,0.975867,0.382183,0.880943,0.477181,0.000747,0.270318,0.787414,0.197281,67,-1.30
4,Kendrick Lamar,Black Panther The Album Music From And Inspire...,Pray For Me,1,hiphop,im always ready for a war again \n go down tha...,im always ready for a war again go down that r...,-1.0,1.1,0.980110,0.889895,0.977369,0.307116,0.962168,0.668731,0.864149,0.875111,0.464351,0.871338,76,-1.10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
822,Halsey,BADLANDS (Deluxe),Haunting,0,pop,keep on haunting \n keep on haunting me \n kee...,keep on haunting keep on haunting me keep on h...,-1.0,1.4,0.930841,0.915972,0.965041,0.019663,0.737102,0.104255,0.964988,0.768947,0.798722,0.982352,59,-1.40
823,Halsey,BADLANDS (Deluxe),Gasoline,1,pop,are you insane like me been in pain like me \n...,are you insane like me been in pain like me bo...,-1.0,1.0,0.988442,0.748622,0.999587,0.672225,0.720852,0.430028,0.083116,0.693261,0.935605,0.962687,72,-1.00
824,Halsey,BADLANDS (Deluxe),Control,0,pop,they send me away to find them a fortune \n a ...,they send me away to find them a fortune a che...,-1.0,1.7,0.995897,0.990308,0.999998,0.062743,0.096730,0.006704,0.654713,0.966235,0.990844,0.999999,69,-1.70
825,Halsey,BADLANDS (Deluxe),Young God,1,pop,he says oh baby girl you know were gonna be le...,he says oh baby girl you know were gonna be le...,0.2,1.1,0.974639,0.804268,0.934779,0.226985,0.578715,0.297378,0.678562,0.892613,0.147212,0.884749,60,0.22


**Testing between Logistic Regression, KNeighbors, and Random Forest again. I randomly choose one feature to look at.**

In [53]:
pipeline = make_pipeline(
    StandardScaler(),
    LinearRegression()
)

X_train = df_personality_sentiment[["emotion"]]
y_train = df_personality_sentiment["popularity"]

pipeline.fit(X_train, y_train)
y_train_ = pipeline.predict(X_train)
np.sqrt(mean_squared_error(y_train, y_train_))

19.347280541425654

In [54]:
pipeline = make_pipeline(
    StandardScaler(),
    Lasso()
)

X_train = df_personality_sentiment[["emotion"]]
y_train = df_personality_sentiment["popularity"]

pipeline.fit(X_train, y_train)
y_train_ = pipeline.predict(X_train)
np.sqrt(mean_squared_error(y_train, y_train_))

19.373106729397527

In [56]:
pipeline = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor()
)

X_train = df_personality_sentiment[["emotion"]]
y_train = df_personality_sentiment["popularity"]

pipeline.fit(X_train, y_train)
y_train_ = pipeline.predict(X_train)
np.sqrt(mean_squared_error(y_train, y_train_))

17.338610798608833

**KNeighbors does the best again.**

In [71]:
rmse = []
feature = []
for features in [["polarity"],
                 ["magnitude"],
                 ["artistic"],
                 ["imagination"],
                 ["emotion"],
                 ["assertive"],
                 ["cheeful"],
                 ["outgoing"],
                 ["modesty"],
                 ["sympathy"],
                 ["fiery"],
                 ["melancholy"],
                 ["sentiment"]]:

  pipeline = make_pipeline(
      StandardScaler(),
      KNeighborsRegressor()
  )
  X_train = df_personality_sentiment[features]
  y_train = df_personality_sentiment["popularity"]

  pipeline.fit(X_train, y_train)
  y_train_ = pipeline.predict(X_train)
  feature.append(features)
  rmse.append(np.sqrt(mean_squared_error(y_train, y_train_)))

df = {}
df["feature"] = feature
df["rmse"] = rmse
df1 = pd.DataFrame(df)
df1

Unnamed: 0,feature,rmse
0,[polarity],21.961073
1,[magnitude],20.569535
2,[artistic],17.711598
3,[imagination],17.09177
4,[emotion],17.338611
5,[assertive],17.398655
6,[cheeful],17.197266
7,[outgoing],16.81265
8,[modesty],17.062754
9,[sympathy],17.398008


In [72]:
rmse = []
feature = []
for features in [["outgoing"],
                 ["outgoing", "fiery"],
                 ["outgoing", "fiery", "modesty"],
                 ["outgoing", "fiery", "modesty", "imagination"]]:

  pipeline = make_pipeline(
      StandardScaler(),
      KNeighborsRegressor()
  )
  X_train = df_personality_sentiment[features]
  y_train = df_personality_sentiment["popularity"]

  pipeline.fit(X_train, y_train)
  y_train_ = pipeline.predict(X_train)
  feature.append(features)
  rmse.append(np.sqrt(mean_squared_error(y_train, y_train_)))

df = {}
df["feature"] = feature
df["rmse"] = rmse
df1 = pd.DataFrame(df)
df1

Unnamed: 0,feature,rmse
0,[outgoing],16.81265
1,"[outgoing, fiery]",17.134215
2,"[outgoing, fiery, modesty]",17.032498
3,"[outgoing, fiery, modesty, imagination]",17.472943


**Fiery just by itself is best. However, I want to do the same check with RandomForest**

In [94]:
rmse = []
feature = []
for features in [["outgoing"],
                 ["outgoing", "fiery"],
                 ["outgoing", "fiery", "modesty"],
                 ["outgoing", "fiery", "modesty", "imagination"],
                 ["outgoing", "fiery", "modesty", "imagination", "cheeful"],
                 ["outgoing", "fiery", "modesty", "imagination", "cheeful",
                  "emotion"]]:

  pipeline = make_pipeline(
      StandardScaler(),
      RandomForestRegressor(max_features="sqrt")
  )
  X_train = df_personality_sentiment[features]
  y_train = df_personality_sentiment["popularity"]

  pipeline.fit(X_train, y_train)
  y_train_ = pipeline.predict(X_train)
  feature.append(features)
  rmse.append(np.sqrt(mean_squared_error(y_train, y_train_)))

df = {}
df["feature"] = feature
df["rmse"] = rmse
df1 = pd.DataFrame(df)
df1

Unnamed: 0,feature,rmse
0,[outgoing],8.617519
1,"[outgoing, fiery]",7.906745
2,"[outgoing, fiery, modesty]",7.619998
3,"[outgoing, fiery, modesty, imagination]",7.628521
4,"[outgoing, fiery, modesty, imagination, cheeful]",7.484226
5,"[outgoing, fiery, modesty, imagination, cheefu...",7.614578


**I just want to check if RandomForest is overfitting compared to the Kneighbors, so I check the cross-val for both.**

In [79]:
pipeline = make_pipeline(
      StandardScaler(),
      KNeighborsRegressor()
  )

X_train = df_personality_sentiment[["outgoing"]]
y_train = df_personality_sentiment["popularity"]
  
cv_errs = -cross_val_score(pipeline, X_train, y_train, 
                          scoring="neg_root_mean_squared_error", cv=10)

cv_errs.mean()

21.257367268515193

In [95]:
pipeline = make_pipeline(
      StandardScaler(),
      RandomForestRegressor(max_features="sqrt")
  )

X_train = df_personality_sentiment[["outgoing", "fiery", "modesty", 
                                    "imagination", "cheeful"]]
y_train = df_personality_sentiment["popularity"]
  
cv_errs = -cross_val_score(pipeline, X_train, y_train, 
                          scoring="neg_root_mean_squared_error", cv=10)

cv_errs.mean()

20.001239315263923

**RandomForest does still perform a little better. Now, since I decided to consider explicit as a textual features, I add it.**

In [96]:
ct = make_column_transformer(
    (OneHotEncoder(), ["explicit"])
)

pipeline = make_pipeline(
      ct,
      StandardScaler(),
      RandomForestRegressor(max_features="sqrt")
  )

X_train = df_personality_sentiment[["outgoing", "fiery", "modesty", 
                                    "imagination", "cheeful", "explicit"]]
y_train = df_personality_sentiment["popularity"]
  
cv_errs = -cross_val_score(pipeline, X_train, y_train, 
                          scoring="neg_root_mean_squared_error", cv=10)

cv_errs.mean()

19.27278910671483

**I want to try some categorical encodings from the category_encoders package to see if they work better than OneHotEncoder.**

In [98]:
ct = make_column_transformer(
    (ce.TargetEncoder(), ["explicit"])
)

pipeline = make_pipeline(
      ct,
      StandardScaler(),
      RandomForestRegressor(max_features="sqrt")
  )

X_train = df_personality_sentiment[["outgoing", "fiery", "modesty", 
                                    "imagination", "cheeful", "explicit"]]
y_train = df_personality_sentiment["popularity"]
  
cv_errs = -cross_val_score(pipeline, X_train, y_train, 
                          scoring="neg_root_mean_squared_error", cv=10)

cv_errs.mean()

19.248401384120392

In [109]:
ct = make_column_transformer(
    (ce.CatBoostEncoder(), ["explicit"])
)

pipeline = make_pipeline(
      ct,
      StandardScaler(),
      RandomForestRegressor(max_features="sqrt")
  )

X_train = df_personality_sentiment[["outgoing", "fiery", "modesty", 
                                    "imagination", "cheeful", "explicit"]]
y_train = df_personality_sentiment["popularity"]
  
cv_errs = -cross_val_score(pipeline, X_train, y_train, 
                          scoring="neg_root_mean_squared_error", cv=10)

cv_errs.mean()

20.132705721492357

**So TargetEncoder works the best.**

In [137]:
ct = make_column_transformer(
    (ce.TargetEncoder(), ["explicit"])
)

pipeline = make_pipeline(
    ct,
    StandardScaler(),
    RandomForestRegressor(max_features = 'sqrt')
)

n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1000, num = 10)]
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)] + [None]
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'randomforestregressor__n_estimators': n_estimators,
               'randomforestregressor__max_depth': max_depth,
               'randomforestregressor__min_samples_split': min_samples_split,
               'randomforestregressor__min_samples_leaf': min_samples_leaf,
               'randomforestregressor__bootstrap': bootstrap}

rs = RandomizedSearchCV(
    pipeline, param_distributions=random_grid, n_iter=100, 
    scoring="neg_root_mean_squared_error", cv=10)

model = rs.fit(
    df_personality_sentiment[["outgoing", "fiery", "modesty", 
                              "imagination", "cheeful", "explicit"]],
    df_personality_sentiment["popularity"]
    )

model.best_params_

ERROR! Session/line number was not unique in database. History logging moved to new session 61


{'randomforestregressor__bootstrap': True,
 'randomforestregressor__max_depth': 60,
 'randomforestregressor__min_samples_leaf': 1,
 'randomforestregressor__min_samples_split': 5,
 'randomforestregressor__n_estimators': 300}

In [142]:
-model.best_score_

19.249630112235714

**The best textual features model beat the best audio features model 19.37 < 19.54. However, they are very close. Likely they are both just bad because there aren't time labels.**

# Lyrical Model

**Now I try a model using idf on the lyrics. I use stopwords that I used for my ngram bar chart.**

In [112]:
my_stop_words = text.ENGLISH_STOP_WORDS.union(
      ["oh", "yeah", "im", "hey"])

pipeline = make_pipeline(
    TfidfVectorizer(stop_words=my_stop_words),
    RandomForestRegressor(max_features = 'sqrt')
)

cv_errs = -cross_val_score(pipeline,
             df_personality_sentiment["lyrics"],
             df_personality_sentiment["popularity"], 
             scoring="neg_root_mean_squared_error", cv=10)

cv_errs.mean()

19.45523398157722

**Hyperparameter tuning the lyric model.**

In [115]:
my_stop_words = text.ENGLISH_STOP_WORDS.union(
      ["oh", "yeah", "im", "hey"])

pipeline = make_pipeline(
    TfidfVectorizer(stop_words=my_stop_words),
    RandomForestRegressor(max_features = 'sqrt')
)

n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1000, num = 10)]
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)] + [None]
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'randomforestregressor__n_estimators': n_estimators,
               'randomforestregressor__max_depth': max_depth,
               'randomforestregressor__min_samples_split': min_samples_split,
               'randomforestregressor__min_samples_leaf': min_samples_leaf,
               'randomforestregressor__bootstrap': bootstrap}

rs = RandomizedSearchCV(
    pipeline, param_distributions = random_grid, n_iter = 100, 
    scoring="neg_root_mean_squared_error", cv = 10)

model = rs.fit(
    df_personality_sentiment["lyrics"],
    df_personality_sentiment["popularity"])

model.best_params_

{'randomforestregressor__bootstrap': False,
 'randomforestregressor__max_depth': 20,
 'randomforestregressor__min_samples_leaf': 4,
 'randomforestregressor__min_samples_split': 5,
 'randomforestregressor__n_estimators': 100}

In [117]:
model.best_score_

-19.095938883814654

**It seems just predicting using the lyrics itselfs beats audio or textual features. There is so much dimensionality from a idf transformation that adding textual features wouldn't really change the model at all.**

# XGBoost

**Now I want to try something new to me with XGBoost, on the lyric model and textual features model.**

In [0]:
from xgboost import XGBRegressor

In [27]:
X_train = df_personality_sentiment[["outgoing", "fiery", "modesty", 
                                    "imagination", "cheeful", "explicit"]]
y_train = df_personality_sentiment["popularity"]

ct = make_column_transformer(
    (ce.TargetEncoder(), ["explicit"])
)

pipeline = make_pipeline(
      ct,
      StandardScaler(),
      XGBRegressor(objective='reg:squarederror', n_estimators=1000, 
                   learning_rate=0.05, early_stopping_rounds=5, 
                   eval_set=[(X_train, y_train)])
  )

cv_errs = -cross_val_score(pipeline, X_train, y_train, 
                          scoring="neg_root_mean_squared_error", cv=10)

cv_errs.mean()

  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \


19.26089777901664

In [26]:
X_train = df_personality_sentiment["lyrics"]
y_train = df_personality_sentiment["popularity"]

my_stop_words = text.ENGLISH_STOP_WORDS.union(
      ["oh", "yeah", "im", "hey"])

pipeline = make_pipeline(
    TfidfVectorizer(stop_words=my_stop_words),
    XGBRegressor(objective='reg:squarederror', n_estimators=500, 
                   learning_rate=0.05, early_stopping_rounds=5, 
                   eval_set=[(X_train, y_train)])
)

cv_errs = -cross_val_score(pipeline, X_train, y_train, 
                          scoring="neg_root_mean_squared_error", cv=10)

cv_errs.mean()

  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \


20.756357166704266

**Neither was improved by using XGBoost.**

# Bigram Markov Chain Model

**Just for fun I use a bigram markov chain model to generate lyrics from each genre, and use that as a different approach to approximate test error for guessing something simple like genre off the lyrics of a song. I won't have any audio features or textual features on these lyrics. I could get textual features like before but audio.**

In [0]:
from sklearn.ensemble import RandomForestClassifier

In [37]:
X_train = df_personality_sentiment["lyrics"]
y_train = df_personality_sentiment["genre"]

my_stop_words = text.ENGLISH_STOP_WORDS.union(
      ["oh", "yeah", "im", "hey"])

pipeline = make_pipeline(
    TfidfVectorizer(stop_words=my_stop_words),
    RandomForestClassifier(max_features = 'sqrt')
)

cv_errs = cross_val_score(pipeline, X_train, y_train, 
                          scoring="f1_macro", cv=10)

cv_errs.mean()

0.8446075580593913

In [38]:
my_stop_words = text.ENGLISH_STOP_WORDS.union(
      ["oh", "yeah", "im", "hey"])

pipeline = make_pipeline(
    TfidfVectorizer(stop_words=my_stop_words),
    RandomForestClassifier(max_features = 'sqrt')
)

n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1000, num = 10)]
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)] + [None]
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'randomforestclassifier__n_estimators': n_estimators,
               'randomforestclassifier__max_depth': max_depth,
               'randomforestclassifier__min_samples_split': min_samples_split,
               'randomforestclassifier__min_samples_leaf': min_samples_leaf,
               'randomforestclassifier__bootstrap': bootstrap}

rs = RandomizedSearchCV(
    pipeline, param_distributions = random_grid, n_iter = 100, 
    scoring="f1_macro", cv = 10)

model = rs.fit(
    df_personality_sentiment["lyrics"],
    df_personality_sentiment["genre"])

model.best_params_

{'randomforestclassifier__bootstrap': False,
 'randomforestclassifier__max_depth': 40,
 'randomforestclassifier__min_samples_leaf': 1,
 'randomforestclassifier__min_samples_split': 5,
 'randomforestclassifier__n_estimators': 100}

In [39]:
model.best_score_

0.8547687257625896

**This is the best model for predicting genre.**

In [42]:
my_stop_words = text.ENGLISH_STOP_WORDS.union(
      ["oh", "yeah", "im", "hey"])

pipeline = make_pipeline(
    TfidfVectorizer(stop_words=my_stop_words),
    RandomForestClassifier(max_features = 'sqrt', n_estimators=100,
                           max_depth=40, min_samples_split=5,
                           min_samples_leaf=1, bootstrap=False)
)

X_train = df_personality_sentiment["lyrics"]
y_train = df_personality_sentiment["genre"]

cv_errs = cross_val_score(pipeline, X_train, y_train, 
                          scoring="f1_macro", cv=10)

cv_errs.mean()

0.8454914532964939

In [0]:
def get_bigrams(words):
    return zip(words, words[1:])

def train_markov_chain(lyrics):
    """
    Args:
      - lyrics: a list of strings, where each string represents
                the lyrics of one song by an artist.
    
    Returns:
      A dict that maps a single word ("unigram") to a list of
      words that follow that word, representing the Markov
      chain trained on the lyrics.
    """
    chain = {"<START>": []}
    for lyric in lyrics:
      lyric[:0] = ["<START>"] 
      lyric.append("<END>")
      lyric_bigrams = get_bigrams(lyric)
      for bigram in list(lyric_bigrams):
        if bigram[0] not in chain:
          chain[bigram[0]] = []
        chain[bigram[0]].append(bigram[1])
        # YOUR CODE HERE
        pass
        
    return chain

In [0]:
tokens_hiphop = df_personality_sentiment[
       df_personality_sentiment["genre"]=="hiphop"]["chain_lyrics"].str.split()
tokens_pop = df_personality_sentiment[
       df_personality_sentiment["genre"]=="pop"]["chain_lyrics"].str.split()
       
markov_hiphop = train_markov_chain(tokens_hiphop)
markov_pop = train_markov_chain(tokens_pop)

In [0]:
import random
def generate_new_lyrics(chain):
    """
    Args:
      - chain: a dict representing the Markov chain
    
    Returns:
      A string representing the randomly generated song.
    """
    
    # a list for storing the generated words
    words = []
    # generate the first word
    words.append(random.choice(chain["<START>"]))
    i = 0
    while words[i] != "<END>":
      words.append(random.choice(chain[words[i]]))
      i += 1
    
    # join the words together into a string with line breaks
    lyrics = " ".join(words[:-1])
    return "\n".join(lyrics.split("<N>"))

In [49]:
new_pop = []
new_hiphop = []
for i in range(500):
  new_pop.append(generate_new_lyrics(markov_pop))
for i in range(500):
  new_hiphop.append(generate_new_lyrics(markov_hiphop))

df = {}
df["pop_lyrics"] = new_pop
df["hiphop_lyrics"] = new_hiphop
df1 = pd.DataFrame(df)
df1

Unnamed: 0,pop_lyrics,hiphop_lyrics
0,youve ever getting back i close handful of us ...,mass hallucination baby baby platinum on the d...
1,climb the way home and i lost in a different p...,right back i give it moving got some different...
2,i i was a sign i shouldve said i i can pick it...,you the trap trapmoneybenny this nothing for r...
3,center of something real bad idea forget the o...,ive got friends way heading up bitch dont shin...
4,its the sky does this way beneath this is an o...,cant photoshop me later we slang no my grip lo...
...,...,...
495,we see youre a deep cut deep as if i remember ...,unread texts i do and you only upsets me and g...
496,put your ways i been on me love the attitude y...,these niggas only if i know i got court drop d...
497,bada bing youll believe that you are where you...,new nigga pull in the left side yall niggas fo...
498,youve been with my baby come on a negative way...,hendrix ah do im in public blame the flow dumb...


In [55]:
pipeline.fit(X_train, y_train)
list(pipeline.predict(df1["pop_lyrics"])).count("pop")/500

0.936

In [58]:
list(pipeline.predict(df1["hiphop_lyrics"])).count("hiphop")/500

0.842

**If we can consider lyrics generating from a markov chain model of hiphop lyrics to be truely hiphop, then these models could be a good way to estimate test error. The new lyrics are not independent of the the old ones, but the inverse document frequecies should be different. This model is better at predicting pop lyrics than hiphop lyrics.**