# Make a Good Model for the mood of the song

In this document I need to work more with NLP of the track lyrics and use columns that I discarded in the MVP.

Good model = more NLP on the Track Lyrics<BR />
Better model = also looking at number of streams and position<BR />
Best model = also looking at time on top list<BR />

Then when we have the best model we can use our predictions and decide the mood of a country and the mood of an artist. Then we can say what artist is suitable for what country. 

## Import stuff

In [1]:
import pandas as pd
import numpy as np
from textblob import TextBlob
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.preprocessing import StandardScaler
import sklearn.metrics
from sklearn.ensemble import RandomForestRegressor
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.model_selection import RandomizedSearchCV

from matplotlib import pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline



## Load data

In [2]:
data = pd.read_csv('./data_top10c_more_lyrics.csv')

In [3]:
data.head(3)

Unnamed: 0.1,Unnamed: 0,Position,Streams,Track Name,Artist,ID,Date,Year,Month,Day,Country,Region,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence
0,0,177,40381,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,2017-10-05,2017,10,5,gb,eu,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879
1,1,151,24132,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,2017-12-23,2017,12,23,it,eu,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756
2,2,78,49766,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,2017-12-24,2017,12,24,it,eu,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756


## Fix a little bit with the data

**Drop rows that are duplicates and keep only one row for each song**

In [4]:
data_per_song = data.drop_duplicates(subset=['Track Name'], keep='first')

**Drop all columns that might change per song**

In [5]:
nlp_data = data_per_song.drop(['Unnamed: 0', 'Position', 'Streams', 'Date', 'Year', 'Month', 'Day', 'Country', 'Region'], axis=1)

nlp_data.head(5)

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence
0,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879
1,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756
43,Douce Nuit,-M-,4EOJWkvkVDpkZrhC8iTDsI,,0.914,0.227,0.163,1.0,81.887,0.0498
44,Zomersessie,101Barz,3ypzzvHUfgwyqxhL9ym4fH,,0.00818,0.403,2.1e-05,1.0,155.748,0.365
47,Zomersessie (feat. 3robi),101Barz,2re4cLViiQw0NZZx5KUpV8,,0.00818,0.403,2.1e-05,1.0,155.748,0.365


**Look at missing values**

In [6]:
nlp_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6919 entries, 0 to 578929
Data columns (total 10 columns):
Track Name          6919 non-null object
Artist              6919 non-null object
ID                  6919 non-null object
Lyrics              4190 non-null object
Acousticness        6918 non-null float64
Energy              6918 non-null float64
Instrumentalness    6918 non-null float64
Mode                6918 non-null float64
Tempo               6918 non-null float64
Valence             6918 non-null float64
dtypes: float64(6), object(4)
memory usage: 594.6+ KB


**Drop rows that have missing values in the Lyrics column**<BR />
We can use dropna to drop all rows that has missing values (should mostly be the Lyrics column)

In [7]:
nlp_data_clean = nlp_data.dropna(axis=0, how='any')

nlp_data_clean.head()

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence
0,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879
1,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756
48,Somme,13 Block,2xkxBVJHf9jQsq7g46UtQx,"J'ai fait l'aller, j'suis sur le retour\nLa ma...",0.494,0.678,0.00151,0.0,79.979,0.528
50,Vide,13 Block,69RclklKbEelwfQJCBzh0m,"13 Blo' gang, tu sais d'jà comment on opère mo...",0.505,0.682,6e-06,0.0,112.063,0.514
71,10 Dinger,187 Strassenbande,3ruUVcomUKxPlX8srBfMua,"Ich schwör' dir, wenn ich mal Kohle mache, dan...",0.0182,0.673,0.0,1.0,94.54,0.642


In [8]:
nlp_data_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4190 entries, 0 to 578929
Data columns (total 10 columns):
Track Name          4190 non-null object
Artist              4190 non-null object
ID                  4190 non-null object
Lyrics              4190 non-null object
Acousticness        4190 non-null float64
Energy              4190 non-null float64
Instrumentalness    4190 non-null float64
Mode                4190 non-null float64
Tempo               4190 non-null float64
Valence             4190 non-null float64
dtypes: float64(6), object(4)
memory usage: 360.1+ KB


### TextBlob

**Turn the lyrics in the Lyrics column into string**

In [9]:
nlp_data_clean['Lyrics'] = nlp_data_clean['Lyrics'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


**Make and run function for TextBlob on the Lyrics**

In [10]:
def sentiment_func(lyrics):
    try:
        return TextBlob(lyrics).sentiment
    except:
        return None

nlp_data_clean['pol_sub'] = nlp_data_clean['Lyrics'].apply(sentiment_func)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


**Split the pol_sub column into 2 new columns (Polarity, Subjectivity)**

In [11]:
nlp_data_clean['pol_sub'][0][0]

nlp_data_clean['Polarity'] = nlp_data_clean['pol_sub'].apply(lambda x: x[0])
nlp_data_clean['Subjectivity'] = nlp_data_clean['pol_sub'].apply(lambda x: x[1])

nlp_data_clean.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,pol_sub,Polarity,Subjectivity
0,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879,"(-0.044454619454619454, 0.5908017908017905)",-0.044455,0.590802
1,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756,"(0.5831501831501833, 0.6706959706959708)",0.58315,0.670696
48,Somme,13 Block,2xkxBVJHf9jQsq7g46UtQx,"J'ai fait l'aller, j'suis sur le retour\nLa ma...",0.494,0.678,0.00151,0.0,79.979,0.528,"(0.1738095238095238, 0.5416666666666666)",0.17381,0.541667


**Drop the pol_sub column**

In [12]:
nlp_data_clean = nlp_data_clean.drop(['pol_sub'], axis=1)

nlp_data_clean.head(3)

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,Polarity,Subjectivity
0,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879,-0.044455,0.590802
1,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756,0.58315,0.670696
48,Somme,13 Block,2xkxBVJHf9jQsq7g46UtQx,"J'ai fait l'aller, j'suis sur le retour\nLa ma...",0.494,0.678,0.00151,0.0,79.979,0.528,0.17381,0.541667


### Do a quick check of the entire data frame

In [13]:
nlp_data_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4190 entries, 0 to 578929
Data columns (total 12 columns):
Track Name          4190 non-null object
Artist              4190 non-null object
ID                  4190 non-null object
Lyrics              4190 non-null object
Acousticness        4190 non-null float64
Energy              4190 non-null float64
Instrumentalness    4190 non-null float64
Mode                4190 non-null float64
Tempo               4190 non-null float64
Valence             4190 non-null float64
Polarity            4190 non-null float64
Subjectivity        4190 non-null float64
dtypes: float64(8), object(4)
memory usage: 585.5+ KB


In [14]:
nlp_data_clean.describe()

Unnamed: 0,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,Polarity,Subjectivity
count,4190.0,4190.0,4190.0,4190.0,4190.0,4190.0,4190.0,4190.0
mean,0.238886,0.660887,0.009918,0.54105,120.254388,0.48252,0.071806,0.453405
std,0.234458,0.165181,0.065771,0.498372,26.560422,0.221104,0.225631,0.228036
min,3e-06,0.0279,0.0,0.0,54.082,0.0371,-1.0,0.0
25%,0.051525,0.562,0.0,0.0,99.98425,0.31,-0.033272,0.35
50%,0.159,0.676,0.0,1.0,120.004,0.473,0.046612,0.4875
75%,0.368,0.78475,3.8e-05,1.0,136.04475,0.654,0.189943,0.591449
max,0.988,0.995,0.89,1.0,232.69,0.982,1.0,1.0


## Train/Test-split

Divide the data into a train and a test set (with a test set of 25%, which is also default)

In [15]:
dep   = nlp_data_clean['Valence']
indep = nlp_data_clean

In [16]:
indep_train, indep_test, dep_train, dep_test = train_test_split(indep, dep, test_size = 0.25, random_state=24)

## NLP

### CountVectorizer (use in model 1)

Since music is an art form, like poems, I might concider not to use stop words. (Maybe in a later try?!)

In [17]:
# instantiate the model
cvec = CountVectorizer(stop_words='english', max_features = 1000) 
# eliminate English stop words and use max_features since there will be more than 60,000 if iI do not

In [18]:
# fit the count vectorizer with training data. 
cvec.fit(indep_train['Lyrics'])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [19]:
# transform X_train
cvec_data = cvec.transform(indep_train['Lyrics'])

In [20]:
# Turn the features into a data frame
df  = pd.DataFrame(cvec_data.todense(),columns=cvec.get_feature_names())

df.head(3)

Unnamed: 0,aan,ab,aber,act,adesso,ah,ahh,ai,aime,ain,...,zit,zo,zu,zum,zwei,ça,équipe,étais,était,être
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,3,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,1,4,0,0,0,0,0,0


In [21]:
len(indep_train)

3142

In [22]:
len(df)

3142

In [23]:
# Concat with big data frame and use for fitting the model
indep_train_cvec = pd.concat([indep_train.reset_index(drop=True), df], axis=1)

indep_train_cvec.head(3)

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,...,zit,zo,zu,zum,zwei,ça,équipe,étais,était,être
0,Atemlos durch die Nacht,Helene Fischer,5fPGpdC4tmcVMmTuJV2HRg,Wir ziehen durch die Straßen und die Clubs die...,0.049,0.73,2e-06,1.0,128.041,0.866,...,0,0,0,0,3,0,0,0,0,0
1,La vita liquida,Brunori Sas,7ctWJ718cdHHqINJlRTuxF,Liquido è il mio corpo che si piega ad ogni co...,0.558,0.621,2.6e-05,0.0,88.065,0.585,...,0,0,0,0,0,0,0,0,0,0
2,Zum ersten Mal Nintendo,Philipp Poisel,2UcgmsztMXVyPo3VgqD5Bu,wie oft wollt' ich weg von hier? anders als di...,0.543,0.481,0.000299,1.0,98.983,0.503,...,0,0,1,4,0,0,0,0,0,0


In [24]:
len(indep_train_cvec)

3142

In [25]:
######################

In [26]:
# transform X_test
cvec_data2 = cvec.transform(indep_test['Lyrics'])

In [27]:
# Turn the features into a data frame
df2  = pd.DataFrame(cvec_data2.todense(),columns=cvec.get_feature_names())

df2.head(3)

Unnamed: 0,aan,ab,aber,act,adesso,ah,ahh,ai,aime,ain,...,zit,zo,zu,zum,zwei,ça,équipe,étais,était,être
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,3,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,4,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
len(indep_test)

1048

In [29]:
len(df2)

1048

In [30]:
# Concat with big data frame and use for scoring the model
indep_test_cvec = pd.concat([indep_test.reset_index(drop=True), df2], axis=1)

indep_test_cvec.head(3)

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,...,zit,zo,zu,zum,zwei,ça,équipe,étais,était,être
0,King Of The North,Bugzy Malone,4DixkYsZqImKOmjaIaYnCi,King! King! King! King! King! King!\nI'm King ...,0.0778,0.772,4.4e-05,0.0,139.886,0.376,...,0,0,0,0,0,0,0,0,0,0
1,Back On,Gucci Mane,0KA5Cc68h9qitLwTadHBpa,Zaytoven\nHah\nWop\nYeah\nIt's Gucci\nZay\nZig...,0.0087,0.639,4e-06,1.0,156.055,0.427,...,0,0,0,0,0,0,0,0,0,0
2,Magazine,Dark Polo Gang,71MGHgauMD6aapixtV6Chd,"Hey, hey\nSick Luke, Sick Luke\n\nLa mia facci...",0.396,0.437,0.0,1.0,136.111,0.525,...,0,0,0,0,0,0,0,0,0,0


In [31]:
len(indep_test_cvec)

1048

*If there are time in the future consider stemming or lemming* 

### TF-IDF (use in model 2)

In [32]:
# instantiate the model
tvec = TfidfVectorizer(stop_words='english', max_features = 1000) 
# eliminate English stop words and use max_features since there will be more than 60,000 if iI do not

In [33]:
# fit the count vectorizer with training data. 
tvec.fit(indep_train['Lyrics'])

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [34]:
# transform X_train
tvec_data3 = tvec.transform(indep_train['Lyrics'])

In [35]:
# Turn the features into a data frame
df3  = pd.DataFrame(tvec_data3.todense(), columns=tvec.get_feature_names())

df3.head(3)

Unnamed: 0,aan,ab,aber,act,adesso,ah,ahh,ai,aime,ain,...,zit,zo,zu,zum,zwei,ça,équipe,étais,était,être
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.137169,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.031071,0.156462,0.0,0.0,0.0,0.0,0.0,0.0


In [36]:
len(indep_train)

3142

In [37]:
len(df3)

3142

In [38]:
# Concat with big data frame and use for fitting the model
indep_train_tvec = pd.concat([indep_train.reset_index(drop=True), df3], axis=1)

indep_train_tvec.head(3)

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,...,zit,zo,zu,zum,zwei,ça,équipe,étais,était,être
0,Atemlos durch die Nacht,Helene Fischer,5fPGpdC4tmcVMmTuJV2HRg,Wir ziehen durch die Straßen und die Clubs die...,0.049,0.73,2e-06,1.0,128.041,0.866,...,0.0,0.0,0.0,0.0,0.137169,0.0,0.0,0.0,0.0,0.0
1,La vita liquida,Brunori Sas,7ctWJ718cdHHqINJlRTuxF,Liquido è il mio corpo che si piega ad ogni co...,0.558,0.621,2.6e-05,0.0,88.065,0.585,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Zum ersten Mal Nintendo,Philipp Poisel,2UcgmsztMXVyPo3VgqD5Bu,wie oft wollt' ich weg von hier? anders als di...,0.543,0.481,0.000299,1.0,98.983,0.503,...,0.0,0.0,0.031071,0.156462,0.0,0.0,0.0,0.0,0.0,0.0


In [39]:
len(indep_train_tvec)

3142

In [40]:
##########################

In [41]:
# transform X_test
tvec_data4 = tvec.transform(indep_test['Lyrics'])

In [42]:
# Turn the features into a data frame
df4  = pd.DataFrame(tvec_data4.todense(),columns=cvec.get_feature_names())

df4.head(3)

Unnamed: 0,aan,ab,aber,act,adesso,ah,ahh,ai,aime,ain,...,zit,zo,zu,zum,zwei,ça,équipe,étais,était,être
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.130963,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.189735,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [43]:
len(indep_test)

1048

In [44]:
len(df4)

1048

In [45]:
# Concat with big data frame and use for scoring the model
indep_test_tvec = pd.concat([indep_test.reset_index(drop=True), df4], axis=1)

indep_test_tvec.head(3) 

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,...,zit,zo,zu,zum,zwei,ça,équipe,étais,était,être
0,King Of The North,Bugzy Malone,4DixkYsZqImKOmjaIaYnCi,King! King! King! King! King! King!\nI'm King ...,0.0778,0.772,4.4e-05,0.0,139.886,0.376,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Back On,Gucci Mane,0KA5Cc68h9qitLwTadHBpa,Zaytoven\nHah\nWop\nYeah\nIt's Gucci\nZay\nZig...,0.0087,0.639,4e-06,1.0,156.055,0.427,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Magazine,Dark Polo Gang,71MGHgauMD6aapixtV6Chd,"Hey, hey\nSick Luke, Sick Luke\n\nLa mia facci...",0.396,0.437,0.0,1.0,136.111,0.525,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [46]:
len(indep_test_tvec)

1048

## Models (LinReg, Lasso and RF) - CountVec

### Linear Regression

In [47]:
def evaluate_model(model, X_train, X_test, y_train, y_test):
    # standardize the predictors
    ss = StandardScaler()
    ss.fit(X_train)
    X_train_s = ss.transform(X_train)
    X_test_s = ss.transform(X_test)
    
    # fit
    model.fit(X_train_s, y_train)
    
    # Evaluate: predict and score
    y_pred = model.predict(X_test_s)
    y_true = y_test
    
    mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
    # Evaluate: score
    score = model.score(X_test_s, y_test)
    
    return {'Score (R^2)': score.mean(), 'MSE': mean_square_error}

**LinReg #1 - CountVec + all coefs**

In [48]:
# define X and y
X_train = indep_train_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_train = indep_train['Valence'] 
X_test = indep_test_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_test = indep_test['Valence']

# chose model 
model = LinearRegression()

# call function
evaluate_model(model, X_train, X_test, y_train, y_test)

{'MSE': 0.24766785367015642, 'Score (R^2)': -0.20051192593378775}

**Importance of the coefficients**

In [49]:
# Look at the feature importance with coef_
pd.Series(dict(zip(X_train.columns,model.coef_))).abs().sort_values(ascending=False).head(10)

Energy          0.091617
Acousticness    0.029776
und             0.022604
von             0.017408
nu              0.015495
baby            0.015344
nem             0.014615
sich            0.014229
das             0.013787
avant           0.013515
dtype: float64

**LinReg #2 - CountrVec + top 3 coefs**

In [50]:
# define X and y
X_train2 = indep_train_cvec[['Energy', 'Acousticness', 'und']]
y_train2 = indep_train['Valence']
X_test2 = indep_test_cvec[['Energy', 'Acousticness', 'und']]
y_test2 = indep_test['Valence']

# chose model 
model2 = LinearRegression()

# call function
evaluate_model(model2, X_train2, X_test2, y_train2, y_test2)

{'MSE': 0.20734850323633328, 'Score (R^2)': 0.1585486584908018}

**LinReg #3 - CountrVec + top 10 coefs**

In [51]:
# define X and y
X_train3 = indep_train_cvec[['Energy', 'Acousticness', 'und', 'von', 'nu', 'baby', 'nem', 'sich', 'das', 'avant']]
y_train3 = indep_train['Valence']
X_test3 = indep_test_cvec[['Energy', 'Acousticness', 'und', 'von', 'nu', 'baby', 'nem', 'sich', 'das', 'avant']]
y_test3 = indep_test['Valence']

# chose model 
model3 = LinearRegression()

# call function
evaluate_model(model3, X_train3, X_test3, y_train3, y_test3)

{'MSE': 0.20627792538890397, 'Score (R^2)': 0.1672153578309279}

*COMMENT: The best LinReg is nr 3*

This is a very unefficient way to find the best number of independent variables. I will run a Lasso model instead to get help with varables.<BR />
An alternative could have been going with the Transformers 'select k-best' (will pick best nr of estimators, where I choose the number of estimaters) or RFE (eliminates varables not to use).

### Lasso Regressor

**Lasso #1 - CountVec + all coefs**

In [52]:
# define X and y
X_train4 = indep_train_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_train4 = indep_train['Valence']
X_test4 = indep_test_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_test4 = indep_test['Valence']

In [53]:
# standardize the predictors
ss = StandardScaler()
ss.fit(X_train4)
X_train4_s = ss.transform(X_train4)
X_test4_s = ss.transform(X_test4)

In [55]:
# RandomizedSearch (best hyperparams) 
params = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
rs = RandomizedSearchCV(lassoreg, params, n_iter=40)
rs.fit(X_train4_s, y_train4)

print(f'best_score: {rs.best_score_}')
print(f'best_params: {rs.best_params_}')

best_score: 0.1800685168502287
best_params: {'selection': 'cyclic', 'alpha': 0.004641588833612782}


In [56]:
# GridSearch (best hyperparams) 
grid = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
gs = GridSearchCV(lassoreg, grid)
gs.fit(X_train4_s, y_train4)
    
print(f'best_score: {gs.best_score_}')
print(f'best_params: {gs.best_params_}') 

best_score: 0.18171454688611116
best_params: {'alpha': 0.005336699231206312, 'selection': 'random'}


In [57]:
# Lasso regression (best hyper params: input alpha and selection from above)
model4 = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000, alpha=0.0053, selection='random')              

# fit
model4.fit(X_train4_s, y_train4)

# Evaluate: predict
y_pred = model4.predict(X_test4_s)
y_true = y_test4
    
mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
# Evaluate: score 
score = model4.score(X_test4_s, y_test4)
    
print(f'Score (R^2): {score.mean()}')
print(f'MSE: {mean_square_error}')

Score (R^2): 0.18087010491593836
MSE: 0.20457981864948047


In [58]:
# Look at the feature importance with coef_
pd.Series(dict(zip(X_train4.columns,model4.coef_))).abs().sort_values(ascending=False).head(15)

Energy              0.082789
Acousticness        0.017239
baby                0.010654
Mode                0.009557
girl                0.008042
niggas              0.007130
el                  0.006408
pute                0.005941
putain              0.005607
Instrumentalness    0.005428
little              0.005357
bro                 0.005156
fast                0.004994
ville               0.004953
yo                  0.004788
dtype: float64

**Lasso #2 - CountVec + top 10 coefs**

In [59]:
# define X and y
X_train5 = indep_train_cvec[['Energy', 'Acousticness', 'baby', 'Mode', 'girl', 'niggas', 'el', 'pute', 'putain',
                             'Instrumentalness']]
y_train5 = indep_train['Valence']
X_test5 = indep_test_cvec[['Energy', 'Acousticness', 'baby', 'Mode', 'girl', 'niggas', 'el', 'pute', 'putain',
                             'Instrumentalness']]
y_test5 = indep_test['Valence']

In [60]:
# standardize the predictors
ss = StandardScaler()
ss.fit(X_train5)
X_train5_s = ss.transform(X_train5)
X_test5_s = ss.transform(X_test5)

In [63]:
# RandomizedSearch (best hyperparams) 
params = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
rs = RandomizedSearchCV(lassoreg, params, n_iter=40)
rs.fit(X_train5_s, y_train5)

print(f'best_score: {rs.best_score_}')
print(f'best_params: {rs.best_params_}')

best_score: 0.17597402874421192
best_params: {'selection': 'random', 'alpha': 0.001}


In [64]:
# GridSearch (best hyperparams) 
grid = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
gs = GridSearchCV(lassoreg, grid)
gs.fit(X_train5_s, y_train5)
    
print(f'best_score: {gs.best_score_}')
print(f'best_params: {gs.best_params_}') 

best_score: 0.17597402874421192
best_params: {'alpha': 0.001, 'selection': 'random'}


In [65]:
# Lasso regression (best hyper params: input alpha and selection from above)
model5 = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000, alpha=0.001, selection='random')              

# fit
model5.fit(X_train5_s, y_train5)

# Evaluate: predict 
y_pred = model5.predict(X_test5_s)
y_true = y_test5
    
mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
# Evaluate: score 
score = model5.score(X_test5_s, y_test5)
    
print(f'Score (R^2): {score.mean()}')
print(f'MSE: {mean_square_error}')

Score (R^2): 0.160808052014364
MSE: 0.20706993883989458


**Lasso #3 - CountVec + top 5 coefs**

In [66]:
# define X and y
X_train6 = indep_train_cvec[['Energy', 'Acousticness', 'baby', 'Mode', 'girl']]
y_train6 = indep_train['Valence']
X_test6 = indep_test_cvec[['Energy', 'Acousticness', 'baby', 'Mode', 'girl']]
y_test6 = indep_test['Valence']

In [67]:
# standardize the predictors
ss = StandardScaler()
ss.fit(X_train6)
X_train6_s = ss.transform(X_train6)
X_test6_s = ss.transform(X_test6)

In [68]:
# RandomizedSearch (best hyperparams) 
params = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
rs = RandomizedSearchCV(lassoreg, params, n_iter=40)
rs.fit(X_train6_s, y_train6)

print(f'best_score: {rs.best_score_}')
print(f'best_params: {rs.best_params_}')

best_score: 0.16485354164376018
best_params: {'selection': 'cyclic', 'alpha': 0.002009233002565048}


In [69]:
# GridSearch (best hyperparams) 
grid = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
gs = GridSearchCV(lassoreg, grid)
gs.fit(X_train6_s, y_train6)
    
print(f'best_score: {gs.best_score_}')
print(f'best_params: {gs.best_params_}') 

best_score: 0.16526677632466533
best_params: {'alpha': 0.001, 'selection': 'random'}


In [70]:
# Lasso regression (best hyper params: input alpha and selection from above)
model6 = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000, alpha=0.001, selection='random')              

# fit
model6.fit(X_train6_s, y_train6)

# Evaluate: predict 
y_pred = model6.predict(X_test6_s)
y_true = y_test6
    
mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
# Evaluate: score 
score = model6.score(X_test6_s, y_test6)
    
print(f'Score (R^2): {score.mean()}')
print(f'MSE: {mean_square_error}')

Score (R^2): 0.16772687310510903
MSE: 0.20621456536068097


*COMMENT: The best Lasso is nr 1*

### Random Forest Regressor

You do not have to scale a Random Forest.

In [75]:
def get_best_hype(model, params, X_train, y_train):  
    # Best Hyperparameters
    rs = RandomizedSearchCV(model, params, n_iter=40)
    
    # fit
    rs.fit(X_train, y_train)
     
    return {'best_score': rs.best_score_,'best_params': rs.best_params_} 

def evaluate_model(model, X_train, X_test, y_train, y_test):
    # fit
    model.fit(X_train, y_train)
    
    # Evaluate: predict
    
    y_pred = model.predict(X_test)
    y_true = y_test
    
    mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
    # Evaluate: score
    score = model.score(X_test, y_test)
    
    return {'Score (R^2)': score.mean(), 'MSE': mean_square_error}

**Random Forest #1 - CountVec + all coefs**

In [77]:
# Declare indep and dep
X_train7 = indep_train_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_train7 = indep_train['Valence']
X_test7 = indep_test_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_test7 = indep_test['Valence']

# RandomizedSearch
params = {'n_estimators': np.arange(10, 100, 10),
        'max_depth': np.arange(2, 20, 3),
        'max_features' : ('auto', 'sqrt', 'log2'),
        'bootstrap': (True, False),
        'verbose' : np.arange(0, 1)}

rfr = RandomForestRegressor(random_state=24)

get_best_hype(rfr, params, X_train7, y_train7)

{'best_params': {'bootstrap': True,
  'max_depth': 11,
  'max_features': 'auto',
  'n_estimators': 70,
  'verbose': 0},
 'best_score': 0.17289482631721254}

In [78]:
# chose model and use best hyperparameters (from gridsearchCV)
model7 = RandomForestRegressor(max_depth=11, max_features='auto', n_estimators=70, verbose=0, bootstrap=True, 
                               random_state=24)

# call function
evaluate_model(model7, X_train7, X_test7, y_train7, y_test7)

{'MSE': 0.20451543764469096, 'Score (R^2)': 0.18138558203393873}

**Feature importance**

In [80]:
pd.Series(dict(zip(X_train7.columns,model7.feature_importances_))).abs().sort_values(ascending=False).head(15)

Energy              0.276219
Acousticness        0.042343
Instrumentalness    0.017891
Tempo               0.014927
baby                0.014496
Subjectivity        0.014376
Polarity            0.014055
oh                  0.008785
en                  0.006358
girl                0.006265
je                  0.006164
Mode                0.006061
ma                  0.005541
ben                 0.005226
che                 0.005105
dtype: float64

**Random Forest #2 - CountVec + top 7 features**

In [81]:
# define X and y
X_train8 = indep_train_cvec[['Energy', 'Acousticness', 'Instrumentalness', 'Tempo', 'baby', 'Subjectivity', 'Polarity']]
y_train8 = indep_train['Valence']
X_test8 = indep_test_cvec[['Energy', 'Acousticness', 'Instrumentalness', 'Tempo', 'baby', 'Subjectivity', 'Polarity']]
y_test8 = indep_test['Valence']

# RandomizedSearch
params = {'n_estimators': np.arange(10, 100, 10),
        'max_depth': np.arange(2, 20, 3),
        'max_features' : ('auto', 'sqrt', 'log2'),
        'bootstrap': (True, False),
        'verbose' : np.arange(0, 1)}

rfr = RandomForestRegressor(random_state=24)

get_best_hype(rfr, params, X_train8, y_train8)

{'best_params': {'bootstrap': True,
  'max_depth': 5,
  'max_features': 'auto',
  'n_estimators': 80,
  'verbose': 0},
 'best_score': 0.1465060448554838}

In [82]:
# chose model and use best hyperparameters (from gridsearchCV)
model8 = RandomForestRegressor(max_depth=5, max_features='auto', n_estimators=80, verbose=0, bootstrap=True, 
                               random_state=24)

# call function
evaluate_model(model8, X_train8, X_test8, y_train8, y_test8)

{'MSE': 0.20688310071633145, 'Score (R^2)': 0.16232176581536006}

**Random Forest #3 - CountVec + top 3 features**

In [83]:
# define X and y
X_train9 = indep_train_cvec[['Energy', 'Acousticness', 'Instrumentalness']]
y_train9 = indep_train['Valence']
X_test9 = indep_test_cvec[['Energy', 'Acousticness', 'Instrumentalness']]
y_test9 = indep_test['Valence']

# RandomizedSearch
params = {'n_estimators': np.arange(10, 100, 10),
        'max_depth': np.arange(2, 20, 3),
        'max_features' : ('auto', 'sqrt', 'log2'),
        'bootstrap': (True, False),
        'verbose' : np.arange(0, 1)}

rfr = RandomForestRegressor(random_state=24)

get_best_hype(rfr, params, X_train9, y_train9)

{'best_params': {'bootstrap': True,
  'max_depth': 5,
  'max_features': 'log2',
  'n_estimators': 40,
  'verbose': 0},
 'best_score': 0.13509606523375745}

In [84]:
# chose model and use best hyperparameters (from gridsearchCV)
model9 = RandomForestRegressor(max_depth=5, max_features='log2', n_estimators=40, verbose=0, bootstrap=True, 
                               random_state=24)

# call function
evaluate_model(model9, X_train9, X_test9, y_train9, y_test9)

{'MSE': 0.20945568443846793, 'Score (R^2)': 0.1413592411949619}

*COMMENT: The best RF is nr 1*

The over all best model was Random Forest nr 1 with a R2-score of 0.18138558203393873

## Models (LinReg, Lasso and RF) - TF-IDF

### Linear Regression

In [107]:
def evaluate_model(model, X_train, X_test, y_train, y_test):
    # standardize the predictors
    ss = StandardScaler()
    ss.fit(X_train)
    X_train_s = ss.transform(X_train)
    X_test_s = ss.transform(X_test)
    
    # fit
    model.fit(X_train_s, y_train)
    
    # Evaluate: predict and score
    y_pred = model.predict(X_test_s)
    y_true = y_test
    
    mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
    # Evaluate: score
    score = model.score(X_test_s, y_test)
    
    return {'Score (R^2)': score.mean(), 'MSE': mean_square_error}

**LinReg #1 - TF-IDF + all coefs**

In [108]:
# define X and y
X_train10 = indep_train_tvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_train10 = indep_train['Valence'] 
X_test10 = indep_test_tvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_test10 = indep_test['Valence']

# chose model 
model10 = LinearRegression()

# call function
evaluate_model(model10, X_train10, X_test10, y_train10, y_test10)

{'MSE': 0.23707443719666482, 'Score (R^2)': -0.1000100529864194}

**Importance of the coefficients**

In [109]:
# Look at the feature importance with coef_
pd.Series(dict(zip(X_train10.columns,model10.coef_))).abs().sort_values(ascending=False).head(10)

Energy              0.092042
Acousticness        0.030842
baby                0.016421
und                 0.016024
von                 0.015729
bro                 0.013605
Mode                0.012768
Instrumentalness    0.012674
avant               0.012573
das                 0.012504
dtype: float64

**LinReg #2 - TF-IDF + top 3 coefs**

In [110]:
# define X and y
X_train11 = indep_train_tvec[['Energy', 'Acousticness', 'baby']]
y_train11 = indep_train['Valence']
X_test11 = indep_test_tvec[['Energy', 'Acousticness', 'baby']]
y_test11 = indep_test['Valence']

# chose model 
model11 = LinearRegression()

# call function
evaluate_model(model11, X_train11, X_test11, y_train11, y_test11)

{'MSE': 0.20591294964191587, 'Score (R^2)': 0.1701597088004173}

**LinReg #3 - TF-IDF + top 10 coefs**

In [111]:
# define X and y
X_train12 = indep_train_tvec[['Energy', 'Acousticness', 'baby', 'und', 'von', 'bro', 'Mode', 'Instrumentalness',
                             'avant', 'das']]
y_train12 = indep_train['Valence']
X_test12 = indep_test_tvec[['Energy', 'Acousticness', 'baby', 'und', 'von', 'bro', 'Mode', 'Instrumentalness',
                             'avant', 'das']]
y_test12 = indep_test['Valence']

# chose model 
model12 = LinearRegression()

# call function
evaluate_model(model12, X_train12, X_test12, y_train12, y_test12)

{'MSE': 0.20564608249825045, 'Score (R^2)': 0.17230929290631947}

COMMENT: Best LinReg is nr 3

### Lasso Regressor

**Lasso #1 - TF-IDF  + all coefs**

In [112]:
# define X and y
X_train13 = indep_train_tvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_train13 = indep_train['Valence']
X_test13 = indep_test_tvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_test13 = indep_test['Valence']

In [113]:
# standardize the predictors
ss = StandardScaler()
ss.fit(X_train13)
X_train13_s = ss.transform(X_train13)
X_test13_s = ss.transform(X_test13)

In [114]:
# RandomizedSearch (best hyperparams) 
params = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
rs = RandomizedSearchCV(lassoreg, params, n_iter=40)
rs.fit(X_train13_s, y_train13)

print(f'best_score: {rs.best_score_}')
print(f'best_params: {rs.best_params_}')

best_score: 0.17587741186287675
best_params: {'selection': 'cyclic', 'alpha': 0.008111308307896872}


In [115]:
# GridSearch (best hyperparams) 
grid = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
gs = GridSearchCV(lassoreg, grid)
gs.fit(X_train13_s, y_train13)
    
print(f'best_score: {gs.best_score_}')
print(f'best_params: {gs.best_params_}') 

best_score: 0.1804718117607225
best_params: {'alpha': 0.006135907273413176, 'selection': 'cyclic'}


In [116]:
# Lasso regression (best hyper params: input alpha and selection from above)
model13 = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000, alpha=0.0061, selection='cyclic')              

# fit
model13.fit(X_train13_s, y_train13)

# Evaluate: predict
y_pred = model13.predict(X_test13_s)
y_true = y_test13
    
mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
# Evaluate: score 
score = model13.score(X_test13_s, y_test13)
    
print(f'Score (R^2): {score.mean()}')
print(f'MSE: {mean_square_error}')

Score (R^2): 0.18231891095186592
MSE: 0.20439881704399268


In [117]:
# Look at the feature importance with coef_
pd.Series(dict(zip(X_train13.columns,model13.coef_))).abs().sort_values(ascending=False).head(15)

Energy              0.081251
Acousticness        0.015524
baby                0.012175
girl                0.009510
Mode                0.008787
niggas              0.007259
fun                 0.005929
bitch               0.005920
el                  0.005749
Instrumentalness    0.005213
elle                0.004706
putain              0.004406
long                0.004383
bro                 0.004156
pute                0.004139
dtype: float64

**Lasso #2 - TF-IDF  + top 10 coefs**

In [118]:
# define X and y
X_train14 = indep_train_tvec[['Energy', 'Acousticness', 'baby', 'girl', 'Mode', 'niggas', 'fun', 'bitch', 'el',
                             'Instrumentalness']]
y_train14 = indep_train['Valence']
X_test14 = indep_test_tvec[['Energy', 'Acousticness', 'baby', 'girl', 'Mode', 'niggas', 'fun', 'bitch', 'el',
                             'Instrumentalness']]
y_test14 = indep_test['Valence']

In [119]:
# standardize the predictors
ss = StandardScaler()
ss.fit(X_train14)
X_train14_s = ss.transform(X_train14)
X_test14_s = ss.transform(X_test14)

In [120]:
# RandomizedSearch (best hyperparams) 
params = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
rs = RandomizedSearchCV(lassoreg, params, n_iter=40)
rs.fit(X_train14_s, y_train14)

print(f'best_score: {rs.best_score_}')
print(f'best_params: {rs.best_params_}')

best_score: 0.1799510512209579
best_params: {'selection': 'cyclic', 'alpha': 0.001}


In [121]:
# GridSearch (best hyperparams) 
grid = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
gs = GridSearchCV(lassoreg, grid)
gs.fit(X_train14_s, y_train14)
    
print(f'best_score: {gs.best_score_}')
print(f'best_params: {gs.best_params_}') 

best_score: 0.17995132338800668
best_params: {'alpha': 0.001, 'selection': 'random'}


In [122]:
# Lasso regression (best hyper params: input alpha and selection from above)
model14 = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000, alpha=0.001, selection='random')              

# fit
model14.fit(X_train14_s, y_train14)

# Evaluate: predict 
y_pred = model14.predict(X_test14_s)
y_true = y_test14
    
mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
# Evaluate: score 
score = model14.score(X_test14_s, y_test14)
    
print(f'Score (R^2): {score.mean()}')
print(f'MSE: {mean_square_error}')

Score (R^2): 0.16482342383120618
MSE: 0.20657394994100306


**Lasso #3 - TF-IDF  + top 5 coefs**

In [123]:
# define X and y
X_train15 = indep_train_tvec[['Energy', 'Acousticness', 'baby', 'girl', 'Mode']]
y_train15 = indep_train['Valence']
X_test15 = indep_test_tvec[['Energy', 'Acousticness', 'baby', 'girl', 'Mode']]
y_test15 = indep_test['Valence']

In [124]:
# standardize the predictors
ss = StandardScaler()
ss.fit(X_train15)
X_train15_s = ss.transform(X_train15)
X_test15_s = ss.transform(X_test15)

In [125]:
# RandomizedSearch (best hyperparams) 
params = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
rs = RandomizedSearchCV(lassoreg, params, n_iter=40)
rs.fit(X_train15_s, y_train15)

print(f'best_score: {rs.best_score_}')
print(f'best_params: {rs.best_params_}')

best_score: 0.1676519870692203
best_params: {'selection': 'cyclic', 'alpha': 0.0013219411484660286}


In [126]:
# GridSearch (best hyperparams) 
grid = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
gs = GridSearchCV(lassoreg, grid)
gs.fit(X_train15_s, y_train15)
    
print(f'best_score: {gs.best_score_}')
print(f'best_params: {gs.best_params_}') 

best_score: 0.16776773876394882
best_params: {'alpha': 0.001, 'selection': 'random'}


In [127]:
# Lasso regression (best hyper params: input alpha and selection from above)
model15 = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000, alpha=0.001, selection='random')              

# fit
model15.fit(X_train15_s, y_train15)

# Evaluate: predict 
y_pred = model15.predict(X_test15_s)
y_true = y_test15
    
mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
# Evaluate: score 
score = model15.score(X_test15_s, y_test15)
    
print(f'Score (R^2): {score.mean()}')
print(f'MSE: {mean_square_error}')

Score (R^2): 0.16853585364051438
MSE: 0.20611431934618338


*COMMENT: The best Lasso is nr 1*

### Random Forest Regressor

You do not have to scale a Random Forest.

In [75]:
def get_best_hype(model, params, X_train, y_train):  
    # Best Hyperparameters
    rs = RandomizedSearchCV(model, params, n_iter=40)
    
    # fit
    rs.fit(X_train, y_train)
     
    return {'best_score': rs.best_score_,'best_params': rs.best_params_} 

def evaluate_model(model, X_train, X_test, y_train, y_test):
    # fit
    model.fit(X_train, y_train)
    
    # Evaluate: predict
    
    y_pred = model.predict(X_test)
    y_true = y_test
    
    mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
    # Evaluate: score
    score = model.score(X_test, y_test)
    
    return {'Score (R^2)': score.mean(), 'MSE': mean_square_error}

**Random Forest #1 - TF-IDF + all coefs**

In [129]:
# Declare indep and dep
X_train16 = indep_train_tvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_train16 = indep_train['Valence']
X_test16 = indep_test_tvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_test16 = indep_test['Valence']

# RandomizedSearch
params = {'n_estimators': np.arange(10, 100, 10),
        'max_depth': np.arange(2, 20, 3),
        'max_features' : ('auto', 'sqrt', 'log2'),
        'bootstrap': (True, False),
        'verbose' : np.arange(0, 1)}

rfr = RandomForestRegressor(random_state=24)

get_best_hype(rfr, params, X_train16, y_train16)

{'best_params': {'bootstrap': True,
  'max_depth': 17,
  'max_features': 'auto',
  'n_estimators': 60,
  'verbose': 0},
 'best_score': 0.1783545403294368}

In [130]:
# chose model and use best hyperparameters (from gridsearchCV)
model16 = RandomForestRegressor(max_depth=17, max_features='auto', n_estimators=60, verbose=0, bootstrap=True, 
                               random_state=24)

# call function
evaluate_model(model16, X_train16, X_test16, y_train16, y_test16)

{'MSE': 0.2044690900624566, 'Score (R^2)': 0.18175657114256183}

**Feature importance**

In [131]:
pd.Series(dict(zip(X_train16.columns,model16.feature_importances_))).abs().sort_values(ascending=False).head(15)

Energy              0.209038
Acousticness        0.030350
Instrumentalness    0.014404
baby                0.013349
Subjectivity        0.013298
Tempo               0.012009
Polarity            0.011515
girl                0.008183
oh                  0.007433
che                 0.006749
je                  0.006355
en                  0.006313
ben                 0.005525
like                0.004684
ma                  0.004602
dtype: float64

**Random Forest #2 - TF-IDF + top 7 features**

In [132]:
# define X and y
X_train17 = indep_train_tvec[['Energy', 'Acousticness', 'Instrumentalness', 'baby', 'Subjectivity', 'Tempo', 'Polarity']]
y_train17 = indep_train['Valence']
X_test17 = indep_test_tvec[['Energy', 'Acousticness', 'Instrumentalness', 'baby', 'Subjectivity', 'Tempo', 'Polarity']]
y_test17 = indep_test['Valence']

# RandomizedSearch
params = {'n_estimators': np.arange(10, 100, 10),
        'max_depth': np.arange(2, 20, 3),
        'max_features' : ('auto', 'sqrt', 'log2'),
        'bootstrap': (True, False),
        'verbose' : np.arange(0, 1)}

rfr = RandomForestRegressor(random_state=24)

get_best_hype(rfr, params, X_train17, y_train17)

{'best_params': {'bootstrap': True,
  'max_depth': 8,
  'max_features': 'log2',
  'n_estimators': 90,
  'verbose': 0},
 'best_score': 0.13893006618660966}

In [133]:
# chose model and use best hyperparameters (from gridsearchCV)
model17 = RandomForestRegressor(max_depth=8, max_features='log2', n_estimators=90, verbose=0, bootstrap=True, 
                               random_state=24)

# call function
evaluate_model(model17, X_train17, X_test17, y_train17, y_test17)

{'MSE': 0.20767675911863986, 'Score (R^2)': 0.15588232634998844}

**Random Forest #3 - TF_IDF + top 3 features**

In [134]:
# define X and y
X_train18 = indep_train_tvec[['Energy', 'Acousticness', 'Instrumentalness']]
y_train18 = indep_train['Valence']
X_test18 = indep_test_tvec[['Energy', 'Acousticness', 'Instrumentalness']]
y_test18 = indep_test['Valence']

# RandomizedSearch
params = {'n_estimators': np.arange(10, 100, 10),
        'max_depth': np.arange(2, 20, 3),
        'max_features' : ('auto', 'sqrt', 'log2'),
        'bootstrap': (True, False),
        'verbose' : np.arange(0, 1)}

rfr = RandomForestRegressor(random_state=24)

get_best_hype(rfr, params, X_train18, y_train18)

{'best_params': {'bootstrap': True,
  'max_depth': 5,
  'max_features': 'sqrt',
  'n_estimators': 40,
  'verbose': 0},
 'best_score': 0.13509606523375745}

In [136]:
# chose model and use best hyperparameters (from gridsearchCV)
model18 = RandomForestRegressor(max_depth=5, max_features='sqrt', n_estimators=40, verbose=0, bootstrap=True, 
                               random_state=24)

# call function
evaluate_model(model18, X_train18, X_test18, y_train18, y_test18)

{'MSE': 0.20945467380259286, 'Score (R^2)': 0.14136752715830003}

*COMMENT: The best RF is nr 1*

The over all best model was Lasso nr 1 with a R2-score of 0.18231891095186592

**TF-IDF, Lasso regressor all variables is the model in this notebook that scores the best.**