# Make a Good Model for the mood of the song

In this document I need to work more with NLP of the track lyrics and use columns that I discarded in the MVP and I also need to do  some feature engineering.

Good model = more NLP on the Track Lyrics<BR />
Better model = also looking at number of streams and position<BR />
Best model = also looking at time on top list<BR />

Then when we have the best model we can use our predictions and decide the mood of a country and the mood of an artist. Then we can say what artist is suitable for what country. 

## Import stuff

In [140]:
import pandas as pd
import numpy as np
from textblob import TextBlob
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.preprocessing import StandardScaler
import sklearn.metrics
from sklearn.ensemble import RandomForestRegressor
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import Lasso

from matplotlib import pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline

## Load data

In [29]:
data = pd.read_csv('./data_top10c_more_lyrics.csv')

In [30]:
data.head(3)

Unnamed: 0.1,Unnamed: 0,Position,Streams,Track Name,Artist,ID,Date,Year,Month,Day,Country,Region,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence
0,0,177,40381,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,2017-10-05,2017,10,5,gb,eu,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879
1,1,151,24132,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,2017-12-23,2017,12,23,it,eu,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756
2,2,78,49766,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,2017-12-24,2017,12,24,it,eu,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756


## Fix a little bit with the data

**Drop rows that are duplicates and keep only one row for each song**

In [31]:
data_per_song = data.drop_duplicates(subset=['Track Name'], keep='first')

**Drop all columns that might change per song**

In [32]:
nlp_data = data_per_song.drop(['Unnamed: 0', 'Position', 'Streams', 'Date', 'Year', 'Month', 'Day', 'Country', 'Region'], axis=1)

nlp_data.head(5)

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence
0,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879
1,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756
43,Douce Nuit,-M-,4EOJWkvkVDpkZrhC8iTDsI,,0.914,0.227,0.163,1.0,81.887,0.0498
44,Zomersessie,101Barz,3ypzzvHUfgwyqxhL9ym4fH,,0.00818,0.403,2.1e-05,1.0,155.748,0.365
47,Zomersessie (feat. 3robi),101Barz,2re4cLViiQw0NZZx5KUpV8,,0.00818,0.403,2.1e-05,1.0,155.748,0.365


**Look at missing values**

In [33]:
nlp_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6919 entries, 0 to 578929
Data columns (total 10 columns):
Track Name          6919 non-null object
Artist              6919 non-null object
ID                  6919 non-null object
Lyrics              4190 non-null object
Acousticness        6918 non-null float64
Energy              6918 non-null float64
Instrumentalness    6918 non-null float64
Mode                6918 non-null float64
Tempo               6918 non-null float64
Valence             6918 non-null float64
dtypes: float64(6), object(4)
memory usage: 594.6+ KB


**Drop rows that have missing values in the Lyrics column**<BR />
We can use dropna to drop all rows that has missing values (should mostly be the Lyrics column)

In [34]:
nlp_data_clean = nlp_data.dropna(axis=0, how='any')

nlp_data_clean.head()

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence
0,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879
1,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756
48,Somme,13 Block,2xkxBVJHf9jQsq7g46UtQx,"J'ai fait l'aller, j'suis sur le retour\nLa ma...",0.494,0.678,0.00151,0.0,79.979,0.528
50,Vide,13 Block,69RclklKbEelwfQJCBzh0m,"13 Blo' gang, tu sais d'jà comment on opère mo...",0.505,0.682,6e-06,0.0,112.063,0.514
71,10 Dinger,187 Strassenbande,3ruUVcomUKxPlX8srBfMua,"Ich schwör' dir, wenn ich mal Kohle mache, dan...",0.0182,0.673,0.0,1.0,94.54,0.642


In [35]:
nlp_data_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4190 entries, 0 to 578929
Data columns (total 10 columns):
Track Name          4190 non-null object
Artist              4190 non-null object
ID                  4190 non-null object
Lyrics              4190 non-null object
Acousticness        4190 non-null float64
Energy              4190 non-null float64
Instrumentalness    4190 non-null float64
Mode                4190 non-null float64
Tempo               4190 non-null float64
Valence             4190 non-null float64
dtypes: float64(6), object(4)
memory usage: 360.1+ KB


### TextBlob

**Turn the lyrics in the Lyrics column into string**

In [36]:
nlp_data_clean['Lyrics'] = nlp_data_clean['Lyrics'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


**Make and run function for TextBlob on the Lyrics**

In [37]:
def sentiment_func(lyrics):
    try:
        return TextBlob(lyrics).sentiment
    except:
        return None

nlp_data_clean['pol_sub'] = nlp_data_clean['Lyrics'].apply(sentiment_func)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


**Split the pol_sub column into 2 new columns (Polarity, Subjectivity)**

In [38]:
nlp_data_clean['pol_sub'][0][0]

nlp_data_clean['Polarity'] = nlp_data_clean['pol_sub'].apply(lambda x: x[0])
nlp_data_clean['Subjectivity'] = nlp_data_clean['pol_sub'].apply(lambda x: x[1])

nlp_data_clean.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,pol_sub,Polarity,Subjectivity
0,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879,"(-0.044454619454619454, 0.5908017908017905)",-0.044455,0.590802
1,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756,"(0.5831501831501833, 0.6706959706959708)",0.58315,0.670696
48,Somme,13 Block,2xkxBVJHf9jQsq7g46UtQx,"J'ai fait l'aller, j'suis sur le retour\nLa ma...",0.494,0.678,0.00151,0.0,79.979,0.528,"(0.1738095238095238, 0.5416666666666666)",0.17381,0.541667


**Drop the pol_sub column**

In [39]:
nlp_data_clean = nlp_data_clean.drop(['pol_sub'], axis=1)

nlp_data_clean.head(3)

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,Polarity,Subjectivity
0,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879,-0.044455,0.590802
1,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756,0.58315,0.670696
48,Somme,13 Block,2xkxBVJHf9jQsq7g46UtQx,"J'ai fait l'aller, j'suis sur le retour\nLa ma...",0.494,0.678,0.00151,0.0,79.979,0.528,0.17381,0.541667
50,Vide,13 Block,69RclklKbEelwfQJCBzh0m,"13 Blo' gang, tu sais d'jà comment on opère mo...",0.505,0.682,6e-06,0.0,112.063,0.514,-0.244444,0.777778
71,10 Dinger,187 Strassenbande,3ruUVcomUKxPlX8srBfMua,"Ich schwör' dir, wenn ich mal Kohle mache, dan...",0.0182,0.673,0.0,1.0,94.54,0.642,0.119167,0.555833


### Do a quick check of the entire data frame

In [40]:
nlp_data_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4190 entries, 0 to 578929
Data columns (total 12 columns):
Track Name          4190 non-null object
Artist              4190 non-null object
ID                  4190 non-null object
Lyrics              4190 non-null object
Acousticness        4190 non-null float64
Energy              4190 non-null float64
Instrumentalness    4190 non-null float64
Mode                4190 non-null float64
Tempo               4190 non-null float64
Valence             4190 non-null float64
Polarity            4190 non-null float64
Subjectivity        4190 non-null float64
dtypes: float64(8), object(4)
memory usage: 585.5+ KB


In [41]:
nlp_data_clean.describe()

Unnamed: 0,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,Polarity,Subjectivity
count,4190.0,4190.0,4190.0,4190.0,4190.0,4190.0,4190.0,4190.0
mean,0.238886,0.660887,0.009918,0.54105,120.254388,0.48252,0.071806,0.453405
std,0.234458,0.165181,0.065771,0.498372,26.560422,0.221104,0.225631,0.228036
min,3e-06,0.0279,0.0,0.0,54.082,0.0371,-1.0,0.0
25%,0.051525,0.562,0.0,0.0,99.98425,0.31,-0.033272,0.35
50%,0.159,0.676,0.0,1.0,120.004,0.473,0.046612,0.4875
75%,0.368,0.78475,3.8e-05,1.0,136.04475,0.654,0.189943,0.591449
max,0.988,0.995,0.89,1.0,232.69,0.982,1.0,1.0


## Train/Test-split

Divide the data into a train and a test set (with a test set of 25%, which is also default)

In [105]:
dep   = nlp_data_clean['Valence']
indep = nlp_data_clean

In [106]:
indep_train, indep_test, dep_train, dep_test = train_test_split(indep, dep, test_size = 0.25, random_state=24)

## NLP

### CountVectorizer (use in model 1)

Since music is an art form, like poems, I might concider not to use stop words. (Maybe in a later try?!)

In [107]:
# instantiate the model
cvec = CountVectorizer(stop_words='english', max_features = 1000) 
# eliminate English stop words and use max_features since there will be more than 60,000 if iI do not

In [108]:
# fit the count vectorizer with training data. 
cvec.fit(indep_train['Lyrics'])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [109]:
# transform X_train
cvec_data = cvec.transform(indep_train['Lyrics'])

In [110]:
# Turn the features into a data frame
df  = pd.DataFrame(cvec_data.todense(),columns=cvec.get_feature_names())

df.head(3)

Unnamed: 0,aan,ab,aber,act,adesso,ah,ahh,ai,aime,ain,...,zit,zo,zu,zum,zwei,ça,équipe,étais,était,être
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,3,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,1,4,0,0,0,0,0,0


In [111]:
len(indep_train)

3142

In [112]:
len(df)

3142

In [113]:
# Concat with big data frame and use for fitting the model
indep_train_cvec = pd.concat([indep_train.reset_index(drop=True), df], axis=1)

indep_train_cvec.head(3)

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,...,zit,zo,zu,zum,zwei,ça,équipe,étais,était,être
0,Atemlos durch die Nacht,Helene Fischer,5fPGpdC4tmcVMmTuJV2HRg,Wir ziehen durch die Straßen und die Clubs die...,0.049,0.73,2e-06,1.0,128.041,0.866,...,0,0,0,0,3,0,0,0,0,0
1,La vita liquida,Brunori Sas,7ctWJ718cdHHqINJlRTuxF,Liquido è il mio corpo che si piega ad ogni co...,0.558,0.621,2.6e-05,0.0,88.065,0.585,...,0,0,0,0,0,0,0,0,0,0
2,Zum ersten Mal Nintendo,Philipp Poisel,2UcgmsztMXVyPo3VgqD5Bu,wie oft wollt' ich weg von hier? anders als di...,0.543,0.481,0.000299,1.0,98.983,0.503,...,0,0,1,4,0,0,0,0,0,0


In [114]:
len(indep_train_cvec)

3142

In [None]:
######################

In [115]:
# transform X_test
cvec_data2 = cvec.transform(indep_test['Lyrics'])

In [116]:
# Turn the features into a data frame
df2  = pd.DataFrame(cvec_data2.todense(),columns=cvec.get_feature_names())

df2.head(3)

Unnamed: 0,aan,ab,aber,act,adesso,ah,ahh,ai,aime,ain,...,zit,zo,zu,zum,zwei,ça,équipe,étais,était,être
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,3,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,4,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [117]:
len(indep_test)

1048

In [118]:
len(df2)

1048

In [119]:
# Concat with big data frame and use for scoring the model
indep_test_cvec = pd.concat([indep_test.reset_index(drop=True), df2], axis=1)

indep_test_cvec.head(3)

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,...,zit,zo,zu,zum,zwei,ça,équipe,étais,était,être
0,King Of The North,Bugzy Malone,4DixkYsZqImKOmjaIaYnCi,King! King! King! King! King! King!\nI'm King ...,0.0778,0.772,4.4e-05,0.0,139.886,0.376,...,0,0,0,0,0,0,0,0,0,0
1,Back On,Gucci Mane,0KA5Cc68h9qitLwTadHBpa,Zaytoven\nHah\nWop\nYeah\nIt's Gucci\nZay\nZig...,0.0087,0.639,4e-06,1.0,156.055,0.427,...,0,0,0,0,0,0,0,0,0,0
2,Magazine,Dark Polo Gang,71MGHgauMD6aapixtV6Chd,"Hey, hey\nSick Luke, Sick Luke\n\nLa mia facci...",0.396,0.437,0.0,1.0,136.111,0.525,...,0,0,0,0,0,0,0,0,0,0


In [120]:
len(indep_test_cvec)

1048

*If there are time in the future consider stemming or lemming* 

### TF-IDF (use in model 2)

In [121]:
# instantiate the model
tvec = TfidfVectorizer(stop_words='english', max_features = 1000) 
# eliminate English stop words and use max_features since there will be more than 60,000 if iI do not

In [122]:
# fit the count vectorizer with training data. 
tvec.fit(indep_train['Lyrics'])

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [123]:
# transform X_train
tvec_data3 = tvec.transform(indep_train['Lyrics'])

In [124]:
# Turn the features into a data frame
df3  = pd.DataFrame(tvec_data3.todense(), columns=tvec.get_feature_names())

df3.head(3)

Unnamed: 0,aan,ab,aber,act,adesso,ah,ahh,ai,aime,ain,...,zit,zo,zu,zum,zwei,ça,équipe,étais,était,être
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.137169,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.031071,0.156462,0.0,0.0,0.0,0.0,0.0,0.0


In [125]:
len(indep_train)

3142

In [126]:
len(df3)

3142

In [127]:
# Concat with big data frame and use for fitting the model
indep_train_tvec = pd.concat([indep_train.reset_index(drop=True), df3], axis=1)

indep_train_tvec.head(3)

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,...,zit,zo,zu,zum,zwei,ça,équipe,étais,était,être
0,Atemlos durch die Nacht,Helene Fischer,5fPGpdC4tmcVMmTuJV2HRg,Wir ziehen durch die Straßen und die Clubs die...,0.049,0.73,2e-06,1.0,128.041,0.866,...,0.0,0.0,0.0,0.0,0.137169,0.0,0.0,0.0,0.0,0.0
1,La vita liquida,Brunori Sas,7ctWJ718cdHHqINJlRTuxF,Liquido è il mio corpo che si piega ad ogni co...,0.558,0.621,2.6e-05,0.0,88.065,0.585,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Zum ersten Mal Nintendo,Philipp Poisel,2UcgmsztMXVyPo3VgqD5Bu,wie oft wollt' ich weg von hier? anders als di...,0.543,0.481,0.000299,1.0,98.983,0.503,...,0.0,0.0,0.031071,0.156462,0.0,0.0,0.0,0.0,0.0,0.0


In [128]:
len(indep_train_tvec)

3142

In [None]:
##########################

In [129]:
# transform X_test
tvec_data4 = tvec.transform(indep_test['Lyrics'])

In [130]:
# Turn the features into a data frame
df4  = pd.DataFrame(tvec_data4.todense(),columns=cvec.get_feature_names())

df4.head(3)

Unnamed: 0,aan,ab,aber,act,adesso,ah,ahh,ai,aime,ain,...,zit,zo,zu,zum,zwei,ça,équipe,étais,était,être
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.130963,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.189735,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [131]:
len(indep_test)

1048

In [132]:
len(df4)

1048

In [133]:
# Concat with big data frame and use for scoring the model
indep_test_tvec = pd.concat([indep_test.reset_index(drop=True), df4], axis=1)

indep_test_tvec.head(3) 

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,...,zit,zo,zu,zum,zwei,ça,équipe,étais,était,être
0,King Of The North,Bugzy Malone,4DixkYsZqImKOmjaIaYnCi,King! King! King! King! King! King!\nI'm King ...,0.0778,0.772,4.4e-05,0.0,139.886,0.376,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Back On,Gucci Mane,0KA5Cc68h9qitLwTadHBpa,Zaytoven\nHah\nWop\nYeah\nIt's Gucci\nZay\nZig...,0.0087,0.639,4e-06,1.0,156.055,0.427,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Magazine,Dark Polo Gang,71MGHgauMD6aapixtV6Chd,"Hey, hey\nSick Luke, Sick Luke\n\nLa mia facci...",0.396,0.437,0.0,1.0,136.111,0.525,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [134]:
len(indep_test_tvec)

1048

## Models (LinReg and RF)

### Linear Regression

In [135]:
def evaluate_model(model, X_train, X_test, y_train, y_test):
    # standardize the predictors
    ss = StandardScaler()
    ss.fit(X_train)
    X_train_s = ss.transform(X_train)
    X_test_s = ss.transform(X_test)
    
    # fit
    model.fit(X_train_s, y_train)
    
    # Evaluate: predict and score
    y_pred = model.predict(X_test_s)
    y_true = y_test
    
    mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
    # Evaluate: score and predict with cross_val
    score = cross_val_score(model, X_test_s, y_test)
    pred = cross_val_predict(model, X_test_s, y_test)
    
    return {'Score (R^2)': score.mean(), 'MSE': mean_square_error}

**LinReg #1 - CountVec + all coefs**

In [136]:
# define X and y
X_train = indep_train_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_train = indep_train['Valence'] 
X_test = indep_test_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_test = indep_test['Valence']

# chose model 
model = LinearRegression()

# call function
evaluate_model(model, X_train, X_test, y_train, y_test)

{'MSE': 0.24766785367015642, 'Score (R^2)': -19.29597933450187}

**Importance of the coefficients**

In [137]:
# Look at the feature importance with coef_
pd.Series(dict(zip(X_train.columns,model.coef_))).abs().sort_values(ascending=False).head(10)

Energy          0.091617
Acousticness    0.029776
und             0.022604
von             0.017408
nu              0.015495
baby            0.015344
nem             0.014615
sich            0.014229
das             0.013787
avant           0.013515
dtype: float64

**LinReg #2 - CountrVec + top 3 coefs**

In [138]:
# define X and y
X_train2 = indep_train_cvec[['Energy', 'Acousticness', 'und']]
y_train2 = indep_train['Valence']
X_test2 = indep_test_cvec[['Energy', 'Acousticness', 'und']]
y_test2 = indep_test['Valence']

# chose model 
model2 = LinearRegression()

# call function
evaluate_model(model2, X_train2, X_test2, y_train2, y_test2)

{'MSE': 0.20734850323633328, 'Score (R^2)': 0.14718737897447168}

**LinReg #3 - CountrVec + top 10 coefs**

In [139]:
# define X and y
X_train3 = indep_train_cvec[['Energy', 'Acousticness', 'und', 'von', 'nu', 'baby', 'nem', 'sich', 'das', 'avant']]
y_train3 = indep_train['Valence']
X_test3 = indep_test_cvec[['Energy', 'Acousticness', 'und', 'von', 'nu', 'baby', 'nem', 'sich', 'das', 'avant']]
y_test3 = indep_test['Valence']

# chose model 
model3 = LinearRegression()

# call function
evaluate_model(model3, X_train3, X_test3, y_train3, y_test3)

{'MSE': 0.20627792538890397, 'Score (R^2)': 0.1484921420481692}

This is a very unefficient way to find the best number of independent variables. I will run a Lasso model instead to get help with varables.<BR />
An alternative could have been going with the Transformers 'select k-best' (will pick best nr of estimators, where I choose the number of estimaters) or RFE (eliminates varables not to use).

### Lasso Regressor

In [None]:
# define X and y
X_train4 = indep_train_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_train4 = indep_train['Valence']
X_test4 = indep_test_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_test4 = indep_test['Valence']

In [None]:
# Gridsearch
grid = {'n_estimators': np.arange(1, 10),
        'max_depth': np.arange(1, 10),
        'max_features' : ('auto', 'sqrt', 'log2', None),
        'bootstrap': (True, False),
        'verbose' : np.arange(0, 1)}

rfr = RandomForestRegressor(random_state=24)

get_best_hype(rfr, grid, X_train4, y_train4)   
    
    # Best Hyperparameters
    gs = GridSearchCV(model, grid)
    gs.fit(X_train, y_train)
    
    # fit
    model.fit(X_train, y_train) 
    return {'best_score': gs.best_score_,'best_params': gs.best_params_} 



In [None]:
# chose model 
model4 = LinearRegression()

# Lasso regression

lassoreg = Lasso(alpha=0.001, normalize=True)
lassoreg.fit(X_train, y_train)
print(lassoreg.coef_)

In [None]:
    # standardize the predictors
    ss = StandardScaler()
    ss.fit(X_train)
    X_train_s = ss.transform(X_train)
    X_test_s = ss.transform(X_test)
    
    # fit
    model.fit(X_train_s, y_train)
    
    # Evaluate: predict and score
    y_pred = model.predict(X_test_s)
    y_true = y_test
    
    mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
    # Evaluate: score and predict with cross_val
    score = cross_val_score(model, X_test_s, y_test)
    pred = cross_val_predict(model, X_test_s, y_test)
    
    return {'Score (R^2)': score.mean(), 'MSE': mean_square_error}