# Make a Good Model for the mood of the song

In this document I need to work more with NLP of the track lyrics and use columns that I discarded in the MVP.

Good model = more NLP on the Track Lyrics<BR />
Better model = countries <BR />
Best model = also looking at number of streams and position, also looking at time on top list<BR />

Then when we have the best model we can use our predictions and decide the mood of a country and the mood of an artist. Then we can say what artist is suitable for what country. 

## Import stuff

In [1]:
import pandas as pd
import numpy as np
from textblob import TextBlob
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.preprocessing import StandardScaler
import sklearn.metrics
from sklearn.ensemble import RandomForestRegressor
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.model_selection import RandomizedSearchCV

from matplotlib import pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline



## Load data

In [2]:
data = pd.read_csv('./data_top10c_more_lyrics.csv')

In [3]:
data.head(3)

Unnamed: 0.1,Unnamed: 0,Position,Streams,Track Name,Artist,ID,Date,Year,Month,Day,Country,Region,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence
0,0,177,40381,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,2017-10-05,2017,10,5,gb,eu,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879
1,1,151,24132,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,2017-12-23,2017,12,23,it,eu,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756
2,2,78,49766,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,2017-12-24,2017,12,24,it,eu,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756


## Fix a little bit with the data

**Drop rows that are duplicates and keep only one row for each song**

In [4]:
data_per_song = data.drop_duplicates(subset=['Track Name'], keep='first')

**Drop all columns that might change per song**

In [5]:
nlp_data = data_per_song.drop(['Unnamed: 0', 'Position', 'Streams', 'Date', 'Year', 'Month', 'Day', 'Country',
                               'Region'], axis=1)

nlp_data.head(5)

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence
0,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879
1,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756
43,Douce Nuit,-M-,4EOJWkvkVDpkZrhC8iTDsI,,0.914,0.227,0.163,1.0,81.887,0.0498
44,Zomersessie,101Barz,3ypzzvHUfgwyqxhL9ym4fH,,0.00818,0.403,2.1e-05,1.0,155.748,0.365
47,Zomersessie (feat. 3robi),101Barz,2re4cLViiQw0NZZx5KUpV8,,0.00818,0.403,2.1e-05,1.0,155.748,0.365


**Look at missing values**

In [6]:
nlp_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6919 entries, 0 to 578929
Data columns (total 10 columns):
Track Name          6919 non-null object
Artist              6919 non-null object
ID                  6919 non-null object
Lyrics              4190 non-null object
Acousticness        6918 non-null float64
Energy              6918 non-null float64
Instrumentalness    6918 non-null float64
Mode                6918 non-null float64
Tempo               6918 non-null float64
Valence             6918 non-null float64
dtypes: float64(6), object(4)
memory usage: 594.6+ KB


**Drop rows that have missing values in the Lyrics column**<BR />
We can use dropna to drop all rows that has missing values (should mostly be the Lyrics column)

In [7]:
nlp_data_clean = nlp_data.dropna(axis=0, how='any')

nlp_data_clean.head()

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence
0,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879
1,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756
48,Somme,13 Block,2xkxBVJHf9jQsq7g46UtQx,"J'ai fait l'aller, j'suis sur le retour\nLa ma...",0.494,0.678,0.00151,0.0,79.979,0.528
50,Vide,13 Block,69RclklKbEelwfQJCBzh0m,"13 Blo' gang, tu sais d'jà comment on opère mo...",0.505,0.682,6e-06,0.0,112.063,0.514
71,10 Dinger,187 Strassenbande,3ruUVcomUKxPlX8srBfMua,"Ich schwör' dir, wenn ich mal Kohle mache, dan...",0.0182,0.673,0.0,1.0,94.54,0.642


In [8]:
nlp_data_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4190 entries, 0 to 578929
Data columns (total 10 columns):
Track Name          4190 non-null object
Artist              4190 non-null object
ID                  4190 non-null object
Lyrics              4190 non-null object
Acousticness        4190 non-null float64
Energy              4190 non-null float64
Instrumentalness    4190 non-null float64
Mode                4190 non-null float64
Tempo               4190 non-null float64
Valence             4190 non-null float64
dtypes: float64(6), object(4)
memory usage: 360.1+ KB


### TextBlob

**Turn the lyrics in the Lyrics column into string**

In [9]:
nlp_data_clean['Lyrics'] = nlp_data_clean['Lyrics'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


**Make and run function for TextBlob on the Lyrics**

In [10]:
def sentiment_func(lyrics):
    try:
        return TextBlob(lyrics).sentiment
    except:
        return None

nlp_data_clean['pol_sub'] = nlp_data_clean['Lyrics'].apply(sentiment_func)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


**Split the pol_sub column into 2 new columns (Polarity, Subjectivity)**

In [11]:
nlp_data_clean['pol_sub'][0][0]

nlp_data_clean['Polarity'] = nlp_data_clean['pol_sub'].apply(lambda x: x[0])
nlp_data_clean['Subjectivity'] = nlp_data_clean['pol_sub'].apply(lambda x: x[1])

nlp_data_clean.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,pol_sub,Polarity,Subjectivity
0,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879,"(-0.044454619454619454, 0.5908017908017905)",-0.044455,0.590802
1,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756,"(0.5831501831501833, 0.6706959706959708)",0.58315,0.670696
48,Somme,13 Block,2xkxBVJHf9jQsq7g46UtQx,"J'ai fait l'aller, j'suis sur le retour\nLa ma...",0.494,0.678,0.00151,0.0,79.979,0.528,"(0.1738095238095238, 0.5416666666666666)",0.17381,0.541667


**Drop the pol_sub column**

In [12]:
nlp_data_clean = nlp_data_clean.drop(['pol_sub'], axis=1)

nlp_data_clean.head(3)

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,Polarity,Subjectivity
0,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879,-0.044455,0.590802
1,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756,0.58315,0.670696
48,Somme,13 Block,2xkxBVJHf9jQsq7g46UtQx,"J'ai fait l'aller, j'suis sur le retour\nLa ma...",0.494,0.678,0.00151,0.0,79.979,0.528,0.17381,0.541667


### Do a quick check of the entire data frame

In [13]:
nlp_data_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4190 entries, 0 to 578929
Data columns (total 12 columns):
Track Name          4190 non-null object
Artist              4190 non-null object
ID                  4190 non-null object
Lyrics              4190 non-null object
Acousticness        4190 non-null float64
Energy              4190 non-null float64
Instrumentalness    4190 non-null float64
Mode                4190 non-null float64
Tempo               4190 non-null float64
Valence             4190 non-null float64
Polarity            4190 non-null float64
Subjectivity        4190 non-null float64
dtypes: float64(8), object(4)
memory usage: 585.5+ KB


In [14]:
nlp_data_clean.describe()

Unnamed: 0,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,Polarity,Subjectivity
count,4190.0,4190.0,4190.0,4190.0,4190.0,4190.0,4190.0,4190.0
mean,0.238886,0.660887,0.009918,0.54105,120.254388,0.48252,0.071806,0.453405
std,0.234458,0.165181,0.065771,0.498372,26.560422,0.221104,0.225631,0.228036
min,3e-06,0.0279,0.0,0.0,54.082,0.0371,-1.0,0.0
25%,0.051525,0.562,0.0,0.0,99.98425,0.31,-0.033272,0.35
50%,0.159,0.676,0.0,1.0,120.004,0.473,0.046612,0.4875
75%,0.368,0.78475,3.8e-05,1.0,136.04475,0.654,0.189943,0.591449
max,0.988,0.995,0.89,1.0,232.69,0.982,1.0,1.0


## Train/Test-split

Divide the data into a train and a test set (with a test set of 25%, which is also default)

In [15]:
dep   = nlp_data_clean['Valence']
indep = nlp_data_clean

In [16]:
indep_train, indep_test, dep_train, dep_test = train_test_split(indep, dep, test_size = 0.25, random_state=24)

## NLP

### Make list with more stopwords (bring in other languages)

French , German and Spanish stopwords from here https://www.ranks.nl/stopwords

In [156]:
ENGLISH_STOP_WORDS = [
    'a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against',
    'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always',
    'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another',
    'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are',
    'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become',
    'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being',
    'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both',
    'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con',
    'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done',
    'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else',
    'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone',
    'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill',
    'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty',
    'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go',
    'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter',
    'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his',
    'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed',
    'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter',
    'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me',
    'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly',
    'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither',
    'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone',
    'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on',
    'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our',
    'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps',
    'please', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed',
    'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side',
    'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone',
    'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such',
    'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them',
    'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby',
    'therefore', 'therein', 'thereupon', 'these', 'they', 'thick', 'thin',
    'third', 'this', 'those', 'though', 'three', 'through', 'throughout',
    'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards',
    'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us',
    'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when',
    'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby',
    'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither',
    'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with',
    'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself',
    'yourselves']

In [157]:
# French
with open('french') as file:
    lines = file.readlines()
FRENCH_STOP_WORDS = []
for line in lines:
    FRENCH_STOP_WORDS += [line]

In [158]:
# remove the line break (\n) from each row in FRENCH_STOP_WORDS
FRENCH_STOP_WORDS = [s.replace('\n', '') for s in FRENCH_STOP_WORDS]

In [159]:
# German
with open('german') as file:
    lines = file.readlines()
GERMAN_STOP_WORDS = []
for line in lines:
    GERMAN_STOP_WORDS += [line]

In [160]:
# remove the line break (\n) from each row in GERMAN_STOP_WORDS
GERMAN_STOP_WORDS = [s.replace('\n', '') for s in GERMAN_STOP_WORDS]

In [161]:
# Spanish
with open('spanish') as file:
    lines = file.readlines()
SPANISH_STOP_WORDS = []
for line in lines:
    SPANISH_STOP_WORDS += [line]

In [162]:
# remove the line break (\n) from each row in SPANISH_STOP_WORDS
SPANISH_STOP_WORDS = [s.replace('\n', '') for s in SPANISH_STOP_WORDS]

In [163]:
# Put all 4 stop-word lists into one list
STOP_WORDS = ENGLISH_STOP_WORDS + FRENCH_STOP_WORDS + GERMAN_STOP_WORDS + SPANISH_STOP_WORDS

In [164]:
# look at the list of the stop words for the 4 different languages
STOP_WORDS

['a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amoungst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'bill',
 'both',
 'bottom',
 'but',
 'by',
 'call',
 'can',
 'cannot',
 'cant',
 'co',
 'con',
 'could',
 'couldnt',
 'cry',
 'de',
 'describe',
 'detail',
 'do',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eg',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'etc',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'fill',
 'find',
 'fire',
 'first',
 'five',
 'for'

In [253]:
# Turn the list to a df to be able to save as csv
stop_words = pd.DataFrame(STOP_WORDS, columns=["colummn"])
stop_words.to_csv('stop_words.csv', index=False)
# when you read in the csv, you will have to make it to a list again...
# you can use df['colummn'].tolist()

### CountVectorizer (use in model 1)

In [165]:
# instantiate the model
cvec = CountVectorizer(stop_words = STOP_WORDS, max_features = 1000) 
# eliminate stop words (that are in the list) and use max_features since there will be more than 60,000 if iI do not

In [166]:
# fit the count vectorizer with training data. 
cvec.fit(indep_train['Lyrics'])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'ar...s', 'trabajais', 'trabajan', 'podria', 'podrias', 'podriamos', 'podrian', 'podriais', 'yo', 'aquel'],
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [167]:
# transform X_train
cvec_data = cvec.transform(indep_train['Lyrics'])

In [168]:
# Turn the features into a data frame
df  = pd.DataFrame(cvec_data.todense(),columns=cvec.get_feature_names())

df.head(3)

Unnamed: 0,10,100,aan,ab,act,ad,adesso,ah,ahh,ai,...,zij,zijn,zit,zo,zonder,zwei,écoute,équipe,étais,était
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,3,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [169]:
len(indep_train)

3142

In [170]:
len(df)

3142

In [171]:
# Concat with big data frame and use for fitting the model
indep_train_cvec = pd.concat([indep_train.reset_index(drop=True), df], axis=1)

indep_train_cvec.head(3)

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,...,zij,zijn,zit,zo,zonder,zwei,écoute,équipe,étais,était
0,Atemlos durch die Nacht,Helene Fischer,5fPGpdC4tmcVMmTuJV2HRg,Wir ziehen durch die Straßen und die Clubs die...,0.049,0.73,2e-06,1.0,128.041,0.866,...,0,0,0,0,0,3,0,0,0,0
1,La vita liquida,Brunori Sas,7ctWJ718cdHHqINJlRTuxF,Liquido è il mio corpo che si piega ad ogni co...,0.558,0.621,2.6e-05,0.0,88.065,0.585,...,0,0,0,0,0,0,0,0,0,0
2,Zum ersten Mal Nintendo,Philipp Poisel,2UcgmsztMXVyPo3VgqD5Bu,wie oft wollt' ich weg von hier? anders als di...,0.543,0.481,0.000299,1.0,98.983,0.503,...,0,0,0,0,0,0,0,0,0,0


In [172]:
len(indep_train_cvec)

3142

In [25]:
######################

In [173]:
# transform X_test
cvec_data2 = cvec.transform(indep_test['Lyrics'])

In [174]:
# Turn the features into a data frame
df2  = pd.DataFrame(cvec_data2.todense(),columns=cvec.get_feature_names())

df2.head(3)

Unnamed: 0,10,100,aan,ab,act,ad,adesso,ah,ahh,ai,...,zij,zijn,zit,zo,zonder,zwei,écoute,équipe,étais,était
0,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,4,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [175]:
len(indep_test)

1048

In [176]:
len(df2)

1048

In [177]:
# Concat with big data frame and use for scoring the model
indep_test_cvec = pd.concat([indep_test.reset_index(drop=True), df2], axis=1)

indep_test_cvec.head(3)

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,...,zij,zijn,zit,zo,zonder,zwei,écoute,équipe,étais,était
0,King Of The North,Bugzy Malone,4DixkYsZqImKOmjaIaYnCi,King! King! King! King! King! King!\nI'm King ...,0.0778,0.772,4.4e-05,0.0,139.886,0.376,...,0,0,0,0,0,0,0,0,0,0
1,Back On,Gucci Mane,0KA5Cc68h9qitLwTadHBpa,Zaytoven\nHah\nWop\nYeah\nIt's Gucci\nZay\nZig...,0.0087,0.639,4e-06,1.0,156.055,0.427,...,0,0,0,0,0,0,0,0,0,0
2,Magazine,Dark Polo Gang,71MGHgauMD6aapixtV6Chd,"Hey, hey\nSick Luke, Sick Luke\n\nLa mia facci...",0.396,0.437,0.0,1.0,136.111,0.525,...,0,0,0,0,0,0,0,0,0,0


In [178]:
len(indep_test_cvec)

1048

*If there are time in the future consider stemming or lemming* 

### TF-IDF (use in model 2)

In [179]:
# instantiate the model
tvec = TfidfVectorizer(stop_words = STOP_WORDS, max_features = 1000) 
# eliminate stop words (that are in the list) and use max_features since there will be more than 60,000 if iI do not

In [180]:
# fit the count vectorizer with training data. 
tvec.fit(indep_train['Lyrics'])

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'ar...s', 'trabajais', 'trabajan', 'podria', 'podrias', 'podriamos', 'podrian', 'podriais', 'yo', 'aquel'],
        strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [181]:
# transform X_train
tvec_data3 = tvec.transform(indep_train['Lyrics'])

In [182]:
# Turn the features into a data frame
df3  = pd.DataFrame(tvec_data3.todense(), columns=tvec.get_feature_names())

df3.head(3)

Unnamed: 0,10,100,aan,ab,act,ad,adesso,ah,ahh,ai,...,zij,zijn,zit,zo,zonder,zwei,écoute,équipe,étais,était
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.19568,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.118221,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [183]:
len(indep_train)

3142

In [184]:
len(df3)

3142

In [185]:
# Concat with big data frame and use for fitting the model
indep_train_tvec = pd.concat([indep_train.reset_index(drop=True), df3], axis=1)

indep_train_tvec.head(3)

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,...,zij,zijn,zit,zo,zonder,zwei,écoute,équipe,étais,était
0,Atemlos durch die Nacht,Helene Fischer,5fPGpdC4tmcVMmTuJV2HRg,Wir ziehen durch die Straßen und die Clubs die...,0.049,0.73,2e-06,1.0,128.041,0.866,...,0.0,0.0,0.0,0.0,0.0,0.19568,0.0,0.0,0.0,0.0
1,La vita liquida,Brunori Sas,7ctWJ718cdHHqINJlRTuxF,Liquido è il mio corpo che si piega ad ogni co...,0.558,0.621,2.6e-05,0.0,88.065,0.585,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Zum ersten Mal Nintendo,Philipp Poisel,2UcgmsztMXVyPo3VgqD5Bu,wie oft wollt' ich weg von hier? anders als di...,0.543,0.481,0.000299,1.0,98.983,0.503,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [186]:
len(indep_train_tvec)

3142

In [187]:
##########################

In [188]:
# transform X_test
tvec_data4 = tvec.transform(indep_test['Lyrics'])

In [189]:
# Turn the features into a data frame
df4  = pd.DataFrame(tvec_data4.todense(),columns=cvec.get_feature_names())

df4.head(3)

Unnamed: 0,10,100,aan,ab,act,ad,adesso,ah,ahh,ai,...,zij,zijn,zit,zo,zonder,zwei,écoute,équipe,étais,était
0,0.039411,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.214023,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [190]:
len(indep_test)

1048

In [191]:
len(df4)

1048

In [192]:
# Concat with big data frame and use for scoring the model
indep_test_tvec = pd.concat([indep_test.reset_index(drop=True), df4], axis=1)

indep_test_tvec.head(3) 

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,...,zij,zijn,zit,zo,zonder,zwei,écoute,équipe,étais,était
0,King Of The North,Bugzy Malone,4DixkYsZqImKOmjaIaYnCi,King! King! King! King! King! King!\nI'm King ...,0.0778,0.772,4.4e-05,0.0,139.886,0.376,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Back On,Gucci Mane,0KA5Cc68h9qitLwTadHBpa,Zaytoven\nHah\nWop\nYeah\nIt's Gucci\nZay\nZig...,0.0087,0.639,4e-06,1.0,156.055,0.427,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Magazine,Dark Polo Gang,71MGHgauMD6aapixtV6Chd,"Hey, hey\nSick Luke, Sick Luke\n\nLa mia facci...",0.396,0.437,0.0,1.0,136.111,0.525,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [193]:
len(indep_test_tvec)

1048

## Models (LinReg, Lasso and RF) - CountVec

### Linear Regression

In [194]:
def evaluate_model(model, X_train, X_test, y_train, y_test):
    # standardize the predictors
    ss = StandardScaler()
    ss.fit(X_train)
    X_train_s = ss.transform(X_train)
    X_test_s = ss.transform(X_test)
    
    # fit
    model.fit(X_train_s, y_train)
    
    # Evaluate: predict and score
    y_pred = model.predict(X_test_s)
    y_true = y_test
    
    mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
    # Evaluate: score
    score = model.score(X_test_s, y_test)
    
    return {'Score (R^2)': score.mean(), 'MSE': mean_square_error}

**LinReg #1 - CountVec + all coefs**

In [195]:
# define X and y
X_train = indep_train_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_train = indep_train['Valence'] 
X_test = indep_test_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_test = indep_test['Valence']

# chose model 
model = LinearRegression()

# call function
evaluate_model(model, X_train, X_test, y_train, y_test)

{'MSE': 0.2587134551086917, 'Score (R^2)': -0.3099817028424565}

**Importance of the coefficients**

In [196]:
# Look at the feature importance with coef_
pd.Series(dict(zip(X_train.columns,model.coef_))).abs().sort_values(ascending=False).head(10)

Energy          0.093047
Acousticness    0.028629
guardo          0.017458
geht            0.017209
nog             0.016786
quando          0.016268
long            0.015991
di              0.015467
baby            0.015285
Mode            0.014997
dtype: float64

**LinReg #2 - CountrVec + top 3 coefs**

In [197]:
# define X and y
X_train2 = indep_train_cvec[['Energy', 'Acousticness', 'guardo']]
y_train2 = indep_train['Valence']
X_test2 = indep_test_cvec[['Energy', 'Acousticness', 'guardo']]
y_test2 = indep_test['Valence']

# chose model 
model2 = LinearRegression()

# call function
evaluate_model(model2, X_train2, X_test2, y_train2, y_test2)

{'MSE': 0.2076809645710392, 'Score (R^2)': 0.155848139254102}

**LinReg #3 - CountrVec + top 10 coefs**

In [198]:
# define X and y
X_train3 = indep_train_cvec[['Energy', 'Acousticness', 'guardo', 'geht', 'nog', 'quando', 'long', 'di', 'baby',
                             'Mode']]
y_train3 = indep_train['Valence']
X_test3 = indep_test_cvec[['Energy', 'Acousticness', 'guardo', 'geht', 'nog', 'quando', 'long', 'di', 'baby',
                             'Mode']]
y_test3 = indep_test['Valence']

# chose model 
model3 = LinearRegression()

# call function
evaluate_model(model3, X_train3, X_test3, y_train3, y_test3)

{'MSE': 0.2071522032148789, 'Score (R^2)': 0.16014113421210033}

*COMMENT: The best LinReg is nr 3*

This is a very unefficient way to find the best number of independent variables. I will run a Lasso model instead to get help with varables.<BR />
An alternative could have been going with the Transformers 'select k-best' (will pick best nr of estimators, where I choose the number of estimaters) or RFE (eliminates varables not to use).

### Lasso Regressor

**Lasso #1 - CountVec + all coefs**

In [256]:
# define X and y
X_train4 = indep_train_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_train4 = indep_train['Valence']
X_test4 = indep_test_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_test4 = indep_test['Valence']

In [200]:
# standardize the predictors
ss = StandardScaler()
ss.fit(X_train4)
X_train4_s = ss.transform(X_train4)
X_test4_s = ss.transform(X_test4)

In [201]:
# RandomizedSearch (best hyperparams) 
params = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
rs = RandomizedSearchCV(lassoreg, params, n_iter=40)
rs.fit(X_train4_s, y_train4)

print(f'best_score: {rs.best_score_}')
print(f'best_params: {rs.best_params_}')

best_score: 0.18210361592445343
best_params: {'selection': 'cyclic', 'alpha': 0.004641588833612782}


In [202]:
# GridSearch (best hyperparams) 
grid = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
gs = GridSearchCV(lassoreg, grid)
gs.fit(X_train4_s, y_train4)
    
print(f'best_score: {gs.best_score_}')
print(f'best_params: {gs.best_params_}') 

best_score: 0.18357351049177206
best_params: {'alpha': 0.005336699231206312, 'selection': 'cyclic'}


In [203]:
# Lasso regression (best hyper params: input alpha and selection from above)
model4 = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000, alpha=0.0053, selection='cyclic')              

# fit
model4.fit(X_train4_s, y_train4)

# Evaluate: predict
y_pred = model4.predict(X_test4_s)
y_true = y_test4
    
mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
# Evaluate: score 
score = model4.score(X_test4_s, y_test4)
    
print(f'Score (R^2): {score.mean()}')
print(f'MSE: {mean_square_error}')

Score (R^2): 0.1821078591179739
MSE: 0.20442519405238987


In [204]:
# Look at the feature importance with coef_
pd.Series(dict(zip(X_train4.columns,model4.coef_))).abs().sort_values(ascending=False).head(15)

Energy              0.082843
Acousticness        0.017023
baby                0.011144
x2                  0.010904
Mode                0.009772
girl                0.008657
niggas              0.006985
little              0.005442
geht                0.005153
Instrumentalness    0.005133
fast                0.005067
bro                 0.005027
viel                0.005011
putain              0.004989
ville               0.004974
dtype: float64

In [None]:
# Do not use x2, that is from the lyrics text and only showing if something is repeating

**Lasso #2 - CountVec + top 10 coefs**

In [205]:
# define X and y
X_train5 = indep_train_cvec[['Energy', 'Acousticness', 'baby', 'Mode', 'girl', 'niggas', 'little', 'geht',
                             'Instrumentalness', 'fast']]
y_train5 = indep_train['Valence']
X_test5 = indep_test_cvec[['Energy', 'Acousticness', 'baby', 'Mode', 'girl', 'niggas', 'little', 'geht',
                             'Instrumentalness', 'fast']]
y_test5 = indep_test['Valence']

In [206]:
# standardize the predictors
ss = StandardScaler()
ss.fit(X_train5)
X_train5_s = ss.transform(X_train5)
X_test5_s = ss.transform(X_test5)

In [207]:
# RandomizedSearch (best hyperparams) 
params = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
rs = RandomizedSearchCV(lassoreg, params, n_iter=40)
rs.fit(X_train5_s, y_train5)

print(f'best_score: {rs.best_score_}')
print(f'best_params: {rs.best_params_}')

best_score: 0.17670844764498583
best_params: {'selection': 'cyclic', 'alpha': 0.0011497569953977356}


In [208]:
# GridSearch (best hyperparams) 
grid = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
gs = GridSearchCV(lassoreg, grid)
gs.fit(X_train5_s, y_train5)
    
print(f'best_score: {gs.best_score_}')
print(f'best_params: {gs.best_params_}') 

best_score: 0.17681591962374835
best_params: {'alpha': 0.001, 'selection': 'random'}


In [209]:
# Lasso regression (best hyper params: input alpha and selection from above)
model5 = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000, alpha=0.001, selection='random')              

# fit
model5.fit(X_train5_s, y_train5)

# Evaluate: predict 
y_pred = model5.predict(X_test5_s)
y_true = y_test5
    
mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
# Evaluate: score 
score = model5.score(X_test5_s, y_test5)
    
print(f'Score (R^2): {score.mean()}')
print(f'MSE: {mean_square_error}')

Score (R^2): 0.1676580825613684
MSE: 0.2062230873941988


**Lasso #3 - CountVec + top 5 coefs**

In [210]:
# define X and y
X_train6 = indep_train_cvec[['Energy', 'Acousticness', 'baby', 'Mode', 'girl']]
y_train6 = indep_train['Valence']
X_test6 = indep_test_cvec[['Energy', 'Acousticness', 'baby', 'Mode', 'girl']]
y_test6 = indep_test['Valence']

In [211]:
# standardize the predictors
ss = StandardScaler()
ss.fit(X_train6)
X_train6_s = ss.transform(X_train6)
X_test6_s = ss.transform(X_test6)

In [212]:
# RandomizedSearch (best hyperparams) 
params = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
rs = RandomizedSearchCV(lassoreg, params, n_iter=40)
rs.fit(X_train6_s, y_train6)

print(f'best_score: {rs.best_score_}')
print(f'best_params: {rs.best_params_}')

best_score: 0.16526677632466533
best_params: {'selection': 'random', 'alpha': 0.001}


In [213]:
# GridSearch (best hyperparams) 
grid = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
gs = GridSearchCV(lassoreg, grid)
gs.fit(X_train6_s, y_train6)
    
print(f'best_score: {gs.best_score_}')
print(f'best_params: {gs.best_params_}') 

best_score: 0.16526677632466533
best_params: {'alpha': 0.001, 'selection': 'random'}


In [214]:
# Lasso regression (best hyper params: input alpha and selection from above)
model6 = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000, alpha=0.001, selection='random')              

# fit
model6.fit(X_train6_s, y_train6)

# Evaluate: predict 
y_pred = model6.predict(X_test6_s)
y_true = y_test6
    
mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
# Evaluate: score 
score = model6.score(X_test6_s, y_test6)
    
print(f'Score (R^2): {score.mean()}')
print(f'MSE: {mean_square_error}')

Score (R^2): 0.16772687310510903
MSE: 0.20621456536068097


*COMMENT: The best Lasso is nr 1*

### Random Forest Regressor

You do not have to scale a Random Forest.

In [215]:
def get_best_hype(model, params, X_train, y_train):  
    # Best Hyperparameters
    rs = RandomizedSearchCV(model, params, n_iter=40)
    
    # fit
    rs.fit(X_train, y_train)
     
    return {'best_score': rs.best_score_,'best_params': rs.best_params_} 

def evaluate_model(model, X_train, X_test, y_train, y_test):
    # fit
    model.fit(X_train, y_train)
    
    # Evaluate: predict
    
    y_pred = model.predict(X_test)
    y_true = y_test
    
    mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
    # Evaluate: score
    score = model.score(X_test, y_test)
    
    return {'Score (R^2)': score.mean(), 'MSE': mean_square_error}

**Random Forest #1 - CountVec + all coefs**

In [216]:
# Declare indep and dep
X_train7 = indep_train_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_train7 = indep_train['Valence']
X_test7 = indep_test_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_test7 = indep_test['Valence']

# RandomizedSearch
params = {'n_estimators': np.arange(10, 100, 10),
        'max_depth': np.arange(2, 20, 3),
        'max_features' : ('auto', 'sqrt', 'log2'),
        'bootstrap': (True, False),
        'verbose' : np.arange(0, 1)}

rfr = RandomForestRegressor(random_state=24)

get_best_hype(rfr, params, X_train7, y_train7)

{'best_params': {'bootstrap': True,
  'max_depth': 14,
  'max_features': 'auto',
  'n_estimators': 90,
  'verbose': 0},
 'best_score': 0.17154829518939638}

In [217]:
# chose model and use best hyperparameters (from gridsearchCV)
model7 = RandomForestRegressor(max_depth=14, max_features='auto', n_estimators=90, verbose=0, bootstrap=True, 
                               random_state=24)

# call function
evaluate_model(model7, X_train7, X_test7, y_train7, y_test7)

{'MSE': 0.20472005307235316, 'Score (R^2)': 0.17974673332238125}

**Feature importance**

In [218]:
pd.Series(dict(zip(X_train7.columns,model7.feature_importances_))).abs().sort_values(ascending=False).head(15)

Energy              0.244485
Acousticness        0.043630
Instrumentalness    0.019697
Tempo               0.019308
Subjectivity        0.017889
Polarity            0.016202
baby                0.014102
oh                  0.009026
je                  0.008125
Mode                0.006135
girl                0.006093
che                 0.005309
ben                 0.004927
x2                  0.004863
floor               0.004749
dtype: float64

In [None]:
# Do not use x2, that is from the lyrics text and only showing if something is repeating

**Random Forest #2 - CountVec + top 7 features**

In [219]:
# define X and y
X_train8 = indep_train_cvec[['Energy', 'Acousticness', 'Instrumentalness', 'Tempo', 'Subjectivity', 'Polarity',
                             'baby']]
y_train8 = indep_train['Valence']
X_test8 = indep_test_cvec[['Energy', 'Acousticness', 'Instrumentalness', 'Tempo', 'Subjectivity', 'Polarity',
                             'baby']]
y_test8 = indep_test['Valence']

# RandomizedSearch
params = {'n_estimators': np.arange(10, 100, 10),
        'max_depth': np.arange(2, 20, 3),
        'max_features' : ('auto', 'sqrt', 'log2'),
        'bootstrap': (True, False),
        'verbose' : np.arange(0, 1)}

rfr = RandomForestRegressor(random_state=24)

get_best_hype(rfr, params, X_train8, y_train8)

{'best_params': {'bootstrap': True,
  'max_depth': 5,
  'max_features': 'auto',
  'n_estimators': 90,
  'verbose': 0},
 'best_score': 0.14664531102380435}

In [220]:
# chose model and use best hyperparameters (from gridsearchCV)
model8 = RandomForestRegressor(max_depth=5, max_features='auto', n_estimators=90, verbose=0, bootstrap=True, 
                               random_state=24)

# call function
evaluate_model(model8, X_train8, X_test8, y_train8, y_test8)

{'MSE': 0.20702112538034922, 'Score (R^2)': 0.16120365780960677}

**Random Forest #3 - CountVec + top 3 features**

In [221]:
# define X and y
X_train9 = indep_train_cvec[['Energy', 'Acousticness', 'Instrumentalness']]
y_train9 = indep_train['Valence']
X_test9 = indep_test_cvec[['Energy', 'Acousticness', 'Instrumentalness']]
y_test9 = indep_test['Valence']

# RandomizedSearch
params = {'n_estimators': np.arange(10, 100, 10),
        'max_depth': np.arange(2, 20, 3),
        'max_features' : ('auto', 'sqrt', 'log2'),
        'bootstrap': (True, False),
        'verbose' : np.arange(0, 1)}

rfr = RandomForestRegressor(random_state=24)

get_best_hype(rfr, params, X_train9, y_train9)

{'best_params': {'bootstrap': True,
  'max_depth': 5,
  'max_features': 'auto',
  'n_estimators': 70,
  'verbose': 0},
 'best_score': 0.13757342917216644}

In [222]:
# chose model and use best hyperparameters (from gridsearchCV)
model9 = RandomForestRegressor(max_depth=5, max_features='auto', n_estimators=70, verbose=0, bootstrap=True, 
                               random_state=24)

# call function
evaluate_model(model9, X_train9, X_test9, y_train9, y_test9)

{'MSE': 0.20845478597045353, 'Score (R^2)': 0.1495457828207792}

*COMMENT: The best RF is nr 1*

The over all best model was Random Forest nr 1 with a R2-score of 0.17974673332238125

## Models (LinReg, Lasso and RF) - TF-IDF

### Linear Regression

In [223]:
def evaluate_model(model, X_train, X_test, y_train, y_test):
    # standardize the predictors
    ss = StandardScaler()
    ss.fit(X_train)
    X_train_s = ss.transform(X_train)
    X_test_s = ss.transform(X_test)
    
    # fit
    model.fit(X_train_s, y_train)
    
    # Evaluate: predict and score
    y_pred = model.predict(X_test_s)
    y_true = y_test
    
    mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
    # Evaluate: score
    score = model.score(X_test_s, y_test)
    
    return {'Score (R^2)': score.mean(), 'MSE': mean_square_error}

**LinReg #1 - TF-IDF + all coefs**

In [224]:
# define X and y
X_train10 = indep_train_tvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_train10 = indep_train['Valence'] 
X_test10 = indep_test_tvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_test10 = indep_test['Valence']

# chose model 
model10 = LinearRegression()

# call function
evaluate_model(model10, X_train10, X_test10, y_train10, y_test10)

{'MSE': 0.24485816941821856, 'Score (R^2)': -0.17342785785246928}

**Importance of the coefficients**

In [226]:
# Look at the feature importance with coef_
pd.Series(dict(zip(X_train10.columns,model10.coef_))).abs().sort_values(ascending=False).head(15)

Energy          0.093520
Acousticness    0.030678
baby            0.014869
faire           0.014618
Mode            0.014488
x2              0.013533
bro             0.013386
sagt            0.013151
zit             0.012580
haben           0.012317
geht            0.012192
zonder          0.012138
damn            0.011880
hai             0.011576
girl            0.011395
dtype: float64

In [None]:
# Do not use x2, that is from the lyrics text and only showing if something is repeating

**LinReg #2 - TF-IDF + top 3 coefs**

In [227]:
# define X and y
X_train11 = indep_train_tvec[['Energy', 'Acousticness', 'baby']]
y_train11 = indep_train['Valence']
X_test11 = indep_test_tvec[['Energy', 'Acousticness', 'baby']]
y_test11 = indep_test['Valence']

# chose model 
model11 = LinearRegression()

# call function
evaluate_model(model11, X_train11, X_test11, y_train11, y_test11)

{'MSE': 0.20589582744539472, 'Score (R^2)': 0.17029770981324463}

**LinReg #3 - TF-IDF + top 10 coefs**

In [228]:
# define X and y
X_train12 = indep_train_tvec[['Energy', 'Acousticness', 'baby', 'faire', 'Mode', 'bro', 'sagt', 'zit',
                             'haben', 'geht']]
y_train12 = indep_train['Valence']
X_test12 = indep_test_tvec[['Energy', 'Acousticness', 'baby', 'faire', 'Mode', 'bro', 'sagt', 'zit',
                             'haben', 'geht']]
y_test12 = indep_test['Valence']

# chose model 
model12 = LinearRegression()

# call function
evaluate_model(model12, X_train12, X_test12, y_train12, y_test12)

{'MSE': 0.20766685282589525, 'Score (R^2)': 0.15596285416045497}

COMMENT: Best LinReg is nr 2

### Lasso Regressor

**Lasso #1 - TF-IDF  + all coefs**

In [229]:
# define X and y
X_train13 = indep_train_tvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_train13 = indep_train['Valence']
X_test13 = indep_test_tvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_test13 = indep_test['Valence']

In [230]:
# standardize the predictors
ss = StandardScaler()
ss.fit(X_train13)
X_train13_s = ss.transform(X_train13)
X_test13_s = ss.transform(X_test13)

In [231]:
# RandomizedSearch (best hyperparams) 
params = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
rs = RandomizedSearchCV(lassoreg, params, n_iter=40)
rs.fit(X_train13_s, y_train13)

print(f'best_score: {rs.best_score_}')
print(f'best_params: {rs.best_params_}')

best_score: 0.18246974042157804
best_params: {'selection': 'random', 'alpha': 0.006135907273413176}


In [232]:
# GridSearch (best hyperparams) 
grid = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
gs = GridSearchCV(lassoreg, grid)
gs.fit(X_train13_s, y_train13)
    
print(f'best_score: {gs.best_score_}')
print(f'best_params: {gs.best_params_}') 

best_score: 0.18247129724569322
best_params: {'alpha': 0.006135907273413176, 'selection': 'cyclic'}


In [233]:
# Lasso regression (best hyper params: input alpha and selection from above)
model13 = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000, alpha=0.0061, selection='cyclic')              

# fit
model13.fit(X_train13_s, y_train13)

# Evaluate: predict
y_pred = model13.predict(X_test13_s)
y_true = y_test13
    
mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
# Evaluate: score 
score = model13.score(X_test13_s, y_test13)
    
print(f'Score (R^2): {score.mean()}')
print(f'MSE: {mean_square_error}')

Score (R^2): 0.1798104941720825
MSE: 0.2047120961529845


In [234]:
# Look at the feature importance with coef_
pd.Series(dict(zip(X_train13.columns,model13.coef_))).abs().sort_values(ascending=False).head(15)

Energy              0.081902
Acousticness        0.015608
baby                0.012307
x2                  0.011009
girl                0.009532
Mode                0.008984
niggas              0.006917
fun                 0.005355
Instrumentalness    0.005026
bitch               0.004998
sexy                0.004667
long                0.004602
track               0.004530
quartier            0.004406
bro                 0.004137
dtype: float64

In [None]:
# Do not use x2, that is from the lyrics text and only showing if something is repeating

**Lasso #2 - TF-IDF  + top 10 coefs**

In [235]:
# define X and y
X_train14 = indep_train_tvec[['Energy', 'Acousticness', 'baby', 'girl', 'Mode', 'niggas', 'fun', 'Instrumentalness',
                              'bitch', 'sexy']]
y_train14 = indep_train['Valence']
X_test14 = indep_test_tvec[['Energy', 'Acousticness', 'baby', 'girl', 'Mode', 'niggas', 'fun', 'Instrumentalness',
                            'bitch', 'sexy']]
y_test14 = indep_test['Valence']

In [236]:
# standardize the predictors
ss = StandardScaler()
ss.fit(X_train14)
X_train14_s = ss.transform(X_train14)
X_test14_s = ss.transform(X_test14)

In [237]:
# RandomizedSearch (best hyperparams) 
params = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
rs = RandomizedSearchCV(lassoreg, params, n_iter=40)
rs.fit(X_train14_s, y_train14)

print(f'best_score: {rs.best_score_}')
print(f'best_params: {rs.best_params_}')

best_score: 0.17993339012403026
best_params: {'selection': 'random', 'alpha': 0.001}


In [238]:
# GridSearch (best hyperparams) 
grid = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
gs = GridSearchCV(lassoreg, grid)
gs.fit(X_train14_s, y_train14)
    
print(f'best_score: {gs.best_score_}')
print(f'best_params: {gs.best_params_}') 

best_score: 0.17993339012403026
best_params: {'alpha': 0.001, 'selection': 'random'}


In [239]:
# Lasso regression (best hyper params: input alpha and selection from above)
model14 = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000, alpha=0.001, selection='random')              

# fit
model14.fit(X_train14_s, y_train14)

# Evaluate: predict 
y_pred = model14.predict(X_test14_s)
y_true = y_test14
    
mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
# Evaluate: score 
score = model14.score(X_test14_s, y_test14)
    
print(f'Score (R^2): {score.mean()}')
print(f'MSE: {mean_square_error}')

Score (R^2): 0.16848620764547673
MSE: 0.20612047270696127


**Lasso #3 - TF-IDF  + top 5 coefs**

In [240]:
# define X and y
X_train15 = indep_train_tvec[['Energy', 'Acousticness', 'baby', 'girl', 'Mode']]
y_train15 = indep_train['Valence']
X_test15 = indep_test_tvec[['Energy', 'Acousticness', 'baby', 'girl', 'Mode']]
y_test15 = indep_test['Valence']

In [241]:
# standardize the predictors
ss = StandardScaler()
ss.fit(X_train15)
X_train15_s = ss.transform(X_train15)
X_test15_s = ss.transform(X_test15)

In [242]:
# RandomizedSearch (best hyperparams) 
params = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
rs = RandomizedSearchCV(lassoreg, params, n_iter=40)
rs.fit(X_train15_s, y_train15)

print(f'best_score: {rs.best_score_}')
print(f'best_params: {rs.best_params_}')

best_score: 0.16727253217087912
best_params: {'selection': 'cyclic', 'alpha': 0.001}


In [243]:
# GridSearch (best hyperparams) 
grid = {'alpha': np.logspace(-3, 3, 100),
        'selection' : ('cyclic', 'random')}

# we want intercept, we do not want to normalize here, since we already have scaled with the StandardScaler
# precompute=auto the computer decides
lassoreg = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000)

# Best Hyperparameters and fit
gs = GridSearchCV(lassoreg, grid)
gs.fit(X_train15_s, y_train15)
    
print(f'best_score: {gs.best_score_}')
print(f'best_params: {gs.best_params_}') 

best_score: 0.16727315161904804
best_params: {'alpha': 0.001, 'selection': 'random'}


In [244]:
# Lasso regression (best hyper params: input alpha and selection from above)
model15 = Lasso(random_state=24, fit_intercept=True, normalize=False, max_iter=1000, alpha=0.001, selection='random')              

# fit
model15.fit(X_train15_s, y_train15)

# Evaluate: predict 
y_pred = model15.predict(X_test15_s)
y_true = y_test15
    
mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
# Evaluate: score 
score = model15.score(X_test15_s, y_test15)
    
print(f'Score (R^2): {score.mean()}')
print(f'MSE: {mean_square_error}')

Score (R^2): 0.1687700698821054
MSE: 0.2060852869930804


*COMMENT: The best Lasso is nr 1*

### Random Forest Regressor

You do not have to scale a Random Forest.

In [245]:
def get_best_hype(model, params, X_train, y_train):  
    # Best Hyperparameters
    rs = RandomizedSearchCV(model, params, n_iter=40)
    
    # fit
    rs.fit(X_train, y_train)
     
    return {'best_score': rs.best_score_,'best_params': rs.best_params_} 

def evaluate_model(model, X_train, X_test, y_train, y_test):
    # fit
    model.fit(X_train, y_train)
    
    # Evaluate: predict
    
    y_pred = model.predict(X_test)
    y_true = y_test
    
    mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
    # Evaluate: score
    score = model.score(X_test, y_test)
    
    return {'Score (R^2)': score.mean(), 'MSE': mean_square_error}

**Random Forest #1 - TF-IDF + all coefs**

In [246]:
# Declare indep and dep
X_train16 = indep_train_tvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_train16 = indep_train['Valence']
X_test16 = indep_test_tvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_test16 = indep_test['Valence']

# RandomizedSearch
params = {'n_estimators': np.arange(10, 100, 10),
        'max_depth': np.arange(2, 20, 3),
        'max_features' : ('auto', 'sqrt', 'log2'),
        'bootstrap': (True, False),
        'verbose' : np.arange(0, 1)}

rfr = RandomForestRegressor(random_state=24)

get_best_hype(rfr, params, X_train16, y_train16)

{'best_params': {'bootstrap': True,
  'max_depth': 11,
  'max_features': 'auto',
  'n_estimators': 80,
  'verbose': 0},
 'best_score': 0.17423560058157853}

In [247]:
# chose model and use best hyperparameters (from gridsearchCV)
model16 = RandomForestRegressor(max_depth=11, max_features='auto', n_estimators=80, verbose=0, bootstrap=True, 
                               random_state=24)

# call function
evaluate_model(model16, X_train16, X_test16, y_train16, y_test16)

{'MSE': 0.20620783195905815, 'Score (R^2)': 0.1677812236571271}

**Feature importance**

In [248]:
pd.Series(dict(zip(X_train16.columns,model16.feature_importances_))).abs().sort_values(ascending=False).head(15)

Energy              0.279480
Acousticness        0.034036
Instrumentalness    0.015717
baby                0.014794
Subjectivity        0.012901
Polarity            0.011307
Tempo               0.010349
girl                0.008337
che                 0.007418
oh                  0.006701
ogni                0.006645
ben                 0.006076
je                  0.005815
mich                0.005648
Mode                0.005332
dtype: float64

**Random Forest #2 - TF-IDF + top 7 features**

In [249]:
# define X and y
X_train17 = indep_train_tvec[['Energy', 'Acousticness', 'Instrumentalness', 'baby', 'Subjectivity', 'Polarity',
                              'Tempo']]
y_train17 = indep_train['Valence']
X_test17 = indep_test_tvec[['Energy', 'Acousticness', 'Instrumentalness', 'baby', 'Subjectivity', 'Polarity',
                            'Tempo']]
y_test17 = indep_test['Valence']

# RandomizedSearch
params = {'n_estimators': np.arange(10, 100, 10),
        'max_depth': np.arange(2, 20, 3),
        'max_features' : ('auto', 'sqrt', 'log2'),
        'bootstrap': (True, False),
        'verbose' : np.arange(0, 1)}

rfr = RandomForestRegressor(random_state=24)

get_best_hype(rfr, params, X_train17, y_train17)

{'best_params': {'bootstrap': True,
  'max_depth': 5,
  'max_features': 'auto',
  'n_estimators': 70,
  'verbose': 0},
 'best_score': 0.147190529747752}

In [250]:
# chose model and use best hyperparameters (from gridsearchCV)
model17 = RandomForestRegressor(max_depth=5, max_features='auto', n_estimators=70, verbose=0, bootstrap=True, 
                               random_state=24)

# call function
evaluate_model(model17, X_train17, X_test17, y_train17, y_test17)

{'MSE': 0.2067073825659824, 'Score (R^2)': 0.16374414162595438}

**Random Forest #3 - TF_IDF + top 3 features**

In [251]:
# define X and y
X_train18 = indep_train_tvec[['Energy', 'Acousticness', 'Instrumentalness']]
y_train18 = indep_train['Valence']
X_test18 = indep_test_tvec[['Energy', 'Acousticness', 'Instrumentalness']]
y_test18 = indep_test['Valence']

# RandomizedSearch
params = {'n_estimators': np.arange(10, 100, 10),
        'max_depth': np.arange(2, 20, 3),
        'max_features' : ('auto', 'sqrt', 'log2'),
        'bootstrap': (True, False),
        'verbose' : np.arange(0, 1)}

rfr = RandomForestRegressor(random_state=24)

get_best_hype(rfr, params, X_train18, y_train18)

{'best_params': {'bootstrap': True,
  'max_depth': 5,
  'max_features': 'auto',
  'n_estimators': 60,
  'verbose': 0},
 'best_score': 0.13714638717866467}

In [252]:
# chose model and use best hyperparameters (from gridsearchCV)
model18 = RandomForestRegressor(max_depth=5, max_features='auto', n_estimators=60, verbose=0, bootstrap=True, 
                               random_state=24)

# call function
evaluate_model(model18, X_train18, X_test18, y_train18, y_test18)

{'MSE': 0.208516295100657, 'Score (R^2)': 0.1490438186500912}

*COMMENT: The best RF is nr 1*

The over all best model was Lasso nr 1 with a R2-score of 0.1798104941720825

**TF-IDF, Lasso regressor all variables is the model in this notebook that scores the best.**