# Make a Good Model for the mood of the song

In this document I need to work more with NLP of the track lyrics and use columns that I discarded in the MVP and I also need to do  some feature engineering.

Good model = more NLP on the Track Lyrics<BR />
Better model = also looking at number of streams and position<BR />
Best model = also looking at time on top list<BR />

Then when we have the best model we can use our predictions and decide the mood of a country and the mood of an artist. Then we can say what artist is suitable for what country. 

## Import stuff

In [45]:
import pandas as pd
import numpy as np
from textblob import TextBlob
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer, CountVectorizer

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.preprocessing import StandardScaler
import sklearn.metrics
from sklearn.ensemble import RandomForestRegressor
from sklearn.grid_search import GridSearchCV

from matplotlib import pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline

## Load data

In [29]:
data = pd.read_csv('./data_top10c_more_lyrics.csv')

In [30]:
data.head(3)

Unnamed: 0.1,Unnamed: 0,Position,Streams,Track Name,Artist,ID,Date,Year,Month,Day,Country,Region,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence
0,0,177,40381,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,2017-10-05,2017,10,5,gb,eu,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879
1,1,151,24132,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,2017-12-23,2017,12,23,it,eu,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756
2,2,78,49766,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,2017-12-24,2017,12,24,it,eu,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756


## Fix a little bit with the data

**Drop rows that are duplicates and keep only one row for each song**

In [31]:
data_per_song = data.drop_duplicates(subset=['Track Name'], keep='first')

**Drop all columns that might change per song**

In [32]:
nlp_data = data_per_song.drop(['Unnamed: 0', 'Position', 'Streams', 'Date', 'Year', 'Month', 'Day', 'Country', 'Region'], axis=1)

nlp_data.head(5)

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence
0,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879
1,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756
43,Douce Nuit,-M-,4EOJWkvkVDpkZrhC8iTDsI,,0.914,0.227,0.163,1.0,81.887,0.0498
44,Zomersessie,101Barz,3ypzzvHUfgwyqxhL9ym4fH,,0.00818,0.403,2.1e-05,1.0,155.748,0.365
47,Zomersessie (feat. 3robi),101Barz,2re4cLViiQw0NZZx5KUpV8,,0.00818,0.403,2.1e-05,1.0,155.748,0.365


**Look at missing values**

In [33]:
nlp_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6919 entries, 0 to 578929
Data columns (total 10 columns):
Track Name          6919 non-null object
Artist              6919 non-null object
ID                  6919 non-null object
Lyrics              4190 non-null object
Acousticness        6918 non-null float64
Energy              6918 non-null float64
Instrumentalness    6918 non-null float64
Mode                6918 non-null float64
Tempo               6918 non-null float64
Valence             6918 non-null float64
dtypes: float64(6), object(4)
memory usage: 594.6+ KB


**Drop rows that have missing values in the Lyrics column**<BR />
We can use dropna to drop all rows that has missing values (should mostly be the Lyrics column)

In [34]:
nlp_data_clean = nlp_data.dropna(axis=0, how='any')

nlp_data_clean.head()

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence
0,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879
1,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756
48,Somme,13 Block,2xkxBVJHf9jQsq7g46UtQx,"J'ai fait l'aller, j'suis sur le retour\nLa ma...",0.494,0.678,0.00151,0.0,79.979,0.528
50,Vide,13 Block,69RclklKbEelwfQJCBzh0m,"13 Blo' gang, tu sais d'jà comment on opère mo...",0.505,0.682,6e-06,0.0,112.063,0.514
71,10 Dinger,187 Strassenbande,3ruUVcomUKxPlX8srBfMua,"Ich schwör' dir, wenn ich mal Kohle mache, dan...",0.0182,0.673,0.0,1.0,94.54,0.642


In [35]:
nlp_data_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4190 entries, 0 to 578929
Data columns (total 10 columns):
Track Name          4190 non-null object
Artist              4190 non-null object
ID                  4190 non-null object
Lyrics              4190 non-null object
Acousticness        4190 non-null float64
Energy              4190 non-null float64
Instrumentalness    4190 non-null float64
Mode                4190 non-null float64
Tempo               4190 non-null float64
Valence             4190 non-null float64
dtypes: float64(6), object(4)
memory usage: 360.1+ KB


### TextBlob

**Turn the lyrics in the Lyrics column into string**

In [36]:
nlp_data_clean['Lyrics'] = nlp_data_clean['Lyrics'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


**Make and run function for TextBlob on the Lyrics**

In [37]:
def sentiment_func(lyrics):
    try:
        return TextBlob(lyrics).sentiment
    except:
        return None

nlp_data_clean['pol_sub'] = nlp_data_clean['Lyrics'].apply(sentiment_func)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


**Split the pol_sub column into 2 new columns (Polarity, Subjectivity)**

In [38]:
nlp_data_clean['pol_sub'][0][0]

nlp_data_clean['Polarity'] = nlp_data_clean['pol_sub'].apply(lambda x: x[0])
nlp_data_clean['Subjectivity'] = nlp_data_clean['pol_sub'].apply(lambda x: x[1])

nlp_data_clean.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,pol_sub,Polarity,Subjectivity
0,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879,"(-0.044454619454619454, 0.5908017908017905)",-0.044455,0.590802
1,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756,"(0.5831501831501833, 0.6706959706959708)",0.58315,0.670696
48,Somme,13 Block,2xkxBVJHf9jQsq7g46UtQx,"J'ai fait l'aller, j'suis sur le retour\nLa ma...",0.494,0.678,0.00151,0.0,79.979,0.528,"(0.1738095238095238, 0.5416666666666666)",0.17381,0.541667


**Drop the pol_sub column**

In [39]:
nlp_data_clean = nlp_data_clean.drop(['pol_sub'], axis=1)

nlp_data_clean.head(3)

Unnamed: 0,Track Name,Artist,ID,Lyrics,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,Polarity,Subjectivity
0,Bye Bye Bye,*NSYNC,4r8lRYnoOGdEi6YyI5OC1o,"hey, hey bye bye bye, bye bye bye bye i'm doi...",0.0408,0.928,0.00104,0.0,172.656,0.879,-0.044455,0.590802
1,"Merry Christmas, Happy Holidays",*NSYNC,15coTBAzEN1bOeipoNDZAR,merry christmas and happy holidays merry chris...,0.103,0.939,0.0,1.0,105.003,0.756,0.58315,0.670696
48,Somme,13 Block,2xkxBVJHf9jQsq7g46UtQx,"J'ai fait l'aller, j'suis sur le retour\nLa ma...",0.494,0.678,0.00151,0.0,79.979,0.528,0.17381,0.541667
50,Vide,13 Block,69RclklKbEelwfQJCBzh0m,"13 Blo' gang, tu sais d'jà comment on opère mo...",0.505,0.682,6e-06,0.0,112.063,0.514,-0.244444,0.777778
71,10 Dinger,187 Strassenbande,3ruUVcomUKxPlX8srBfMua,"Ich schwör' dir, wenn ich mal Kohle mache, dan...",0.0182,0.673,0.0,1.0,94.54,0.642,0.119167,0.555833


### Do a quick check of the entire data frame

In [40]:
nlp_data_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4190 entries, 0 to 578929
Data columns (total 12 columns):
Track Name          4190 non-null object
Artist              4190 non-null object
ID                  4190 non-null object
Lyrics              4190 non-null object
Acousticness        4190 non-null float64
Energy              4190 non-null float64
Instrumentalness    4190 non-null float64
Mode                4190 non-null float64
Tempo               4190 non-null float64
Valence             4190 non-null float64
Polarity            4190 non-null float64
Subjectivity        4190 non-null float64
dtypes: float64(8), object(4)
memory usage: 585.5+ KB


In [41]:
nlp_data_clean.describe()

Unnamed: 0,Acousticness,Energy,Instrumentalness,Mode,Tempo,Valence,Polarity,Subjectivity
count,4190.0,4190.0,4190.0,4190.0,4190.0,4190.0,4190.0,4190.0
mean,0.238886,0.660887,0.009918,0.54105,120.254388,0.48252,0.071806,0.453405
std,0.234458,0.165181,0.065771,0.498372,26.560422,0.221104,0.225631,0.228036
min,3e-06,0.0279,0.0,0.0,54.082,0.0371,-1.0,0.0
25%,0.051525,0.562,0.0,0.0,99.98425,0.31,-0.033272,0.35
50%,0.159,0.676,0.0,1.0,120.004,0.473,0.046612,0.4875
75%,0.368,0.78475,3.8e-05,1.0,136.04475,0.654,0.189943,0.591449
max,0.988,0.995,0.89,1.0,232.69,0.982,1.0,1.0


## Train/Test-split

Divide the data into a train and a test set (with a test set of 25%, which is also default)

In [42]:
dep   = nlp_data_clean['Valence']
indep = nlp_data_clean

In [43]:
indep_train, indep_test, dep_train, dep_test = train_test_split(indep, dep, test_size = 0.25, random_state=24)

## NLP

### CountVectorizer

Since music is an art form, like poems, I might concider not to use stop words. (Maybe in a later try?!)

In [46]:
# instantiate the model
cvec = CountVectorizer(stop_words='english') # eliminate English stop words

In [47]:
# fit the count vectorizer to the data. 
cvec.fit(indep_train['Lyrics'])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [48]:
# transform
cvec_data = cvec.transform(indep_train['Lyrics'])

In [50]:
# Turn the features into a (pandas)dataframe
df  = pd.DataFrame(cvec_data.todense(),columns=cvec.get_feature_names())

In [51]:
df.head(3)

Unnamed: 0,00,000,000e,030,04,040,0492,05,06,06er,...,조준,죽기엔,죽어,죽지,지겠지,질러,차트를,친구들아,트램펄린,함께라는
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**I need to drop the columns that contain asian characters**

In [63]:
df1 = df.drop(df.columns[-81:], axis=1)

In [64]:
df1.head(3)

Unnamed: 0,00,000,000e,030,04,040,0492,05,06,06er,...,œuf,œufs,œuvre,œuvrer,œuvres,şurup,šaban,šaulić,živeli,ḥarām
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
