# Predicting the Decade of a Song

In this notebook, I am exploring a dataset, which I download from [here](https://www.kaggle.com/rakannimer/billboard-lyrics), that has different aspects of songs, and seeing what aspect best at predicting which decade a song came from. The dataset has the following characteristics about songs, lyrics, song title, and artist.

I am going to create different models to see which combination of model and song aspect will be best at predicting which decade a song came from. To start, below I have read in the data from my source and began to explore what my data looked like from here.

In [1]:
import pandas as pd

music = pd.read_csv('billboard_lyrics_1964-2015.csv', encoding='cp1252')

In [2]:
music.head()

Unnamed: 0,Rank,Song,Artist,Year,Lyrics,Source
0,1,wooly bully,sam the sham and the pharaohs,1965,sam the sham miscellaneous wooly bully wooly b...,3.0
1,2,i cant help myself sugar pie honey bunch,four tops,1965,sugar pie honey bunch you know that i love yo...,1.0
2,3,i cant get no satisfaction,the rolling stones,1965,,1.0
3,4,you were on my mind,we five,1965,when i woke up this morning you were on my mi...,1.0
4,5,youve lost that lovin feelin,the righteous brothers,1965,you never close your eyes anymore when i kiss...,1.0


##### Here I made note that there was data missing from the Lyrics column, and since I didn't have a sufficient way of adding data into where the lyrics were missing and the amount that was missing was only 3.6% of samples. I decided it was best to drop these samples from creating my model.

In [3]:
music.isnull().sum()

Rank        0
Song        0
Artist      0
Year        0
Lyrics    187
Source    187
dtype: int64

In [4]:
music.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5100 entries, 0 to 5099
Data columns (total 6 columns):
Rank      5100 non-null int64
Song      5100 non-null object
Artist    5100 non-null object
Year      5100 non-null int64
Lyrics    4913 non-null object
Source    4913 non-null float64
dtypes: float64(1), int64(2), object(3)
memory usage: 239.2+ KB


In [5]:
music['Year'].sort_values()

0       1965
72      1965
71      1965
70      1965
69      1965
        ... 
5027    2015
5026    2015
5024    2015
5035    2015
5099    2015
Name: Year, Length: 5100, dtype: int64

In [8]:
music.dropna(inplace = True)

##### After dropping some of my samples, I looked at the values counts and noticed that some years had more songs than others, so when creating my model I knew it would be best to stratify my y value in order to make sure that the model would equally take weight all the y vaules for predictions.

In [9]:
music['Year'].value_counts()

2000    100
1987    100
1989    100
2011    100
2010     99
2014     99
1988     99
2002     99
1993     99
1985     99
1973     98
2013     98
2001     98
2012     98
1990     98
2004     98
2005     98
2015     98
1984     98
1992     98
1967     98
1983     98
1995     97
1976     97
2003     97
1999     97
2008     97
1970     97
2009     97
1991     97
2006     96
1978     96
1982     96
1986     96
1965     95
1966     95
1981     95
1998     95
1979     95
2007     95
1994     94
1980     94
1969     93
1974     93
1975     93
1968     93
1997     92
1977     91
1996     91
1972     91
1971     88
Name: Year, dtype: int64

##### When performing my exploratory data analysis, I noticed that I was going to need to changes to the years column because I wanted to create a model that would predict a decade. Therefore I created a column below that put the years into each decade that the dataset contained. This was also another reason I needed to stratify my y value because there were some decades that had less years than others.

In [13]:
song_year = []

for x in music["Year"]:
    if 1965 <= x <= 1969:
        song_year.append(0)
    if 1970 <= x <= 1979:
        song_year.append(1)
    if 1980 <= x <= 1989:
        song_year.append(2)
    if 1990 <= x <= 1999:
        song_year.append(3)
    if 2000 <= x <= 2009:
        song_year.append(4)
    if 2010 <= x <= 2015:
        song_year.append(5)
        
music['song_year'] = song_year

music.head()
            

Unnamed: 0,Rank,Song,Artist,Year,Lyrics,Source,song_year
0,1,wooly bully,sam the sham and the pharaohs,1965,sam the sham miscellaneous wooly bully wooly b...,3.0,0
1,2,i cant help myself sugar pie honey bunch,four tops,1965,sugar pie honey bunch you know that i love yo...,1.0,0
2,3,i cant get no satisfaction,the rolling stones,1965,,1.0,0
3,4,you were on my mind,we five,1965,when i woke up this morning you were on my mi...,1.0,0
4,5,youve lost that lovin feelin,the righteous brothers,1965,you never close your eyes anymore when i kiss...,1.0,0


In [144]:
music['song_year'].value_counts(normalize = True)

2    0.198453
4    0.198453
3    0.194993
1    0.191126
5    0.120497
0    0.096479
Name: song_year, dtype: float64

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

##### Here I count vetorized my lyrics in order to see what the top two-word phrases of the top songs are, and the results below were not suprising to me, which is that most of the 2 word phrases are filler words in songs, such as oh oh, yeah yeah, la la.

In [25]:
X = music['Lyrics']
y = music['song_year']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

In [17]:
top_words = CountVectorizer(stop_words= 'english', ngram_range= (2,2))

In [18]:
top_words.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(2, 2), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [20]:
top_phrases = pd.DataFrame(top_words.transform(X_train).todense(), columns = top_words.get_feature_names())

In [22]:
top_phrases.sum().sort_values(ascending = False).head(20)

oh oh          2389
dont know      1308
yeah yeah      1191
la la          1185
love love       944
im gonna        857
na na           755
dont want       626
dont wanna      621
just like       495
hey hey         495
baby baby       489
know im         459
ive got         441
oh yeah         413
feel like       393
youre gonna     392
cause im        387
im just         385
da da           383
dtype: int64

## Creating Models with Lyrics

First, I wanted to see how well I could build a model that could predict the decade of a song just based on lyrics. So below I have built five different kinds of models, Random Forest Classifier and Regressor, Baggigng Classifier, Adaboost Classifier, and a Logistic Regression model, in order to see which model would give me the best score. I found that the best model was a Logistic Regression model because my baseline score was .16 and the Logistic Regression Score for testing the data was .38, which is a little more than double the baseline score. 

In [99]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier, RandomForestRegressor
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression

In [142]:
# X = music[['Lyrics', 'Artist', 'Song']]
X = music['Lyrics']
y = music['song_year']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

logr_pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words= 'english')),
    ('logr', LogisticRegression(solver = 'lbfgs'))
])

logr_pipe.fit(X_train, y_train);



In [53]:
logr_pipe.score(X_train, y_train)

0.9856134636264929

In [54]:
logr_pipe.score(X_test, y_test)

0.38812042310821804

In [141]:
X = music['Lyrics']
y = music['song_year']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

rfc_pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words= 'english', max_features = 15000, max_df = .95)),
    ('rf', RandomForestClassifier())
])

rfc_pipe.fit(X_train, y_train);



In [87]:
rfc_pipe.score(X_train, y_train)

0.9777415852334419

In [88]:
rfc_pipe.score(X_test, y_test)

0.3360455655004068

In [140]:
X = music['Lyrics']
y = music['song_year']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

rfr_pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words= 'english', max_features = 15000, max_df = .95)),
    ('rfr', RandomForestRegressor())
])

rfr_pipe.fit(X_train, y_train);



In [102]:
rfr_pipe.score(X_train, y_train)

0.8713270359207359

In [103]:
rfr_pipe.score(X_test, y_test)

0.2840585953899428

In [139]:
X = music['Lyrics']
y = music['song_year']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

bc_pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words= 'english')),
    ('bc', BaggingClassifier())
])

bc_pipe.fit(X_train, y_train);

In [94]:
bc_pipe.score(X_train, y_train)

0.9782844733984799

In [95]:
bc_pipe.score(X_test, y_test)

0.34011391375101707

In [138]:
X = music['Lyrics']
y = music['song_year']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

adbc_pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words= 'english', max_features = 15000, max_df = .95)),
    ('adbc', AdaBoostClassifier())
])

adbc_pipe.fit(X_train, y_train);

In [97]:
adbc_pipe.score(X_train, y_train)

0.34527687296416937

In [98]:
adbc_pipe.score(X_test, y_test)

0.31000813669650124

## Creating Models with Song Titles

Another feature I used to predict the decadea song came from was based on the song title. I used all the same models for lyrics, and tinkered with the hyperparamaters to see which ones would give the best score on the testing data. The best model again was Logistic Regression as before for song lyrics, but ultimately lyrics are better at predicting than song titles, which makes sense because there is more data to work with from the songs themselves rather than just the titles.

In [137]:
X = music['Song']
y = music['song_year']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

slogr_pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words= 'english')),
    ('slogr', LogisticRegression(solver = 'lbfgs'))
])

slogr_pipe.fit(X_train, y_train);



In [136]:
slogr_pipe.score(X_train, y_train), slogr_pipe.score(X_test, y_test) 

(0.6628664495114006, 0.2611879576891782)

In [135]:
X = music['Song']
y = music['song_year']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

srfr_pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words= 'english', max_features = 15000, max_df = .95)),
    ('srfr', RandomForestRegressor())
])

srfr_pipe.fit(X_train, y_train);



In [134]:
srfr_pipe.score(X_train, y_train), srfr_pipe.score(X_test, y_test)

(0.6865253986863444, -0.11570444123636457)

In [143]:
X = music['Song']
y = music['song_year']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

srfc_pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words= 'english')),
    ('srf', RandomForestClassifier())
])

srfc_pipe.fit(X_train, y_train);



In [133]:
srfc_pipe.score(X_train, y_train), srfc_pipe.score(X_test, y_test)

(0.8284473398479913, 0.25386493083807976)

In [123]:
X = music['Song']
y = music['song_year']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

sbc_pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words= 'english')),
    ('sbc', BaggingClassifier())
])

sbc_pipe.fit(X_train, y_train);

In [132]:
sbc_pipe.score(X_train, y_train), sbc_pipe.score(X_test, y_test)

(0.8287187839305103, 0.24979658258746948)

In [129]:
X = music['Song']
y = music['song_year']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

sadbc_pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words= 'english', max_features = 15000, max_df = .95)),
    ('sadbc', AdaBoostClassifier())
])

sadbc_pipe.fit(X_train, y_train);

In [131]:
sadbc_pipe.score(X_train, y_train), sadbc_pipe.score(X_test, y_test)

(0.25841476655808904, 0.22050447518307567)

## Conclusion

If I had to choose the best model and the best feature to use to create a model for predict the decade a song was made it would be the Logistic Regression model and the song lyrics. This combination created a model that had an accuracy score that is more than double the baseline score, from 20% to 38%. 