# EDA and Modeling

This notebook will focus on exploring our lyrics dataset and creating our model for our lyrics classifier. Analysis on model performance and expectations will be present throughout.

In [1]:
import pandas as pd
import re
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import collections
import random
import time
import itertools
import nltk
import string
import ast 
import gensim
import plotly
import plotly.graph_objs as go
%matplotlib inline


from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn import tree
from sklearn.preprocessing import LabelEncoder
from collections import Counter
from sklearn.ensemble import BaggingClassifier
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlob
from nltk import pos_tag
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.stem import PorterStemmer
from pprint import pprint
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import plot_confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec, KeyedVectors
from sklearn.naive_bayes import MultinomialNB
from bokeh.models import ColumnDataSource
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.plotting import figure, show, output_notebook, reset_output
from bokeh.palettes import d3
import bokeh.models as bmo
from bokeh.io import save, output_file
from sklearn.manifold import TSNE

Reloading our model

In [None]:
model_doc = Doc2Vec.load(r'C:\Users\Fib0nacci\Desktop\doc2vec.model') #Reloading our model

In [24]:
strt_time = time.time()

vector = model_doc.infer_vector(['lyrics_cleaned']) #This line of code takes a list of string tokens to infer.

print(vector)
print("Time taken to create doc2vec vectors: " + str(time.time() - strt_time))

[-2.8193034e-03 -4.5900457e-03 -8.6133834e-03  6.7437263e-03
  1.4861921e-03 -3.8491181e-04 -3.0308836e-03 -3.7900792e-03
 -7.2661246e-04 -5.1474189e-03  6.4379019e-03  4.0894968e-04
 -8.7951319e-03  8.4061520e-03 -1.9734043e-03 -4.7886688e-03
 -3.5051818e-03  9.4960555e-03 -2.5042426e-04 -2.9409265e-03
  3.9216005e-03  6.4020315e-03 -5.1882993e-03 -7.6486203e-03
 -5.4829717e-03  2.8772005e-03  9.4751706e-03  5.5605960e-03
 -7.0150779e-03 -2.6032517e-03 -2.5021818e-03 -4.8330887e-03
 -6.5898560e-03  6.0743992e-03 -6.5941624e-03 -7.6210983e-03
 -3.7697426e-03 -4.5107184e-03  4.5187967e-03 -9.5646217e-05
  1.9093503e-03  3.5764819e-03 -2.8145763e-03 -5.5646449e-03
  7.9427762e-03  8.9559291e-04 -9.7134067e-03 -4.4014147e-03
 -5.7344878e-04  4.3869442e-03]
Time taken to create doc2vec vectors: 0.004987239837646484


In [301]:
docvec = model_doc.docvecs[3]  #Lets look at the top most similar words among docs.
similar_words =  model_doc.wv.most_similar(positive=[docvec])
print(similar_words)

[('layering', 0.5639477968215942), ('empanada', 0.5403931736946106), ('miendamon', 0.5383723974227905), ('ensaladas', 0.5382735729217529), ('federales', 0.536660373210907), ('uwasa', 0.5329006314277649), ('dieta', 0.5163031816482544), ('bakriyon', 0.5093613862991333), ('kaun', 0.5042240619659424), ('scampi', 0.49949026107788086)]


## Visualization

To visualize my doc2vec model, I will be using t-SNE. I chose this to better visualize my high dimensional vectors. I will be utlizing the Bokeh library and the go library. I want to see the pattern of spaces these words occupies and the plots below will create a visualization of such

In [1]:
tsne_model = TSNE(n_jobs=4, n_components=2, verbose=1, random_state=42, n_iter=300)

tsne = tsne_model.fit_transform(model_doc.docvecs.vectors_docs)

tsne_df = pd.DataFrame(data=tsne, columns=['x', 'y'])

tsne_df['artists'] = lyrics_df['artists'].values
tsne_df['popularity'] = lyrics_df['popularity'].values
tsne_df['artists'] = lyrics_df['artists'].values
tsne_df['name'] = lyrics_df['name'].values
tsne_df['artists'] = lyrics_df['artists'].values
tsne_df['genres'] = lyrics_df['genres'].values
tsne_df['lyrics_cleaned'] = lyrics_df['lyrics_cleaned'].values


# 2D Visualization
I have input some random words that have appeared more frequently in our documents.I want to view the words similar to them in dimensional space. The closer the words, the more simiar.

In [206]:
#This plot will give us a t-SNE visualization of our words across our doc2vec model. We will be able to see what this model looks like in space.
#The visualization codes below were adapted from Ruben Winastwan 
def append_list(sim_words, words):
    
    list_of_words = []
    
    for i in range(len(sim_words)):
        
        sim_words_list = list(sim_words[i])
        sim_words_list.append(words)
        sim_words_tuple = tuple(sim_words_list)
        list_of_words.append(sim_words_tuple)
        
    return list_of_words

input_word = ['love', 'future', 'roll']
user_input = [x.strip() for x in input_word]
result_word = []
    
for words in user_input:
    
        sim_words = model_doc.wv.most_similar(words, topn = 5)
        sim_words = append_list(sim_words, words)
            
        result_word.extend(sim_words)
    
similar_word = [word[0] for word in result_word]
similarity = [word[1] for word in result_word] 
similar_word.extend(user_input)
labels = [word[2] for word in result_word]
label_dict = dict([(y,x+1) for x,y in enumerate(set(labels))])
color_map = [label_dict[x] for x in labels]

In [207]:
#This plot will give us a t-SNE visualization of our words across our doc2vec model. We will be able to see what this model looks like in space.
#The visualization codes below were adapted from Ruben Winastwan 



def display_tsne_scatterplot_2D(model, user_input=None, words=None, label=None, color_map=None, topn=5, sample=10):

    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]
    
    word_vectors = np.array([model[w] for w in words])
    
    two_dim = TSNE(n_components = 2, random_state=0).fit_transform(word_vectors)[:,:2]


    data = []
    count = 0
    
    for i in range (len(user_input)):

                trace = go.Scatter(
                    x = two_dim[count:count+topn,0], 
                    y = two_dim[count:count+topn,1],  
                    text = words[count:count+topn],
                    name = user_input[i],
                    textposition = "top center",
                    textfont_size = 20,
                    mode = 'markers+text',
                    marker = {
                        'size': 10,
                        'opacity': 0.8,
                        'color': 2
                    }
       
                )
               
                data.append(trace)
                count = count+topn

    trace_input = go.Scatter(
                    x = two_dim[count:,0], 
                    y = two_dim[count:,1],  
                    text = words[count:],
                    name = 'input words',
                    textposition = "top center",
                    textfont_size = 20,
                    mode = 'markers+text',
                    marker = {
                        'size': 10,
                        'opacity': 1,
                        'color': 'black'
                    }
                    )

            
    data.append(trace_input)
    
# Configure the layout

    layout = go.Layout(
        margin = {'l': 0, 'r': 0, 'b': 0, 't': 0},
        showlegend=True,
        legend=dict(
        x=1,
        y=0.5,
        font=dict(
            family="Courier New",
            size=15,
            color="black"
        )),
        font = dict(
            family = " Courier New ",
            size = 10),
        autosize = False,
        width = 600,
        height = 600
        )


    plot_figure = go.Figure(data = data, layout = layout)
    plot_figure.show()
    
display_tsne_scatterplot_2D(model_doc, user_input, similar_word, labels, color_map)

# 3D Visualization
Now I will view words that have appeared in my document to map where they are in space to each other. These words were chosen at random from the doc2vec corpus. Due to the size of our document, I will be pulling and visualizing random words from our corpus.

In [305]:
#This plot will give us a t-SNE visualization of our words across our doc2vec model. We will be able to see what this model looks like in space.
#The visualization codes below were adapted from Ruben Winastwan 


input_word_3d = ['spell', 'doctor', 'letters', 'goodbye', 'forgive']
user_input_3d = [x.strip() for x in input_word_3d]
result_word = []
    
for words in user_input_3d:
    
        sim_words = model_doc.wv.most_similar(words, topn = 3) #Takes top 3 most similar words in our corpus for each input word
        sim_words = append_list(sim_words, words)
            
        result_word.extend(sim_words)
    
similar_word = [word[0] for word in result_word]
similarity = [word[1] for word in result_word] 
similar_word.extend(user_input_3d)
labels = [word[2] for word in result_word]
label_dict = dict([(y,x+1) for x,y in enumerate(set(labels))])
color_map = [label_dict[x] for x in labels]

In [306]:
def display_tsne_scatterplot_3D(model, user_input=None, words=None, label=None, color_map=None, topn=5, sample=10):

    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]
    
    word_vectors = np.array([model[w] for w in words])
    
    three_dim = TSNE(n_components = 3, random_state=0).fit_transform(word_vectors)[:,:3]


    data = []
    count = 0
    
    for i in range (len(user_input)):

                trace = go.Scatter3d(
                    x = three_dim[count:count+topn,0], 
                    y = three_dim[count:count+topn,1], 
                    z = three_dim[count:count+topn,2],
                    text = words[count:count+topn],
                    name = user_input_3d[i],
                    textposition = "top center",
                    textfont_size = 20,
                    mode = 'markers+text',
                    marker = {
                        'size': 10,
                        'opacity': 0.8,
                        'color': 2
                    }
       
                )
               
                data.append(trace)
                count = count+topn

    trace_input = go.Scatter3d(
                    x = three_dim[count:,0], 
                    y = three_dim[count:,1],
                    z = three_dim[count:,2],
                    text = words[count:],
                    name = 'input words',
                    textposition = "top center",
                    textfont_size = 20,
                    mode = 'markers+text',
                    marker = {
                        'size': 10,
                        'opacity': 1,
                        'color': 'black'
                    }
                    )

            
    data.append(trace_input)
    
# Configure the layout

    layout = go.Layout(
        margin = {'l': 0, 'r': 0, 'b': 0, 't': 0},
        showlegend=True,
        legend=dict(
        x=1,
        y=0.5,
        font=dict(
            family="Courier New",
            size=15,
            color="black"
        )),
        font = dict(
            family = " Courier New ",
            size = 10),
        autosize = False,
        width = 1000,
        height = 1000
        )


    plot_figure = go.Figure(data = data, layout = layout)
    plot_figure.show()
    
display_tsne_scatterplot_3D(model_doc, user_input_3d, similar_word, labels, color_map)

In [75]:
train, test = train_test_split(lyrics_df, test_size=0.3, random_state=42)

In [61]:
tagged_tr = [TaggedDocument(lyrics_df['lyrics_cleaned'], tags=[str(i)]) for i, _d in enumerate(train)]
tagged_test = [TaggedDocument(lyrics_df['lyrics_cleaned'], tags=[str(i)]) for i, _d in enumerate(test)]


In [91]:
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs
    targets, regressors = zip(*[(doc.tags[0], model_doc.infer_vector(doc.words, steps=20)) for doc in sents])
    return targets, regressors


In [88]:
train_dt, test_dt = train_test_split(data_for_training, test_size=0.3, random_state=42)

# Modeling 
For our baseline model, we will use a decision tree classifier and we will only slightly tune our parameters. I want to see how the model scores first based on this classifier. I am choosing a decision tree to optimize my model and since decision trees can perform feature selection, they work well on categorical data as well as numeric.

In [152]:
#Decision Tree modeling with tuned hyperparameters
dtc = DecisionTreeClassifier(random_state=42, max_leaf_nodes=150, max_depth=7, min_samples_split=150)
dtc.fit(train_vc, y_train) #With tuning of hyper parameters, our score is still very much the same for decision trees.

print(f'Testing Accuracy {dtc.score(test_vc, y_test)}')
print(f'Training Accuracy {dtc.score(train_vc, y_train)}')

Testing Accuracy 0.4844463036203225
Training Accuracy 0.4848998484280522


In [155]:
#We updated the hyperparameters from this decision tree. the criterion was changed as well as the max sample size and bagging was added.
dc = DecisionTreeClassifier(criterion="entropy", max_depth=7)
bag = BaggingClassifier(base_estimator=dc, n_estimators=100, max_samples=0.8, bootstrap=True) #Using bootstrap and bagging to get a better score 
bag = bag.fit(train_vc, y_train)


print(f'Testing Accuracy {bag.score(test_vc, y_test)}')
print(f'Training Accuracy {bag.score(train_vc, y_train)}')

Testing Accuracy 0.5244904167934287
Training Accuracy 0.52737258992454


Our Decision tree model with some hyperparameters tuned scored extremely low, though there is little to no overfitting present. Our Decision tree model using bagging scored higher and also has a little to no overfitting. I decided to implement bagging to combine weaker models into one ensemble model since my scores were so low. Both of these models still scored much lower than anticipated.

### Modeling with Random Forest, Pipeline and Grid Search

I am choosing to now model with a random forest classifier using a pipeline and gridsearch. Since I used decision trees for my baseline model, I want to use bagged decision trees that are split at random subsets. Random forest also helps our outliers by binning them, which could improve our score. I am also comfortable using random forest due to the models trained in the audio modeling notebook.

In [146]:
#A portion of this code was adapted from https://stackoverflow.com/questions

from sklearn.model_selection import RepeatedStratifiedKFold

rfc = RandomForestClassifier() 
n_estimators = [10, 100, 1000]
max_features = ['sqrt', 'log2']

# define grid search
grid = dict(n_estimators=n_estimators,max_features=max_features)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=rfc, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(train_vc, y_train)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.723764 using {'max_features': 'sqrt', 'n_estimators': 1000}
0.707553 (0.005223) with: {'max_features': 'sqrt', 'n_estimators': 10}
0.721091 (0.005722) with: {'max_features': 'sqrt', 'n_estimators': 100}
0.723764 (0.005407) with: {'max_features': 'sqrt', 'n_estimators': 1000}
0.707309 (0.005658) with: {'max_features': 'log2', 'n_estimators': 10}
0.720532 (0.005369) with: {'max_features': 'log2', 'n_estimators': 100}
0.723422 (0.005396) with: {'max_features': 'log2', 'n_estimators': 1000}


In [150]:
print('Random Forest Train score', grid_result.score(train_vc, y_train)) #Our highest score was
print('Random Forest Test score', grid_result.score(test_vc, y_test))

Random Forest Train score 0.868735433609857
Random Forest Test score 0.7290462427745664


Our score has improved immensely! Though it is still not as high as we might like, we were able to reach a test score above 70% and a training score at 87%. our model is extremely overfit however, so to reduce this bias/variance trade off, we would need to further explore turning our hyperparameters.

Our highest training score is roughly at 87%, while our testing score is at 73%. There is clear evidence of overfitting, however, our model scored decently in comparison to our lyrics classifier.

In [159]:
preds_grid = grid_result.predict(test_vc)

In [162]:
#Lets print a classification report to look at our values.
print(classification_report(y_test, preds_grid))

              precision    recall  f1-score   support

           0       0.84      0.71      0.77       770
           1       0.91      0.61      0.73       167
           2       0.82      0.69      0.75      1337
           3       0.62      0.52      0.56       572
           4       0.81      0.57      0.67       158
           5       0.78      0.80      0.79      3498
           6       0.71      0.57      0.63       292
           7       0.64      0.28      0.39       476
           8       0.76      0.70      0.73      6542
           9       0.70      0.60      0.64      4730
          10       0.69      0.87      0.77      7754

    accuracy                           0.73     26296
   macro avg       0.75      0.63      0.68     26296
weighted avg       0.73      0.73      0.72     26296



# Conclusions and Next Steps
Our NLP analysis pointed out a few key things. The first is that there was an error in not omitting spanish or latin stop words. These showed up in our plots and they dominated our word count for the latin genre. There was also word overlap in the classical and latin genres. I suspect this was due to our previous genre classification method and how these songs were categorized. Revisiting the features and methods of classification is necessary to improve our scores.

As concluded from our prior model in our audio notebook, our random forest model scored highest out of the 3 models we trained. Our lyrics classifier is still 13% overfit however. A positive is that this classifier (using lyrics to train), scored higher on our testing set than in our audio classifier testing set, so we know that our feature selection was better here. Our decision tree models scored roughly the same and were very low even still. That leads me to belive those models just werent the best fit for this specific project. Our decision tree models were not nearly as overfit as our random forest model. The margin of erro is next to none, however, I would need to do more exploration on why these models scored poorly if I wanted to revisit training them.
In our classification report, we see that our precision and our recall are closer in range for our random forest model. Aside from genres 7 and 10, our model performed much better in terms of predicting actual truths and actual false values.

## Next Steps
1. Since we stripped our classifier of all audio attributes, we did not get a chance to run it in tandem with our audio features model. It would be beneficial to combine both datasets and run another model that allowed for the training on a master dataset. This would be best to do after features are revisited and explored more, as to optimize our models score.

2. Cleaning up our stop words would give us a better corpus to train on. The addition of the stops words from other languages would give us different results for our eda.

3. Combining an ensemble model would also be another way of improving our score. Rather than combining our datasets, our models could be combined for another score. This can be done before feature exploration or now with the models we have trained.

Ultimately, our lyrics classifier and our Audio classifier are promising, but feature selection and exploration are necessary if we want this to become a reliable way to predict genres of songs.