# EDA and Modeling

This notebook will focus on exploring our lyrics dataset and creating our model for our lyrics classifier. Analysis on model performance and expectations will be present throughout.

In [1]:
import pandas as pd
import re
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import collections
import random
import time
import itertools
import nltk
import string
import ast 
import gensim
import plotly
import plotly.graph_objs as go
%matplotlib inline


from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn import tree
from sklearn.preprocessing import LabelEncoder
from collections import Counter
from sklearn.ensemble import BaggingClassifier
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlob
from nltk import pos_tag
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.stem import PorterStemmer
from pprint import pprint
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import plot_confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec, KeyedVectors
from sklearn.naive_bayes import MultinomialNB
from bokeh.models import ColumnDataSource
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.plotting import figure, show, output_notebook, reset_output
from bokeh.palettes import d3
import bokeh.models as bmo
from bokeh.io import save, output_file
from sklearn.manifold import TSNE

Reloading our Word2Vec model

In [16]:
model_vec = Word2Vec.load(r'C:\Users\Fib0nacci\Desktop\ML_genres.model') #Reloading our model

In [17]:
model_bin = Word2Vec.load(r'C:\Users\Fib0nacci\Desktop\data\model.bin') #reloading our bin

In [18]:
#Testing our vector on the word "baby"
print(model_vec.wv.vocab["baby"].index)

In [19]:
#printing the size of our vector for one word "baby". We want to see what this looks like.
print(len(model_vec.wv['baby']))

In [20]:
#Total number of words.
print(len(model_vec.wv.vocab))

Lets repeat this process for a few more words that we saw present in our top 10s above.

In [262]:
 #This function will print the word index and vector length generated for each input.
    def word_info(word):
        print(f"Index of the word {word}:")
        print(model_vec.wv.vocab[word].index)
        print("Length of the vector generated for a word")
        print(len(model_vec.wv[word]))
        

In [21]:
word_info('love')

Now, lets pull out some words that have strong correlations and look at their similarity value. These words will be opposites of each other to start, but I also want to look at words that are closely related.

In [271]:
def most_similar(w1, w2):
    print(model_vec.wv.similarity(w1=w1, w2=w2))

In [6]:
most_similar('happy', 'sad') #opposites

In [4]:
model_vec.wv.most_similar('love') #The scores for words similar to 'love' are generally in the higher range.

In [8]:
vocab = list(model_bin.wv.vocab) #The list of words in the word 2 vec bin.

In [9]:
len(vocab)

In [5]:
import time
start_time = time.time() #instantiating a timer to view how long it takes for our model to train on our lyrics data.

model_vec.train(lyrics_df['lyrics_cleaned'], total_examples=model_vec.corpus_count,epochs=50) #Adding a value of 50 epochs.

## Visualizing Word2vec Embeddings

I will use plotly to visualize the words in our word2vec corpus. This will aloow me to analyize the positions and relationships amongst my word vectors.

In [10]:
A portion of this code was adapted from Eric Klepplen on towardsdatascience.com
X = model_vec[model_bin.wv.vocab] #Implementing PCA and fitting our results.
pca = PCA(n_components = 2)

res = pca.fit_transform(X)
Creating a dataframe
pca_df = pd.DataFrame(res, columns = ['x', 'y'])

pca_df['words'] = words
pca_df.head()

In [26]:
#A portion of this code was adapted from Eric Klepplen on towardsdatascience.com
N = 1000000 #number of randomizations
words = list(model_bin.wv.vocab)
fig = go.Figure(data=go.Scattergl( #Using go to visualize our scatterplot
    x = pca_df['x'],
    y = pca_df['y'],
    mode='markers',
    marker=dict(
        color=np.random.randn(N), #Our marker is a dictionary of the colors
        colorscale='plasma',
        line_width=1
    ),
    text=pca_df['words'],
    textposition="bottom center"
))
fig.show()

In [51]:
from tqdm import tqdm #This is a progress bar to help us see where our vectorizor is in transforming our X_train and X_test.

In [27]:
w2v_words = list(model.wv.vocab)

In [53]:
#A portion of this code was adapted from stackoverflow.com/
def vectorized(X_train):
    vector=[]
    for sent in tqdm(X_train['lyrics_cleaned'].values):
        sent_vec=np.zeros(200)
        count =0
        for word in sent: 
            if word in w2v_words:
                vec = model.wv[word]
                sent_vec += vec 
                count += 1
        if count != 0:
            sent_vec /= count #normalize
        vector.append(sent_vec)
    return vector


# Doc2Vec 

I will now go further by using doc2vec to create a vectorized representation of word groups taken collectively. I want to do this so we can better understand the relationships among our lyrics and our genres.

In [12]:
#We will begin by implementing a function to tag our document.
def tagged_doc(ls_of_ls_of_words):
    for i, list_of_words in enumerate(ls_of_ls_of_words):
        yield gensim.models.doc2vec.TaggedDocument(list_of_words, [i])
data_for_training = list(tagged_doc(lyrics_df['lyrics_cleaned']))

In [28]:
model_doc = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=10)

I am using a vector size of 50 as suggested by the documentation at *https://radimrehurek.com/gensim/*. I chose an epoch size of 10, since I have produced thousands of docs. Since my dataset is large, I want to reduce my iterations. I chose a minimum word count of 2 to discard words that have few occurances. this was also a suggestion of the documentation source stated above.

In [23]:
model_doc.build_vocab(data_for_training)

In [24]:
model_doc.train(data_for_training, total_examples=model_doc.corpus_count, epochs=model_doc.epochs)

In [25]:
model_doc.save(r'C:\Users\Fib0nacci\Desktop\data\doc2vec.model') #Saving our vectors 

# Part II
Part 2 using Doc2Vec and Visualization is in the second notebook named: Lyrics_Data_Modeling_Part_2
