## Project Name: NLP Analysis of Lyrics 

### Predictive Model
This notebook contains the predictive model ran on the dataset. First, we use a bag of words model for our variables to get a count of each word in the lyrics. Then those numbers are ran through a neural network to get a multilabel output that tries to predict the artist.

### Project Submission Group Members
- Group member 1
    - Name: Yash Gupta
    - Email: yg444@drexel.edu
- Group member 2
    - Name: Shubham Jadhav
    - Email: sj3237@drexel.edu

`The below section of code will import our first dataset and create our total bag of words object. This object contains all words of all songs and will be zerod out and used for each song to keep a consistant feature space for our neural network. We can see the output of the below function is the counter object containing all of the words among all of the songs. `

In [2]:
# import the dataset and develop our [target, bag of words]
# array for predictive model 1

from collections import Counter
import pandas as pd
import numpy as np
import spacy

df = pd.read_csv("data/Songs.csv") 

count = 0
visited = []

for i in range(0, len(df['Artist'])):
    if df['Artist'][i] not in visited: 
        visited.append(df['Artist'][i])
        count += 1

nlp = spacy.load("en_core_web_sm")
nlp.max_length = 100000000

def count_words_foundation(paragraph, frequency, pos = True, lemma = True):
    sentences = nlp(paragraph)
    for word in sentences:
        if word.pos_ == "SPACE" or word.pos_ == "PUNCT":
            continue
        frequency[(word.lemma_, (word.pos_ if pos else ""))] += 1
    return frequency

def count_words_extra(paragraph, frequency, pos = True, lemma = True):
    sentences = nlp(paragraph)
    for word in sentences:
        if word.pos_ == "SPACE" or word.pos_ == "PUNCT":
            continue
        if (word.lemma_, (word.pos_ if pos else "")) in frequency.keys():
            frequency[(word.lemma_, (word.pos_ if pos else ""))] += 1
    return frequency

f = Counter()
bagOfWordsTotal = count_words_foundation(" ".join(df['Lyrics'].values), f)
print(bagOfWordsTotal)





`Now we will be constructing the array of counter objects and artist targets for each song. This constructed array will serve as the initial storage for the targets and variables associated with each song.`

In [3]:
bagOfWordsSongs = []
for ind in df.index:
    visit = visited.index(df['Artist'][ind])
    targetArray = []
    for i in range(count):
        if i == visit:
            targetArray.append(1)
        else:
            targetArray.append(0)
    counterZero = Counter()
    for item in bagOfWordsTotal:
        counterZero[item] = 0
    bagOfWordsSongs.append([targetArray, count_words_extra(df['Lyrics'][ind], counterZero)])

`Now we construct a better representation by changing our counter object to just an numpy array of all the counts. We then transfer these counters and targets to their own variable varData and tarData.`

In [4]:
nnData = []
for entry in bagOfWordsSongs:
    nnDataVars = []
    for item in entry[1]:
        nnDataVars.append(entry[1][item])
    nnData.append([np.array(entry[0], np.int32), np.array(nnDataVars, np.int32)])

In [5]:
# change our [target, bag of words] array into
# variables/targets for predictive model 1

import math

nnData = np.asarray(nnData, dtype=object)
np.random.shuffle(nnData)
varData = np.array(nnData[:, 1])
tarData = np.array(nnData[:, 0])
newVarData = []
newTarData = []
for row in varData:
    newVarData.append(row)
newVarData = np.array(newVarData)
for row in tarData:
    newTarData.append(row)
newTarData = np.array(newTarData)
print(newVarData.shape)

(745, 10239)


`Below is our tensorflow neural network. We have 2 dense layers with 100 and 50 outputs respectively both using a ReLu activation function. Both layers have a 25% dropout layer rate to help overfitting and we have a final dense layer with 21 outputs (since we have 21 artists) and a softmax activation. Our model uses a cross entropy loss function with ADAM as our optimizer and a learning rate of 10^-4. Finally, we run our model for 200 epochs with a batch size of 70 and we use 25% of our data for validation.`

In [7]:
# run our tensorflow neural network for predictive model 1

import tensorflow as tf
from tensorflow import keras

m1 = keras.Sequential([
        keras.layers.Dense(units=100, activation='relu'),
        keras.layers.Dropout(.25),
        keras.layers.Dense(units=50, activation='relu'),
        keras.layers.Dropout(.25),
        keras.layers.Dense(units=21, activation='softmax')
    ])

m1.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4), 
              loss=tf.losses.CategoricalCrossentropy(),
              metrics=['accuracy'])

history1 = m1.fit(newVarData, newTarData, epochs = 200, batch_size = 70, shuffle = True, validation_split = .25)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

`Now we are looking at our second dataset. We have to scan this dataset differently since each artist is in a different csv file. Below we have the output of the number of songs per csv file and we see the lowest numbet of songs is Ed Sheeran with 296 songs. This will be the number of songs per artist we use. We also see there are 12 artists in the directory so we will have 12 targets for our neural network.`

In [8]:
# import the dataset and develop our [target, bag of words]
# array for predictive model 2

import os
directory = 'data/DataSet2/'

for filename in os.scandir(directory):
    if filename.is_file():
        df = pd.read_csv(filename.path)
        print(str(len(df.index)) + " " + str(filename))

308 <DirEntry 'ArianaGrande.csv'>
406 <DirEntry 'Beyonce.csv'>
344 <DirEntry 'ColdPlay.csv'>
466 <DirEntry 'Drake.csv'>
296 <DirEntry 'EdSheeran.csv'>
521 <DirEntry 'Eminem.csv'>
348 <DirEntry 'JustinBieber.csv'>
325 <DirEntry 'KatyPerry.csv'>
402 <DirEntry 'LadyGaga.csv'>
323 <DirEntry 'NickiMinaj.csv'>
405 <DirEntry 'Rihanna.csv'>
479 <DirEntry 'TaylorSwift.csv'>


`We craft our variable in the same way we did with the first dataset where we have a counter object that spans all of the songs. We can see this counter object in the output below.`

In [9]:
# import the dataset and develop our [target, bag of words]
# array for predictive model 2

count = 12 #12 artists
max_songs = 296 #Ed sheeran has the least songs after cutting some artists

bagOfWordsTotal2 = Counter()
for filename in os.scandir(directory):
    if filename.is_file():
        df2 = pd.read_csv(filename.path).head(max_songs)
        print(str(len(df2.index)) + " " + str(filename))
        bagOfWordsTotal2 = count_words_foundation("".join(str(df2['Lyric'].values)), bagOfWordsTotal2)
print(bagOfWordsTotal2)
bagOfWordsTotal2 = Counter(el for el in bagOfWordsTotal2.elements() if bagOfWordsTotal2[el] >= 5)

296 <DirEntry 'ArianaGrande.csv'>
296 <DirEntry 'Beyonce.csv'>
296 <DirEntry 'ColdPlay.csv'>
296 <DirEntry 'Drake.csv'>
296 <DirEntry 'EdSheeran.csv'>
296 <DirEntry 'Eminem.csv'>
296 <DirEntry 'JustinBieber.csv'>
296 <DirEntry 'KatyPerry.csv'>
296 <DirEntry 'LadyGaga.csv'>
296 <DirEntry 'NickiMinaj.csv'>
296 <DirEntry 'Rihanna.csv'>
296 <DirEntry 'TaylorSwift.csv'>


`Now we again craft our array that contains our songs targets with their counter object. Then we change the array to contain only the count of the counter object in a numpy array. We finally split it into 2 arrays of variables and targets.`

In [10]:
bagOfWordsSongs2 = []
for artist, filename in enumerate(os.scandir(directory)):
    if filename.is_file():
        df2 = pd.read_csv(filename.path).head(max_songs)
        print(str(len(df2.index)) + " " + str(filename))
        for ind in df2.index:
            targetArray = []
            for i in range(12):
                if i == artist:
                    targetArray.append(1)
                else:
                    targetArray.append(0)
            counterZero = Counter()
            for item in bagOfWordsTotal2:
                counterZero[item] = 0
            bagOfWordsSongs2.append([targetArray, count_words_extra(str(df2['Lyric'][ind]), counterZero)])

296 <DirEntry 'ArianaGrande.csv'>
296 <DirEntry 'Beyonce.csv'>
296 <DirEntry 'ColdPlay.csv'>
296 <DirEntry 'Drake.csv'>
296 <DirEntry 'EdSheeran.csv'>
296 <DirEntry 'Eminem.csv'>
296 <DirEntry 'JustinBieber.csv'>
296 <DirEntry 'KatyPerry.csv'>
296 <DirEntry 'LadyGaga.csv'>
296 <DirEntry 'NickiMinaj.csv'>
296 <DirEntry 'Rihanna.csv'>
296 <DirEntry 'TaylorSwift.csv'>


In [11]:
nnData2 = []
for entry in bagOfWordsSongs2:
    nnDataVars2 = []
    for item in entry[1]:
        nnDataVars2.append(entry[1][item])
    nnData2.append([np.array(entry[0], np.int32), np.array(nnDataVars2, np.int32)])

In [12]:
# change our [target, bag of words] array into testing/training
# variables/targets for predictive model 2

nnData2 = np.asarray(nnData2, dtype=object)
np.random.shuffle(nnData2)
varData2 = np.array(nnData2[:, 1])
tarData2 = np.array(nnData2[:, 0])
newVarData2 = []
newTarData2 = []
for row in varData2:
    newVarData2.append(row)
newVarData2 = np.array(newVarData2)
for row in tarData2:
    newTarData2.append(row)
newTarData2 = np.array(newTarData2)

`Below is our tensorflow model for our second dataset. All of the hyperparameters are the same as the first model while the model is adjusted to have 12 targets instead of 21.`

In [13]:
# run our tensorflow neural network for predictive model 2

m1 = keras.Sequential([
        keras.layers.Dense(units=100, activation='relu'),
        keras.layers.Dropout(.25),
        keras.layers.Dense(units=50, activation='relu'),
        keras.layers.Dropout(.25),
        keras.layers.Dense(units=12, activation='softmax')
    ])

m1.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4), 
              loss=tf.losses.CategoricalCrossentropy(),
              metrics=['accuracy'])

history1 = m1.fit(newVarData2, newTarData2, epochs = 200, batch_size = 70, shuffle = True, validation_split = .25)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

In [14]:
firstModelBasePercent = 1/21
secondModelBasePercent = 1/12
firstModelResult = .29
secondModelResult = .66

print("First model score = " + str(firstModelResult/firstModelBasePercent))
print("Second model score = " + str(secondModelResult/secondModelBasePercent))

First model score = 6.09
Second model score = 7.920000000000001


We can see that the second model did 7.92x better compared to randomly selecting an artist while the first model did 6.09x better compared to randomly selecting an artist. This can most likely be attributed to the increased amount of data that the second model used to train on. Another possibility is that the artists in the second model were just more different from one another. For example the second dataset contained rappers while the first dataset did not.