#Author: Aishwarya
Dataset is a musical lyrics dataset obtained from kaggle. 


# Description

*   We will import the oversampled train and test set
*   Calculate the parts of speech for each lyric and convert them into a dataframe with rows as lyrics and columns as a parts of speech
*   Concatenating the Parts of speech dataframe with train and test datasets respectively
*   Then store this dataframes into csv files for further logistic and naive bayes.

*Detailed description is written for many parts of the code below. Please read through for the same.

#Command to run the file: 

>Open the ipynb notebook in Jupyter Lab and go to the menu bar on the top, click on 'Run' and from the dropdown select the 'Run All' option to run
all the cells in the notebook. 

# Input and Output

Input
> oversampling_train.csv - The oversampled ditribution version of the dataset. It is only used as train set.
> oversampling_test.csv - The oversampled ditribution version of the dataset. It is only used as test set.

Output

> posovrsamprockredtrain.csv - The POS tags as features concatenated to the input train set.

> posovrsamprockredtest.csv - The POS tags as features concatenated to the input test set.


*The inputs to the program must be in the same folder as the script. 






Importing the necessary packages from python.

In [0]:
import numpy as np
import pandas as pd
import re
from nltk import pos_tag
import nltk
nltk.download('averaged_perceptron_tagger')
#from langdetect import detect
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
#import emoji
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report



[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


function below takes input a dictonary consisting of word and what is its parts of speech and computes the frequency of each parts of speech in a lyrics. It out puts a dictionary consisting with keys as parts of speech and frequencies as values

In [0]:

def count_tags(title_with_tags):
    tag_count = {}
    for word, tag in title_with_tags:
        if tag in tag_count:
            tag_count[tag] += 1
        else:
            tag_count[tag] = 1
    return(tag_count)


#music_df=music_df.dropna(how='any',axis=0)

Importing the oversampled train file and removing the unnamed column which was created when we are importing the file

In [0]:
music_df=pd.read_csv("oversampling_train.csv")
music_df.drop(music_df.columns[music_df.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)

Finding the parts of speech for each lyric will have a series of dictionaries consisting key as token and value as parts of speech of token

In [0]:
pos_lyrics = music_df['lyrics'].str.split().map(pos_tag)

Computes a Dictionary of parts of speech as key and frequency of that parts of speech in that lyrics using the function coun_tags and this action is performed for a lyrics column for all records and outputs a seroes

In [0]:
pos_lyrics=pos_lyrics.map(count_tags)

In [0]:
pos_lyrics.head()

0    {'DT': 3, 'NN': 139, 'JJ': 30, 'IN': 21, 'FW':...
1    {'NN': 22, 'VBP': 10, 'WRB': 1, 'VBD': 5, 'PRP...
2    {'PRP': 28, 'VBD': 10, 'DT': 21, 'NN': 31, 'VB...
3    {'NN': 2, 'WDT': 2, 'VB': 2, 'WRB': 3, 'PRP': ...
4    {'UH': 1, 'PRP$': 3, 'JJ': 13, 'NN': 30, 'RB':...
Name: lyrics, dtype: object

Imports a test data oversampled

In [0]:
testdf=pd.read_csv("oversampling_test.csv")

In [0]:
testdf.head()

Unnamed: 0.1,Unnamed: 0,lyrics,genre
0,4009,when a cold chill begin to burn at your veri s...,R&B
1,179678,when the boy wa no more than a shaver hi old m...,Country
2,48436,mayb i can t live to love you as long as i wan...,Jazz
3,80888,artist erick sermon f al green album react son...,Hip-Hop
4,209771,in a simpl life dream die hard you never let e...,Country


Removes unnamed column which has been created when we are reading the file 

In [0]:
testdf.drop(testdf.columns[testdf.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)



In [0]:
testdf.head()

Unnamed: 0,lyrics,genre
0,when a cold chill begin to burn at your veri s...,R&B
1,when the boy wa no more than a shaver hi old m...,Country
2,mayb i can t live to love you as long as i wan...,Jazz
3,artist erick sermon f al green album react son...,Hip-Hop
4,in a simpl life dream die hard you never let e...,Country


Finding the parts of speech for each lyric will have a series of dictionaries consisting key as token and value as parts of speech of token

In [0]:
tstpos_lyrics = testdf['lyrics'].str.split().map(pos_tag)

Computes a Dictionary of parts of speech as key and frequency of that parts of speech in that lyrics using the function coun_tags and this action is performed for a lyrics column for all records and outputs a series object conatining parts of speech frequency dictionary for every lyric intest dataset

In [0]:
tstpos_lyrics=tstpos_lyrics.map(count_tags)

In [0]:
tstpos_lyrics.head()

0    {'WRB': 8, 'DT': 26, 'NN': 69, 'VB': 20, 'TO':...
1    {'WRB': 2, 'DT': 22, 'NN': 55, 'VBZ': 11, 'JJR...
2    {'NN': 40, 'MD': 8, 'VB': 17, 'JJ': 18, 'TO': ...
3    {'NN': 178, 'JJ': 78, 'VB': 30, 'DT': 45, 'IN'...
4    {'IN': 51, 'DT': 18, 'JJ': 25, 'NN': 63, 'VB':...
Name: lyrics, dtype: object

Transforming parts of speech frequency dictionary series object to a dataframe with columns as parts of speech and rows as lyrics

In [0]:
df = pd.DataFrame(pos_lyrics.tolist())
df=df.fillna(0)

In [0]:
df.columns

Index(['DT', 'NN', 'JJ', 'IN', 'FW', 'VBG', 'VBD', 'VBN', 'TO', 'VB', 'RB',
       'VBZ', '.', 'NNS', 'VBP', 'WRB', 'PRP', 'PRP$', 'WP', 'CC', 'CD', 'MD',
       'RP', 'WDT', 'UH', 'JJR', 'PDT', 'JJS', 'EX', 'RBR', 'RBS', 'NNP',
       'WP$', '''', '$', 'NNPS', 'SYM', 'POS', 'LS'],
      dtype='object')

In [0]:
df.shape

(265026, 38)

Dropping the column ''''

In [0]:
df=df.drop(df.columns[33], axis=1)

Transforming parts of speech frequency dictionary series object to a dataframe with columns as parts of speech and rows as lyrics

In [0]:
tsdf = pd.DataFrame(tstpos_lyrics.tolist())
tsdf=tsdf.fillna(0)

In [0]:
tsdf.columns

Index(['WRB', 'DT', 'NN', 'VB', 'TO', 'IN', 'PRP$', 'JJ', 'WDT', 'VBZ', 'RB',
       'VBG', 'PRP', 'MD', 'VBP', 'RP', 'CC', 'VBD', 'JJS', 'CD', 'VBN', 'EX',
       'JJR', 'NNS', 'PDT', 'WP', '.', 'UH', 'FW', 'RBR', 'RBS', 'WP$', 'NNP',
       '''', 'POS', 'SYM', '$', 'NNPS'],
      dtype='object')

Concatenating test oversapmpled with the parts of speech testdataframe 

In [0]:
tdf_concat = pd.concat([testdf.reset_index(drop=True), tsdf.reset_index(drop=True)], axis=1)

Concatenating train oversapmpled with the parts of speech traindataframe 

In [0]:
df_concat = pd.concat([music_df.reset_index(drop=True), df.reset_index(drop=True)], axis=1)

In [0]:
df_concat.shape

(265026, 40)

In [0]:
tdf_concat.shape

(66257, 40)

In [0]:
df_concat.head()

Unnamed: 0,lyrics,genre,DT,NN,JJ,IN,FW,VBG,VBD,VBN,TO,VB,RB,VBZ,.,NNS,VBP,WRB,PRP,PRP$,WP,CC,CD,MD,RP,WDT,UH,JJR,PDT,JJS,EX,RBR,RBS,NNP,WP$,$,NNPS,SYM,POS,LS
0,a kay a kay jordan de shoe pairan vich kaali h...,Other,3.0,139.0,30.0,21.0,29.0,1.0,3.0,1.0,1.0,1.0,3.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,i remeb when i saw you at the diner s yet my h...,Rock,10.0,22.0,5.0,7.0,0.0,0.0,5.0,0.0,1.0,2.0,5.0,0.0,0.0,2.0,10.0,1.0,8.0,3.0,1.0,3.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,you left the water run when you left me here b...,Country,21.0,31.0,3.0,15.0,0.0,0.0,10.0,5.0,4.0,16.0,7.0,0.0,0.0,0.0,13.0,5.0,28.0,4.0,0.0,8.0,0.0,2.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,thing that happen when you fall asleep when yo...,Electronic,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0,3.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,oh my god my heart s as heavi as a stone and w...,Indie,9.0,30.0,13.0,12.0,0.0,0.0,1.0,3.0,4.0,12.0,14.0,1.0,0.0,2.0,16.0,1.0,3.0,3.0,0.0,9.0,0.0,1.0,2.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [0]:
tdf_concat.head()

Unnamed: 0,lyrics,genre,WRB,DT,NN,VB,TO,IN,PRP$,JJ,WDT,VBZ,RB,VBG,PRP,MD,VBP,RP,CC,VBD,JJS,CD,VBN,EX,JJR,NNS,PDT,WP,.,UH,FW,RBR,RBS,WP$,NNP,'',POS,SYM,$,NNPS
0,when a cold chill begin to burn at your veri s...,R&B,8.0,26.0,69.0,20.0,11.0,36.0,10.0,17.0,5.0,9.0,19.0,2.0,30.0,2.0,18.0,2.0,9.0,15.0,1.0,1.0,3.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,when the boy wa no more than a shaver hi old m...,Country,2.0,22.0,55.0,8.0,4.0,15.0,1.0,29.0,4.0,11.0,9.0,1.0,14.0,1.0,2.0,1.0,8.0,7.0,0.0,0.0,1.0,0.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,mayb i can t live to love you as long as i wan...,Jazz,2.0,5.0,40.0,17.0,8.0,14.0,1.0,18.0,0.0,1.0,22.0,0.0,12.0,8.0,24.0,0.0,7.0,2.0,0.0,1.0,0.0,0.0,0.0,3.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,artist erick sermon f al green album react son...,Hip-Hop,9.0,45.0,178.0,30.0,6.0,48.0,16.0,78.0,2.0,33.0,19.0,0.0,26.0,6.0,33.0,5.0,22.0,12.0,1.0,2.0,4.0,2.0,3.0,9.0,0.0,10.0,1.0,2.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,in a simpl life dream die hard you never let e...,Country,1.0,18.0,63.0,30.0,12.0,51.0,7.0,25.0,1.0,4.0,18.0,0.0,16.0,14.0,38.0,1.0,5.0,5.0,0.0,2.0,2.0,0.0,1.0,3.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Storing into csv files

In [0]:
df_concat.to_csv('posovrsamprockredtrain.csv')
tdf_concat.to_csv('posovrsamprockredtest.csv')