#Author: Aishwarya Varala

#Description



> The dataset is a musical lyrics dataset obtained from kaggle. In this file the lyrics will be extracted and cleaned. The cleaning of the dataset is as follows:


*   Firstly, the rows not having a genre are removed as it is our target variable.
*   Detecting the non-english lyrics and removing them by using the 'lang_detect package'.
*   Removing any emoji's present in text and also removing the numbers and special characters except the '!' symbol.
*   For Stemming, Porter Stemmer is being used.

The resultant cleaned and preprocessed dataset is stored in a csv file namely processed_lyics.csv


> Oversampling is performed on the resultant dataset by decreasing the Rock genre by 50% and increasing all other class genres to match up with the number of Rock Genre records.
 
After oversampling the dataset is shuffled for a couple of times to avoid any continuous sequenes of lyrics belonging to the same genre. 

The resultant oversampled Dataset is stored in cleaneddata_oversampled.csv.


# Command to run the file


> Open the ipynb notebook in Jupyter Lab and go to the menu bar on the top, click on 'Run' and from the dropdown select the 'Run All' option to run all the cells in the notebook. 


# Inputs and Outputs

> Input: The input to the file is lyrics.csv which is the main dataset that contains all the relevant attributes. It is already read through the read_csv() function in the code. 

> Output: The processed_lyics.csv is generated as the output which is a representation of pre-processed data and no sampling. The cleaneddata_oversampled.csv file is also an output which is the distribution of oversampled instances of other genres and undersampled instances of the Rock genre.



*The inputs to the program must be in the same folder as the script.

INPUT: The lyrics file with each lyric as a row
OUTPUT: Pre-processed lyrics of each lyric in the input file







In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:

#Importing the required python packages for preprocessing 
import numpy as np
import pandas as pd
import re
from nltk import pos_tag
import nltk
from sklearn.utils import shuffle

nltk.download('averaged_perceptron_tagger')
from langdetect import detect
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
import emoji
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report

# Removing the emoji from the text. Accepts a single text record and outputs the cleaned text data free of emoji's.
def text_has_emoji(text):
    i=0
    new_str=''
    arr=[]
    for character in text:
        if character in emoji.UNICODE_EMOJI:
             arr.append(character)
        else:
            new_str = new_str + text[i]
        i=i+1
    return new_str

# Accepts a text record and calls the text_has_emoji function for freeing the text data from emoji's and tokenizes the text and lowercase it at same time. 
#Outputs a list of tokens extracted from the text. Tokens are extracted using the Tokenizer library 
#and with our custom made regular expression specifying removing numbers and cleaning out of symbols except '!'.
def tokenextractor(text):
    new_text=text_has_emoji(text)
    pattern = '[0-9]'
    t=re.sub(pattern, '', new_text)
    tokenizer = RegexpTokenizer('\w+|\!')
    tokens=tokenizer.tokenize(t.lower())
    return tokens
#Accepts a record in dataframe and derives the lyrics and calls the tokenextractor then the tokens will be derived from it.The resultant tokens are stemmed using 
#Porter Stemmer and joined them. the resultant output will be the text which has been stemmed and clenaed .
def tokenize_and_stem(tok_df):
    #print(sentence_set)
    #return_set = []
    stemmed_tokens=[]
    sentence_set=tok_df['lyrics']
    tokens = tokenextractor(sentence_set)
    stems = []
    ps=PorterStemmer()
    for j in tokens:
        stems.append(ps.stem(j))
        stemmed_tokens.append(ps.stem(j))
        return_set=' '.join(stems)
    return(return_set)
# Accepts a record in a dataframe and derives lyrics and determines whether it is a english lyrics or not. We will determine this using detect function which will 
#give us the language of the text. Then we will match whether the choosen text is eng or not. If it is eng we will return en_df =1 pr else 2.
def detect_lyric(en_df):
    en_flg=0
    #print(en_df)
    lyrics_txt=en_df['lyrics']
    pattern = r'(?=[a-z])'
    pattern = re.compile(pattern, re.IGNORECASE)
    matches = re.search(pattern, lyrics_txt)
    #print(lyrics_txt)
    if(bool(matches)):
        lan=detect(str(lyrics_txt))
        if(lan=='en'):
            en_df['is_eng']=1
    else:
        en_df['is_eng']=2
    return en_df


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Importing the Lyrics dataset

In [0]:
music_df=pd.read_csv("lyrics.csv")
music_df=music_df.dropna(how='any',axis=0)

Checking missing values and removinng them.
Checking the language of lyrics and removing any non-english lyrics.

In [0]:
missing_music_df = music_df[music_df.genre == 'Not Available']
music_df=music_df[music_df.genre != 'Not Available']
music_df=music_df.apply(detect_lyric,axis=1)
music_df = music_df[music_df.is_eng == 1]

Checking the missing values in all columns 

In [0]:
music_df.isnull().sum()

artist    0
genre     0
index     0
is_eng    0
lyrics    0
song      0
year      0
dtype: int64

In [0]:
music_df.shape

(220650, 7)

transforming the raw text data to get cleaned and transformed the text data which is free of numbers, Special symbols(except '!' ) and lowercase the whole text data and perform stemming on it.

In [0]:
music_df['lyrics']=music_df[["lyrics"]].apply(tokenize_and_stem,axis=1)

Exporting the cleaned and processed data to processed_lyics.csv file

In [0]:
music_df.to_csv('processed_lyics.csv')

In [0]:
df=pd.read_csv('processed_lyics.csv')

In [0]:
df.head()

Unnamed: 0.1,Unnamed: 0,lyrics,genre
0,173586,i wait for the pain it alway come again and i ...,Rock
1,192873,as i hear the mock bird i rememb the word when...,Country
2,196702,gab yeah yeah x lateef blackalici lateef the t...,Hip-Hop
3,34320,well the sky broke in two i found you danc alo...,Pop
4,77537,when madam pompadour wa on a ballroom floor sa...,Jazz


Dropping the unnamed column which has been created when we are importing the library.

In [0]:
df.drop(df.columns[df.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)

In [0]:
df.head()

Unnamed: 0,lyrics,genre
0,i wait for the pain it alway come again and i ...,Rock
1,as i hear the mock bird i rememb the word when...,Country
2,gab yeah yeah x lateef blackalici lateef the t...,Hip-Hop
3,well the sky broke in two i found you danc alo...,Pop
4,when madam pompadour wa on a ballroom floor sa...,Jazz


In [0]:
df.isnull().sum()

lyrics    0
genre     0
dtype: int64

In [0]:
df_concat=df.copy()

Over sampling the data. reducing the rock genre to 50% and increasing all other genre's respectively to match up with the rock.

We will be preparing each genre dataframe and duplicate to a certain 'n' number of times so that it will match up with the Rock genre.

In [0]:
music_df_cou=df_concat[df_concat['genre']=='Country']
music_df_country= pd.concat([music_df_cou]*2, ignore_index=True)
#music_df_country.drop(["index"], axis = 1, inplace = True)

music_df_Electron=df_concat[df_concat['genre']=='Electronic']
music_df_Electronic= pd.concat([music_df_Electron]*4, ignore_index=True)
#music_df_Electronic.drop(["index"], axis = 1, inplace = True)

music_df_Folk=df_concat[df_concat['genre']=='Folk']
music_df_Fo= pd.concat([music_df_Folk]*16, ignore_index=True)
#music_df_Fo.drop(["index"], axis = 1, inplace = True)

music_df_Hip=df_concat[df_concat['genre']=='Hip-Hop']
samp_hip=df_concat[df_concat['genre']=='Hip-Hop'].sample(n=8000)
music_df_Hiphop= pd.concat([music_df_Hip,samp_hip], ignore_index=True)
#music_df_Hiphop.drop(["index"], axis = 1, inplace = True)

music_df_Ind=df_concat[df_concat['genre']=='Indie']
music_df_Indie= pd.concat([music_df_Ind]*9, ignore_index=True)
#music_df_Indie.drop(["index"], axis = 1, inplace = True)

music_df_Ja=df_concat[df_concat['genre']=='Jazz']
music_df_Jazz= pd.concat([music_df_Ja]*3, ignore_index=True)
#music_df_Jazz.drop(["index"], axis = 1, inplace = True)

music_df_Met=df_concat[df_concat['genre']=='Metal']
samp_met=df_concat[df_concat['genre']=='Metal'].sample(n=9000)
music_df_Metal= pd.concat([music_df_Met,samp_met], ignore_index=True)
#music_df_Metal.drop(["index"], axis = 1, inplace = True)

music_df_oth=df_concat[df_concat['genre']=='Other']
music_df_other= pd.concat([music_df_oth]*7, ignore_index=True)
#music_df_other.drop(["index"], axis = 1, inplace = True)

music_df_Pop=df_concat[df_concat['genre']=='Pop']

music_df_r=df_concat[df_concat['genre']=='R&B']
music_df_rb= pd.concat([music_df_r]*8, ignore_index=True)
#music_df_rb.drop(["index"], axis = 1, inplace = True)
music_df_rock=df_concat[df_concat['genre']=='Rock'].sample(frac =.5)
#music_df_rock.drop(["index"], axis = 1, inplace = True)





We will concatenate all the individual genre dataframes into one dataframe

In [0]:
processed_df=pd.concat([music_df_rb,music_df_Pop,music_df_other,music_df_Metal,music_df_Jazz,music_df_Indie,music_df_Hiphop,music_df_Fo,music_df_Electronic,music_df_country,music_df_rock],ignore_index=True)



From the above step the resultant concatenated dataframe will have a series of stagnated same genre's. So we will shuffle the data by using shuffle function from sklearn.utils.

We will run the below cell multiple times so that a efficient shuffled dataset is formed.

In [0]:
processed_df=shuffle(processed_df)

Exporting the cleaned oversampled data to a csv file named:
cleaneddata_oversampled.csv

In [0]:
processed_df.to_csv("cleaneddata_oversampled.csv")