# EDA and Pre-Processing

## Initial EDA

In [3]:
import pandas as pd

In [4]:
df = pd.read_csv('data/cleaned_lyrics_gender.csv')
df.head()

Unnamed: 0,artist,seq,GENDER,IS_BAND
0,Elizabeth Naccarato,"Oh, Danny boy, the pipes, the pipes are callin...",female,False
1,Ella Fitzgerald,I never feel a thing is real\r\nWhen I'm away ...,female,False
2,Ella Fitzgerald,"I really can't stay\r\nBut, baby, it's cold ou...",female,False
3,Ella Fitzgerald,All my life\r\nI've been waiting for you\r\nMy...,female,False
4,Ella Fitzgerald,I'll be down to get you in a taxi honey\r\nBet...,female,False


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21057 entries, 0 to 21056
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   artist   21057 non-null  object
 1   seq      21057 non-null  object
 2   GENDER   21057 non-null  object
 3   IS_BAND  21054 non-null  object
dtypes: object(4)
memory usage: 658.2+ KB


In [6]:
df = df[df['GENDER'].isin(['female', 'male'])]

In [7]:
len(df)

19721

In [8]:
gender_proportions = df['GENDER'].value_counts(normalize=True)
gender_proportions

male      0.777952
female    0.222048
Name: GENDER, dtype: float64

In [9]:
band_proportions = df['IS_BAND'].value_counts(normalize=True)
band_proportions

False    0.600771
True     0.399229
Name: IS_BAND, dtype: float64

Now look at lyrics

In [10]:
df['seq'].iloc[0]

"Oh, Danny boy, the pipes, the pipes are calling\r\nFrom glen to glen, and down the mountain side.\r\nThe summer's gone, and all the roses falling,\r\nIt's you, it's you must go and I must bide.\r\n\r\nBut come ye back when summer's in the meadow,\r\nOr when the valley's hushed and white with snow,\r\nIt's I'll be here in sunshine or in shadow,\r\nOh, Danny boy, oh Danny boy, I love you so!\r\n\r\nBut when ye come, and all the flowers are dying,\r\nIf I am dead, as dead I well may be,\r\nYou'll come and find the place where I am lying,\r\nAnd kneel and say an Ave there for me.\r\nAnd I shall hear, though soft you tread above me,\r\nAnd all my grave will warmer, sweeter be,\r\nFor you will bend and tell me that you love me,\r\nAnd I shall sleep in peace until you come to me!"

## Data Pre-Procesing

### Loading Libraries

In [11]:
import re

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [12]:
nltk.download('stopwords')
nltk.download('punkt') 
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\asus\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\asus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\asus\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [13]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\asus\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### Cleaning Lyrics

Here, we use the `re` library to clean the lyrics by keeping only desired charachters, turnig to lowercase and removing spaces.

In [14]:
def clean_lyrics(text):
    text = re.sub(r'[^a-zA-Z0-9\s!?]', '', text)  
    text = text.lower()
    text = re.sub(r'\s+', ' ', text).strip()
    return text

In [15]:
df['clean_lyrics'] = df['seq'].apply(clean_lyrics)

In [16]:
df.head()

Unnamed: 0,artist,seq,GENDER,IS_BAND,clean_lyrics
0,Elizabeth Naccarato,"Oh, Danny boy, the pipes, the pipes are callin...",female,False,oh danny boy the pipes the pipes are calling f...
1,Ella Fitzgerald,I never feel a thing is real\r\nWhen I'm away ...,female,False,i never feel a thing is real when im away from...
2,Ella Fitzgerald,"I really can't stay\r\nBut, baby, it's cold ou...",female,False,i really cant stay but baby its cold outside i...
3,Ella Fitzgerald,All my life\r\nI've been waiting for you\r\nMy...,female,False,all my life ive been waiting for you my wonder...
4,Ella Fitzgerald,I'll be down to get you in a taxi honey\r\nBet...,female,False,ill be down to get you in a taxi honey better ...


### Tokenize Lyrics

Tokenize from `NLTK` will be used to tokenoze the lyrics.
For reference if needed: https://www.nltk.org/api/nltk.tokenize.word_tokenize.html#nltk-tokenize-word-tokenize 

In [17]:
def tokenize_lyrics(text):
    return word_tokenize(text)

df['tokenized_lyrics'] = df['clean_lyrics'].apply(tokenize_lyrics)

In [18]:
df.head()

Unnamed: 0,artist,seq,GENDER,IS_BAND,clean_lyrics,tokenized_lyrics
0,Elizabeth Naccarato,"Oh, Danny boy, the pipes, the pipes are callin...",female,False,oh danny boy the pipes the pipes are calling f...,"[oh, danny, boy, the, pipes, the, pipes, are, ..."
1,Ella Fitzgerald,I never feel a thing is real\r\nWhen I'm away ...,female,False,i never feel a thing is real when im away from...,"[i, never, feel, a, thing, is, real, when, im,..."
2,Ella Fitzgerald,"I really can't stay\r\nBut, baby, it's cold ou...",female,False,i really cant stay but baby its cold outside i...,"[i, really, cant, stay, but, baby, its, cold, ..."
3,Ella Fitzgerald,All my life\r\nI've been waiting for you\r\nMy...,female,False,all my life ive been waiting for you my wonder...,"[all, my, life, ive, been, waiting, for, you, ..."
4,Ella Fitzgerald,I'll be down to get you in a taxi honey\r\nBet...,female,False,ill be down to get you in a taxi honey better ...,"[ill, be, down, to, get, you, in, a, taxi, hon..."


### Stopword Removal

We loop through every token to see if it present in the stopwords, and remove it if it is.

In [19]:
stop_words = set(stopwords.words('english'))

In [20]:
def remove_stopwords(tokens):
    return [word for word in tokens if word not in stop_words]

df['filtered_lyrics'] = df['tokenized_lyrics'].apply(remove_stopwords)

In [21]:
df.head()

Unnamed: 0,artist,seq,GENDER,IS_BAND,clean_lyrics,tokenized_lyrics,filtered_lyrics
0,Elizabeth Naccarato,"Oh, Danny boy, the pipes, the pipes are callin...",female,False,oh danny boy the pipes the pipes are calling f...,"[oh, danny, boy, the, pipes, the, pipes, are, ...","[oh, danny, boy, pipes, pipes, calling, glen, ..."
1,Ella Fitzgerald,I never feel a thing is real\r\nWhen I'm away ...,female,False,i never feel a thing is real when im away from...,"[i, never, feel, a, thing, is, real, when, im,...","[never, feel, thing, real, im, away, embrace, ..."
2,Ella Fitzgerald,"I really can't stay\r\nBut, baby, it's cold ou...",female,False,i really cant stay but baby its cold outside i...,"[i, really, cant, stay, but, baby, its, cold, ...","[really, cant, stay, baby, cold, outside, got,..."
3,Ella Fitzgerald,All my life\r\nI've been waiting for you\r\nMy...,female,False,all my life ive been waiting for you my wonder...,"[all, my, life, ive, been, waiting, for, you, ...","[life, ive, waiting, wonderful, one, ive, begu..."
4,Ella Fitzgerald,I'll be down to get you in a taxi honey\r\nBet...,female,False,ill be down to get you in a taxi honey better ...,"[ill, be, down, to, get, you, in, a, taxi, hon...","[ill, get, taxi, honey, better, ready, bout, h..."


### Lemmatization

Now we lemmatize the words
Source: https://www.nltk.org/api/nltk.stem.WordNetLemmatizer.html?highlight=wordnet

In [22]:
wnl = WordNetLemmatizer()

In [23]:
def lemmatize_lyrics(tokens):
    return [wnl.lemmatize(word) for word in tokens]

df['lemmatized_lyrics'] = df['filtered_lyrics'].apply(lemmatize_lyrics)

In [24]:
df.head()

Unnamed: 0,artist,seq,GENDER,IS_BAND,clean_lyrics,tokenized_lyrics,filtered_lyrics,lemmatized_lyrics
0,Elizabeth Naccarato,"Oh, Danny boy, the pipes, the pipes are callin...",female,False,oh danny boy the pipes the pipes are calling f...,"[oh, danny, boy, the, pipes, the, pipes, are, ...","[oh, danny, boy, pipes, pipes, calling, glen, ...","[oh, danny, boy, pipe, pipe, calling, glen, gl..."
1,Ella Fitzgerald,I never feel a thing is real\r\nWhen I'm away ...,female,False,i never feel a thing is real when im away from...,"[i, never, feel, a, thing, is, real, when, im,...","[never, feel, thing, real, im, away, embrace, ...","[never, feel, thing, real, im, away, embrace, ..."
2,Ella Fitzgerald,"I really can't stay\r\nBut, baby, it's cold ou...",female,False,i really cant stay but baby its cold outside i...,"[i, really, cant, stay, but, baby, its, cold, ...","[really, cant, stay, baby, cold, outside, got,...","[really, cant, stay, baby, cold, outside, got,..."
3,Ella Fitzgerald,All my life\r\nI've been waiting for you\r\nMy...,female,False,all my life ive been waiting for you my wonder...,"[all, my, life, ive, been, waiting, for, you, ...","[life, ive, waiting, wonderful, one, ive, begu...","[life, ive, waiting, wonderful, one, ive, begu..."
4,Ella Fitzgerald,I'll be down to get you in a taxi honey\r\nBet...,female,False,ill be down to get you in a taxi honey better ...,"[ill, be, down, to, get, you, in, a, taxi, hon...","[ill, get, taxi, honey, better, ready, bout, h...","[ill, get, taxi, honey, better, ready, bout, h..."


### Updated Data Frame

In [25]:
df2 = df[["lemmatized_lyrics", "GENDER"]]
df2.head()

Unnamed: 0,lemmatized_lyrics,GENDER
0,"[oh, danny, boy, pipe, pipe, calling, glen, gl...",female
1,"[never, feel, thing, real, im, away, embrace, ...",female
2,"[really, cant, stay, baby, cold, outside, got,...",female
3,"[life, ive, waiting, wonderful, one, ive, begu...",female
4,"[ill, get, taxi, honey, better, ready, bout, h...",female


In [26]:
df2['lemmatized_lyrics'].iloc[0]

['oh',
 'danny',
 'boy',
 'pipe',
 'pipe',
 'calling',
 'glen',
 'glen',
 'mountain',
 'side',
 'summer',
 'gone',
 'rose',
 'falling',
 'must',
 'go',
 'must',
 'bide',
 'come',
 'ye',
 'back',
 'summer',
 'meadow',
 'valley',
 'hushed',
 'white',
 'snow',
 'ill',
 'sunshine',
 'shadow',
 'oh',
 'danny',
 'boy',
 'oh',
 'danny',
 'boy',
 'love',
 '!',
 'ye',
 'come',
 'flower',
 'dying',
 'dead',
 'dead',
 'well',
 'may',
 'youll',
 'come',
 'find',
 'place',
 'lying',
 'kneel',
 'say',
 'ave',
 'shall',
 'hear',
 'though',
 'soft',
 'tread',
 'grave',
 'warmer',
 'sweeter',
 'bend',
 'tell',
 'love',
 'shall',
 'sleep',
 'peace',
 'come',
 '!']

### Set Threshold for Minimum Token Count

There may be songs that have very few lyrics, so we can't extract as much information. For now, an arbitrary minimum threshold of `25` tokens is set.

In [27]:
df2['lemmatized_lyrics_length'] = df2['lemmatized_lyrics'].apply(len)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['lemmatized_lyrics_length'] = df2['lemmatized_lyrics'].apply(len)


In [28]:
threshold = 25

In [29]:
df_length_count = df2[df2['lemmatized_lyrics_length'] >= threshold]
df3 = df_length_count[["lemmatized_lyrics", "GENDER"]]

In [30]:
len(df3) #Decreased from 19721

19487

In [31]:
df3

Unnamed: 0,lemmatized_lyrics,GENDER
0,"[oh, danny, boy, pipe, pipe, calling, glen, gl...",female
1,"[never, feel, thing, real, im, away, embrace, ...",female
2,"[really, cant, stay, baby, cold, outside, got,...",female
3,"[life, ive, waiting, wonderful, one, ive, begu...",female
4,"[ill, get, taxi, honey, better, ready, bout, h...",female
...,...,...
21052,"[tomboy, hail, mary, never, need, dress, make,...",female
21053,"[throw, line, cant, reel, throw, dart, cant, m...",female
21054,"[mind, cluttered, kitchen, sink, heart, empty,...",female
21055,"[well, moment, heavy, im, ready, like, caged, ...",female


In [32]:
gender_proportions = df3['GENDER'].value_counts(normalize=True) # We have similar split
gender_proportions

male      0.777236
female    0.222764
Name: GENDER, dtype: float64

In [34]:
df3.to_csv("data/cleaned_eda.csv", index = False)