In [2]:
import os
import numpy as np
import pandas as pd
import re 

#pip install contractions
import string
import contractions
import nltk
from nltk.tokenize import word_tokenize

from nltk.stem import WordNetLemmatizer

#nltk.download('wordnet')
#nltk.download('punkt')
#nltk.download('stopwords')
from nltk.corpus import stopwords

In [3]:
raw_data=pd.read_csv("raw_data.csv")
raw_data.head()

Unnamed: 0,Title,Lyrics
0,Grenade,"[Verse 1]\nEasy come, easy go, that's just how..."
1,Just the Way You Are,"[Verse 1]\nOh, her eyes, her eyes\nMake the st..."
2,Our First Time,"[Refrain]\nDon't it feel good, babe?\nDon't it..."
3,Runaway Baby,"[Intro]\nOh yes\n\n[Verse 1]\nWell, looky here..."
4,Lazy Song,"[Chorus]\nToday, I don't feel like doing anyth..."


We begin data preprocessing by looking at how our raw data is formatted. The following lyrics are from *Grenade:*

In [4]:
raw_data["Lyrics"][0]

'[Verse 1]\nEasy come, easy go, that\'s just how you live, oh\nTake, take, take it all, but you never give\nShoulda known you was trouble from the first kiss\nHad your eyes wide open; why were they open? (Ooh)\n\n[Pre-Chorus]\nGave you all I had and you tossed it in the trash\nYou tossed it in the trash, you did\nTo give me all your love is all I ever ask\n\'Cause what you don\'t understand is\n\n[Chorus]\nI\'d catch a grenade for ya (Yeah, yeah, yeah)\nThrow my hand on a blade for ya (Yeah, yeah, yeah)\nI\'d jump in front of a train for ya (Yeah, yeah, yeah)\nYou know I\'d do anything for ya (Yeah, yeah, yeah)\nOh, oh-oh oh oh, I would go through all this pain\nTake a bullet straight through my brain\nYes, I would die for ya baby\nBut you won\'t do the same\n\n[Post-Chorus]\nNo, no, no, no, oh-oh oh...\n\n[Verse 2]\nBlack, black, black and blue, beat me \'til I\'m numb\nTell the devil I said "Hey" when you get back to where you\'re from\nMad woman, bad woman, that\'s just what you are

The instances of \n denote line breaks as how the song appears on Genius. Splitting the text between \n, we can see that the lyrics are divided by song structure, namely the verse, chorus, and bridge:

In [5]:
raw_data["Lyrics"][0].split("\n")

['[Verse 1]',
 "Easy come, easy go, that's just how you live, oh",
 'Take, take, take it all, but you never give',
 'Shoulda known you was trouble from the first kiss',
 'Had your eyes wide open; why were they open? (Ooh)',
 '',
 '[Pre-Chorus]',
 'Gave you all I had and you tossed it in the trash',
 'You tossed it in the trash, you did',
 'To give me all your love is all I ever ask',
 "'Cause what you don't understand is",
 '',
 '[Chorus]',
 "I'd catch a grenade for ya (Yeah, yeah, yeah)",
 'Throw my hand on a blade for ya (Yeah, yeah, yeah)',
 "I'd jump in front of a train for ya (Yeah, yeah, yeah)",
 "You know I'd do anything for ya (Yeah, yeah, yeah)",
 'Oh, oh-oh oh oh, I would go through all this pain',
 'Take a bullet straight through my brain',
 'Yes, I would die for ya baby',
 "But you won't do the same",
 '',
 '[Post-Chorus]',
 'No, no, no, no, oh-oh oh...',
 '',
 '[Verse 2]',
 "Black, black, black and blue, beat me 'til I'm numb",
 'Tell the devil I said "Hey" when you get ba

It would be interesting to specifically look at the chorus; it's essentially the "heart" of any song. But for now, I will not be distinguishing the text by song structure. In addition to removing those tags from my data, I will:

a. expand contractions and convert all characters to lowercase

b. remove stop words, punctuations, and white space

c. lemmatize words to their roots

To streamline text preprocessing, I created a function to perform the commands above that I will then iterate through all 28 songs in my data:

In [6]:
index=str.maketrans(dict.fromkeys(string.punctuation))
lemmatizer=WordNetLemmatizer()
stopWords = set(stopwords.words("english"))

Preprocessed=[]
def preprocess(text):
    text=text.lower() 
    text=re.sub(r'\[.*?\]',"",text)
    expandContractions=contractions.fix(text)
    removePunctuations=expandContractions.translate(index)
    removeStopWords=[i for i in word_tokenize(removePunctuations) if i not in stopWords]
    lemmatization=[lemmatizer.lemmatize(i) for i in removeStopWords]
    text=" ".join(lemmatization)
    Preprocessed.append(text)

for i in raw_data["Lyrics"]:
    text=preprocess(i)
    
preprocessed_data=raw_data
preprocessed_data.columns.values[[0,1]]=["Song Title","Raw Data"]
preprocessed_data["Preprocessed"]=Preprocessed

After preprocessing, the lyrics from *Grenade* is now in this format:

In [7]:
preprocessed_data["Preprocessed"][0]

'easy come easy go live oh take take take never give shoulda known trouble first kiss eye wide open open ooh gave tossed trash tossed trash give love ever ask understand I would catch grenade ya yeah yeah yeah throw hand blade ya yeah yeah yeah I would jump front train ya yeah yeah yeah know I would anything ya yeah yeah yeah oh ohoh oh oh would go pain take bullet straight brain yes would die ya baby ohoh oh black black black blue beat til I numb tell devil said hey get back mad woman bad woman yeah smile face rip brake car gave tossed trash tossed trash yes give love ever ask understand I would catch grenade ya yeah yeah yeah throw hand blade ya yeah yeah yeah I would jump front train ya yeah yeah yeah know I would anything ya yeah yeah yeah oh ohoh oh oh would go pain take bullet straight brain yes would die ya baby body fire ooh would watch burn flame said loved liar never ever ever baby darlin I would still catch grenade ya yeah yeah yeah throw hand blade ya yeah yeah yeah I would

A certain degree of random noise will always remain after any form of text preprocessing. In *Grenade,* there are three variations of *oh:* oh, ohoh, and ooh.

In [8]:
preprocessed_data.head()

Unnamed: 0,Song Title,Raw Data,Preprocessed
0,Grenade,"[Verse 1]\nEasy come, easy go, that's just how...",easy come easy go live oh take take take never...
1,Just the Way You Are,"[Verse 1]\nOh, her eyes, her eyes\nMake the st...",oh eye eye make star look like shinin hair hai...
2,Our First Time,"[Refrain]\nDon't it feel good, babe?\nDon't it...",feel good babe feel good baby brand new babe b...
3,Runaway Baby,"[Intro]\nOh yes\n\n[Verse 1]\nWell, looky here...",oh yes well looky looky ah another pretty thin...
4,Lazy Song,"[Chorus]\nToday, I don't feel like doing anyth...",today feel like anything want lay bed feel lik...


In [None]:
preprocessed_data.to_csv("preprocessed_data.csv",index=False)