# Data Preprocessing

## Environment

In [1]:
import pandas as pd
import pyarrow
import re

In [2]:
# mount google drive
from google.colab import drive
drive.mount('/content/gdrive')
%cd /content/gdrive/MyDrive/semantic_song_search

Mounted at /content/gdrive
/content/gdrive/MyDrive/Projects/semantic_song_search


## Load Data

In [3]:
df = pd.read_parquet("data/5mil_song_set.parquet")
df = df.dropna(subset = "lyrics")
df = df.reset_index(drop=True)
print(df.shape)
df.info()

(5911279, 7)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5911279 entries, 0 to 5911278
Data columns (total 7 columns):
 #   Column    Dtype   
---  ------    -----   
 0   title     object  
 1   tag       category
 2   artist    object  
 3   year      int16   
 4   views     int32   
 5   features  object  
 6   lyrics    object  
dtypes: category(1), int16(1), int32(1), object(4)
memory usage: 219.9+ MB


In [28]:
df = df.drop(columns=["features"])
blank = (df["lyrics"] == "")
df = df[~blank]

## Limit by "Good Music"
Started out by limiting by view counts, but then I realized that a lot of people have poor music tastes, especially the masses. So, I printed out a list of artists in my personal collection.

In [4]:
good_artists = ["311", "1975", "Al Green", "Allah-Las", "Animal Collective", "Apples In Stereo", "Arctic Monkeys", "Atlas Sound",
                "Atoms For Peace", "Audioslave", "Autechre", "Average White Band", "Avey Tare", "Art d'Ecco", "A Tribe Called Quest",
                "Art Blakey & The Jazz Messengers", "Barenaked Ladies", "Battles", "Beach Boys", "Beach Fossils", "Bear In Heaven",
                "Beck", "Bert Jansch", "Best Coast", "Billy Idol", "Black Keys", "Black Moth Super Rainbow", "Boards Of Canada",
                "Bob Dylan", "Bodies Of Water", "Bon Iver", "The Books", "Boys Noize", "Brant Bjork", "Brendan Benson",
                "Brightblack Morning Light", "Broken Bells", "The Beatles", "Beastie Boys", "Bob Marley", "Burning Spear",
                "Booker T. & The M.G.'s", "Blur", "Cage The Elephant", "Canned Heat", "Car Seat Headrest", "Caribou", "Carole King",
                "Cat Power", "Cat Stevens", "Chromatics", "Coldplay", "Com Truise", "Counting Crows", "Cream", "Creedence Clearwater Revival",
                "Crystal Castles", "The Crystal Method", "CSS", "Cut Copy", "Childish Gambino", "Charles Mingus", "D.R. Hooker",
                "Dabrye", "Daedelus", "Daft Punk", "Damon Albarn", "Dan Auerbach", "The Dandy Warhols", "Darondo", "David Bowie",
                "The Dead Weather", "Deakin", "Decemberists", "Deep Sea Diver", "Deerhunter", "Department Of Eagles", "Digitalism",
                "Dirty Projectors", "Discovery", "Dispatch", "The Doors", "Dr. Dog", "Danger Mouse", "Das Racist", "Del Tha Funkee Homosapien",
                "Dolly Parton", "The Dandy Warhols", "Dave Brubeck", "Dave Douglas", "Duke Ellington", "Deltron 3030", "DJ Shadow", "DJ Vadim",
                "Doseone", "Dr. Dre", "Dandy Livingstone", "Edward Sharpe & The Magnetic Zeros", "Eels", "Electric President", "Elliott Smith",
                "Elvis Presley", "Endo", "Eric Clapton", "Explosions In The Sky", "Elton John", "Eurythmics", "Eagles", "El-P", "Eminem",
                "Fatboy Slim", "Feist", "Fischerspooner", "Fleet Foxes", "Fleetwood Mac", "Four Tet", "Frank Sinatra", "Frankie Valli",
                "Franz Ferdinand", "Freddie King", "Fu Manchu", "Fugazi", "Future Shuttle", "Frank Sinatra", "Gang Gang Dance", "Garage Voice",
                "Gardens & Villa", "Gary Numan", "Gary Wright", "Gentleman Losers", "Gorillaz", "Green Day", "Greta Van Fleet", "Grizzly Bear",
                "The Guess Who", "George Harrison", "Geto Boys", "Gnarls Barkley", "GZA", "Gregory Isaacs", "Hercules And Love Affair", 
                "Here We Go Magic", "High Places", "Hit Factory", "Hives", "Hot Chip", "Hotel Pools", "Huey Lewis & The News", "Hank Williams",
                "Herbie Hancock", "Handsome Boy Modeling School", "Heems", "Hound Dog Taylor", "Incubus", "Iron & Wine", "Iggy Pop",
                "Idris Muhammad", "Ice Cube", "Jack White", "James Brown", "James Holden", "James Taylor", "Jamie Lidell", "Jane", "Janis Joplin",
                "Jars Of Clay", "Jefferson Airplane", "Jelly Roll Morton", "Jimi Hendrix", "Jon Spencer Blues Explosion", "Junior Boys",
                "Junk Culture", "Justice", "Justin Timberlake", "John Lennon", "Johnny Cash", "J Dilla", "Jurassic 5", "John Lee Hooker",
                "Kelley Polar", "Kills", "Kinks", "Kraftwerk", "Kurt Vile", "Kanye West", "Kid Cudi", "Killer Mike", "Kings of Leon",
                "Late BP Helium", "LCD Soundsystem", "Led Zeppelin", "Leon Bridges", "Les Claypool", "Limp Bizkit", "Lindstrøm", "Linkwood",
                "Loretta Lynn", "Louis Armstrong", "Lil Wayne", "Logic", "Lord T & Eloise", "La's", "Mac DeMarco", "Madonna", "Marcy Playground",
                "Marvin Gaye", "Massive Attack", "Medeski, Martin & Wood", "Memory Cassette", "Memory Tapes", "MGMT", "Michael Jackson", "Moby",
                "Modest Mouse", "MSTRKRFT", "Meters", "Miles Davis", "Mose Allison", "M.I.A", "Mac Miller", "MF DOOM", "Missy Elliott",
                "The Mamas & The Papas", "Neon Indian", "New Order", "Nikka Costa", "Nine Inch Nails", "Nirvana", "Neil Young", "of Montreal",
                "Otis Redding", "Panda Bear", "Pearl Harbor", "Phoenix", "Picastro", "Pinback", "Pink Floyd", "Pixies", "Polvo",
                "The Presidents of the United States of America", "Prince", "Prodigy", "Paul McCartney", "People Under The Stairs", "The Police",
                "Pointer Sisters", "Quarashi", "R.E.M", "Raconteurs", "Radiohead", "Rage Against The Machine", "Ratatat", "Ravi Shankar",
                "Red Hot Chili Peppers", "Royal Blood", "The Rolling Stones", "Ringo Starr", "Ramsey Lewis", "Roots Manuva", "Run The Jewels",
                "Robert Johnson", "Rick James", "Sensational Saints", "Sheryl Crow", "The Shins", "Simian Mobile Disco", "Sonic Youth", "Spoon",
                "St. Vincent", "Steve Miller Band", "Stevie Nicks", "Stevie Wonder", "Sting", "Stone Temple Pilots", "Strokes", "Sublime",
                "Sufjan Stevens", "Surfer Blood", "Sylvan Esso", "Sun Ra", "Shabazz Palaces", "Simon & Garfunkel", "Snail Mail", "Steely Dan", "The Smile", 
                "The Talking Heads", "Tamaryn", "Tame Impala", "Tegan and Sara", "Tenacious D", "Thievery Corporation", "Thom Yorke", "Three Dog Night",
                "Tim Carroll", "Timber Timbre", "Toadies", "TOBACCO", "Tom Petty", "Trentemøller", "The Turtles", "Tycho", "Themselves", "Three 6 Mafia",
                "Thelma Houston", "The Unicorns", "Vince Guaraldi Trio", "Washed Out", "Weeknd", "Weezer", "White Denim", "The White Stripes",
                "The Whitest Boy Alive", "Who", "Widowspeak", "Wolfmother", "Westside Connection", "Yeah Yeah Yeahs", "Yeasayer", "Zero 7", "Zoot Woman"]

In [7]:
print(df.shape)
df = df[df['artist'].isin(good_artists)].reset_index(drop=True)
print(df.shape)

(5911279, 7)
(55878, 7)


## Profanity

In [175]:
# The Obscenity List - 1600+ popular English profanities and their variations
# https://github.com/surge-ai/profanity
nsfw = pd.read_csv("data/profanity_en.csv")
nsfw[nsfw["text"] == "dumbass"]

Unnamed: 0,text,canonical_form_1,canonical_form_2,canonical_form_3,category_1,category_2,category_3,severity_rating,severity_description
493,dumbass,ass,,,mental disability,other / general insult,,1.4,Mild


In [10]:
df["tag"].value_counts()

rock       21172
pop        20304
rap         7171
rb          3587
country     2841
misc         803
Name: tag, dtype: int64

In [11]:
from tqdm import tqdm
from tqdm.auto import tqdm
tqdm.pandas()

# Create a regular expression pattern that matches any of the words in nsfw['text']
nsfw_pattern = re.compile('|'.join([re.escape(word) for word in nsfw['text']]))

mask = []
for lyric in tqdm(df['lyrics'], desc='Processing Lyrics'):
    mask.append(nsfw_pattern.search(lyric, re.IGNORECASE) is not None)



Processing Lyrics:   0%|          | 0/55878 [00:00<?, ?it/s]

In [16]:
# Select only rows without NSFW words
print(df.shape)
mask = pd.Series(mask)
df = df[~mask]
df = df.reset_index(drop=True)
print(df.shape)

(55878, 7)
(28714, 7)


In [17]:
df["tag"].value_counts()

pop        11942
rock       11269
rb          2195
country     1417
rap         1416
misc         475
Name: tag, dtype: int64

## Annotations

In [19]:
# remove annotations (in brackets)
#limit size of brackets to 100 so the regex doesn't stall out on unclosed brackets
df['lyrics'] = df['lyrics'].replace("\[.{,100}]", "", regex = True)

## Reduce Count

In [35]:
desired_count = 17000
fraction = desired_count/df.shape[0]
df = df.sample(frac=fraction, replace=False)
df = df.reset_index(drop=True)

In [36]:
df.shape

(17000, 6)

## Create Normalized Version of Lyrics (for GloVe)

In [37]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

from nltk.corpus import wordnet
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN


def preprocess_text(text):
    # lowercase
    text = text.lower()
    # remove special characters and punctuation
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    # tokenize
    tokens = word_tokenize(text)
    # remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # lemmatize
    lemmatizer = WordNetLemmatizer()
    tagged = nltk.pos_tag(tokens)
    tokens = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in tagged]
    # rejoin string
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [38]:
from tqdm import tqdm
from tqdm.auto import tqdm
tqdm.pandas()

# preprocess text
df['normalized_lyrics'] = df['lyrics'].progress_apply(preprocess_text)

  0%|          | 0/17000 [00:00<?, ?it/s]

## Review and Save

In [39]:
pd.set_option('max_colwidth', 500)
df.sample(10).T

Unnamed: 0,9229,15687,16180,12137,8445,3490,7866,15658,4392,9954
title,I Need Some Money,Lift San Francisco 27 March 1996 MD111.2,Jelmore,Behind the Sun,Monday Morning live,Heartbreaker 1/4/71 Paris Theatre,Dig It 2021 Mix,The Epiphany,Until We Fall,Jump In For Gilkey Elementary School
tag,rb,rock,rock,rock,pop,rock,rock,rock,rock,pop
artist,John Lee Hooker,Radiohead,Bon Iver,Eric Clapton,Fleetwood Mac,Led Zeppelin,The Beatles,Eels,Audioslave,High Places
year,1960,2019,2019,1985,2015,1997,2021,2018,2006,2015
views,766,70,31547,772,175,134,228,1016,3767,39
lyrics,"\nThe best thing in life is free\nBut you can give it to the birds and bees\nI need some money, need some money, oh, yeah\nWhat I want\n\n\nYour love give me such a thrill\nBut your lovin' don't pay my bills\nI need money, need some money, oh, yeah\nWhat I want\n\n\nI need some money, honey\nI need some money right away\nI need some money so bad\nI need some money, oh, yeah\nWhat I want\n\n\nMoney don't get everything, it's true\nWhat it don't buy that I can't use\nI need some money, I need ...","\nThank you\nThis is a brand new song\nI hope you like it\n\n\nThis is the place\nSit down, you’re safe now\nYou’ve been stuck in a lift\nYou’ve been trying to reach me, Thom\nThis is the place\nIt won’t hurt ever again\n\n\nThe smell of air conditioning\nThe fish are belly up\nEmpty all your pockets\nBecause it’s time to come home\n\n\nThis is the place\nRemember me? I’m the face you always see\nYou’ve been stuck in a lift\nIn the belly of a whale at the bottom of the ocean\n\nThe smell of ...","\nWell, angel, morning Sivanna\nWell, ain't been gone too far\nBut heading out towards Ponoma\nWhere you won't be alone\n\n\nWhere there's thrift store manager in a poke camadee\nAnd a gas mask on his arm\nAnd one by one by one\nWe'll all be gone\n\n\nWe'll all be gone by the fall\nWe'll all be gone by the falling light\n\n\nBrick layer\nWith a hat down on his feet\nI'll say no more\nI won't lead no Calvary\n\nHow long\nWill you disregard the heat?\nHalf beat\nIt's no misnomer though\n\n\nI'...","My love has gone behind the sun\nSince she left, the darkness has begun\nThe smile that used to shine on me\nIs nothing more than a memory\n\nI see her face, I hear her voice\nShe made a move, I had no choice\nBut walk and cry, wipe tears with my hand\nThe one that carries a wedding band\n\nAnd the clouds hang low\nAnd the flowers that used to grow in my heart\nAre dying now\nDying now\nDying now","Monday morning you sure look fine\nFriday I got travelling on my mind\nFirst you love me then you fade away\nI can't go on believing this way\n\nI got nothing but love for you\nSo tell me what you really want to do\nFirst you love me then you get on down the line\nBut I don't mind, no, I don't mind\n\nI'll be there if you want me to\nNo one else that could ever do\nI got to get some peace in my mind\n\nMonday morning you sure look fine\nFriday I got travelling on my mind\nFirst you love me t...","Hey fellas, have you heard the news\nYou know that Annie's back in town\nIt won't take long just watch an' see\nHow the fellas lay their money down\nHer style is new but the face is same\nAs it was so long ago\nBut from her eyes, a different smile\nLike that of one who knows\n\nWell, it's been ten years and maybe more\nFirst set eyes on you\nThe best years of my life gone by\nHere I am alone and blue\nSome people cry and some people die\nBy the wicked ways of love\nBut I'll just keep on roll...","\nLike a rolling stone\nLike a rolling stone\nLike a rolling stone\nLike the FBI\nAnd the CIA\nAnd the BBC\nB.B. King\nAnd Doris Day\nMatt Busby, dig it, dig it\nDig it, dig it, dig it\n\n\nThat was ""Can You Dig It"" by Georgie Wood\nAnd now we'd like to do ""Hark the Angels Come""",\nMy old friend\nI just want to go back\nTo that simpler time\nWhen we were so happy and free\nMy old friend\nCan I just go back?\n\n\nMy old friend\nPlease take me back\nTo those halcyon days\nSo far away now\nCan I just go back?\n\n\nCan't go back\nBut I can make today a memory to last,\nI walk the streets without I regret\nI stood alone and watched myself fall apart\nAnswered the voices in my head\nSlipped through the chainlink of a broken heart\n\n\nAnd when I'm tired of feeling black\nSpread the wings upon your back\nTake us high above it all\nAnd stroke your feathers 'till we fall\nBack down again\n\n\nBought everything that sounded good\nI understand that I've been misunderstood\n\n\nAnd when I'm tired of feeling black\nSpread the wings upon your back\nTake us high ab...,"1...2...1,2...1,2,3\n\nIf you never take the first step\nYou cannot go too far\nI'm sure you know that you strike me as a smart kid\nAnd you've got big plans and big dreams and big big goals\nSo get a move on\nJump in\n\nHey, hey, hey, hey, hey ,hey, hey\nWe're jumpin' in\nToday, -day, -day, -day, -day, -day, -day\nIt begins\n\nAnd oooh, I'm so impressed\n'Cause you have done your best\nAnd we will get so much done\nBefore this day is done\n\n'Cause\nHey, hey, hey, hey, hey ,hey, hey\nWe're ..."
normalized_lyrics,best thing life free give bird bee need money need money oh yeah want love give thrill lovin pay bill need money need money oh yeah want need money honey need money right away need money bad need money oh yeah want money get everything true buy use need money need money yeah want need money honey need money bad bill behind need money right know lovin good baby need money oh yeah want need money need money want,thank brand new song hope like place sit safe stuck lift try reach thom place hurt ever smell air condition fish belly empty pocket time come home place remember face always see stuck lift belly whale bottom ocean smell air condition fish belly empty pocket time come home smell air condition fish belly ah let go today first day rest day lighten squirt,well angel morning sivanna well go far head towards ponoma alone thrift store manager poke camadee gas mask arm one one one go go fall go fall light brick layer hat foot say lead calvary long disregard heat half beat misnomer though feeling well go slide right door oh,love go behind sun since leave darkness begin smile use shine nothing memory see face hear voice make move choice walk cry wipe tear hand one carry wed band cloud hang low flower use grow heart die die die,monday morning sure look fine friday get travel mind first love fade away go believe way get nothing love tell really want first love get line mind mind want one else could ever get get peace mind monday morning sure look fine friday get travel mind first love say wrong go believe long know true want get first love get line mind mind want one else could ever get get peace mind,hey fella hear news know annie back town take long watch see fellas lay money style new face long ago eyes different smile like one know well ten year maybe first set eye best year life go alone blue people cry people die wicked way love keep rollin along grace lord people talkin around bout way leave flat care people say know jive one thing mind clarify please way call another guy name baby baby baby oh try make love mmm work hard unwind get money save abuse love thousand time however hard ...,like roll stone like roll stone like roll stone like fbi cia bbc b b king doris day matt busby dig dig dig dig dig dig georgie wood like hark angel come,old friend want go back simpler time happy free old friend go back old friend please take back halcyon day far away go back go back make today memory last,walk street without regret stand alone watch fall apart answer voice head slip chainlink broken heart tire feeling black spread wing upon back take u high stroke feather till fall back bought everything sound good understand misunderstood tire feeling black spread wing upon back take u high stroke feather till fall back feel think see blink battle dream stroke feather till scream tire feeling black spread wing upon back take u high stroke feather fall back,never take first step go far sure know strike smart kid get big plan big dream big big goal get move jump hey hey hey hey hey hey hey jumpin today day day day day day day begin oooh impressed cause do best get much do day do cause hey hey hey hey hey hey hey jumpin today day day day day day day begin never take first step go far let get move jump never take first step go far let get move jump


In [40]:
df.to_parquet("data/5mil_song_set_clean_v2_good_artists.parquet")