Before we use the dataset, we are going to filter out some songs to both get the data to a more manageable size and exclude songs we don't want to include in our search engine (ex. non-english songs)

In [None]:
import pandas as pd
import regex as re

Reading in the whole dataset takes ~9 min on my computer

In [None]:
#read first 10 million rows (all of the dataset)
whole_df = pd.read_csv("data/ds2.csv", nrows = 10_000_000)

In [None]:
len(whole_df)

5913411

In [None]:
gt_million = sum(whole_df['views'] > 1_000_000)
gt_hundred_thousand = sum(whole_df['views'] > 100_000)
gt_ten_thousand = sum(whole_df['views'] > 10_000)
gt_million, gt_hundred_thousand, gt_ten_thousand

(1396, 23698, 185791)

In [None]:
pcts = [n / len(whole_df) for n in (gt_million, gt_hundred_thousand, gt_ten_thousand)]
pcts = [100 * p for p in pcts]
pcts

[0.02360735622807209, 0.4007500916137911, 3.14185839611013]

It makes sense to filter songs based on the number of views, since songs with few views are more likey to pollute search results than actually be searched for.

Filtering for lyrics with more than 1,000,000, 100,000, and 10,000 leaves 1,396, 23,698, and 185,791 songs resp. which is 0.02%, 0.4 % and 3.14% of the original dataset size

We are going to use only songs with > 10,000 views

In [None]:
df = whole_df[whole_df['views'] > 10_000].copy().reset_index()

Some of the songs in this table don't have lyrics, so we will drop those

In [None]:
df = df[~df['lyrics'].isna()].copy()

In [None]:
sum(df['lyrics'].isna())

0

The lyrics often contain annotations that we don't want to analyze, so we can remove them using regex:

In [None]:
#limit size of brackets to 100 so the regex doesn't stall out on unclosed brackets
df['lyrics'] = df['lyrics'].replace("\[.{,100}]", "", regex = True)

In [None]:
lyrics = df['lyrics']
lyrics

0         \nKilla Cam, Killa Cam, Cam\nKilla Cam, Killa ...
1         \n\n\nYeah, hah, yeah, Roc-A-Fella\nWe invite ...
2         \n\n\nUgh, Killa!\nBaby!\nKanye, this that 197...
3         \nSo they ask me\n"Young boy\nWhat you gon' do...
4         \nHaha\nUh-huh\nNo homo (Young Mula, baby!)\nI...
                                ...                        
185786                                                     
185787    \n\n\nНам пора прощаться (Пока), c'est la vie ...
185788    \nOoh, ooh\nYeah\n\n\nNega cham gunggeumhae ge...
185789    \n\n\n来年の暮れには、咲き誇る春\n真夏の夜の夢\n秋を感じ、そして冬の雪\n4つの季...
185790    \nI heard I was cancelled (Hmm)\nWell, let's s...
Name: lyrics, Length: 185769, dtype: object

We also want to remove foreign language songs, the best we can do on this front is remove any song with a non-english character.

In [None]:
non_eng_pattern = '[^\x00-\xFF]'

In [None]:
def to_eng(_lyrics, find_char, replace_char):
    _lyrics = _lyrics.replace(find_char, replace_char, regex = True)
    print(f"Changing {find_char} to {replace_char}.")
    new_sum = sum(_lyrics.str.contains(non_eng_pattern))
    print(f"reduced non-eng songs to {new_sum} \n")
    return(_lyrics)

In [None]:
len(df)

185769

In [None]:
sum(lyrics.str.contains(non_eng_pattern))

78475

In [None]:
sum(lyrics.str.contains(non_eng_pattern)) / len(df)

0.422433236977106

About 42% of our songs contain non-latin characters, but it turns out a lot of that is due to the curious use of lookalikes for what should be normal ASCII characters. We can again use regex to replace these lookalikes.

In [None]:
replacements = [("[—–‒−]", "-"), ("[‘’′ʼ]", "'"),("…","..."), ('[“”″]', '"'),("[\u2005\u205f\xa0\u200a]", " "),
                ("[\u200b\u2060\u200e\ufeff]", ""), ("е", "e"), ("‚",","), ("Ι", "I")]
for rep in replacements:
    lyrics = to_ascii(lyrics, rep[0], rep[1])
df['lyrics'] = lyrics

Changing [—–‒−] to -.
reduced non-eng songs to 73387 

Changing [‘’′ʼ] to '.
reduced non-eng songs to 52411 

Changing … to ....
reduced non-eng songs to 50382 

Changing [“”″] to ".
reduced non-eng songs to 47548 

Changing [    ] to  .
reduced non-eng songs to 33486 

Changing [​⁠‎﻿] to .
reduced non-eng songs to 32627 

Changing е to e.
reduced non-eng songs to 26993 

Changing ‚ to ,.
reduced non-eng songs to 26822 

Changing Ι to I.
reduced non-eng songs to 26820 



In [None]:
sum(lyrics.str.contains(non_eng_pattern)) / len(df)

0.1443728501526089

When we replace the lookalikes, the percentage of "non-english" songs goes down from 42% to 14%. Particularly, there seems to be a lot of obscure utf whitespace characters used.

Upon further inspection we can confirm our method sorts between English and Non-English songs about as well as we can expect to.

In [None]:
eng_df = df[~df['lyrics'].str.contains(non_eng_pattern)]
eng_df

Unnamed: 0,index,title,tag,artist,year,views,features,lyrics,id
0,0,Killa Cam,rap,Cam'ron,2004,173166,"{""Cam\\'ron"",""Opera Steve""}","\nKilla Cam, Killa Cam, Cam\nKilla Cam, Killa ...",1
1,1,Can I Live,rap,JAY-Z,1996,468624,{},"\n\n\nYeah, hah, yeah, Roc-A-Fella\nWe invite ...",3
2,3,Down and Out,rap,Cam'ron,2004,144404,"{""Cam\\'ron"",""Kanye West"",""Syleena Johnson""}","\n\n\nUgh, Killa!\nBaby!\nKanye, this that 197...",5
3,4,Fly In,rap,Lil Wayne,2005,78271,{},"\nSo they ask me\n""Young boy\nWhat you gon' do...",6
4,5,Lollipop Remix,rap,Lil Wayne,2008,580832,"{""Kanye West"",""Static Major""}","\nHaha\nUh-huh\nNo homo (Young Mula, baby!)\nI...",7
...,...,...,...,...,...,...,...,...,...
185784,5893992,Starting 5,rap,"Dreamville, Lute, Cozz & Omen",2022,17441,"{""Dreamville / Lute / Cozz & Omen""}","\n\n\nAyy, Lute\nCan you do me a favor?\nI nee...",7856933
185785,5894479,First Class,rap,Jack Harlow,2022,79607,{},"\nMm\nI been a G, throw up the L\nSex in the A...",7857574
185786,5896182,Halfway Home,pop,Louis Tomlinson,2022,12717,{},,7859681
185788,5901262,IVE - LOVE DIVE Romanized,pop,Genius Romanizations,2022,14383,{},"\nOoh, ooh\nYeah\n\n\nNega cham gunggeumhae ge...",7866684


In [None]:
non_eng_df = df[df['lyrics'].str.contains(non_eng_pattern)]
non_eng_df

Unnamed: 0,index,title,tag,artist,year,views,features,lyrics,id
33,46,Barry Bonds,rap,Kanye West,2007,280626,"{""Lil Wayne""}","\nIt's what you all been waiting for, ain't it...",38
440,533,Watch My Shoes,rap,Lil Wayne,2009,232712,{},"\nOK, No Ceilings, motherfucker, good morning\...",534
1070,1311,Gotta Go Hard,rap,Nicki Minaj,2009,127459,"{""Lil Wayne""}","\n(And we go by The Runners)\nYo, SB, haha (Go...",1281
1097,1352,Lord Lord Lord,rap,Kanye West,2010,84256,"{Raekwon,""Yasiin Bey"",""Swizz Beatz"",""Charlie W...","\n\n\nYeah, yeah, there it go, there it is\nTh...",1323
1101,1356,Heard Em Say,rap,Kanye West,2005,608685,"{""Adam Levine""}","\n\n\nWest-Mr. West-Mr. West\nUh, yeah, uh, ye...",1329
...,...,...,...,...,...,...,...,...,...
185773,5890175,ОКО EYE,rap,pyrokinesis,2022,44012,{​​pyrokinesis},"\n\n\n2022, 2022, 2022, 2022\n\n\nЗадeржи дыха...",7851935
185776,5891754,Kwaku the Traveller,rap,Black Sherif,2022,28608,{},\nYeahhh\nKwaku Killa don't lie when I say I d...,7853921
185778,5892473,Krive Karte,pop,Dino Merlin,2022,10759,{},\n\n\nBio sam nepravedan prema tebi\nOnako kak...,7855001
185787,5896315,Селяви Cest la vie,rap,MORGENSHTERN,2022,12736,{},"\n\n\nНам пора прощаться (Пока), c'est la vie ...",7859858


Still, a non-significant portion of the songs we filtered as not English appear to be English. We can look at why this is the case:

In [None]:
def show_non_eng_lyric(lyric_str, half_width):
    non_eng = re.search(non_eng_pattern, lyric_str)
    left_bound = non_eng.span()[0] - half_width
    right_bound = non_eng.span()[1] + half_width
    print(lyric_str[max(left_bound, 0):min(right_bound, len(lyric_str)- 1)])

In [None]:
for i in range(len(non_eng_df)):
    print(f"\n{i+1}")
    print(f"{non_eng_df['title'].iloc[i]} by {non_eng_df['artist'].iloc[i]}")
    show_non_eng_lyric(non_eng_df['lyrics'].iloc[i], 20)


1
Barry Bonds by Kanye West
it
I don't need writērs, I might bounce i

2
Watch My Shoes by Lil Wayne
ke a fucking hors d'œuvre around this ho


3
Gotta Go Hard by Nicki Minaj
 call my gun "Gunović"
Weezy F Baby and t

4
Lord Lord Lord by Kanye West
of days, Yawm al-Qiyāmah
This tiny stone 

5
Heard Em Say by Kanye West
the minister say
Allāhu Akbar and throw i

6
Oodles of Os by De La Soul
at speaks the Guacomō
Kinfolk will play t

7
Muhammad Walks by Lupe Fiasco
 like Jesus sall Allāhu ʿalay-hi wa-salla

8
Nuttin to Do by Bad Meets Evil
d orders
Ate hors d'œuvres and hit the wa

9
Pourvu quelles maiment by Booba
rai pas ton fiancé
Cœur brisé, le cul cas

10
Lunatic by Booba
bre
La main sur le cœur, l'autre sur le c

11
My New Shit by Drake
And I would be like �I ain't getting serv

12
Trône by Mdine
en ai des hauts de cœur
Je suis fils de b

13
Différent by Orelsan
nt résidant dans l'cœur des schnecks
"Bob

14
Jigga That Nigga by JAY-Z

みなさんJay-Zのようになりたいよ
彼は

15
La France by Sniper
n

In [None]:
for i in range(len(non_eng_df)):
    print(re.search(non_eng_pattern, non_eng_df['lyrics'].iloc[i]))

<regex.Match object; span=(663, 664), match='ē'>
<regex.Match object; span=(974, 975), match='œ'>
<regex.Match object; span=(4337, 4338), match='ć'>
<regex.Match object; span=(917, 918), match='ā'>
<regex.Match object; span=(393, 394), match='ā'>
<regex.Match object; span=(1184, 1185), match='ō'>
<regex.Match object; span=(2049, 2050), match='ā'>
<regex.Match object; span=(363, 364), match='œ'>
<regex.Match object; span=(962, 963), match='œ'>
<regex.Match object; span=(1030, 1031), match='œ'>
<regex.Match object; span=(1594, 1595), match='�'>
<regex.Match object; span=(445, 446), match='œ'>
<regex.Match object; span=(277, 278), match='œ'>
<regex.Match object; span=(1, 2), match='み'>
<regex.Match object; span=(5457, 5458), match='œ'>
<regex.Match object; span=(1776, 1777), match='œ'>
<regex.Match object; span=(3209, 3210), match='œ'>
<regex.Match object; span=(2498, 2499), match='œ'>
<regex.Match object; span=(4, 5), match='誰'>
<regex.Match object; span=(2544, 2545), match='œ'>
<regex.M

A lot of the lyrics we wrongly filtered as non-english use accents that aren't on our character "whitelist".

In the future we can go back and get more particular with out methods to filter non-english songs, but for now I consider this a reasonable amount. (By inspection about 15% of filtered songs are English)

# Write out clean datasets

We are going to write 2 datasets, one with all the songs with greater than 10,000 views that we plan to use in our final implementation of the project, and one with songs greater than 1,000,000 views that we can play around with a bit more easily.

In [None]:
mini_df = df[df['views'] > 1_000_000]

In [None]:
mini_df.to_csv("data/small_dataset")

In [None]:
df.to_csv("data/clean_dataset")