# Sync Link
### Part 1B: Starting the Dataset with Randomness

I've already scraped songs from [What-Song.com](https://what-song.com) which will essentially be the target (whether or not it appears on that site). Now I'll need to add additional information for each song and build out the dataset. Ideally, only about half of the final songs will have been synced so I need another source to help randomly pick songs and end up with that distribution. 

In searching for a giant list of songs, I realized a karaoke catalog might be the best thing I could find. Below I've downloaded the catalog from [KaraFun](https://karafun.com) and will use that list of songs to randomly built this dataset.

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import time
import regex as re

In [2]:
kara = pd.read_csv('../data/karafuncatalog.csv', sep = ';')

In [3]:
kara.tail()

Unnamed: 0,Id,Title,Artist,Year,Duo,Explicit,Date Added,Styles,Languages
36458,30797,Steppin' Away,Stereomud,2001,0,0,2011-11-16,"Hard/Metal,Rock",English
36459,19236,Where Do We Go,Sandrine,2007,0,0,2013-07-23,Pop,English
36460,21128,Sata Vuotta,Herra Ylppö & Ihmiset,2008,0,0,2012-05-29,"Rock,Pop",Finnish
36461,47720,As Long As I Got You,Lily Allen,2014,0,0,2016-02-24,Pop,English
36462,23828,Cold,Jeremy McComb,2008,0,0,2013-11-19,Country,English


The combo of title and artist is what I'll use to see if the song has been synced before in the other DataFrame.

In [4]:
kara['title_artist'] = kara['Title'] + " - " + kara['Artist']

In [5]:
kara['title_artist']

0                   Shallow - A Star is Born
1        Tennessee Whiskey - Chris Stapleton
2                 Dance Monkey - Tones and I
3              Sweet Caroline - Neil Diamond
4          Someone You Loved - Lewis Capaldi
                        ...                 
36458              Steppin' Away - Stereomud
36459              Where Do We Go - Sandrine
36460    Sata Vuotta - Herra Ylppö & Ihmiset
36461      As Long As I Got You - Lily Allen
36462                   Cold - Jeremy McComb
Name: title_artist, Length: 36463, dtype: object

Loading in the data from WhatSong.

In [6]:
what = pd.read_csv('../data/ws_all_songs_cleaned.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [7]:
what.head()

Unnamed: 0,artist,song_title,show,episode,date,month_year,year,favorites,song_artist,type,avg_per_ep
0,Dusty Springfield,Girls It Ain't Easy,The Hunt,none,12 Mar 2020,Mar 2020,2020,0,"""Girls It Ain't Easy"" - Dusty Springfield",Movie,35.0
1,Dusty Springfield,Wishin' and Hopin',Sex Education,S2E8,16 Jan 2020,Jan 2020,2020,0,"""Wishin' and Hopin'"" - Dusty Springfield",TV,7.94
2,Dusty Springfield,Spooky,9-1-1,S3E6,27 Oct 2019,Oct 2019,2019,0,"""Spooky"" - Dusty Springfield",TV,4.11
3,Dusty Springfield,I Can't Make It Alone,The Deuce,S3E4,29 Sep 2019,Sep 2019,2019,0,"""I Can't Make It Alone"" - Dusty Springfield",TV,7.8
4,Dusty Springfield,No Easy Way Down,The Deuce,S3E4,29 Sep 2019,Sep 2019,2019,0,"""No Easy Way Down"" - Dusty Springfield",TV,7.8


In [8]:
what.loc['song_artist', :] = what['song_title'] + " - " + what["artist"]

In [9]:
what = what.dropna()

In [10]:
def remove_quote(string):
    return string.replace('\"', '')

In [11]:
what['song_artist'] = what['song_artist'].apply(remove_quote)

Finding the unique title/artist combinations and saving them as a set will make it quicker for my computer to search through.

In [12]:
synced = set(what['song_artist'].unique())

Also making everything lowercase helps the matches be more accurate.

In [13]:
def lower(string):
    return string.lower()

In [14]:
synced = str(set([lower(x) for x in synced]))

This removes songs that were on the soundtrack aka created for the film/show. These aren't helpful in predicting if a song is usable because it is made specifically for that project and no licensing is needed.

In [15]:
kara = kara.drop(index = kara[kara['Styles'].str.contains('soundtrack')].index)

The function below returns 1 if the song is on the WhatSong website and has been synced before.

In [16]:
def sync(song):
    if song in synced:
        return 1
    else:
        return 0

In [17]:
kara['title_artist'] = kara['title_artist'].apply(lower)

Applying the synced function to the karaoke songs:

In [18]:
kara['synced'] = kara['title_artist'].apply(sync)

Songs with artists of "Traditional" are in the public domain. I personally know many of these songs are used quite frequently, but there's a differnet artist attached, not "Traditional", so I'll go ahead and set these as synced.

In [19]:
kara.loc[kara[kara['Artist'].str.contains('Traditional')].index, 'synced'] = 1

In [20]:
kara['synced'].value_counts()

0    29372
1     4783
Name: synced, dtype: int64

In [21]:
kara = kara[kara['Languages'].str.contains('English')]

In [22]:
kara['synced'].value_counts()

0    21317
1     4691
Name: synced, dtype: int64

In [23]:
kara[kara['synced'] == 0]

Unnamed: 0,Id,Title,Artist,Year,Duo,Explicit,Date Added,Styles,Languages,title_artist,synced
22,60088,Blinding Lights,The Weeknd,2019,0,0,2020-01-13,"80s,Pop,Electro",English,blinding lights - the weeknd,0
24,59413,Memories,Maroon 5,2019,0,0,2019-11-03,Pop,English,memories - maroon 5,0
27,5089,Let It Be,The Beatles,1970,0,0,2011-10-28,Pop,English,let it be - the beatles,0
37,12913,Stand by Me,Ben E. King,1961,0,0,2008-07-25,Soul,English,stand by me - ben e. king,0
42,21608,Hallelujah,Alexandra Burke,2008,0,0,2011-05-30,Pop,English,hallelujah - alexandra burke,0
...,...,...,...,...,...,...,...,...,...,...,...
36457,29762,Breathe,Seven Channels,2001,0,0,2013-08-01,"Hard/Metal,Rock",English,breathe - seven channels,0
36458,30797,Steppin' Away,Stereomud,2001,0,0,2011-11-16,"Hard/Metal,Rock",English,steppin' away - stereomud,0
36459,19236,Where Do We Go,Sandrine,2007,0,0,2013-07-23,Pop,English,where do we go - sandrine,0
36461,47720,As Long As I Got You,Lily Allen,2014,0,0,2016-02-24,Pop,English,as long as i got you - lily allen,0


From my past experience, looking at these songs that were labeled "not synced" I know that some of them have been used just aren't recorded on the site. To fix the class imbalance and try to ensure accuracy, my final dataset will be around 10,000 songs including all in class 1, and a random sample of class 0 which I can manually check.

In [24]:
unsync = kara[kara['synced'] == 0].sample(10000, 
                                        replace = False, 
                                        random_state = 21)

In [25]:
sync_df = kara[kara['synced'] == 1]

In [26]:
sync_df = pd.concat([sync_df, unsync], axis = 0)

In [27]:
sync_df = sync_df.reset_index()

Getting rid of unnecessary columns.

In [28]:
sync_df = sync_df.drop(columns = ['index', 'Id', 'Duo', 'Date Added'])

In [29]:
sync_df.to_csv('../data/synced.csv', index = False)

In [30]:
sync_df

Unnamed: 0,Title,Artist,Year,Explicit,Styles,Languages,title_artist,synced
0,Tennessee Whiskey,Chris Stapleton,2015,0,"Blues,Rock,Country",English,tennessee whiskey - chris stapleton,1
1,Dance Monkey,Tones and I,2019,0,Pop,English,dance monkey - tones and i,1
2,Sweet Caroline,Neil Diamond,1969,0,Pop,English,sweet caroline - neil diamond,1
3,Someone You Loved,Lewis Capaldi,2018,0,Pop,English,someone you loved - lewis capaldi,1
4,Amazing Grace,Traditional,1831,0,"Traditionnal,Gospel,Blues",English,amazing grace - traditional,1
...,...,...,...,...,...,...,...,...
14686,It Came Upon a Midnight Clear,Frank Sinatra,1948,0,"Christmas,Christian,Traditionnal",English,it came upon a midnight clear - frank sinatra,0
14687,Singing The Blues,The Kentucky Headhunters,1997,0,"Country,Rock,Rock 'n Roll",English,singing the blues - the kentucky headhunters,0
14688,Beautiful War,Kings of Leon,2013,0,"Alternative,Rock",English,beautiful war - kings of leon,0
14689,Midnight Blues,Gary Moore,1990,0,"Blues,Rock",English,midnight blues - gary moore,0
