# Sync Link
### Part 2: Cleaning

All the data is gathered so it's time to clean! First, I'm going to create a feautre that gauges accuracy. Since I was pulling for multiple sources, I wanted to make sure the results actually line up. Once I've verified that, I can get rid of duplicate columns like title, artist, release year, etc. and reorganize the columns.

In [1]:
import pandas as pd
import numpy as np
import regex as re
import requests

In [2]:
pt1 = pd.read_csv('./data/sync_spotify_final_1.csv')

In [3]:
pt2 = pd.read_csv('./data/sync_spotify_final_2.csv')

In [4]:
sync = pd.concat([pt1, pt2], axis = 0)

In [5]:
sync.duplicated().sum()

0

In [6]:
sync = sync.drop(columns = ['index', 'level_0'])

In [7]:
sync.reset_index(inplace = True)

In [8]:
sync.shape

(10845, 41)

Imputing Data:
Since this information is all easily found online, I'm going to save a few rows by manually reseraching and adding the info in.

Songs that have no lyric url are all Traditional meaning they are in the public domain and the *publishing* does not require clearance. 

In [9]:
sync[sync['lyric_url'].isna()].fillna(0, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,


In [10]:
sync.dropna(inplace = True)

#### Step 1: Assessing Accuracy
This dataset was built by using 4 different sources. To make sure they match, I'm going to create a function that scores the accuracy of each observation by making sure the title/artist from all four sources are the same and the years from the original set matches Deezer.

First, I'll need to reformat the title/artist from the sources to match the original (all lowercase with a dash inbetween).

In [11]:
#Deezer
sync['deezer_title_artist'] = sync['d_song'].str.lower().str.strip() + ' - ' + sync['d_artist'].str.lower().str.strip()

In [12]:
#Spotify
sync['spotify_title_artist'] = sync['s_track'].str.lower().str.strip() + ' - ' + sync['s_artist'].str.lower().str.strip()

In [13]:
#Lyric Freak has to be cleaned first
sync['l_title'] = sync['l_title'].apply(lambda x: re.sub('\\n', ' ', x))




In [14]:
def clean_up(song):
    song = song.replace('About', '').replace('Lyrics', '').replace('lyrics', '')
    try:
        song = song.split('–')[1]
    except:
        pass
    song = song.strip()
    return song

In [15]:
clean_up('Tones And I – Dance Monkey Lyrics')

'Dance Monkey'

In [16]:
sync['l_title'] = sync['l_title'].apply(clean_up)

In [17]:
sync['l_title']

0           Tennessee Whiskey
1                Dance Monkey
2              Sweet Caroline
3           Someone You Loved
6                       Creep
                 ...         
10840                       0
10841           Frank Sinatra
10842           Beautiful War
10843          Midnight Blues
10844    Smokin' And Drinkin'
Name: l_title, Length: 10817, dtype: object

In [18]:
sync.isna().sum();

In [19]:
sync.tail()

Unnamed: 0,index,title,artist,year,explicit,styles,languages,title_artist,synced,d_id,...,s_speech,s_acoustic,s_inst,s_live,s_valence,s_tempo,s_duration,s_time_sig,deezer_title_artist,spotify_title_artist
10840,4118,Fixing A Hole,The Beatles,1967,0,Rock,English,fixing a hole - the beatles,0,116348678,...,0.0356,0.706,0.0334,0.129,0.259,70.828,172133.0,4.0,fixing a hole (remastered 2009) - the beatles,fixing a hole - remastered 2009 - the beatles
10841,4119,It Came Upon a Midnight Clear,Frank Sinatra,1948,0,"Christmas,Christian,Traditionnal",English,it came upon a midnight clear - frank sinatra,0,115007404,...,0.0356,0.706,0.0334,0.129,0.259,70.828,172133.0,4.0,it came upon a midnight clear - bing crosby,it came upon a midnight clear - frank sinatra
10842,4120,Beautiful War,Kings of Leon,2013,0,"Alternative,Rock",English,beautiful war - kings of leon,0,70584821,...,0.0356,0.706,0.0334,0.129,0.259,70.828,172133.0,4.0,beautiful war - kings of leon,beautiful war - kings of leon
10843,4121,Midnight Blues,Gary Moore,1990,0,"Blues,Rock",English,midnight blues - gary moore,0,3133096,...,0.0356,0.706,0.0334,0.129,0.259,70.828,172133.0,4.0,midnight blues - gary moore,midnight blues - gary moore
10844,4122,Smokin' and Drinkin',Miranda Lambert,2014,0,"Pop,Country,Soft rock",English,smokin' and drinkin' - miranda lambert,0,78383556,...,0.0356,0.706,0.0334,0.129,0.259,70.828,172133.0,4.0,smokin' and drinkin' (feat. little big town) -...,smokin' and drinkin' (feat. little big town) -...


In [20]:
sync.drop(columns='index', inplace=True)

In [21]:
sync['l_title_artist'] = sync['l_title'].str.lower().str.strip() + ' - ' + sync['l_artist'].str.lower().str.strip()

In [22]:
def get_year(string):
    return string[0:4]

In [23]:
sync['d_year'] = sync['d_release'].apply(get_year)

In [24]:
sync.head(1).T

Unnamed: 0,0
title,Tennessee Whiskey
artist,Chris Stapleton
year,2015
explicit,0
styles,"Blues,Rock,Country"
languages,English
title_artist,tennessee whiskey - chris stapleton
synced,1
d_id,98975170
d_song,Tennessee Whiskey


In [25]:
sync.reset_index(inplace = True)

In [26]:
sample = sync.head()

In [27]:
sample['score'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [28]:
sample.drop(columns= 'index', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [29]:
def fuzzy_score(df):
    
    #For every row in the dataframe
    for i in range(len(df)):
        #This column is what the others should match
        match = df.loc[i, 'title_artist']
        
        #This will score each column
        deezer_score = 0
        spotify_score = 0
        lyric_score = 0
        
        #This is what to compare the match to
        deezer = df.loc[i, 'deezer_title_artist']
        spotify = df.loc[i, 'spotify_title_artist']
        lyric = df.loc[i, 'spotify_title_artist']

        #Try looping through each letter and counting up the number of matches
        try:
            for j, m in enumerate(match):

                if deezer[j] == m:
                    deezer_score += 1
                else:
                    deezer_score = deezer_score


                if spotify[j] == m: 
                    spotify_score += 1   
                else:
                    spotify_score = spotify_score


                if lyric[j] == m: 
                     lyric_score += 1   
                else:
                    lyric_score = lyric_score
                final = ((deezer_score / len(match)) + (spotify_score / len(match)) + (lyric_score / len(match))) / 3
                final = round(final, 2)
        
        #If this doesn't work, it's likely because the lengths of the string are too different (i.e. missing or wrong info)
        except:
            final = 'Error: Diff Length'
        
    
        df.loc[i, 'score'] = final
    return df

In [30]:
fuzzy_score(sample)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Unnamed: 0,title,artist,year,explicit,styles,languages,title_artist,synced,d_id,d_song,...,s_live,s_valence,s_tempo,s_duration,s_time_sig,deezer_title_artist,spotify_title_artist,l_title_artist,d_year,score
0,Tennessee Whiskey,Chris Stapleton,2015,0,"Blues,Rock,Country",English,tennessee whiskey - chris stapleton,1,98975170,Tennessee Whiskey,...,0.0821,0.512,48.718,293293.0,4.0,tennessee whiskey - chris stapleton,tennessee whiskey - chris stapleton,tennessee whiskey - chris stapleton,2015,1.0
1,Dance Monkey,Tones and I,2019,0,Pop,English,dance monkey - tones and i,1,739870792,Dance Monkey,...,0.149,0.513,98.027,209438.0,4.0,dance monkey - tones and i,dance monkey - tones and i,dance monkey - tones and i,2019,1.0
2,Sweet Caroline,Neil Diamond,1969,0,Pop,English,sweet caroline - neil diamond,1,145434430,Sweet Caroline,...,0.237,0.578,63.05,203573.0,4.0,sweet caroline - neil diamond,sweet caroline - neil diamond,sweet caroline - neil diamond,2017,1.0
3,Someone You Loved,Lewis Capaldi,2018,0,Pop,English,someone you loved - lewis capaldi,1,582143242,Someone You Loved,...,0.105,0.446,109.891,182161.0,4.0,someone you loved - lewis capaldi,someone you loved - lewis capaldi,someone you loved - lewis capaldi,2018,1.0
4,Creep,Radiohead,1992,1,"Rock,Alternative",English,creep - radiohead,1,138547415,Creep,...,0.129,0.104,91.841,238640.0,4.0,creep - radiohead,creep - radiohead,creep - radiohead,1993,1.0


In [31]:
sync['score'] = 0

In [32]:
sync = fuzzy_score(sync)

In [33]:
sync = sync[sync['score'] != 'Error: Diff Length']

In [34]:
sync = sync[sync['score'] > .5]

In [35]:
def count(string):
    return len(string.split(','))

In [36]:
sync['n_writers'] = sync['l_writer'].apply(count)

In [37]:
sync = sync[sync['l_pub'] != '0']

In [38]:
def clean_pub(string):
    try:
        string = string.split('©')[1]
        string = string.split('Lyrics')[0]
        if ', Inc.' in string:
            string = string.replace(', Inc.', '')
        if '\\n' in string:
            string = string.replace('\\n', '')  
    except:
        pass
    return string.strip()

In [39]:
clean_pub(sync['l_pub'][4])

'Sony/ATV Music Publishing LLC, Warner Chappell Music'

In [40]:
sync['l_pub'] = sync['l_pub'].apply(clean_pub)

In [41]:
sync['n_pub'] = sync['l_pub'].apply(count)

In [42]:
sync.columns

Index(['index', 'title', 'artist', 'year', 'explicit', 'styles', 'languages',
       'title_artist', 'synced', 'd_id', 'd_song', 'd_isrc', 'd_release',
       'd_explicit', 'd_bpm', 'd_artist', 'd_album_id', 'd_album', 'd_art',
       'lyric_url', 'l_title', 'l_artist', 'l_album', 'l_writer', 'l_pub',
       's_artist', 's_track', 's_uri', 's_dance', 's_energy', 's_key',
       's_loudness', 's_mode', 's_speech', 's_acoustic', 's_inst', 's_live',
       's_valence', 's_tempo', 's_duration', 's_time_sig',
       'deezer_title_artist', 'spotify_title_artist', 'l_title_artist',
       'd_year', 'score', 'n_writers', 'n_pub'],
      dtype='object')

In [43]:
sync = sync[['title', 'artist', 'year', 'explicit', 'styles',
       'languages', 'title_artist', 'd_id', 'd_isrc',
       'd_release', 'd_album_id', 'd_album',
       'd_art', 'lyric_url','l_writer', 'n_writers', 'n_pub',
       'l_pub', 's_uri', 's_dance', 's_energy', 's_key',
       's_loudness', 's_mode', 's_speech', 's_acoustic', 's_inst', 's_live',
       's_valence', 's_tempo', 's_duration', 's_time_sig', 'score',
       'synced']]

In [44]:
sync

Unnamed: 0,title,artist,year,explicit,styles,languages,title_artist,d_id,d_isrc,d_release,...,s_speech,s_acoustic,s_inst,s_live,s_valence,s_tempo,s_duration,s_time_sig,score,synced
0,Tennessee Whiskey,Chris Stapleton,2015,0,"Blues,Rock,Country",English,tennessee whiskey - chris stapleton,98975170,USUM71418088,2015-05-04,...,0.0298,0.2050,0.009600,0.0821,0.512,48.718,293293.0,4.0,1,1
1,Dance Monkey,Tones and I,2019,0,Pop,English,dance monkey - tones and i,739870792,QZES71982312,2019-08-29,...,0.0924,0.6920,0.000104,0.1490,0.513,98.027,209438.0,4.0,1,1
2,Sweet Caroline,Neil Diamond,1969,0,Pop,English,sweet caroline - neil diamond,145434430,USMC16991138,2017-03-31,...,0.0274,0.6110,0.000109,0.2370,0.578,63.050,203573.0,4.0,1,1
3,Someone You Loved,Lewis Capaldi,2018,0,Pop,English,someone you loved - lewis capaldi,582143242,DEUM71807062,2018-11-08,...,0.0319,0.7510,0.000000,0.1050,0.446,109.891,182161.0,4.0,1,1
4,Creep,Radiohead,1992,1,"Rock,Alternative",English,creep - radiohead,138547415,GBAYE9200070,1993-02-22,...,0.0369,0.0102,0.000141,0.1290,0.104,91.841,238640.0,4.0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10810,Buck Rogers,Feeder,2001,0,"Rock,Alternative",English,buck rogers - feeder,131233178,GBBND0000727,1996-05-15,...,0.0356,0.7060,0.033400,0.1290,0.259,70.828,172133.0,4.0,1,0
10811,I Cross My Heart,George Strait,1992,0,Country,English,i cross my heart - george strait,890375,USMC19238758,2004-10-05,...,0.0356,0.7060,0.033400,0.1290,0.259,70.828,172133.0,4.0,0.73,0
10814,Beautiful War,Kings of Leon,2013,0,"Alternative,Rock",English,beautiful war - kings of leon,70584821,USRC11300783,2013-09-20,...,0.0356,0.7060,0.033400,0.1290,0.259,70.828,172133.0,4.0,1,0
10815,Midnight Blues,Gary Moore,1990,0,"Blues,Rock",English,midnight blues - gary moore,3133096,GBAAA9000066,1995-01-24,...,0.0356,0.7060,0.033400,0.1290,0.259,70.828,172133.0,4.0,1,0


In [45]:
sync = sync.reset_index()

In [46]:
sync.drop(columns = 'index', inplace = True)

In [47]:
sync

Unnamed: 0,title,artist,year,explicit,styles,languages,title_artist,d_id,d_isrc,d_release,...,s_speech,s_acoustic,s_inst,s_live,s_valence,s_tempo,s_duration,s_time_sig,score,synced
0,Tennessee Whiskey,Chris Stapleton,2015,0,"Blues,Rock,Country",English,tennessee whiskey - chris stapleton,98975170,USUM71418088,2015-05-04,...,0.0298,0.2050,0.009600,0.0821,0.512,48.718,293293.0,4.0,1,1
1,Dance Monkey,Tones and I,2019,0,Pop,English,dance monkey - tones and i,739870792,QZES71982312,2019-08-29,...,0.0924,0.6920,0.000104,0.1490,0.513,98.027,209438.0,4.0,1,1
2,Sweet Caroline,Neil Diamond,1969,0,Pop,English,sweet caroline - neil diamond,145434430,USMC16991138,2017-03-31,...,0.0274,0.6110,0.000109,0.2370,0.578,63.050,203573.0,4.0,1,1
3,Someone You Loved,Lewis Capaldi,2018,0,Pop,English,someone you loved - lewis capaldi,582143242,DEUM71807062,2018-11-08,...,0.0319,0.7510,0.000000,0.1050,0.446,109.891,182161.0,4.0,1,1
4,Creep,Radiohead,1992,1,"Rock,Alternative",English,creep - radiohead,138547415,GBAYE9200070,1993-02-22,...,0.0369,0.0102,0.000141,0.1290,0.104,91.841,238640.0,4.0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8138,Buck Rogers,Feeder,2001,0,"Rock,Alternative",English,buck rogers - feeder,131233178,GBBND0000727,1996-05-15,...,0.0356,0.7060,0.033400,0.1290,0.259,70.828,172133.0,4.0,1,0
8139,I Cross My Heart,George Strait,1992,0,Country,English,i cross my heart - george strait,890375,USMC19238758,2004-10-05,...,0.0356,0.7060,0.033400,0.1290,0.259,70.828,172133.0,4.0,0.73,0
8140,Beautiful War,Kings of Leon,2013,0,"Alternative,Rock",English,beautiful war - kings of leon,70584821,USRC11300783,2013-09-20,...,0.0356,0.7060,0.033400,0.1290,0.259,70.828,172133.0,4.0,1,0
8141,Midnight Blues,Gary Moore,1990,0,"Blues,Rock",English,midnight blues - gary moore,3133096,GBAAA9000066,1995-01-24,...,0.0356,0.7060,0.033400,0.1290,0.259,70.828,172133.0,4.0,1,0


In [48]:
def split_genre(x):
    x = x.split(',')
    all_x = [a.strip() for a in x]
    return all_x

In [49]:
sync['styles'] = sync['styles'].apply(split_genre)

In [50]:
sync['styles']

0          [Blues, Rock, Country]
1                           [Pop]
2                           [Pop]
3                           [Pop]
4             [Rock, Alternative]
                  ...            
8138          [Rock, Alternative]
8139                    [Country]
8140          [Alternative, Rock]
8141                [Blues, Rock]
8142    [Pop, Country, Soft rock]
Name: styles, Length: 8143, dtype: object

In [51]:
all_styles = []
for style in sync['styles']:
    for s in style:
        if s.strip() in all_styles:
            pass
        else:
            all_styles.append(s.strip())

In [52]:
for style in all_styles:
    sync[style] = 0
    for row in range(0, len(sync['styles'])):
        if style in sync.loc[row, 'styles']:
            sync.loc[row, style] = 1
        else:
            sync.loc[row, style] = 0

In [53]:
sync.head()

Unnamed: 0,title,artist,year,explicit,styles,languages,title_artist,d_id,d_isrc,d_release,...,Latin,Ska,Musical,Christmas,Classical,Humour,French pop,World/Folk,Zouk/Creole,Schlager
0,Tennessee Whiskey,Chris Stapleton,2015,0,"[Blues, Rock, Country]",English,tennessee whiskey - chris stapleton,98975170,USUM71418088,2015-05-04,...,0,0,0,0,0,0,0,0,0,0
1,Dance Monkey,Tones and I,2019,0,[Pop],English,dance monkey - tones and i,739870792,QZES71982312,2019-08-29,...,0,0,0,0,0,0,0,0,0,0
2,Sweet Caroline,Neil Diamond,1969,0,[Pop],English,sweet caroline - neil diamond,145434430,USMC16991138,2017-03-31,...,0,0,0,0,0,0,0,0,0,0
3,Someone You Loved,Lewis Capaldi,2018,0,[Pop],English,someone you loved - lewis capaldi,582143242,DEUM71807062,2018-11-08,...,0,0,0,0,0,0,0,0,0,0
4,Creep,Radiohead,1992,1,"[Rock, Alternative]",English,creep - radiohead,138547415,GBAYE9200070,1993-02-22,...,0,0,0,0,0,0,0,0,0,0


In [54]:
sync.to_csv('./data/cleaned_sync.csv', index = False)