# Sync Link
### Part 2: Cleaning

All the data is gathered so it's time to clean! First, I'm going to create a feautre that gauges accuracy. Since I was pulling for multiple sources, I wanted to make sure the results actually line up. Once I've verified that, I can get rid of duplicate columns like title, artist, release year, etc. and reorganize the columns.

In [1]:
import pandas as pd
import numpy as np
import regex as re
import requests

In [2]:
pt1 = pd.read_csv('./data/sync_spotify_final_1.csv')

In [3]:
pt2 = pd.read_csv('./data/sync_spotify_final_2.csv')

In [4]:
sync = pd.concat([pt1, pt2], axis = 0)

In [5]:
sync.duplicated().sum()

0

In [6]:
sync = sync.drop(columns = ['index', 'level_0'])

In [7]:
sync.reset_index(inplace = True)

In [8]:
sync.shape

(10845, 41)

Imputing Data:
Since this information is all easily found online, I'm going to save a few rows by manually reseraching and adding the info in.

Songs that have no lyric url are all Traditional meaning they are in the public domain and the *publishing* does not require clearance. 

In [9]:
sync[sync['lyric_url'].isna()].fillna(0, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,


In [10]:
sync.dropna(inplace = True)

#### Step 1: Assessing Accuracy
This dataset was built by using 4 different sources. To make sure they match, I'm going to create a function that scores the accuracy of each observation by making sure the title/artist from all four sources are the same and the years from the original set matches Deezer.

First, I'll need to reformat the title/artist from the sources to match the original (all lowercase with a dash inbetween).

In [11]:
#Deezer
sync['deezer_title_artist'] = sync['d_song'].str.lower().str.strip() + ' - ' + sync['d_artist'].str.lower().str.strip()

In [12]:
#Spotify
sync['spotify_title_artist'] = sync['s_track'].str.lower().str.strip() + ' - ' + sync['s_artist'].str.lower().str.strip()

In [13]:
#Lyric Freak has to be cleaned first
sync['l_title'] = sync['l_title'].apply(lambda x: re.sub('\\n', ' ', x))




In [14]:
def clean_up(song):
    song = song.replace('About', '').replace('Lyrics', '').replace('lyrics', '')
    try:
        song = song.split('–')[1]
    except:
        pass
    song = song.strip()
    return song

In [15]:
clean_up('Tones And I – Dance Monkey Lyrics')

'Dance Monkey'

In [16]:
sync['l_title'] = sync['l_title'].apply(clean_up)

In [17]:
sync['l_title']

0           Tennessee Whiskey
1                Dance Monkey
2              Sweet Caroline
3           Someone You Loved
6                       Creep
                 ...         
10840                       0
10841           Frank Sinatra
10842           Beautiful War
10843          Midnight Blues
10844    Smokin' And Drinkin'
Name: l_title, Length: 10817, dtype: object

In [18]:
sync.isna().sum();

In [19]:
sync.tail()

Unnamed: 0,index,title,artist,year,explicit,styles,languages,title_artist,synced,d_id,...,s_speech,s_acoustic,s_inst,s_live,s_valence,s_tempo,s_duration,s_time_sig,deezer_title_artist,spotify_title_artist
10840,4118,Fixing A Hole,The Beatles,1967,0,Rock,English,fixing a hole - the beatles,0,116348678,...,0.0356,0.706,0.0334,0.129,0.259,70.828,172133.0,4.0,fixing a hole (remastered 2009) - the beatles,fixing a hole - remastered 2009 - the beatles
10841,4119,It Came Upon a Midnight Clear,Frank Sinatra,1948,0,"Christmas,Christian,Traditionnal",English,it came upon a midnight clear - frank sinatra,0,115007404,...,0.0356,0.706,0.0334,0.129,0.259,70.828,172133.0,4.0,it came upon a midnight clear - bing crosby,it came upon a midnight clear - frank sinatra
10842,4120,Beautiful War,Kings of Leon,2013,0,"Alternative,Rock",English,beautiful war - kings of leon,0,70584821,...,0.0356,0.706,0.0334,0.129,0.259,70.828,172133.0,4.0,beautiful war - kings of leon,beautiful war - kings of leon
10843,4121,Midnight Blues,Gary Moore,1990,0,"Blues,Rock",English,midnight blues - gary moore,0,3133096,...,0.0356,0.706,0.0334,0.129,0.259,70.828,172133.0,4.0,midnight blues - gary moore,midnight blues - gary moore
10844,4122,Smokin' and Drinkin',Miranda Lambert,2014,0,"Pop,Country,Soft rock",English,smokin' and drinkin' - miranda lambert,0,78383556,...,0.0356,0.706,0.0334,0.129,0.259,70.828,172133.0,4.0,smokin' and drinkin' (feat. little big town) -...,smokin' and drinkin' (feat. little big town) -...


In [20]:
sync['l_title_artist'] = sync['l_title'].str.lower().str.strip() + ' - ' + sync['l_artist'].str.lower().str.strip()

In [21]:
def get_year(string):
    return string[0:4]

In [22]:
sync['d_year'] = sync['d_release'].apply(get_year)

In [23]:
sync.head(1).T

Unnamed: 0,0
index,0
title,Tennessee Whiskey
artist,Chris Stapleton
year,2015
explicit,0
styles,"Blues,Rock,Country"
languages,English
title_artist,tennessee whiskey - chris stapleton
synced,1
d_id,98975170


In [24]:
sync.reset_index(inplace = True)

In [25]:
sample = sync.head()

In [26]:
sample['score'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [27]:
def get_score(df):
    for i in range(len(df)):
        score = 0
        if df.loc[i, 'title_artist'] == df.loc[i, 'deezer_title_artist'] :
            score += 1
        elif df.loc[i, 'title_artist'] == df.loc[i, 'spotify_title_artist']:
            score +=1
        elif df.loc[i, 'title_artist']  == df.loc[i, 'l_title_artist']:
            score += 1
        else:
            score = score
    df.loc[i, 'score'] = score
    return df

In [28]:
sample

Unnamed: 0,level_0,index,title,artist,year,explicit,styles,languages,title_artist,synced,...,s_live,s_valence,s_tempo,s_duration,s_time_sig,deezer_title_artist,spotify_title_artist,l_title_artist,d_year,score
0,0,0,Tennessee Whiskey,Chris Stapleton,2015,0,"Blues,Rock,Country",English,tennessee whiskey - chris stapleton,1,...,0.0821,0.512,48.718,293293.0,4.0,tennessee whiskey - chris stapleton,tennessee whiskey - chris stapleton,tennessee whiskey - chris stapleton,2015,0
1,1,1,Dance Monkey,Tones and I,2019,0,Pop,English,dance monkey - tones and i,1,...,0.149,0.513,98.027,209438.0,4.0,dance monkey - tones and i,dance monkey - tones and i,dance monkey - tones and i,2019,0
2,2,2,Sweet Caroline,Neil Diamond,1969,0,Pop,English,sweet caroline - neil diamond,1,...,0.237,0.578,63.05,203573.0,4.0,sweet caroline - neil diamond,sweet caroline - neil diamond,sweet caroline - neil diamond,2017,0
3,3,3,Someone You Loved,Lewis Capaldi,2018,0,Pop,English,someone you loved - lewis capaldi,1,...,0.105,0.446,109.891,182161.0,4.0,someone you loved - lewis capaldi,someone you loved - lewis capaldi,someone you loved - lewis capaldi,2018,0
4,6,6,Creep,Radiohead,1992,1,"Rock,Alternative",English,creep - radiohead,1,...,0.129,0.104,91.841,238640.0,4.0,creep - radiohead,creep - radiohead,creep - radiohead,1993,0


In [29]:
sample = get_score(sample)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [30]:
titles = sync[['title_artist', 'deezer_title_artist', 'spotify_title_artist', 'l_title_artist']]

In [31]:
titles.columns

Index(['title_artist', 'deezer_title_artist', 'spotify_title_artist',
       'l_title_artist'],
      dtype='object')

In [32]:
sync = sync[(sync['title_artist'] == sync['deezer_title_artist']) & (sync['title_artist'] == sync['spotify_title_artist'])& (sync['title_artist'] == sync['l_title_artist'])]



In [37]:
sync.drop(columns=['level_0', 'index'], inplace=True)

In [36]:
sync.columns

Index(['level_0', 'index', 'title', 'artist', 'year', 'explicit', 'styles',
       'languages', 'title_artist', 'synced', 'd_id', 'd_song', 'd_isrc',
       'd_release', 'd_explicit', 'd_bpm', 'd_artist', 'd_album_id', 'd_album',
       'd_art', 'lyric_url', 'l_title', 'l_artist', 'l_album', 'l_writer',
       'l_pub', 's_artist', 's_track', 's_uri', 's_dance', 's_energy', 's_key',
       's_loudness', 's_mode', 's_speech', 's_acoustic', 's_inst', 's_live',
       's_valence', 's_tempo', 's_duration', 's_time_sig',
       'deezer_title_artist', 'spotify_title_artist', 'l_title_artist',
       'd_year'],
      dtype='object')

In [38]:
sync = sync[['title', 'artist', 'year', 'explicit', 'styles',
       'languages', 'title_artist', 'd_id', 'd_isrc',
       'd_release', 'd_album_id', 'd_album',
       'd_art', 'lyric_url','l_writer',
       'l_pub', 's_uri', 's_dance', 's_energy', 's_key',
       's_loudness', 's_mode', 's_speech', 's_acoustic', 's_inst', 's_live',
       's_valence', 's_tempo', 's_duration', 's_time_sig',
       'synced']]

In [39]:
sync

Unnamed: 0,title,artist,year,explicit,styles,languages,title_artist,d_id,d_isrc,d_release,...,s_mode,s_speech,s_acoustic,s_inst,s_live,s_valence,s_tempo,s_duration,s_time_sig,synced
0,Tennessee Whiskey,Chris Stapleton,2015,0,"Blues,Rock,Country",English,tennessee whiskey - chris stapleton,98975170,USUM71418088,2015-05-04,...,1.0,0.0298,0.2050,0.009600,0.0821,0.512,48.718,293293.0,4.0,1
1,Dance Monkey,Tones and I,2019,0,Pop,English,dance monkey - tones and i,739870792,QZES71982312,2019-08-29,...,0.0,0.0924,0.6920,0.000104,0.1490,0.513,98.027,209438.0,4.0,1
2,Sweet Caroline,Neil Diamond,1969,0,Pop,English,sweet caroline - neil diamond,145434430,USMC16991138,2017-03-31,...,1.0,0.0274,0.6110,0.000109,0.2370,0.578,63.050,203573.0,4.0,1
3,Someone You Loved,Lewis Capaldi,2018,0,Pop,English,someone you loved - lewis capaldi,582143242,DEUM71807062,2018-11-08,...,1.0,0.0319,0.7510,0.000000,0.1050,0.446,109.891,182161.0,4.0,1
4,Creep,Radiohead,1992,1,"Rock,Alternative",English,creep - radiohead,138547415,GBAYE9200070,1993-02-22,...,1.0,0.0369,0.0102,0.000141,0.1290,0.104,91.841,238640.0,4.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10806,Odio Por Amor,Juanes,2008,0,"Pop,Rock","Spanish,English",odio por amor - juanes,2559824,USUM70835975,2008-09-23,...,0.0,0.0356,0.7060,0.033400,0.1290,0.259,70.828,172133.0,4.0,0
10809,A Woman Like You,Lee Brice,2011,0,Country,English,a woman like you - lee brice,75766206,USCRB1109727,2012-04-24,...,0.0,0.0356,0.7060,0.033400,0.1290,0.259,70.828,172133.0,4.0,0
10810,Buck Rogers,Feeder,2001,0,"Rock,Alternative",English,buck rogers - feeder,131233178,GBBND0000727,1996-05-15,...,0.0,0.0356,0.7060,0.033400,0.1290,0.259,70.828,172133.0,4.0,0
10814,Beautiful War,Kings of Leon,2013,0,"Alternative,Rock",English,beautiful war - kings of leon,70584821,USRC11300783,2013-09-20,...,0.0,0.0356,0.7060,0.033400,0.1290,0.259,70.828,172133.0,4.0,0
