# Spotify EDA

The overall goal of this personal project is to use Spotify listening data to practice organizing and cleaning data, running some exploratory analyses, and implementing some basic machine learning models. 

This notebook primarily deals with getting the audio features of songs (by querying the Spotify API, via [spotipy](https://spotipy.readthedocs.io/en/2.16.1/) library) and organizing the information into a dataframe for easier use. 

In [12]:
import pandas as pd

## Read in streaming data

In [67]:
# read in streaming data as pd dataframes
data_df1 = pd.read_json('MyData/StreamingHistory0.json')
data_df2 = pd.read_json('MyData/StreamingHistory1.json')

data = pd.concat([data_df1, data_df2]).reset_index() #concat into single df
data['index'] = data.index # so that 'index' numbering is continuous
data.rename({'index': 'orig_index'}, axis=1, inplace=True) # may be useful to keep track of this
data

Unnamed: 0,orig_index,endTime,artistName,trackName,msPlayed
0,0,2019-10-09 00:02,TWICE,Breakthrough,217650
1,1,2019-10-09 00:05,GOT7,너란 Girl Magnetic,203084
2,2,2019-10-09 00:09,SE7EN,Better Together,218120
3,3,2019-10-09 00:12,GOT7,Look,194535
4,4,2019-10-09 00:15,TWICE,KNOCK KNOCK,195753
...,...,...,...,...,...
17590,17590,2020-10-09 21:46,TAEYANG,"눈,코,입 (Eyes, Nose, Lips)",229989
17591,17591,2020-10-09 21:50,TWICE,CHEER UP,208853
17592,17592,2020-10-09 21:53,SEULGI,Wow Thing,171998
17593,17593,2020-10-09 21:56,Hyolyn,BAE,206566


## Start with some very basic cleanup

In [68]:
# have some unknowns
is_unknown = data['trackName'] == 'Unknown Track'
print('{} out of {} plays are unknown'.format(is_unknown.sum(), len(data)))
data[is_unknown].head()

43 out of 17595 plays are unknown


Unnamed: 0,orig_index,endTime,artistName,trackName,msPlayed
16884,16884,2020-09-25 18:11,Unknown Artist,Unknown Track,2004626
16885,16885,2020-09-26 00:32,Unknown Artist,Unknown Track,1968807
16887,16887,2020-09-26 02:14,Unknown Artist,Unknown Track,2215833
17009,17009,2020-10-02 16:28,Unknown Artist,Unknown Track,172182
17010,17010,2020-10-02 16:31,Unknown Artist,Unknown Track,143343


In [69]:
# filter out unknowns
data = data[~is_unknown]
print(str(len(data)) + ' "known" song plays')

17552 "known" song plays


In [247]:
#check play count; filter out songs that were played only once
artist_track = data['artistName'] + '__' + data['trackName'] #assuming no artist/song has __ in its title; use to split later
play_counts = artist_track.value_counts()
num_played_once = (play_counts == 1).sum()

print('{}/{} songs were played only once in the past year.\n\
Going to ignore these from here on out.\n'.format(num_played_once, len(data)))

keep_unique_songs = play_counts[play_counts > 1].index #unique list of songs I've played more than once
data_idxs = [idx for idx,song in enumerate(artist_track.values) if song in keep_unique_songs] #corresponding idxs in data df
data_filt = data.iloc[idxs].reset_index(drop=True)
data_filt

2136/17552 songs were played only once in the past year.
Going to ignore these from here on out.



Unnamed: 0,orig_index,endTime,artistName,trackName,msPlayed
0,0,2019-10-09 00:02,TWICE,Breakthrough,217650
1,1,2019-10-09 00:05,GOT7,너란 Girl Magnetic,203084
2,2,2019-10-09 00:09,SE7EN,Better Together,218120
3,3,2019-10-09 00:12,GOT7,Look,194535
4,4,2019-10-09 00:15,TWICE,KNOCK KNOCK,195753
...,...,...,...,...,...
15411,17590,2020-10-09 21:46,TAEYANG,"눈,코,입 (Eyes, Nose, Lips)",229989
15412,17591,2020-10-09 21:50,TWICE,CHEER UP,208853
15413,17592,2020-10-09 21:53,SEULGI,Wow Thing,171998
15414,17593,2020-10-09 21:56,Hyolyn,BAE,206566


## Search each song to get its Spotify ID
Each track's Spotify ID is needed in order to query its audio features; the main way to get a track's Spotify ID is to search for it.

In [153]:
# use spotipy library to interface with Spotify API
import spotipy 
from spotipy.oauth2 import SpotifyClientCredentials

from time import sleep

In [72]:
# fill in with your client ID & secret key
clientID = '<client ID here>'
secret = '<secret here>'

In [73]:
client_credentials_manager = SpotifyClientCredentials(client_id=clientID, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

search_for_songs uses spotipy to access the Spotify Search API to search for individual songs, using the artist and song name for each song. 

A few notes on the function:
- Songs/artists with single/double quotation marks in the title are not searched correctly. Backslashes cannot be used as escapes; rather, removing the quotation mark is sufficient, so this is what search_for_songs does. (FWIW: &, -, (, and ) characters are fine to leave in; don't prevent match.)
- Searching with artist + track name does not always return the correct song first. search_for_songs includes user-side check as part of the search - if the artist + track name of the first search result does not match the song of interest, the function will ask the user Y/N if a given search result is the correct match, cycling through the search results until a correct match is found. Note that the current version of this isn't particularly robust - e.g., there isn't a way to go back/recover if user makes a mistake. 
- Mismatches are stored in a dictionary that is saved and can be used for future reference. This dictionary logs the query used and the index of the correct search result, which search_for_songs can use to automatically find correct matches. Notably, this usage is dependent on Spotify returning search results in the same order for a given query, which isn't a safe assumption to make forever.

See [Spotify API documentation](https://developer.spotify.com/documentation/web-api/reference/search/search/#writing-a-query---guidelines) for more information on writing search queries

In [154]:
# songs should be a list of '<artist>__<track>' strings representing individual songs
# mismatches is a dict, with <artist__<track> strings as keys and the "match idx" as values - the idx for the correct search result when queried
# search_for_songs uses the artist + track names to query each song 
# prompts user check if the first result from the search query doesn't match the artist/track name
# stores correct search results, and keeps track of any mismatches and any songs that weren't found
def search_for_songs(songs, mismatches=None):
    search_results = {}
    no_match = []
    if mismatches is None:
        mismatches = {}
    
    for idx, song in enumerate(songs):       
        #get artist & track name, free of single/double quotes
        artist = song.split('__')[0].replace("'", '').replace('"', '') 
        track = song.split('__')[1].replace("'", '').replace('"', '') 
        
        #search
        query = 'artist:"{}" track:"{}"'.format(artist, track)
        matches = sp.search(q=query, type='track')['tracks']['items'] #search results are stored here

        sleep(.1)
        
        #if search returned a result
        if matches: 
            
            #grab first match
            match = matches[0] 
            matched_artist = match['artists'][0]['name']
            matched_track = match['name']
            
            #check search result 
            if ((song.split('__')[0].lower() == matched_artist.lower()) & 
                (song.split('__')[1].lower() == matched_track.lower())):
                search_results[song] = match #if artist & track name match, then store
                      
            else: #otherwise, if mismatch between artist/track name and search result ..
                if song in mismatches: #if this mismatch has been logged before, use stored info to resolve
                    match_idx = mismatches[song]['match_idx'] 
                    search_results[song] = matches[match_idx] #idx of correct search query result
                
                else: #otherwise, look through search results
                    print('\nMismatch for ' + song)
                    print('Search result: ')
                    print(matched_artist + ' - ' + matched_track)
                    correct = input('Is this correct? Y/N: \n')
                    
                    if correct == 'Y': #if first match actually is correct
                        mismatches[song] = {'query': query, 'match_idx': 0} #store 'mismatch' info
                        search_results[song] = match #store match
                    
                    elif correct == 'N': #if first match is incorrect
                        find_match = 'N'
                        if len(matches) > 1: #if there are more matches
                            for idx, m in enumerate(matches[1:], start=1): #cycle through each option
                                print('How about... ')
                                print(m['artists'][0]['name'] + ' - ' + m['name'])
                                find_match = input('Is this correct? Y/N: \n')
                                
                                if find_match == 'Y': #store correct match when found
                                    search_results[song] = matches[idx]
                                    mismatches[song] = {'query': query, 'match_idx': idx}
                                    break
                            if idx == len(matches)-1 & find_match == 'N': #if reached end of matches & didn't find one
                                no_match.append(song) #store in no_match
                        elif len(matches) == 1:  #if there was only a single (incorrect) match
                            no_match.append(song) #store in no_match
                    
                    elif correct == 'Z': #Z to quit
                        break

        elif not matches: #if no search result
            no_match.append(song)
            
    return search_results, no_match, mismatches

In [232]:
artist_track_filt = data_filt['artistName'] + '__' + data_filt['trackName']
unique_songs = artist_track_filt.unique()

print('{} unique songs. Search for each and get its spotify ID info..'
      .format(len(unique_songs)))

try:
    mismatches = pd.read_pickle('mismatches.pkl').to_dict(orient='index')
except FileNotFoundError:
    mismatches = None

search_results, no_match, mismatches = search_for_songs(unique_songs, mismatches)

1505 unique songs. Search for each and get its spotify ID info..

Mismatch for chief.__orion
Search result: 
The Late Night Chiefs - Orion
Is this correct? Y/N: 
N

Mismatch for SEVENTEEN__Adore U - Vocal Team Ver.
Search result: 
SEVENTEEN - Adore U (Vocal Team Ver.)
Is this correct? Y/N: 
Y

Mismatch for LiSA__Gurenge
Search result: 
LiSA - Gurenge (Demon Slayer) - Remix
Is this correct? Y/N: 
N


In [251]:
# number of unique songs should equal number of search results + songs that weren't matched
print(len(unique_songs))
print(len(search_results) + len(no_match))

1505
1505


### Check songs that weren't found - do any of these need to be corrected?

In [233]:
no_match

['TWICE__OOH-AHH하게 Like OOH-AHH',
 'GOT7__Bibouroku',
 'DAY6__If 〜また逢えたら〜',
 'William Bolton__Some Love (feat. Jackson Breit)',
 'ASTRO__그림자',
 'chief.__orion',
 'DAY6__君なら',
 'eaJ’s Music__eaJ - Pinocchio',
 'eaJ’s Music__eaJ - Guess Not',
 'eaJ’s Music__eaJ - Otherside',
 'eaJ’s Music__eaJ - La Trains',
 'DAY6 Another Songs__Young one - Honesty',
 'DAY6 Another Songs__eaJ - LA TRAINS (Acoustic Ver.)',
 'DAY6 Another Songs__Young One - Face to Face',
 'DAY6 Another Songs__You Were Beautiful (acoustic)',
 'DAY6 Another Songs__Young One - Talk',
 'DAY6 Another Songs__Jae - Happy Ending',
 'DAY6 Another Songs__Jae - Rose',
 'Yumi Arai__ひこうき雲',
 'Korean Songs__eaJ - 50 Proof',
 'DAY6 Another Songs__YOU',
 'DAY6 Another Songs__eaJ - PINOCCHIO (Acoustic Ver.)',
 'DAY6 Another Songs__eaJ - Otherside',
 'DAY6 Solo Project__eaJ - 50 proof',
 'DAY6 Solo Project__eaJ - Truman',
 'Part Of Mind__しき の うた',
 'Bajune Tobeta__テディベアとパスワード ニコラ・コンテ re-mix',
 'Yoshihisa Hirano__暗殺一家の館',
 'Taylor Swift__Cl

In the above list of songs that were not found, a handful are songs should have been matched. Did some experimenting and found alternate titles that would match properly. Fix these entries and update data accordingly. (The remaining "no matches" are either songs that are not well known and therefore decided to be fine to exclude, and/or podcast episodes that should not be included anyway).

In [240]:
# these songs need to be renamed so that they can be searched properly
fix_these_songs = ['TWICE__OOH-AHH하게 Like OOH-AHH', 'ASTRO__그림자', 'DAY6__君なら', 'DAY6__If 〜また逢えたら〜']

# key = old name, value = new name
rename = {'OOH-AHH하게 Like OOH-AHH': 'Like Ooh-Ahh',
          '그림자': 'Shadow',
          '君なら': '君',
          'If 〜また逢えたら〜': 'If ~'}

data_copy = data_filt.copy()
fixed = []
for song in fix_these_songs:
    track_name = song.split('__')[1]
    
    fix_these = data_copy['trackName'] == track_name #idxes to change
    data_copy.loc[fix_these, 'trackName'] = rename[track_name] 
    
    fixed.append(song.split('__')[0] + '__' + rename[track_name]) #build list of new song names to use in search
    
data_filt = data_copy

In [255]:
# re-run search using newly renamed songs
search_fixed, _, _ = search_for_songs(fixed)

already_found = [key in search_results for key in search_fixed.keys()]
print(f'\nFYI, {sum(already_found)} renamed song(s) already found in original search_results')


Mismatch for DAY6__君
Search result: 
DAY6 - 君なら
Is this correct? Y/N: 
Y

Mismatch for DAY6__If ~
Search result: 
DAY6 - If 〜また逢えたら〜
Is this correct? Y/N: 
Y

FYI, 1 renamed song(s) already found in original search_results


In [270]:
# add new search results from corrected names to original search_results
search_results.update(search_fixed)
print(f'Added {len(search_fixed) - sum(already_found)} items to search results.\n')

print('Effectively {} unique songs to analyze ({} unique songs in dataset, but {} were not found).'.
     format(len(search_results), 
            len(unique_songs), 
            len(no_match) - len(search_fixed) + sum(already_found)))

Added 3 items to search results.

Effectively 1476 unique songs to analyze (1505 unique songs in dataset, but 29 were not found).


In [None]:
# save mismatches
mismatches_df = pd.DataFrame.from_dict(mismatches, orient='index')
mismatches_df.to_pickle('mismatches.pkl')

## Convert search results into dataframe with track, artist, and album info
Spotify search returned album and artist information for each track as dictionary keys. The blocks below extract this information and convert it into album/artist dataframes, then merges everything in order to have metadata altogether in a single dataframe

In [None]:
# add artist_track_name; easier to use as unique identifier
artist_track = data_filt['artistName'] + '__' + data_filt['trackName']
data_filt['artist_track_name'] = artist_track

In [271]:
# convert track info to dataframe
track_info = pd.DataFrame.from_dict(search_results, orient='index').add_prefix('track_')
track_info = track_info.reset_index().rename({'index': 'artist_track_name'}, axis=1)

# pull out 'album' and 'artists' (first artist) dictionaries and convert to dataframe
album_info = pd.DataFrame([search_results[x]['album'] for x in search_results]).add_prefix('album_')
artist_info = pd.DataFrame([search_results[x]['artists'][0] for x in search_results]).add_prefix('artist_')

Check track, album, and artists dataframes & decide if there are any columns that can be dropped

In [272]:
album_info.head()

Unnamed: 0,album_album_type,album_artists,album_available_markets,album_external_urls,album_href,album_id,album_images,album_name,album_release_date,album_release_date_precision,album_total_tracks,album_type,album_uri
0,single,[{'external_urls': {'spotify': 'https://open.s...,"[AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B...",{'spotify': 'https://open.spotify.com/album/7L...,https://api.spotify.com/v1/albums/7LWfEiSeue9B...,7LWfEiSeue9BXPbUOH34q6,"[{'height': 640, 'url': 'https://i.scdn.co/ima...",Breakthrough,2019-06-12,day,1,album,spotify:album:7LWfEiSeue9BXPbUOH34q6
1,album,[{'external_urls': {'spotify': 'https://open.s...,"[AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B...",{'spotify': 'https://open.spotify.com/album/02...,https://api.spotify.com/v1/albums/02dvCQbuBKdU...,02dvCQbuBKdU1QHHGtrCHy,"[{'height': 640, 'url': 'https://i.scdn.co/ima...",Identify,2014-11-28,day,11,album,spotify:album:02dvCQbuBKdU1QHHGtrCHy
2,album,[{'external_urls': {'spotify': 'https://open.s...,"[AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B...",{'spotify': 'https://open.spotify.com/album/5W...,https://api.spotify.com/v1/albums/5WdFwkGiwtXM...,5WdFwkGiwtXM4DiXxl02CM,"[{'height': 640, 'url': 'https://i.scdn.co/ima...",Digital Bounce,2010,year,7,album,spotify:album:5WdFwkGiwtXM4DiXxl02CM
3,single,[{'external_urls': {'spotify': 'https://open.s...,"[AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B...",{'spotify': 'https://open.spotify.com/album/5a...,https://api.spotify.com/v1/albums/5aCJKuo69SGb...,5aCJKuo69SGbveapvkSYMW,"[{'height': 640, 'url': 'https://i.scdn.co/ima...",Eyes On You,2018-03-12,day,6,album,spotify:album:5aCJKuo69SGbveapvkSYMW
4,album,[{'external_urls': {'spotify': 'https://open.s...,"[AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B...",{'spotify': 'https://open.spotify.com/album/3O...,https://api.spotify.com/v1/albums/3O7HQtCnGoaX...,3O7HQtCnGoaXMxXLkspFL8,"[{'height': 640, 'url': 'https://i.scdn.co/ima...",Twicecoaster: Lane 2,2017-02-20,day,9,album,spotify:album:3O7HQtCnGoaXMxXLkspFL8


In [138]:
# drop..
# 'album_artists' - already have track artists; not going to worry about album artists
# 'album_available_markets' - don't care about this
# 'album_type' - no useful info
# 
# + rename 'album_album_type' to 'album_type'
album_info.drop(['album_artists', 'album_available_markets', 'album_type'], axis=1, inplace=True)
album_info.rename(columns={'album_album_type': 'album_type'}, inplace=True)

In [139]:
artist_info.head()

Unnamed: 0,artist_external_urls,artist_href,artist_id,artist_name,artist_type,artist_uri
0,{'spotify': 'https://open.spotify.com/artist/7...,https://api.spotify.com/v1/artists/7n2Ycct7Bei...,7n2Ycct7Beij7Dj7meI4X0,TWICE,artist,spotify:artist:7n2Ycct7Beij7Dj7meI4X0
1,{'spotify': 'https://open.spotify.com/artist/6...,https://api.spotify.com/v1/artists/6nfDaffa50m...,6nfDaffa50mKtEOwR8g4df,GOT7,artist,spotify:artist:6nfDaffa50mKtEOwR8g4df
2,{'spotify': 'https://open.spotify.com/artist/1...,https://api.spotify.com/v1/artists/14yLuCwlBqt...,14yLuCwlBqteUdBqx9soJV,SE7EN,artist,spotify:artist:14yLuCwlBqteUdBqx9soJV
3,{'spotify': 'https://open.spotify.com/artist/6...,https://api.spotify.com/v1/artists/6nfDaffa50m...,6nfDaffa50mKtEOwR8g4df,GOT7,artist,spotify:artist:6nfDaffa50mKtEOwR8g4df
4,{'spotify': 'https://open.spotify.com/artist/7...,https://api.spotify.com/v1/artists/7n2Ycct7Bei...,7n2Ycct7Beij7Dj7meI4X0,TWICE,artist,spotify:artist:7n2Ycct7Beij7Dj7meI4X0


In [273]:
# drop'artist_type' - no useful info
artist_info.drop('artist_type', axis=1, inplace=True)

In [274]:
track_info.head()

Unnamed: 0,artist_track_name,track_album,track_artists,track_available_markets,track_disc_number,track_duration_ms,track_explicit,track_external_ids,track_external_urls,track_href,track_id,track_is_local,track_name,track_popularity,track_preview_url,track_track_number,track_type,track_uri
0,TWICE__Breakthrough,"{'album_type': 'single', 'artists': [{'externa...",[{'external_urls': {'spotify': 'https://open.s...,"[AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B...",1,217650,False,{'isrc': 'JPWP01970489'},{'spotify': 'https://open.spotify.com/track/5C...,https://api.spotify.com/v1/tracks/5COO2JgOmHIJ...,5COO2JgOmHIJ2jsXFwflz8,False,Breakthrough,61,https://p.scdn.co/mp3-preview/cb9af41416965765...,1,track,spotify:track:5COO2JgOmHIJ2jsXFwflz8
1,GOT7__너란 Girl Magnetic,"{'album_type': 'album', 'artists': [{'external...",[{'external_urls': {'spotify': 'https://open.s...,"[AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B...",1,203084,False,{'isrc': 'US5TA1400113'},{'spotify': 'https://open.spotify.com/track/3q...,https://api.spotify.com/v1/tracks/3qLNtf8mMhji...,3qLNtf8mMhjivYM84iVmy8,False,너란 Girl Magnetic,45,https://p.scdn.co/mp3-preview/c3dd3873b1bcab20...,4,track,spotify:track:3qLNtf8mMhjivYM84iVmy8
2,SE7EN__Better Together,"{'album_type': 'album', 'artists': [{'external...",[{'external_urls': {'spotify': 'https://open.s...,"[AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B...",1,218120,False,{'isrc': 'KRA491000084'},{'spotify': 'https://open.spotify.com/track/6r...,https://api.spotify.com/v1/tracks/6r2RWys84mPO...,6r2RWys84mPOYBMHXbrPZN,False,Better Together,37,https://p.scdn.co/mp3-preview/8022dfb633858fa6...,3,track,spotify:track:6r2RWys84mPOYBMHXbrPZN
3,GOT7__Look,"{'album_type': 'single', 'artists': [{'externa...",[{'external_urls': {'spotify': 'https://open.s...,"[AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B...",1,194535,False,{'isrc': 'US5TA1800023'},{'spotify': 'https://open.spotify.com/track/1Z...,https://api.spotify.com/v1/tracks/1ZFQugO7BqYJ...,1ZFQugO7BqYJjw8FVQHcze,False,Look,56,https://p.scdn.co/mp3-preview/0a45d1e590df117c...,2,track,spotify:track:1ZFQugO7BqYJjw8FVQHcze
4,TWICE__KNOCK KNOCK,"{'album_type': 'album', 'artists': [{'external...",[{'external_urls': {'spotify': 'https://open.s...,"[AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B...",1,195746,False,{'isrc': 'US5TA1700022'},{'spotify': 'https://open.spotify.com/track/2d...,https://api.spotify.com/v1/tracks/2dYl2SI3sLyU...,2dYl2SI3sLyUffzxVBaZBz,False,Knock Knock,62,,1,track,spotify:track:2dYl2SI3sLyUffzxVBaZBz


In [275]:
# drop..
# 'track_album' - data is in album_info
# 'track_artists' - data is in artist_info
# 'track_available_markets', 'track_disc_number', 'track_explicit', 'track_is_local' - don't care about these
# 'track_type' - no useful info
# 
# + rename 'track_track_number' to 'track_number'

track_info.drop(['track_album', 'track_artists', 'track_available_markets',
                'track_disc_number', 'track_explicit', 'track_is_local', 'track_type'], 
                axis=1, inplace=True)
track_info.rename(columns={'track_track_number': 'track_number'}, inplace=True)

Now track, album, and artist dataframes are ready to be concatenated together

In [276]:
metadata = pd.concat([track_info, album_info, artist_info], axis=1)
metadata

Unnamed: 0,artist_track_name,track_duration_ms,track_external_ids,track_external_urls,track_href,track_id,track_name,track_popularity,track_preview_url,track_number,...,album_release_date,album_release_date_precision,album_total_tracks,album_type,album_uri,artist_external_urls,artist_href,artist_id,artist_name,artist_uri
0,TWICE__Breakthrough,217650,{'isrc': 'JPWP01970489'},{'spotify': 'https://open.spotify.com/track/5C...,https://api.spotify.com/v1/tracks/5COO2JgOmHIJ...,5COO2JgOmHIJ2jsXFwflz8,Breakthrough,61,https://p.scdn.co/mp3-preview/cb9af41416965765...,1,...,2019-06-12,day,1,album,spotify:album:7LWfEiSeue9BXPbUOH34q6,{'spotify': 'https://open.spotify.com/artist/7...,https://api.spotify.com/v1/artists/7n2Ycct7Bei...,7n2Ycct7Beij7Dj7meI4X0,TWICE,spotify:artist:7n2Ycct7Beij7Dj7meI4X0
1,GOT7__너란 Girl Magnetic,203084,{'isrc': 'US5TA1400113'},{'spotify': 'https://open.spotify.com/track/3q...,https://api.spotify.com/v1/tracks/3qLNtf8mMhji...,3qLNtf8mMhjivYM84iVmy8,너란 Girl Magnetic,45,https://p.scdn.co/mp3-preview/c3dd3873b1bcab20...,4,...,2014-11-28,day,11,album,spotify:album:02dvCQbuBKdU1QHHGtrCHy,{'spotify': 'https://open.spotify.com/artist/6...,https://api.spotify.com/v1/artists/6nfDaffa50m...,6nfDaffa50mKtEOwR8g4df,GOT7,spotify:artist:6nfDaffa50mKtEOwR8g4df
2,SE7EN__Better Together,218120,{'isrc': 'KRA491000084'},{'spotify': 'https://open.spotify.com/track/6r...,https://api.spotify.com/v1/tracks/6r2RWys84mPO...,6r2RWys84mPOYBMHXbrPZN,Better Together,37,https://p.scdn.co/mp3-preview/8022dfb633858fa6...,3,...,2010,year,7,album,spotify:album:5WdFwkGiwtXM4DiXxl02CM,{'spotify': 'https://open.spotify.com/artist/1...,https://api.spotify.com/v1/artists/14yLuCwlBqt...,14yLuCwlBqteUdBqx9soJV,SE7EN,spotify:artist:14yLuCwlBqteUdBqx9soJV
3,GOT7__Look,194535,{'isrc': 'US5TA1800023'},{'spotify': 'https://open.spotify.com/track/1Z...,https://api.spotify.com/v1/tracks/1ZFQugO7BqYJ...,1ZFQugO7BqYJjw8FVQHcze,Look,56,https://p.scdn.co/mp3-preview/0a45d1e590df117c...,2,...,2018-03-12,day,6,album,spotify:album:5aCJKuo69SGbveapvkSYMW,{'spotify': 'https://open.spotify.com/artist/6...,https://api.spotify.com/v1/artists/6nfDaffa50m...,6nfDaffa50mKtEOwR8g4df,GOT7,spotify:artist:6nfDaffa50mKtEOwR8g4df
4,TWICE__KNOCK KNOCK,195746,{'isrc': 'US5TA1700022'},{'spotify': 'https://open.spotify.com/track/2d...,https://api.spotify.com/v1/tracks/2dYl2SI3sLyU...,2dYl2SI3sLyUffzxVBaZBz,Knock Knock,62,,1,...,2017-02-20,day,9,album,spotify:album:3O7HQtCnGoaXMxXLkspFL8,{'spotify': 'https://open.spotify.com/artist/7...,https://api.spotify.com/v1/artists/7n2Ycct7Bei...,7n2Ycct7Beij7Dj7meI4X0,TWICE,spotify:artist:7n2Ycct7Beij7Dj7meI4X0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1471,Luvian__Forest,273307,{'isrc': 'USUS11600313'},{'spotify': 'https://open.spotify.com/track/2k...,https://api.spotify.com/v1/tracks/2koAp3QuiVbf...,2koAp3QuiVbfJ6tTu8ij2h,Forest,30,https://p.scdn.co/mp3-preview/45e1815b7a103e35...,4,...,2016-07-29,day,5,album,spotify:album:4KQ1om2HjggigU8wiKTGnL,{'spotify': 'https://open.spotify.com/artist/2...,https://api.spotify.com/v1/artists/25zpZzQN6Wn...,25zpZzQN6WnNMt8LFtfsJW,Luvian,spotify:artist:25zpZzQN6WnNMt8LFtfsJW
1472,BSS__Just do it,201757,{'isrc': 'KRA381704606'},{'spotify': 'https://open.spotify.com/track/57...,https://api.spotify.com/v1/tracks/57ITlzpnOMkS...,57ITlzpnOMkSE6oHGbvTqi,Just do it,54,https://p.scdn.co/mp3-preview/ea111e334ff61485...,1,...,2018-03-21,day,1,album,spotify:album:7cL58mAH0Cs8giOqRNEjQP,{'spotify': 'https://open.spotify.com/artist/1...,https://api.spotify.com/v1/artists/1uAT5bTSp6d...,1uAT5bTSp6dWbNmixIUP5t,BSS,spotify:artist:1uAT5bTSp6dWbNmixIUP5t
1473,ASTRO__Shadow,193609,{'isrc': 'KRA382050128'},{'spotify': 'https://open.spotify.com/track/5g...,https://api.spotify.com/v1/tracks/5gXS4UeCGKuf...,5gXS4UeCGKufZGOGKyvJcM,Shadow,32,https://p.scdn.co/mp3-preview/394c667c008f97f7...,1,...,2018-01-29,day,1,album,spotify:album:6Qhio1wu4IaAWKGH09Vmix,{'spotify': 'https://open.spotify.com/artist/4...,https://api.spotify.com/v1/artists/4pz4uzOMpJQ...,4pz4uzOMpJQyV8UTsDy4H8,ASTRO,spotify:artist:4pz4uzOMpJQyV8UTsDy4H8
1474,DAY6__君,192159,{'isrc': 'JPWP01971085'},{'spotify': 'https://open.spotify.com/track/71...,https://api.spotify.com/v1/tracks/71IelX1rq1Tq...,71IelX1rq1TqK35DmH56XJ,君なら,38,https://p.scdn.co/mp3-preview/fad5c74ec156de9c...,2,...,2019-12-04,day,4,album,spotify:album:1LDHf4VNXPeb7Q99QznE4h,{'spotify': 'https://open.spotify.com/artist/5...,https://api.spotify.com/v1/artists/5TnQc2N1iKl...,5TnQc2N1iKlFjYD7CPGvFc,DAY6,spotify:artist:5TnQc2N1iKlFjYD7CPGvFc


## Check for duplicates
It's possible (likely) that some of the "unique" song names in metadata are actually duplicates. This was difficult to check earlier with just artist/track names, since the duplicates could result from any number of different variations (e.g., change in capitalization, or punctuation, or abbreviation, etc). But now, can at least check for any duplicate spotify IDs and fix them accordingly.

In [277]:
id_value_counts = metadata['track_id'].value_counts()
double_ids = id_value_counts[id_value_counts > 1]
double_ids

4O4maU1Ki1PMaByGZqoC5g    2
2dYl2SI3sLyUffzxVBaZBz    2
1IX47gefluXmKX4PrTBCRM    2
5KawlOMHjWeUjQtnuRs22c    2
5HiSc2ZCGn8L3cH3qSwzBT    2
Name: track_id, dtype: int64

In [278]:
print('{} songs are duplicated.'.format(len(double_ids)))

5 songs are duplicated.


In [279]:
# get idxs of duplicate rows in metadata to delete
by_id = metadata.reset_index().set_index('track_id')
metadata_doubles = by_id.loc[double_ids.index, :]
metadata_doubles   

Unnamed: 0_level_0,index,artist_track_name,track_duration_ms,track_external_ids,track_external_urls,track_href,track_name,track_popularity,track_preview_url,track_number,...,album_release_date,album_release_date_precision,album_total_tracks,album_type,album_uri,artist_external_urls,artist_href,artist_id,artist_name,artist_uri
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4O4maU1Ki1PMaByGZqoC5g,920,Brasstracks__Melanin Man (feat. Masego),250521,{'isrc': 'TCACR1607043'},{'spotify': 'https://open.spotify.com/track/4O...,https://api.spotify.com/v1/tracks/4O4maU1Ki1PM...,Melanin Man (feat. Masego),41,https://p.scdn.co/mp3-preview/6bd3ee77296d7c4b...,6,...,2016-08-19,day,7,album,spotify:album:4dj2CRLWvdFgEzsVtc2qr6,{'spotify': 'https://open.spotify.com/artist/5...,https://api.spotify.com/v1/artists/5sKvgmG84C0...,5sKvgmG84C0bIMWeS2SRPr,Brasstracks,spotify:artist:5sKvgmG84C0bIMWeS2SRPr
4O4maU1Ki1PMaByGZqoC5g,935,Brasstracks__Melanin Man,250521,{'isrc': 'TCACR1607043'},{'spotify': 'https://open.spotify.com/track/4O...,https://api.spotify.com/v1/tracks/4O4maU1Ki1PM...,Melanin Man (feat. Masego),41,https://p.scdn.co/mp3-preview/6bd3ee77296d7c4b...,6,...,2016-08-19,day,7,album,spotify:album:4dj2CRLWvdFgEzsVtc2qr6,{'spotify': 'https://open.spotify.com/artist/5...,https://api.spotify.com/v1/artists/5sKvgmG84C0...,5sKvgmG84C0bIMWeS2SRPr,Brasstracks,spotify:artist:5sKvgmG84C0bIMWeS2SRPr
2dYl2SI3sLyUffzxVBaZBz,4,TWICE__KNOCK KNOCK,195746,{'isrc': 'US5TA1700022'},{'spotify': 'https://open.spotify.com/track/2d...,https://api.spotify.com/v1/tracks/2dYl2SI3sLyU...,Knock Knock,62,,1,...,2017-02-20,day,9,album,spotify:album:3O7HQtCnGoaXMxXLkspFL8,{'spotify': 'https://open.spotify.com/artist/7...,https://api.spotify.com/v1/artists/7n2Ycct7Bei...,7n2Ycct7Beij7Dj7meI4X0,TWICE,spotify:artist:7n2Ycct7Beij7Dj7meI4X0
2dYl2SI3sLyUffzxVBaZBz,804,TWICE__Knock Knock,195746,{'isrc': 'US5TA1700022'},{'spotify': 'https://open.spotify.com/track/2d...,https://api.spotify.com/v1/tracks/2dYl2SI3sLyU...,Knock Knock,62,,1,...,2017-02-20,day,9,album,spotify:album:3O7HQtCnGoaXMxXLkspFL8,{'spotify': 'https://open.spotify.com/artist/7...,https://api.spotify.com/v1/artists/7n2Ycct7Bei...,7n2Ycct7Beij7Dj7meI4X0,TWICE,spotify:artist:7n2Ycct7Beij7Dj7meI4X0
1IX47gefluXmKX4PrTBCRM,19,TWICE__What is Love?,208240,{'isrc': 'US5TA1800038'},{'spotify': 'https://open.spotify.com/track/1I...,https://api.spotify.com/v1/tracks/1IX47gefluXm...,What is Love,73,,4,...,2018-07-09,day,9,album,spotify:album:35LVzMbjGUCfYZYEP6YWyr,{'spotify': 'https://open.spotify.com/artist/7...,https://api.spotify.com/v1/artists/7n2Ycct7Bei...,7n2Ycct7Beij7Dj7meI4X0,TWICE,spotify:artist:7n2Ycct7Beij7Dj7meI4X0
1IX47gefluXmKX4PrTBCRM,772,TWICE__What is Love,208240,{'isrc': 'US5TA1800038'},{'spotify': 'https://open.spotify.com/track/1I...,https://api.spotify.com/v1/tracks/1IX47gefluXm...,What is Love,73,,4,...,2018-07-09,day,9,album,spotify:album:35LVzMbjGUCfYZYEP6YWyr,{'spotify': 'https://open.spotify.com/artist/7...,https://api.spotify.com/v1/artists/7n2Ycct7Bei...,7n2Ycct7Beij7Dj7meI4X0,TWICE,spotify:artist:7n2Ycct7Beij7Dj7meI4X0
5KawlOMHjWeUjQtnuRs22c,5,BTS__Boy With Luv (feat. Halsey),229773,{'isrc': 'QM6MZ1917908'},{'spotify': 'https://open.spotify.com/track/5K...,https://api.spotify.com/v1/tracks/5KawlOMHjWeU...,Boy With Luv (feat. Halsey),84,https://p.scdn.co/mp3-preview/d16797fb391fb909...,2,...,2019-04-12,day,7,album,spotify:album:1AvXa8xFEXtR3hb4bgihIK,{'spotify': 'https://open.spotify.com/artist/3...,https://api.spotify.com/v1/artists/3Nrfpe0tUJi...,3Nrfpe0tUJi4K4DXYWgMUX,BTS,spotify:artist:3Nrfpe0tUJi4K4DXYWgMUX
5KawlOMHjWeUjQtnuRs22c,714,BTS__Boy With Luv (Feat. Halsey),229773,{'isrc': 'QM6MZ1917908'},{'spotify': 'https://open.spotify.com/track/5K...,https://api.spotify.com/v1/tracks/5KawlOMHjWeU...,Boy With Luv (feat. Halsey),84,https://p.scdn.co/mp3-preview/d16797fb391fb909...,2,...,2019-04-12,day,7,album,spotify:album:1AvXa8xFEXtR3hb4bgihIK,{'spotify': 'https://open.spotify.com/artist/3...,https://api.spotify.com/v1/artists/3Nrfpe0tUJi...,3Nrfpe0tUJi4K4DXYWgMUX,BTS,spotify:artist:3Nrfpe0tUJi4K4DXYWgMUX
5HiSc2ZCGn8L3cH3qSwzBT,41,Red Velvet__러시안 룰렛 Russian Roulette,211243,{'isrc': 'KRA301600315'},{'spotify': 'https://open.spotify.com/track/5H...,https://api.spotify.com/v1/tracks/5HiSc2ZCGn8L...,러시안 룰렛 Russian Roulette,63,https://p.scdn.co/mp3-preview/80c7cca2fe309ce9...,1,...,2016-09-07,day,7,album,spotify:album:6MNlcai3skKLKv5syzFwC3,{'spotify': 'https://open.spotify.com/artist/1...,https://api.spotify.com/v1/artists/1z4g3DjTBBZ...,1z4g3DjTBBZKhvAroFlhOM,Red Velvet,spotify:artist:1z4g3DjTBBZKhvAroFlhOM
5HiSc2ZCGn8L3cH3qSwzBT,455,Red Velvet__Russian Roulette,211243,{'isrc': 'KRA301600315'},{'spotify': 'https://open.spotify.com/track/5H...,https://api.spotify.com/v1/tracks/5HiSc2ZCGn8L...,러시안 룰렛 Russian Roulette,63,https://p.scdn.co/mp3-preview/80c7cca2fe309ce9...,1,...,2016-09-07,day,7,album,spotify:album:6MNlcai3skKLKv5syzFwC3,{'spotify': 'https://open.spotify.com/artist/1...,https://api.spotify.com/v1/artists/1z4g3DjTBBZ...,1z4g3DjTBBZKhvAroFlhOM,Red Velvet,spotify:artist:1z4g3DjTBBZKhvAroFlhOM


In [281]:
# fix data_filt dataframe - duplicates should all have same name
# duplicates have same ID; doesn't really matter which name is kept, as long as they're all changed to that same one
def rename_duplicates(data_df, keep_name, change_name):
    keep_artist, keep_track = keep_name.split('__')    
    fix_these = data_df['artist_track_name'] == change_name
    
#     can this be consolidated?
    data_df.loc[fix_these, 'artist_track_name'] = keep_name
    data_df.loc[fix_these, 'artistName'] = keep_artist
    data_df.loc[fix_these, 'trackName'] = keep_track
    
    return data_df

data_copy = data_filt.copy()
for dup_id in double_ids.index:
    keep, change = by_id.loc[dup_id]['artist_track_name'].values
    data_copy = rename_duplicates(data_copy, keep, change)

In [282]:
# fix meta_data dataframe - delete duplicates
meta_copy = metadata.copy()
delete_these = metadata_doubles.iloc[1::2, :] #keep first match, delete second 
delete_these = delete_these['index'].values
meta_copy = meta_copy.drop(delete_these, axis=0).reset_index(drop=True)

In [283]:
data_filt = data_copy
metadata = meta_copy

# check - no more duplicate IDs
id_value_counts = metadata['track_id'].value_counts()
recheck_doubles = id_value_counts[id_value_counts > 1]
recheck_doubles

Series([], Name: track_id, dtype: int64)

In [284]:
# save
metadata.to_pickle('unique_song_metadata.pkl')
data_filt.to_pickle('filtered_song_data.pkl')

## Use IDs in metadata df to get audio features for each unique song
Use artist_id to get artist genre (album genre is often empty); use track_id to get audio features

In [285]:
# max 50 for single artists search
artist_ids = list(metadata['artist_id'])
artist_search = []
for start_idx in list(range(0, len(metadata), 50)):
    artist_search += sp.artists(artist_ids[start_idx:start_idx+50])['artists']

# get genres for each artist_search result    
artist_genres = [a['genres'] for a in artist_search]

In [286]:
# max 100 for single audio features search
track_ids = list(metadata['track_id'])
features_search = []
for start_idx in list(range(0, len(metadata), 100)):
    features_search += sp.audio_features(tracks=track_ids[start_idx:start_idx+100])

In [287]:
feature_data = pd.concat([metadata['artist_track_name'],
                          pd.Series(artist_genres, name='artist_genres'),
                          pd.DataFrame(features_search)], axis=1)
feature_data

Unnamed: 0,artist_track_name,artist_genres,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,TWICE__Breakthrough,"[k-pop, k-pop girl group]",0.868,0.728,6,-3.338,0,0.1100,0.043900,0.000000,0.0975,0.616,112.006,audio_features,5COO2JgOmHIJ2jsXFwflz8,spotify:track:5COO2JgOmHIJ2jsXFwflz8,https://api.spotify.com/v1/tracks/5COO2JgOmHIJ...,https://api.spotify.com/v1/audio-analysis/5COO...,217651,4
1,GOT7__너란 Girl Magnetic,"[k-pop, k-pop boy group]",0.720,0.870,0,-4.129,0,0.0445,0.059600,0.000000,0.3460,0.842,109.960,audio_features,3qLNtf8mMhjivYM84iVmy8,spotify:track:3qLNtf8mMhjivYM84iVmy8,https://api.spotify.com/v1/tracks/3qLNtf8mMhji...,https://api.spotify.com/v1/audio-analysis/3qLN...,203084,4
2,SE7EN__Better Together,[k-pop],0.782,0.813,8,-3.788,1,0.0385,0.009710,0.000000,0.3440,0.788,119.998,audio_features,6r2RWys84mPOYBMHXbrPZN,spotify:track:6r2RWys84mPOYBMHXbrPZN,https://api.spotify.com/v1/tracks/6r2RWys84mPO...,https://api.spotify.com/v1/audio-analysis/6r2R...,218120,4
3,GOT7__Look,"[k-pop, k-pop boy group]",0.643,0.913,5,-1.724,0,0.1570,0.113000,0.000000,0.3830,0.734,119.938,audio_features,1ZFQugO7BqYJjw8FVQHcze,spotify:track:1ZFQugO7BqYJjw8FVQHcze,https://api.spotify.com/v1/tracks/1ZFQugO7BqYJ...,https://api.spotify.com/v1/audio-analysis/1ZFQ...,194535,4
4,TWICE__KNOCK KNOCK,"[k-pop, k-pop girl group]",0.673,0.968,1,-2.636,0,0.1310,0.024200,0.000000,0.0587,0.476,129.972,audio_features,2dYl2SI3sLyUffzxVBaZBz,spotify:track:2dYl2SI3sLyUffzxVBaZBz,https://api.spotify.com/v1/tracks/2dYl2SI3sLyU...,https://api.spotify.com/v1/audio-analysis/2dYl...,195747,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1466,Luvian__Forest,[],0.808,0.411,9,-11.420,0,0.0837,0.431000,0.760000,0.1200,0.152,103.020,audio_features,2koAp3QuiVbfJ6tTu8ij2h,spotify:track:2koAp3QuiVbfJ6tTu8ij2h,https://api.spotify.com/v1/tracks/2koAp3QuiVbf...,https://api.spotify.com/v1/audio-analysis/2koA...,273307,4
1467,BSS__Just do it,[k-pop],0.610,0.978,2,-2.432,1,0.1210,0.170000,0.000000,0.3920,0.815,150.047,audio_features,57ITlzpnOMkSE6oHGbvTqi,spotify:track:57ITlzpnOMkSE6oHGbvTqi,https://api.spotify.com/v1/tracks/57ITlzpnOMkS...,https://api.spotify.com/v1/audio-analysis/57IT...,201758,4
1468,ASTRO__Shadow,"[k-pop, k-pop boy group]",0.763,0.905,11,-2.493,0,0.1020,0.066800,0.000000,0.0594,0.811,118.015,audio_features,5gXS4UeCGKufZGOGKyvJcM,spotify:track:5gXS4UeCGKufZGOGKyvJcM,https://api.spotify.com/v1/tracks/5gXS4UeCGKuf...,https://api.spotify.com/v1/audio-analysis/5gXS...,193609,4
1469,DAY6__君,"[k-pop, k-pop boy group]",0.552,0.949,11,-2.580,0,0.0335,0.000581,0.000476,0.2900,0.615,131.969,audio_features,71IelX1rq1TqK35DmH56XJ,spotify:track:71IelX1rq1TqK35DmH56XJ,https://api.spotify.com/v1/tracks/71IelX1rq1Tq...,https://api.spotify.com/v1/audio-analysis/71Ie...,192159,4


In [288]:
# keep feature info only; most of the other ID stuff is already in meta_data
feature_data.drop(['type', 'id', 'uri', 'track_href', 'analysis_url'], axis=1, inplace=True)
feature_data.head()

Unnamed: 0,artist_track_name,artist_genres,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,TWICE__Breakthrough,"[k-pop, k-pop girl group]",0.868,0.728,6,-3.338,0,0.11,0.0439,0.0,0.0975,0.616,112.006,217651,4
1,GOT7__너란 Girl Magnetic,"[k-pop, k-pop boy group]",0.72,0.87,0,-4.129,0,0.0445,0.0596,0.0,0.346,0.842,109.96,203084,4
2,SE7EN__Better Together,[k-pop],0.782,0.813,8,-3.788,1,0.0385,0.00971,0.0,0.344,0.788,119.998,218120,4
3,GOT7__Look,"[k-pop, k-pop boy group]",0.643,0.913,5,-1.724,0,0.157,0.113,0.0,0.383,0.734,119.938,194535,4
4,TWICE__KNOCK KNOCK,"[k-pop, k-pop girl group]",0.673,0.968,1,-2.636,0,0.131,0.0242,0.0,0.0587,0.476,129.972,195747,4


In [289]:
#  save
feature_data.to_pickle('song_feature_data.pkl') 