# Sync Link
### Part 1E: Gathering Spotify Data (Audio Features)

The next layer to add info from Spotify. This will be similar to the previous searches where first, we grab the ID of the song via search and then we go back to grab the song info. First, I'll clean up the results from the Deezer and Lyric Freak search.

In [1]:
import pandas as pd
import numpy as np
import requests
import time

import warnings
warnings.filterwarnings('ignore')

In [2]:
sync = pd.read_csv('./data/synced_writer_pub.csv')

In [3]:
sync.head()

Unnamed: 0,title,artist,year,explicit,styles,languages,title_artist,synced,d_id,d_song,...,d_artist,d_album_id,d_album,d_art,lyric_url,l_title,l_artist,l_album,l_writer,l_pub
0,Tennessee Whiskey,Chris Stapleton,2015,0,"Blues,Rock,Country",English,tennessee whiskey - chris stapleton,1,98975170,Tennessee Whiskey,...,Chris Stapleton,10127538,Traveller,https://cdns-images.dzcdn.net/images/cover/1dd...,/c/chris+stapleton/tennessee+whiskey_21104107....,\nAbout\nTennessee Whiskey lyrics\n,Chris Stapleton,\n from album:\n ...,"Songwriters: Linda H Bartholomew, Dean Dillon",\n Tennessee Whiskey lyrics...
1,Dance Monkey,Tones and I,2019,0,Pop,English,dance monkey - tones and i,1,739870792,Dance Monkey,...,Tones and I,108770322,The Kids Are Coming,https://cdns-images.dzcdn.net/images/cover/563...,/t/tones+and+i/dance+monkey_1711431.html,\nTones And I – Dance Monkey Lyrics,Tones And I,0,Songwriters: Toni Watson,\n Dance Monkey lyrics © Wa...
2,Sweet Caroline,Neil Diamond,1969,0,Pop,English,sweet caroline - neil diamond,1,145434430,Sweet Caroline,...,Neil Diamond,15802430,50th Anniversary Collection,https://cdns-images.dzcdn.net/images/cover/40b...,/n/neil+diamond/sweet+caroline_20098802.html,\nAbout\nSweet Caroline lyrics\n,Neil Diamond,\n from album:\n ...,Songwriters: Neil Diamond,\n Sweet Caroline lyrics © ...
3,Someone You Loved,Lewis Capaldi,2018,0,Pop,English,someone you loved - lewis capaldi,1,582143242,Someone You Loved,...,Lewis Capaldi,78031872,Breach,https://cdns-images.dzcdn.net/images/cover/9a4...,/l/lewis+capaldi/someone+you+loved_21585657.html,\nLewis Capaldi – Someone You Loved Lyrics,Lewis Capaldi,0,"Songwriters: Benjamin Kohn, Lewis Capaldi, Pet...",\n Someone You Loved lyrics...
4,Amazing Grace,Traditional,1831,0,"Traditionnal,Gospel,Blues",English,amazing grace - traditional,1,958704262,Amazing Grace,...,Traditional,147840742,8 Angels Blessing of Song,https://e-cdns-images.dzcdn.net/images/cover/a...,,0,0,0,0,0


In [4]:
sync.isna().sum()

title              0
artist             0
year               0
explicit           0
styles             0
languages          0
title_artist       0
synced             0
d_id               0
d_song             0
d_isrc             0
d_release          0
d_explicit         0
d_bpm              0
d_artist           0
d_album_id         0
d_album            0
d_art              1
lyric_url       3128
l_title            0
l_artist           0
l_album            0
l_writer           0
l_pub              0
dtype: int64

Here we can see where 'lyric_url' is missing, we weren't able to retrieve lyrics from Lyric Freak because they were not on the site. These will be removed.

In [5]:
sync[sync['lyric_url'].isna()]['synced'].value_counts()

0    2720
1     408
Name: synced, dtype: int64

Luckily, this in the long run helps the class imbalance which is why I chose a larger number of "unsynced" to start.

In [6]:
sync[sync['lyric_url'].notna()]['synced'].value_counts()

0    7280
1    4283
Name: synced, dtype: int64

In [7]:
sync = sync[(sync['lyric_url'].notna()) & (sync['d_isrc'] != '0') | (sync['artist'] == 'Traditional')]

In [8]:
sync = sync.reset_index()

Additionally, if we weren't able to find the track on Deezer, the 'd_isrc' would be blank. We'll drop these, but keep songs labeled as "Traditional" since these are in the public domain and wouldn't have a specific ISRC.

In [9]:
len(sync)

10845

Our dataset moving into this next search is just under 11,000 songs.

### Spotify API
Next, we'll go through the Spotify API in two steps. The first is a search to grab the URI for the track and the second is to grab the specific data with the URI.

In [10]:
#Importing credentials
from credentials import s_key, ss_key

###### Step 0: Authorization
First step in accessing the API is getting an access token via authorization:

In [11]:
#Spotify API help from https://github.com/kylepw/spotify-api-auth-examples/blob/master/client/app.py
def auth(key, sec):
    
    authorize = 'https://accounts.spotify.com/api/token'
    param = {
    "Content-Type": "application/x-www-form-urlencoded",
    'grant_type' : 'client_credentials'
    }
    res = requests.post(authorize, auth = (s_key, ss_key), data = param)
    token = res.json()['access_token']
    
    return token

In [12]:
token = auth(s_key, ss_key)

In [13]:
token

'BQBkSJZanpveeeJtPCXzFhIrZwbkk3zEumiTPUtDY5dNfE2hrsP3QJQ-IHAM9719EGrz9_wXFKa3dNjW2pU'

In [14]:
spotify_search = 'https://api.spotify.com/v1/search'

###### Step 1: Get Spotify track URIs
Setting up the empty column below that will hold the results of the search:

In [15]:
sync['s_artist'] = 0
sync['s_track'] = 0
sync['s_uri'] = 0

This function search for each row on Spotify and adds the top result to the DataFrame:

In [16]:
def get_uri(df):
    #Set up empty counter and total songs to get
    count = 0 
    total = len(df)
    token = auth(s_key, ss_key)
    
    for i in range(len(df)):
        count += 1
        song = df.loc[i, 'title_artist']
       
        try:
            
            #Search parameters with the song plugged in
            params = {
            'q' : song,
            'type': 'track',
            'limit' : 5
            }

            #Header for authorization
            header = {'Authorization' : f'Bearer {token}'}

            #Search endpoint
            spotify_search = 'https://api.spotify.com/v1/search'

            #Make the request
            res = requests.get(spotify_search, headers = header, params = params)
            status = res.status_code
            
            #Continue if status code is 200:
            if res.status_code == 200:
            
                #Save the results and grab the uri
                results = res.json()

                artist = results['tracks']['items'][0]['artists'][0]['name']
                track = results['tracks']['items'][0]['name']
                uri = results['tracks']['items'][0]['id']
            
            else:
                #Try reauthenticating
                token = auth(s_key, ss_key)
                res = requests.get(spotify_search, headers = header, params = params)
                
                #Continue if status code is 200:
                if res.status_code == 200:

                    #Save the results and grab the uri
                    results = res.json()

                    artist = results['tracks']['items'][0]['artists'][0]['name']
                    track = results['tracks']['items'][0]['name']
                    uri = results['tracks']['items'][0]['id']
        except:
            #If it doesn't work, print out the song and fill in with nan
            print(f'Could not retrieve {song}, status code{status}')
            uri = np.nan
            artist = np.nan
            track = np.nan
            
        #Add the info the the DataFrame:
        df.loc[i, 's_artist'] = artist
        df.loc[i, 's_track'] = track
        df.loc[i, 's_uri'] = uri
        
        #Print updates for every 100
        if count % 100 == 0:
            print(f'{count} gathered out of {total}')
        
        #Take a break to not exceed the rate limit
        time.sleep(1)

    return df

In [17]:
sync = get_uri(sync)

100 gathered out of 10845
Could not retrieve when the saints go marching in (big band version) - traditional, status code200
200 gathered out of 10845
300 gathered out of 10845
400 gathered out of 10845
500 gathered out of 10845
600 gathered out of 10845
700 gathered out of 10845
800 gathered out of 10845
900 gathered out of 10845
1000 gathered out of 10845
1100 gathered out of 10845
1200 gathered out of 10845
1300 gathered out of 10845
1400 gathered out of 10845
1500 gathered out of 10845
1600 gathered out of 10845
1700 gathered out of 10845
1800 gathered out of 10845
1900 gathered out of 10845
2000 gathered out of 10845
2100 gathered out of 10845
2200 gathered out of 10845
2300 gathered out of 10845
2400 gathered out of 10845
2500 gathered out of 10845
2600 gathered out of 10845
2700 gathered out of 10845
2800 gathered out of 10845
2900 gathered out of 10845
3000 gathered out of 10845
Could not retrieve i'd rather be with you - joshua radin, status code200
3100 gathered out of 10845


Saving this copy of the DataFrame with Spotify URIs:

In [18]:
sync.to_csv('./data/sync_spotify_uri.csv', index = False)

###### Step 2: Using the URIs, go back to get the track features:
Setting up empty columns to hold the returned features:

In [19]:
sync['s_dance'] = 0
sync['s_energy'] = 0 
sync['s_key'] = 0 
sync['s_loudness'] = 0 
sync['s_mode'] = 0
sync['s_speech'] = 0
sync['s_acoustic'] = 0 
sync['s_inst'] = 0
sync['s_live'] = 0 
sync['s_valence'] = 0 
sync['s_tempo'] = 0 
sync['s_duration'] = 0
sync['s_time_sig'] = 0

This feature will go to each song via the URIs from the last serach and get the features about the track:

In [22]:
def get_features(df):
    #Set up empty counter and total songs to get
    count = 0 
    total = len(df)
    
    token = auth(s_key, ss_key)
    header = {'Authorization' : f'Bearer {token}'}
    
    #For each URI
    for i in range(len(df)):
        count += 1
        
        if df.loc[i, 's_uri']:
        
            try:
                uri = df.loc[i, 's_uri']
                res = requests.get(f'https://api.spotify.com/v1/audio-features/{uri}', headers = header) 
                status = res.status_code
                
                if res.status_code == 200:
                    results = res.json()

                    #These are all the features to save:
                    dance = results['danceability']
                    energy = results['energy']
                    key = results['key']
                    loudness = results['loudness']
                    mode = results['mode']
                    speech = results['speechiness']
                    acoustic = results['acousticness']
                    inst = results['instrumentalness']
                    live = results['liveness']
                    valence = results['valence']
                    tempo = results['tempo']
                    duration = results['duration_ms']
                    time_sig = results['time_signature']
                else:
                    #Try reauthenticating
                    token = auth(s_key, ss_key)
                    res = requests.get(spotify_search, headers = header, params = params)

                    if res.status_code == 200:
                        results = res.json()

                        #These are all the features to save:
                        dance = results['danceability']
                        energy = results['energy']
                        key = results['key']
                        loudness = results['loudness']
                        mode = results['mode']
                        speech = results['speechiness']
                        acoustic = results['acousticness']
                        inst = results['instrumentalness']
                        live = results['liveness']
                        valence = results['valence']
                        tempo = results['tempo']
                        duration = results['duration_ms']
                        time_sig = results['time_signature']

            except:
                print(f'Could not gather row {i}, status code {statud}')
                #Insert NaNs instead:
                dance = np.nan
                energy = np.nan
                key = np.nan
                loudness = np.nan
                mode = np.nan
                speech = np.nan
                acoustic = np.nan
                inst = np.nan
                live = np.nan
                valence = np.nan
                tempo = np.nan
                duration = np.nan
                time_sig = np.nan
        else:
               print(f'No URI for row {i}')
        #Insert NaNs instead:
                dance = np.nan
                energy = np.nan
                key = np.nan
                loudness = np.nan
                mode = np.nan
                speech = np.nan
                acoustic = np.nan
                inst = np.nan
                live = np.nan
                valence = np.nan
                tempo = np.nan
                duration = np.nan
                time_sig = np.nan
                
        df.loc[i, 's_dance'] = dance
        df.loc[i, 's_energy'] = energy 
        df.loc[i, 's_key'] = key 
        df.loc[i, 's_loudness'] = loudness 
        df.loc[i, 's_mode'] = mode
        df.loc[i, 's_speech'] = speech
        df.loc[i, 's_acoustic'] = acoustic 
        df.loc[i, 's_inst'] = inst
        df.loc[i, 's_live'] = live 
        df.loc[i, 's_valence'] = valence 
        df.loc[i, 's_tempo'] = tempo 
        df.loc[i, 's_duration'] = duration
        df.loc[i, 's_time_sig'] = time_sig
               
        if count % 100 == 0:
               print(f'{count} songs gathered out of {total}')
        time.sleep(1)
               
    return df

In [23]:
sync = get_features(sync)

100 songs gathered out of 10845
200 songs gathered out of 10845
300 songs gathered out of 10845
400 songs gathered out of 10845
500 songs gathered out of 10845
600 songs gathered out of 10845
700 songs gathered out of 10845
800 songs gathered out of 10845
900 songs gathered out of 10845
1000 songs gathered out of 10845
1100 songs gathered out of 10845
1200 songs gathered out of 10845
1300 songs gathered out of 10845
1400 songs gathered out of 10845
1500 songs gathered out of 10845
1600 songs gathered out of 10845
1700 songs gathered out of 10845
1800 songs gathered out of 10845
1900 songs gathered out of 10845
2000 songs gathered out of 10845
2100 songs gathered out of 10845
2200 songs gathered out of 10845
2300 songs gathered out of 10845
2400 songs gathered out of 10845
2500 songs gathered out of 10845
2600 songs gathered out of 10845
2700 songs gathered out of 10845
2800 songs gathered out of 10845
2900 songs gathered out of 10845
3000 songs gathered out of 10845
3100 songs gathered

In [24]:
sync.head()

Unnamed: 0,index,title,artist,year,explicit,styles,languages,title_artist,synced,d_id,...,s_loudness,s_mode,s_speech,s_acoustic,s_inst,s_live,s_valence,s_tempo,s_duration,s_time_sig
0,0,Tennessee Whiskey,Chris Stapleton,2015,0,"Blues,Rock,Country",English,tennessee whiskey - chris stapleton,1,98975170,...,-10.888,1.0,0.0298,0.205,0.0096,0.0821,0.512,48.718,293293.0,4.0
1,1,Dance Monkey,Tones and I,2019,0,Pop,English,dance monkey - tones and i,1,739870792,...,-6.4,0.0,0.0924,0.692,0.000104,0.149,0.513,98.027,209438.0,4.0
2,2,Sweet Caroline,Neil Diamond,1969,0,Pop,English,sweet caroline - neil diamond,1,145434430,...,-16.066,1.0,0.0274,0.611,0.000109,0.237,0.578,63.05,203573.0,4.0
3,3,Someone You Loved,Lewis Capaldi,2018,0,Pop,English,someone you loved - lewis capaldi,1,582143242,...,-5.679,1.0,0.0319,0.751,0.0,0.105,0.446,109.891,182161.0,4.0
4,4,Amazing Grace,Traditional,1831,0,"Traditionnal,Gospel,Blues",English,amazing grace - traditional,1,958704262,...,-10.39,1.0,0.0378,0.684,2.2e-05,0.11,0.0756,76.416,270267.0,1.0


Saving the final results to a csv:

In [25]:
sync.to_csv('./data/sync_spotify_final.csv', index = False)