# Data Collection Using Spotify Web API

## Spotify Web API
Spotify has a number of [API endpoints](https://developer.spotify.com/documentation/web-api/reference-beta/) available to access the Spoitfy data.  In this notebook, I use the following endpoints:

+ [search enpoint](https://developer.spotify.com/documentation/web-api/reference/search/search/) to get the track IDs
+ [audio features endpoint](https://developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/) to get the corresponding audio features.

## Purpose of Notebook
The purpose of this notebook is to show how to collect and store audio features data for tracks from the [official Spotify Web API](https://developer.spotify.com/documentation/web-api/) for futher exploratory data analysis and machine learning.  

# 1. Setup
The following code uses `spotipy` from the [Spotify](https://spotipy.readthedocs.io/en/latest/) library.  Spotipy is a python library for accessing the Spotify web API.   

In [50]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
import timeit

In [83]:
cid ="c27d762046d144e48d9d7d929e9c2206" 
secret = "fd9583d5356249e1a32e262640b989dc"

client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)


# Data Collection

Data collection is done in 3 parts: first we need the artist lineup, second the track IDs, and third the audio features for each track ID. 

In [88]:
df_lineup = pd.read_csv("data/anjunadeep-explorations-artists.txt", sep=",", header=0)
#df_lineup = pd.DataFrame({'artist_name':lineup})

In [89]:
print(df_lineup.shape)


(56, 3)


In [90]:
df_lineup.head()

Unnamed: 0,artist_name,artist_id,artist_uri
0,16BL,0u2qG4roqULELVVO9fMgSG,spotify:artist:0u2qG4roqULELVVO9fMgSG
1,Amber Stomp,,
2,Antic,7auS7PkaRQHTzo2ExUJIff,spotify:artist:7auS7PkaRQHTzo2ExUJIff
3,Baltra,2tEyBfwGBfQgLXeAJW0MgC,spotify:artist:2tEyBfwGBfQgLXeAJW0MgC
4,Ben Böhmer,5tDjiBYUsTqzd0RkTZxK7u,spotify:artist:5tDjiBYUsTqzd0RkTZxK7u


In [92]:
print(df_lineup['artist_uri'][2])
#print(df_lineup.at[0,'artist_id'])

spotify:artist:7auS7PkaRQHTzo2ExUJIff


In [99]:
#Pull all of the artist's albums
sp_albums = sp.artist_albums(df_lineup['artist_uri'][0], album_type='album')
sp_singles = sp.artist_albums(df_lineup['artist_uri'][0], album_type='single')
sp_appears_on = sp.artist_albums(df_lineup['artist_uri'][0], album_type='appears_on')
sp_compilations = sp.artist_albums(df_lineup['artist_uri'][0], album_type='compilation')


#Store artist's albums' names' and uris in separate lists
album_names = []
album_uris = []
for i in range(len(sp_albums['items'])):
    album_names.append(sp_albums['items'][i]['name'])
    album_uris.append(sp_albums['items'][i]['uri'])
    
single_names = []
single_uris = []
for i in range(len(sp_singles['items'])):
    single_names.append(sp_singles['items'][i]['name'])
    single_uris.append(sp_singles['items'][i]['uri'])

appears_on_names = []
appears_on_uris = []
for i in range(len(sp_appears_on['items'])):
    single_names.append(sp_appears_on['items'][i]['name'])
    single_uris.append(sp_appears_on['items'][i]['uri'])
    
compilation_names = []
compilation_uris = []
for i in range(len(sp_compilations['items'])):
    single_names.append(sp_compilations['items'][i]['name'])
    single_uris.append(sp_compilations['items'][i]['uri'])
    
print(df_lineup['artist_name'][0])

print('Albums:')
for i in range(len(album_names)):
    print(album_names[i], '-', album_uris[i])

print('\nSingles:')
for i in range(len(single_names)):
    print(single_names[i], '-', single_uris[i])
    
print('\nAppears On:')
for i in range(len(appears_on_names)):
    print(appears_on_names[i], '-', appears_on_uris[i])
    
print('\nCompilations:')
for i in range(len(compilation_names)):
    print(compilation_names[i], '-', compilation_uris[i])
#Keep names and uris in same order to keep track of duplicate albums

16BL
Albums:
Warung Brazil 2012 (Unmixed Edits) - spotify:album:6XADszDV1om9Ba5CPXQdD7
Warung Brazil 2012 (Presented by 16 Bit Lolitas) - spotify:album:76fWXHQ79NL0mTlzmgCd39
Warung Brazil 2012 - presented by 16 Bit Lolitas (Mixed Version) - spotify:album:1bIN4Ty4bZ506cjh9gTeKR
Supermarkt - spotify:album:7jRPozZUucjr0J5G1c9Bxx
Supermarkt (Mixed Version) - spotify:album:0QCiNmGREfkCxo4VYIR2Ny

Singles:
You Are High / Far and Wide - spotify:album:5YKoRwJxTa2JVHMAXlnahA
Vette / Leaving Home - spotify:album:4rDSAYMAsLVz7sJqxBCF1b
Lie Alone (16BL Remix) - spotify:album:0UQ17CZ8TnAxeUN8EnbE2S
Peninsula EP - spotify:album:409MstQ9mj43L7Kax48IfN
Stardust EP - spotify:album:532SRjzNyWbdpW8lmmfIpW
Not The Only One EP - spotify:album:3CSMwazZ5rBbAyOlIRit4o
Gravity - spotify:album:75pOASR4DUaD76TLqXDMF4
Deep In My Soul EP - spotify:album:5i7B8WeK7fyaIs09D9JGwT
Beat Organ EP - spotify:album:0lpgIFBoTBxs1drbXM6p1x
Chant A Tune / Deep Space Girls (Remixed) - spotify:album:3dQWNFlEd2oHf7bNgXqT6x
Chant

In [94]:
def albumTracks(uri):
    album = uri #assign album uri to a_name
    spotify_albums[album] = {} #Creates dictionary for that specific album
    #Create keys-values of empty lists inside nested dictionary for album
    spotify_albums[album]['album'] = [] #create empty list
    spotify_albums[album]['track_number'] = []
    spotify_albums[album]['id'] = []
    spotify_albums[album]['name'] = []
    spotify_albums[album]['uri'] = []

    tracks = sp.album_tracks(album) #pull data on album tracks
    for n in range(len(tracks['items'])): #for each song track
        spotify_albums[album]['album'].append(album_names[album_count]) #append album name tracked via album_count
        spotify_albums[album]['track_number'].append(tracks['items'][n]['track_number'])
        spotify_albums[album]['id'].append(tracks['items'][n]['id'])
        spotify_albums[album]['name'].append(tracks['items'][n]['name'])
        spotify_albums[album]['uri'].append(tracks['items'][n]['uri'])

In [95]:
spotify_albums = {}
album_count = 0
for i in album_uris: #each album
    albumTracks(i)
    print("Album " + str(album_names[album_count]) + " songs has been added to spotify_albums dictionary")
    album_count+=1 #Updates album count once all tracks have been added

Album Warung Brazil 2012 (Unmixed Edits) songs has been added to spotify_albums dictionary
Album Warung Brazil 2012 (Presented by 16 Bit Lolitas) songs has been added to spotify_albums dictionary
Album Warung Brazil 2012 - presented by 16 Bit Lolitas (Mixed Version) songs has been added to spotify_albums dictionary
Album Supermarkt songs has been added to spotify_albums dictionary
Album Supermarkt (Mixed Version) songs has been added to spotify_albums dictionary


In [39]:
start = timeit.default_timer()

# create empty lists where the results are going to be stored
artist_name = []
track_name = []
popularity = []
track_id = []

for k in range(len(df_lineup)):
    lineup = 'artist:' + df_lineup['artist_name'][k]
    for i in range(0,1000,50):
        track_results = sp.search(q=lineup, type='track', limit=50,offset=i)
        for i, t in enumerate(track_results['tracks']['items']):
            artist_name.append(t['artists'][0]['name'])
            track_name.append(t['name'])
            track_id.append(t['id'])
            popularity.append(t['popularity'])
    
    progress = '(' + str(k) + '/' + str(len(df_lineup)) + ')'
    print(progress, "Finished downloading songs for: ", df_lineup['artist_name'][k])
stop = timeit.default_timer()
print ('Time to run this code (in seconds):', stop - start)


(0/56) Finished downloading songs for:  16BL


KeyboardInterrupt: 

# 3. EDA
Now for some exploratory data analysis on the data we just collected.

Checking the `track_id` list:

In [7]:
print('number of elements in track_id list:', len(track_id))

number of elements in track_id list: 9885


Loading the lists into a dataframe

In [26]:
df_tracks = pd.DataFrame({'artist_name':artist_name, 
                          'track_name':track_name, 
                          'track_id':track_id, 
                          'popularity':popularity})
print(df_tracks.shape)
df_tracks.head()

(9885, 4)


Unnamed: 0,artist_name,track_name,track_id,popularity
0,16BL,Vette,39Oui3EXtdMLKQqQIFjRpa,45
1,16BL,Deep In My Soul - Original Mix,5nmr0qCA3JXN2MRdEDUFi6,45
2,16BL,Leaving Home,2S3UsA5sFyCGdpdWqkusTe,42
3,16BL,Got This Feeling [Mix Cut] - Original Mix,497k4sQ2fYCqx7LvMzu7jI,40
4,16BL,You Are High - Mixed,1jIQineO80eKy3CDenpUEn,39


Let's view some information about the data frame

In [9]:
df_tracks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9885 entries, 0 to 9884
Data columns (total 4 columns):
artist_name    9885 non-null object
track_name     9885 non-null object
track_id       9885 non-null object
popularity     9885 non-null int64
dtypes: int64(1), object(3)
memory usage: 309.0+ KB


## Checking Our Data
There can be duplicates of the same track under different `track_id`s.  This is caused by the track being released in singles albums and full albums. 

Let's check how many duplicates there are by checking the `artist_name` and `track_name`. 

In [10]:
# group the entries by artist_name and track_name and check for duplicates

duplicates = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()
print("Number of duplicate tracks: ", duplicates[duplicates > 1].count() )


Number of duplicate tracks:  1217


There are a bunch of duplicate tracks in the dataframe.  But which should be dropped and which should be kept?  We can sort by the track `popularity`.  

From the official [Spotify docs](https://developer.spotify.com/documentation/web-api/reference/search/search/):

> "The popularity of the track. The value will be between 0 and 100, with 100 being the most popular.
The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. 
Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an album) are rated independently. Artist and album popularity is derived mathematically from track popularity. Note that the popularity value may lag actual popularity by a few days: the value is not updated in real time."

The `popularity` metric isn't based solely on 'number of plays' but also how recently those plays were.  

In [11]:
df_tracks.sort_values(by=['artist_name', 'track_name', 'popularity'], ascending=False, inplace=True)
df_tracks.drop_duplicates(subset=['artist_name', 'track_name'], keep='first', inplace=True)

Now we check again if there are duplicates:

In [12]:
# group the entries by artist_name and track_name and check for duplicates
duplicates = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()
print("Number of duplicate tracks: ", duplicates[duplicates > 1].count() )

Number of duplicate tracks:  0


Alternatively, we can check for duplicates via:

In [13]:
df_tracks[df_tracks.duplicated(subset=['artist_name','track_name'],keep=False)].count()

artist_name    0
track_name     0
track_id       0
popularity     0
dtype: int64

Check the number of tracks after dropping duplicates:

In [14]:
df_tracks.shape

(7887, 4)

# 4. Retrieve Audio Features Data
Using the [audio features endpoint](https://developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/) we can retrieve the audio features data for the tracks we have collected.

There is a 100 track ID limit per query for this endpoint.  We can use anested for loop to pull track IDs in batches of size 100. 

In [15]:
# Measuring time
start = timeit.default_timer()

rows = []
batchsize = 100
None_counter = 0

for i in range(0, len(df_tracks['track_id']), batchsize): 
    batch = df_tracks['track_id'][i:i+batchsize]
    feature_results = sp.audio_features(batch)
    
    for i, t in enumerate(feature_results):
        if t == None:
            None_counter = None_counter + 1
        else:
            rows.append(t)
            
print('Number of tracks where no audio features were available:', None_counter)

stop = timeit.default_timer()
print ('Code runtime (sec):', stop - start)

Number of tracks where no audio features were available: 102
Code runtime (sec): 20.385093240998685


# 5. EDA & Data Preparation

In [16]:
print('Number of elements in track_feature list:', len(rows))

Number of elements in track_feature list: 7785


Loading audio features into a dataframe:

In [17]:
df_audio_features = pd.DataFrame.from_dict(rows, orient='columns')
print("Shape of dataset:", df_audio_features.shape)
df_audio_features.head()

Shape of dataset: (7785, 18)


Unnamed: 0,acousticness,analysis_url,danceability,duration_ms,energy,id,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,track_href,type,uri,valence
0,0.0362,https://api.spotify.com/v1/audio-analysis/6tew...,0.612,222511,0.865,6tewlP12DteyK9l7N0obC6,4.7e-05,4,0.146,-7.974,1,0.0505,106.898,4,https://api.spotify.com/v1/tracks/6tewlP12Dtey...,audio_features,spotify:track:6tewlP12DteyK9l7N0obC6,0.856
1,0.3,https://api.spotify.com/v1/audio-analysis/5zd1...,0.693,236042,0.87,5zd1UPKRyOVFUpVCBwH2ot,0.0,7,0.201,-5.706,1,0.0765,140.007,4,https://api.spotify.com/v1/tracks/5zd1UPKRyOVF...,audio_features,spotify:track:5zd1UPKRyOVFUpVCBwH2ot,0.844
2,0.25,https://api.spotify.com/v1/audio-analysis/1Ul6...,0.532,231131,0.796,1Ul6aOajjwkm4Gpncei7N3,0.0,10,0.0939,-7.214,0,0.0411,169.929,4,https://api.spotify.com/v1/tracks/1Ul6aOajjwkm...,audio_features,spotify:track:1Ul6aOajjwkm4Gpncei7N3,0.917
3,0.0833,https://api.spotify.com/v1/audio-analysis/565F...,0.569,209450,0.746,565FReKHXs16ki0nlmJrsB,3e-06,9,0.0712,-7.578,0,0.044,135.036,4,https://api.spotify.com/v1/tracks/565FReKHXs16...,audio_features,spotify:track:565FReKHXs16ki0nlmJrsB,0.596
4,0.00272,https://api.spotify.com/v1/audio-analysis/7ati...,0.51,198165,0.727,7atiIWXBaeA5oYelLxqTZV,0.0,9,0.267,-7.851,0,0.0309,127.068,1,https://api.spotify.com/v1/tracks/7atiIWXBaeA5...,audio_features,spotify:track:7atiIWXBaeA5oYelLxqTZV,0.809


In [18]:
df_audio_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7785 entries, 0 to 7784
Data columns (total 18 columns):
acousticness        7785 non-null float64
analysis_url        7785 non-null object
danceability        7785 non-null float64
duration_ms         7785 non-null int64
energy              7785 non-null float64
id                  7785 non-null object
instrumentalness    7785 non-null float64
key                 7785 non-null int64
liveness            7785 non-null float64
loudness            7785 non-null float64
mode                7785 non-null int64
speechiness         7785 non-null float64
tempo               7785 non-null float64
time_signature      7785 non-null int64
track_href          7785 non-null object
type                7785 non-null object
uri                 7785 non-null object
valence             7785 non-null float64
dtypes: float64(9), int64(4), object(5)
memory usage: 1.1+ MB


Renaming `id` to `track_id` to match the `df_tracks` dataframe:

In [19]:
df_audio_features.rename(columns = {'id': 'track_id'}, inplace=True)
df_audio_features.shape # checking our progress

(7785, 18)

In [20]:
df_track_meta_data = df_audio_features[['track_id', 'analysis_url', 'track_href', 'type', 'uri']]
df_track_meta_data.to_csv('data/anjunadeep-explorations-track-meta-data.csv')
df_track_meta_data.head()

Unnamed: 0,track_id,analysis_url,track_href,type,uri
0,6tewlP12DteyK9l7N0obC6,https://api.spotify.com/v1/audio-analysis/6tew...,https://api.spotify.com/v1/tracks/6tewlP12Dtey...,audio_features,spotify:track:6tewlP12DteyK9l7N0obC6
1,5zd1UPKRyOVFUpVCBwH2ot,https://api.spotify.com/v1/audio-analysis/5zd1...,https://api.spotify.com/v1/tracks/5zd1UPKRyOVF...,audio_features,spotify:track:5zd1UPKRyOVFUpVCBwH2ot
2,1Ul6aOajjwkm4Gpncei7N3,https://api.spotify.com/v1/audio-analysis/1Ul6...,https://api.spotify.com/v1/tracks/1Ul6aOajjwkm...,audio_features,spotify:track:1Ul6aOajjwkm4Gpncei7N3
3,565FReKHXs16ki0nlmJrsB,https://api.spotify.com/v1/audio-analysis/565F...,https://api.spotify.com/v1/tracks/565FReKHXs16...,audio_features,spotify:track:565FReKHXs16ki0nlmJrsB
4,7atiIWXBaeA5oYelLxqTZV,https://api.spotify.com/v1/audio-analysis/7ati...,https://api.spotify.com/v1/tracks/7atiIWXBaeA5...,audio_features,spotify:track:7atiIWXBaeA5oYelLxqTZV


Dropping all variables (columns) not relevant to the analysis:

In [21]:
columns_to_drop = ['analysis_url', 'track_href', 'type', 'uri']
df_audio_features.drop(columns_to_drop, axis=1, inplace=True)

To combine the two dataframes we do an `inner` merge to only keep track IDs that are in both datasets.

In [22]:
df = pd.merge(df_tracks, df_audio_features, on='track_id', how='inner')
print("Shape of dataset:", df_audio_features.shape)
df.head()

Shape of dataset: (7785, 14)


Unnamed: 0,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Žika Antić,Uzice,6tewlP12DteyK9l7N0obC6,0,0.0362,0.612,222511,0.865,4.7e-05,4,0.146,-7.974,1,0.0505,106.898,4,0.856
1,Žika Antić,Sta ako sam je voleo,5zd1UPKRyOVFUpVCBwH2ot,0,0.3,0.693,236042,0.87,0.0,7,0.201,-5.706,1,0.0765,140.007,4,0.844
2,Žika Antić,Kise jesenje,1Ul6aOajjwkm4Gpncei7N3,0,0.25,0.532,231131,0.796,0.0,10,0.0939,-7.214,0,0.0411,169.929,4,0.917
3,Žika Antić,Kad te budu pitali,565FReKHXs16ki0nlmJrsB,0,0.0833,0.569,209450,0.746,3e-06,9,0.0712,-7.578,0,0.044,135.036,4,0.596
4,Žika Antić,Evo moje sudbine,7atiIWXBaeA5oYelLxqTZV,0,0.00272,0.51,198165,0.727,0.0,9,0.267,-7.851,0,0.0309,127.068,1,0.809


In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7785 entries, 0 to 7784
Data columns (total 17 columns):
artist_name         7785 non-null object
track_name          7785 non-null object
track_id            7785 non-null object
popularity          7785 non-null int64
acousticness        7785 non-null float64
danceability        7785 non-null float64
duration_ms         7785 non-null int64
energy              7785 non-null float64
instrumentalness    7785 non-null float64
key                 7785 non-null int64
liveness            7785 non-null float64
loudness            7785 non-null float64
mode                7785 non-null int64
speechiness         7785 non-null float64
tempo               7785 non-null float64
time_signature      7785 non-null int64
valence             7785 non-null float64
dtypes: float64(9), int64(5), object(3)
memory usage: 1.1+ MB


Checking again for duplicate tracks:

In [24]:
df[df.duplicated(subset=['artist_name', 'track_name'], keep=False)]

Unnamed: 0,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence


In [30]:
#df.hist(by=df['artist_name'])
duplicates = df_tracks.groupby(['artist_name'], as_index=True).size()
print(duplicates)
#print("Number of duplicate artists: ", duplicates[duplicates > 1].count() )

artist_name
16BL                                                                                                                                                              411
4D : James Monro & Grant Collins                                                                                                                                    2
ANTIC                                                                                                                                                               2
Abafana Baka Molo                                                                                                                                                  34
Adam Zahran, Hisham Zahran                                                                                                                                          4
Adham Hisham, Hisham Zahran                                                                                                                                   

So far everything looks good so let's save the dataframe as a .csv file.

In [25]:
df.to_csv('data/anjunadeep-explorations-track-audio-features.csv')