# Collecting Data from the Spotify Web API using Spotipy

### About Spotipy:

From the [official Spotipy docs](https://spotipy.readthedocs.io/en/latest/): "Spotipy is a lightweight Python library for the Spotify Web API. With Spotipy you get full access to all of the music data provided by the Spotify platform."


### About using the Spotify Web API:

Spotify offers a number of [API endpoints](https://beta.developer.spotify.com/documentation/web-api/reference/) to access the Spotify data. In this notebook, I used the [search endpoint](https://beta.developer.spotify.com/documentation/web-api/reference/search/search/) to get the track IDs and the [audio features endpoint](https://beta.developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/) to get the corresponding audio features. 
The data was collected on April 23rd 2018.


### Goal of this notebook:

The goal is to collect audio features data for tracks from the [official Spotify Web API](https://beta.developer.spotify.com/documentation/web-api/) in order to use it for further analysis/ machine learning which will be part of another notebook.

## Importing libraries

Disclaimer: installation/ authorization part for setting up the Spotipy library is not in the scope of this notebook. Detailed information about the procedure is available in the [official docs](https://spotipy.readthedocs.io/en/latest/#installation).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.util as util

cid ="xy" 
secret = "xy"
username = "xy"
redirect_uri='xy'

client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret) 
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
scope = 'user-library-read playlist-read-private'
token = util.prompt_for_user_token(username, scope, client_id=cid,client_secret=secret,redirect_uri=redirect_uri)

## Step 1: track IDs

The [search endpoint](https://beta.developer.spotify.com/documentation/web-api/reference/search/search/) used in this step had a few limitations:

- limit: a maximum of 50 results can be returned per query
- offset: this is the index of the first result to return. Maximum offset is 100.000.

My solution: using a nested for loop, I was increasing the offset by 50 until the maxium offset was reached. The inner for loop did the actual query while appending all the returned results to appropriate lists which I used afterwards to create my dataframe.

In [2]:
# timeit library to measure the time needed to run this code

import timeit
start = timeit.default_timer()

# creating empty lists where the results are going to be stored

artist_name = []
track_name = []
popularity = []
track_id = []

for i in range(0,100000,50):
    track_results = sp.search(q='year:2018', type='track', limit=50,offset=i)
    for i, t in enumerate(track_results['tracks']['items']):
        artist_name.append(t['artists'][0]['name'])
        track_name.append(t['name'])
        track_id.append(t['id'])
        popularity.append(t['popularity'])
      

stop = timeit.default_timer()
print ('Time to run this code (in seconds):', stop - start)

Time to run this code (in seconds): 1726.960964133963


Almost half an hour!

A quick check for the track_id list:

In [3]:
print('number of elements in the track_id list:', len(track_id))

number of elements in the track_id list: 100000


Looks good. I will load the lists in a dataframe now and do some basic analysis.

In [4]:
df_tracks = pd.DataFrame({'artist_name':artist_name,'track_name':track_name,'track_id':track_id,'popularity':popularity})
df_tracks.head()

Unnamed: 0,artist_name,popularity,track_id,track_name
0,Drake,97,2XW4DbS6NddZxRPm5rMCeY,God's Plan
1,Drake,99,1cTZMwcBJT0Ka3UJPXOeeN,Nice For What
2,Post Malone,95,65NwOZqoXny4JxqAPlfxRF,Psycho (feat. Ty Dolla $ign)
3,BlocBoy JB,98,4qKcDkK6siZ7Jp1Jb4m0aL,Look Alive (feat. Drake)
4,XXXTENTACION,97,3ee8Jmje8o58CHK66QrVC2,SAD!


In [5]:
df_tracks.shape

(100000, 4)

In [6]:
df_tracks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
artist_name    100000 non-null object
popularity     100000 non-null int64
track_id       100000 non-null object
track_name     100000 non-null object
dtypes: int64(1), object(3)
memory usage: 3.1+ MB


Sometimes, the same track is returned under different track IDs (single, as part of an album etc.).

This needs to be checked for and corrected if needed.

In [7]:
# grouping the entries by artist_name and track_name and checking for duplicates

grouped = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()
grouped[grouped > 1]

artist_name             track_name                     
!!!                     Happiness Is A Warm Yes (It Is)    2
$tupid Young            Murder Scene (feat. Lil Durk)      2
                        Pray 4 Me (feat. KB)               2
03 Greedo               If I Wasn't Rappin'                2
                        Pop It                             2
                        Substance                          2
070 Shake               Lost In Love                       2
                        Somebody Like Me                   2
                        Stranger                           2
12th Street Pharmacist  Suicide                            2
16yrold                 Young Scooter                      2
2 Chainz                LAND OF THE FREAKS                 2
                        OK BITCH                           2
                        PROUD                              2
                        Proud                              2
3LAU                    On My

There are 4267 duplicate entries which will be dropped in the next cell:

In [8]:
df_tracks.drop_duplicates(subset=['artist_name','track_name'], inplace=True)

In [9]:
# doing the same grouping as before to verify the solution

grouped_after_dropping = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()
grouped_after_dropping[grouped_after_dropping > 1]

Series([], dtype: int64)

This time the results are empty. Another way of checking this:

In [10]:
df_tracks[df_tracks.duplicated(subset=['artist_name','track_name'],keep=False)].count()

artist_name    0
popularity     0
track_id       0
track_name     0
dtype: int64

Checking how many tracks are left now:

In [11]:
df_tracks.shape

(94515, 4)

## Step 2: audio features

With the [audio features endpoint](https://beta.developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/) I will now get the audio features data for my 94515 track IDs.

The limitation for this endpoint was that a maximum of 100 track IDs can be submitted per query.

Again, I used a nested for loop. This time the outer loop was pulling track IDs in batches of size 100 and the inner for loop was doing the query and appending the results to the rows list.

Additionaly, I had to implement a check when a track ID didn't return any audio features (i.e. None was returned) as this was causing issues.

In [12]:
# again measuring the time with timeit

start = timeit.default_timer()

# setting up the empty list, batchsize and the counter for None results
rows = []
batchsize = 100
None_counter = 0

for i in range(0,len(df_tracks['track_id']),batchsize):
    batch = df_tracks['track_id'][i:i+batchsize]
    feature_results = sp.audio_features(batch)
    for i, t in enumerate(feature_results):
        if t == None:
            None_counter = None_counter + 1
        else:
            rows.append(t)
            
print('Number of tracks where no audio features were available:',None_counter)

stop = timeit.default_timer()
print ('Time to run this code (in seconds):',stop - start)

Number of tracks where no audio features were available: 825
Time to run this code (in seconds): 269.73977800505236


This one was relatively fast - less than 5 minutes!

825 tracks had no audio features.

Checking how the rows list looks like:

In [13]:
print('number of elements in the track_id list:', len(rows))

number of elements in the track_id list: 93690


Finally, I will load the audio features in a dataframe, do some basic checks and merge it with the first one:

In [14]:
df_audio_features = pd.DataFrame.from_dict(rows,orient='columns')
df_audio_features.head()

Unnamed: 0,acousticness,analysis_url,danceability,duration_ms,energy,id,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,track_href,type,uri,valence
0,0.0244,https://api.spotify.com/v1/audio-analysis/2XW4...,0.753,198960,0.454,2XW4DbS6NddZxRPm5rMCeY,5.6e-05,7,0.498,-9.488,1,0.0963,77.17,4,https://api.spotify.com/v1/tracks/2XW4DbS6NddZ...,audio_features,spotify:track:2XW4DbS6NddZxRPm5rMCeY,0.344
1,0.0934,https://api.spotify.com/v1/audio-analysis/1cTZ...,0.567,210926,0.913,1cTZMwcBJT0Ka3UJPXOeeN,0.000124,8,0.114,-6.471,1,0.0736,93.35,4,https://api.spotify.com/v1/tracks/1cTZMwcBJT0K...,audio_features,spotify:track:1cTZMwcBJT0Ka3UJPXOeeN,0.792
2,0.566,https://api.spotify.com/v1/audio-analysis/65Nw...,0.74,220880,0.558,65NwOZqoXny4JxqAPlfxRF,0.0,8,0.112,-8.115,1,0.102,140.057,4,https://api.spotify.com/v1/tracks/65NwOZqoXny4...,audio_features,spotify:track:65NwOZqoXny4JxqAPlfxRF,0.421
3,0.00104,https://api.spotify.com/v1/audio-analysis/4qKc...,0.922,181263,0.581,4qKcDkK6siZ7Jp1Jb4m0aL,5.9e-05,10,0.105,-7.495,1,0.27,140.022,4,https://api.spotify.com/v1/tracks/4qKcDkK6siZ7...,audio_features,spotify:track:4qKcDkK6siZ7Jp1Jb4m0aL,0.595
4,0.258,https://api.spotify.com/v1/audio-analysis/3ee8...,0.74,166606,0.613,3ee8Jmje8o58CHK66QrVC2,0.00372,8,0.123,-4.88,1,0.145,75.023,4,https://api.spotify.com/v1/tracks/3ee8Jmje8o58...,audio_features,spotify:track:3ee8Jmje8o58CHK66QrVC2,0.473


In [15]:
df_audio_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93690 entries, 0 to 93689
Data columns (total 18 columns):
acousticness        93690 non-null float64
analysis_url        93690 non-null object
danceability        93690 non-null float64
duration_ms         93690 non-null int64
energy              93690 non-null float64
id                  93690 non-null object
instrumentalness    93690 non-null float64
key                 93690 non-null int64
liveness            93690 non-null float64
loudness            93690 non-null float64
mode                93690 non-null int64
speechiness         93690 non-null float64
tempo               93690 non-null float64
time_signature      93690 non-null int64
track_href          93690 non-null object
type                93690 non-null object
uri                 93690 non-null object
valence             93690 non-null float64
dtypes: float64(9), int64(4), object(5)
memory usage: 12.9+ MB


Some columns containt URLs/URIs which are not needed for the analysis so I will drop them.

Also the ID column will be renamed to track_id so that it matches the column name from the first dataframe.

In [16]:
columns_to_drop = ['analysis_url','track_href','type','uri']

df_audio_features.drop(columns_to_drop, axis=1,inplace=True)

In [17]:
df_audio_features.rename(columns={'id': 'track_id'}, inplace=True)

In [18]:
df_audio_features.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,track_id,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,0.0244,0.753,198960,0.454,2XW4DbS6NddZxRPm5rMCeY,5.6e-05,7,0.498,-9.488,1,0.0963,77.17,4,0.344
1,0.0934,0.567,210926,0.913,1cTZMwcBJT0Ka3UJPXOeeN,0.000124,8,0.114,-6.471,1,0.0736,93.35,4,0.792
2,0.566,0.74,220880,0.558,65NwOZqoXny4JxqAPlfxRF,0.0,8,0.112,-8.115,1,0.102,140.057,4,0.421
3,0.00104,0.922,181263,0.581,4qKcDkK6siZ7Jp1Jb4m0aL,5.9e-05,10,0.105,-7.495,1,0.27,140.022,4,0.595
4,0.258,0.74,166606,0.613,3ee8Jmje8o58CHK66QrVC2,0.00372,8,0.123,-4.88,1,0.145,75.023,4,0.473


In [19]:
df_audio_features.describe()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
count,93690.0,93690.0,93690.0,93690.0,93690.0,93690.0,93690.0,93690.0,93690.0,93690.0,93690.0,93690.0,93690.0
mean,0.324905,0.586431,213644.7,0.580781,0.230713,5.242758,0.192953,-9.779976,0.604141,0.112959,119.995312,3.886914,0.438479
std,0.334599,0.18738,126051.7,0.253634,0.362988,3.605844,0.16557,6.331046,0.489037,0.125132,30.133499,0.503241,0.260862
min,0.0,0.0,3203.0,0.0,0.0,0.0,0.0,-60.0,0.0,0.0,0.0,0.0,0.0
25%,0.0283,0.47,166217.0,0.415,0.0,2.0,0.0972,-11.626,0.0,0.0388,97.0,4.0,0.222
50%,0.188,0.611,202865.5,0.611,0.000214,5.0,0.123,-7.982,1.0,0.0564,120.043,4.0,0.419
75%,0.591,0.728,241083.2,0.781,0.498,8.0,0.233,-5.729,1.0,0.131,139.914,4.0,0.637
max,0.996,0.996,5610020.0,1.0,1.0,11.0,0.996,1.806,1.0,0.964,249.983,5.0,1.0


In [20]:
df_audio_features.shape

(93690, 14)

In [21]:
# merging both dataframes

df = pd.merge(df_tracks,df_audio_features,on='track_id',how='inner')
df.head()

Unnamed: 0,artist_name,popularity,track_id,track_name,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Drake,97,2XW4DbS6NddZxRPm5rMCeY,God's Plan,0.0244,0.753,198960,0.454,5.6e-05,7,0.498,-9.488,1,0.0963,77.17,4,0.344
1,Drake,99,1cTZMwcBJT0Ka3UJPXOeeN,Nice For What,0.0934,0.567,210926,0.913,0.000124,8,0.114,-6.471,1,0.0736,93.35,4,0.792
2,Post Malone,95,65NwOZqoXny4JxqAPlfxRF,Psycho (feat. Ty Dolla $ign),0.566,0.74,220880,0.558,0.0,8,0.112,-8.115,1,0.102,140.057,4,0.421
3,BlocBoy JB,98,4qKcDkK6siZ7Jp1Jb4m0aL,Look Alive (feat. Drake),0.00104,0.922,181263,0.581,5.9e-05,10,0.105,-7.495,1,0.27,140.022,4,0.595
4,XXXTENTACION,97,3ee8Jmje8o58CHK66QrVC2,SAD!,0.258,0.74,166606,0.613,0.00372,8,0.123,-4.88,1,0.145,75.023,4,0.473


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 93690 entries, 0 to 93689
Data columns (total 17 columns):
artist_name         93690 non-null object
popularity          93690 non-null int64
track_id            93690 non-null object
track_name          93690 non-null object
acousticness        93690 non-null float64
danceability        93690 non-null float64
duration_ms         93690 non-null int64
energy              93690 non-null float64
instrumentalness    93690 non-null float64
key                 93690 non-null int64
liveness            93690 non-null float64
loudness            93690 non-null float64
mode                93690 non-null int64
speechiness         93690 non-null float64
tempo               93690 non-null float64
time_signature      93690 non-null int64
valence             93690 non-null float64
dtypes: float64(9), int64(5), object(3)
memory usage: 12.9+ MB


Just in case, checking for any duplicate tracks:

In [23]:
df[df.duplicated(subset=['artist_name','track_name'],keep=False)]

Unnamed: 0,artist_name,popularity,track_id,track_name,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence


Everything seems to be fine so I will save the dataframe as a .csv file.

In [24]:
df.to_csv('SpotifyAudioFeatures260042018')