# Data Collection Using Spotify Web API

## Spotify Web API
Spotify has a number of [API endpoints](https://developer.spotify.com/documentation/web-api/reference-beta/) available to access the Spoitfy data.  In this notebook, I use the following endpoints:

+ [search enpoint](https://developer.spotify.com/documentation/web-api/reference/search/search/) to get the track IDs
+ [audio features endpoint](https://developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/) to get the corresponding audio features.

## Purpose of Notebook
The purpose of this notebook is to show how to collect and store audio features data for tracks from the [official Spotify Web API](https://developer.spotify.com/documentation/web-api/) for futher exploratory data analysis and machine learning.  

# 1. Setup
The following code uses `spotipy` from the [Spotify](https://spotipy.readthedocs.io/en/latest/) library.  Spotipy is a python library for accessing the Spotify web API.   

In [2]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

cid ="c27d762046d144e48d9d7d929e9c2206" 
secret = "fd9583d5356249e1a32e262640b989dc"

client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)


# 2. Data Collection

Data collection is done in 2 parts: first the track IDs and then the audio features for each track ID. 

In [3]:
# timeit library to measure the time needed to run this code
import timeit
start = timeit.default_timer()

# create empty lists where the results are going to be stored
artist_name = []
track_name = []
popularity = []
track_id = []

for i in range(0,1000,50):
    track_results = sp.search(q='year:2018', type='track', limit=50,offset=i)
    for i, t in enumerate(track_results['tracks']['items']):
        artist_name.append(t['artists'][0]['name'])
        track_name.append(t['name'])
        track_id.append(t['id'])
        popularity.append(t['popularity'])
      
stop = timeit.default_timer()
print ('Time to run this code (in seconds):', stop - start)



length of track_results: 1
Time to run this code (in seconds): 17.1485118290002


# 3. EDA
Now for some exploratory data analysis on the data we just collected.

Checking the `track_id` list:

In [3]:
print('number of elements in track_id list:', len(track_id))

number of elements in track_id list: 1000


Loading the lists into a dataframe

In [5]:
import pandas as pd

df_tracks = pd.DataFrame({'artist_name':artist_name, 
                          'track_name':track_name, 
                          'track_id':track_id, 
                          'popularity':popularity})
print(df_tracks.shape)
df_tracks.head()

(1000, 4)


Unnamed: 0,artist_name,track_name,track_id,popularity
0,Post Malone,Sunflower - Spider-Man: Into the Spider-Verse,3KkXRkHbMCARz0aVfEt68P,95
1,A Boogie Wit da Hoodie,Swervin (feat. 6ix9ine),1wJRveJZLSb1rjhnUHQiv6,91
2,Post Malone,Wow.,6MWtB6iiXyIwun0YzU6DFP,93
3,Meek Mill,Going Bad (feat. Drake),2IRZnDFmlqMuOrYOLnZZyc,90
4,YNW Melly,Murder On My Mind,7eBqSVxrzQZtK2mmgRG6lC,89


Let's view some information about the data frame

In [6]:
df_tracks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
artist_name    1000 non-null object
track_name     1000 non-null object
track_id       1000 non-null object
popularity     1000 non-null int64
dtypes: int64(1), object(3)
memory usage: 31.3+ KB


## Checking Our Data
There can be duplicates of the same track under different IDs.  This is caused by the track being released in single albums and full albums. 

Let's check how many duplicates there are by checking the `artist_name` and `track_name`. 

In [8]:
# group the entries by artist_name and track_name and check for duplicates

duplicates = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()
print("Number of duplicate tracks: ", duplicates[duplicates > 1].count() )


Number of duplicate tracks:  36


To drop the duplicate tracks:

In [9]:
df_tracks.drop_duplicates(subset=['artist_name', 'track_name'], inplace=True)

Now we check again if there are duplicates:

In [10]:
# group the entries by artist_name and track_name and check for duplicates

duplicates = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()
print("Number of duplicate tracks: ", duplicates[duplicates > 1].count() )


Number of duplicate tracks:  0


Alternatively, we can check for duplicates via:

In [11]:
df_tracks[df_tracks.duplicated(subset=['artist_name','track_name'],keep=False)].count()

artist_name    0
track_name     0
track_id       0
popularity     0
dtype: int64

Check the number of tracks after dropping duplicates:

In [12]:
df_tracks.shape

(964, 4)

# 4. Retrieve Audio Features Data
Using the [audio features endpoint](https://developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/) we can retrieve the audio features data for the tracks we have collected.

There is a 100 track ID limit per query for this endpoint.  We can use anested for loop to pull track IDs in batches of size 100. 

In [14]:
# Measuring time
start = timeit.default_timer()

rows = []
batchsize = 100
None_counter = 0

for i in range(0, len(df_tracks['track_id']), batchsize): 
    batch = df_tracks['track_id'][i:i+batchsize]
    feature_results = sp.audio_features(batch)
    
    for i, t in enumerate(feature_results):
        if t == None:
            None_counter = None_counter + 1
        else:
            rows.append(t)
            
print('Number of tracks where no audio features were available:', None_counter)

stop = timeit.default_timer()
print ('Code runtime (sec):', stop - start)

Number of tracks where no audio features were available: 0
Code runtime (sec): 1.094440887998644


# 5. EDA & Data Preparation

In [15]:
print('Number of elements in track_feature list:', len(rows))

Number of elements in track_id list: 964


Loading autdio features into a dataframe:

In [17]:
df_audio_features = pd.DataFrame.from_dict(rows, orient='columns')
print("Shape of dataset:", df_audio_features.shape)
df_audio_features.head()

Shape of dataset: (964, 18)


Unnamed: 0,acousticness,analysis_url,danceability,duration_ms,energy,id,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,track_href,type,uri,valence
0,0.556,https://api.spotify.com/v1/audio-analysis/3KkX...,0.76,158040,0.479,3KkXRkHbMCARz0aVfEt68P,0.0,2,0.0703,-5.574,1,0.0466,89.911,4,https://api.spotify.com/v1/tracks/3KkXRkHbMCAR...,audio_features,spotify:track:3KkXRkHbMCARz0aVfEt68P,0.913
1,0.0153,https://api.spotify.com/v1/audio-analysis/1wJR...,0.581,189487,0.662,1wJRveJZLSb1rjhnUHQiv6,0.0,9,0.111,-5.239,1,0.303,93.023,4,https://api.spotify.com/v1/tracks/1wJRveJZLSb1...,audio_features,spotify:track:1wJRveJZLSb1rjhnUHQiv6,0.434
2,0.163,https://api.spotify.com/v1/audio-analysis/6MWt...,0.833,149520,0.539,6MWtB6iiXyIwun0YzU6DFP,2e-06,11,0.101,-7.399,0,0.178,99.947,4,https://api.spotify.com/v1/tracks/6MWtB6iiXyIw...,audio_features,spotify:track:6MWtB6iiXyIwun0YzU6DFP,0.385
3,0.259,https://api.spotify.com/v1/audio-analysis/2IRZ...,0.889,180522,0.496,2IRZnDFmlqMuOrYOLnZZyc,0.0,4,0.252,-6.365,0,0.0905,86.003,4,https://api.spotify.com/v1/tracks/2IRZnDFmlqMu...,audio_features,spotify:track:2IRZnDFmlqMuOrYOLnZZyc,0.544
4,0.145,https://api.spotify.com/v1/audio-analysis/7eBq...,0.759,268434,0.73,7eBqSVxrzQZtK2mmgRG6lC,3e-06,0,0.11,-7.985,0,0.0516,115.007,4,https://api.spotify.com/v1/tracks/7eBqSVxrzQZt...,audio_features,spotify:track:7eBqSVxrzQZtK2mmgRG6lC,0.74


In [18]:
df_audio_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 964 entries, 0 to 963
Data columns (total 18 columns):
acousticness        964 non-null float64
analysis_url        964 non-null object
danceability        964 non-null float64
duration_ms         964 non-null int64
energy              964 non-null float64
id                  964 non-null object
instrumentalness    964 non-null float64
key                 964 non-null int64
liveness            964 non-null float64
loudness            964 non-null float64
mode                964 non-null int64
speechiness         964 non-null float64
tempo               964 non-null float64
time_signature      964 non-null int64
track_href          964 non-null object
type                964 non-null object
uri                 964 non-null object
valence             964 non-null float64
dtypes: float64(9), int64(4), object(5)
memory usage: 135.6+ KB


Dropping all variables (columns) not relevant to the analysis:

In [19]:
columns_to_drop = ['analysis_url', 'track_href', 'type', 'uri']
df_audio_features.drop(columns_to_drop, axis=1, inplace=True)

Renaming `id` to `track_id` to match the `df_tracks` dataframe:

In [23]:
df_audio_features.rename(columns = {'id': 'track_id'}, inplace=True)
df_audio_features.shape # checking our progress

(964, 14)

To combine the two dataframes we do an `inner` merge to only keep track IDs that are in both datasets.

In [25]:
df = pd.merge(df_tracks, df_audio_features, on='track_id', how='inner')
print("Shape of dataset:", df_audio_features.shape)
df.head()

Shape of dataset: (964, 14)


Unnamed: 0,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Post Malone,Sunflower - Spider-Man: Into the Spider-Verse,3KkXRkHbMCARz0aVfEt68P,95,0.556,0.76,158040,0.479,0.0,2,0.0703,-5.574,1,0.0466,89.911,4,0.913
1,A Boogie Wit da Hoodie,Swervin (feat. 6ix9ine),1wJRveJZLSb1rjhnUHQiv6,91,0.0153,0.581,189487,0.662,0.0,9,0.111,-5.239,1,0.303,93.023,4,0.434
2,Post Malone,Wow.,6MWtB6iiXyIwun0YzU6DFP,93,0.163,0.833,149520,0.539,2e-06,11,0.101,-7.399,0,0.178,99.947,4,0.385
3,Meek Mill,Going Bad (feat. Drake),2IRZnDFmlqMuOrYOLnZZyc,90,0.259,0.889,180522,0.496,0.0,4,0.252,-6.365,0,0.0905,86.003,4,0.544
4,YNW Melly,Murder On My Mind,7eBqSVxrzQZtK2mmgRG6lC,89,0.145,0.759,268434,0.73,3e-06,0,0.11,-7.985,0,0.0516,115.007,4,0.74


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 964 entries, 0 to 963
Data columns (total 17 columns):
artist_name         964 non-null object
track_name          964 non-null object
track_id            964 non-null object
popularity          964 non-null int64
acousticness        964 non-null float64
danceability        964 non-null float64
duration_ms         964 non-null int64
energy              964 non-null float64
instrumentalness    964 non-null float64
key                 964 non-null int64
liveness            964 non-null float64
loudness            964 non-null float64
mode                964 non-null int64
speechiness         964 non-null float64
tempo               964 non-null float64
time_signature      964 non-null int64
valence             964 non-null float64
dtypes: float64(9), int64(5), object(3)
memory usage: 135.6+ KB


Checking again for duplicate tracks:

In [27]:
df[df.duplicated(subset=['artist_name', 'track_name'], keep=False)]

Unnamed: 0,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence


So far everything looks good so let's save the dataframe as a .csv file.

In [28]:
df.to_csv('SpotifyAudioFeatures25052019.csv')