# Data Collection Using Spotify Web API

## Purpose of Notebook
The purpose of this notebook is to show how to collect and store audio features data for tracks from the [official Spotify Web API](https://developer.spotify.com/documentation/web-api/) for futher exploratory data analysis and machine learning.  

## Related Notebooks

1. **[spotify-data-retrieval](https://github.com/rtedwards/spotify-data-visualizations/blob/master/spotify-data-visualizations/spotify-data-retrieval.ipynb)** walks through collecting liked tracks using Spotipy as a Python wrapper for the Spotify API and attaching audio features to each track and storing in a dataframe.
2. [spotify-data-exploration](https://github.com/rtedwards/spotify-data-visualizations/blob/master/spotify-data-visualizations/spotify-data-exploration.ipynb) is an exploratory data analysis of my liked tracks
3. [spotify-data-clustering](https://github.com/rtedwards/spotify-data-visualizations/blob/master/spotify-data-visualizations/spotify-data-clustering.ipynb) attempts to finds genres in my liked tracks using K-Means clustering from 

## Spotify Web API
Spotify has a number of [API endpoints](https://developer.spotify.com/documentation/web-api/reference-beta/) available to access the Spoitfy data.  In this notebook, I use the following endpoints:

+ [search endpoint](https://developer.spotify.com/documentation/web-api/reference/search/search/) to get the track IDs
+ [audio features endpoint](https://developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/) to get the corresponding audio features.

# 1. Setup
The following code uses `spotipy` from the [Spotify](https://spotipy.readthedocs.io/en/latest/) library.  Spotipy is a python library for accessing the Spotify web API.  

In [25]:
'''
user-read-private \
user-read-email \
user-read-recently-played \
user-read-playback-state \
user-read-currently-playing \
user-library-read \
playlist-modify-public \
playlist-read-private \
user-follow-read \
user-top-read streaming
'''

import os  # for accessing environment variables
import time  # for execution times
import spotipy  # python library for Spotify API
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.util as util
import seaborn as sns
import pandas as pd

# 2. Data Collection

Data collection is done in 2 parts: first the track IDs and then the audio features for each track ID. 

In [26]:
# Retreive client_id and client_secret from environment variables
client_id = os.getenv('SPOTIFY_CLIENT_ID')
client_secret = os.getenv('SPOTIFY_CLIENT_SECRET')
redirect_uri = "https://localhost/8888"
username = ""
scope = 'user-library-read \
        playlist-read-private' 

# Setting Spotify Client Credentials
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

# Retrieving API token
token = util.prompt_for_user_token(username, scope, client_id, client_secret, redirect_uri)


## I. Retrieving Saved Tracks

In [27]:
if token:
    # timeit library to measure the time needed to run this code
    start = time.time()
    
    # create empty lists where the results are going to be stored
    artist_name = []
    track_name = []
    popularity = []
    track_id = []
    num_tracks = 3500
    sp = spotipy.Spotify(auth=token)
    
    for i in range(0,num_tracks,50):
        track_results = sp.current_user_saved_tracks(limit=50,offset=i)
        for item in track_results['items']:
            track = item['track']
            artist_name.append(track['artists'][0]['name'])
            track_name.append(track['name'])
            track_id.append(track['id'])
            popularity.append(track['popularity'])
        
    stop = time.time()
    print ('Time to run this code (in seconds):', stop - start)
    
else:
    print("Can't get token for", username)
    

Time to run this code (in seconds): 18.95026397705078


Checking `track_id` list

In [28]:
print('number of elements in track_id list:', len(track_id))

number of elements in track_id list: 3234


Loading data into a dataframe

In [29]:
df_tracks = pd.DataFrame({'artist_name':artist_name, 
                          'track_name':track_name, 
                          'track_id':track_id, 
                          'popularity':popularity})
print(df_tracks.shape)
df_tracks.head()

(3234, 4)


Unnamed: 0,artist_name,track_name,track_id,popularity
0,Metallica,Enter Sandman - Live with the SFSO,0T3Pft6gyhtBCrGOsSrvlE,36
1,Metallica,Battery - Live with the SFSO,2T9a3ToWrj8ijGUzCtlRNz,34
2,Metallica,One - Live with the SFSO,1L3xLJzYMRCN6ydha1XF42,37
3,Metallica,Master of Puppets (Remastered),54bm2e3tk8cliUz3VSdCPZ,66
4,Metallica,Sad But True - Live with the SFSO,5snyl56jcL8JMOqR02HYCA,35


Let's view some information about the dataframe:

In [30]:
df_tracks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3234 entries, 0 to 3233
Data columns (total 4 columns):
artist_name    3234 non-null object
track_name     3234 non-null object
track_id       3234 non-null object
popularity     3234 non-null int64
dtypes: int64(1), object(3)
memory usage: 101.1+ KB


## Checking Our Data
There can be duplicates of the same track under different IDs.  This is caused by the track being released in single albums and full albums. 

In [31]:
df_tracks[df_tracks.duplicated(subset=['artist_name','track_name'],keep=False)].count()

artist_name    183
track_name     183
track_id       183
popularity     183
dtype: int64

In [32]:
# group the entries by artist_name and track_name and check for duplicates

duplicates = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()
print("Number of duplicate tracks: ", duplicates[duplicates > 1].count() )


Number of duplicate tracks:  90


Dropping duplicate tracks:

In [33]:
df_tracks.drop_duplicates(subset=['artist_name', 'track_name'], inplace=True)

Check again for duplicates:

In [34]:
# group the entries by artist_name and track_name and check for duplicates

duplicates = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()
print("Number of duplicate tracks: ", duplicates[duplicates > 1].count() )


Number of duplicate tracks:  0


Now to check the shape of our data:

In [35]:
df_tracks.shape

(3141, 4)

# II. Retrieve Audio Features for each Track
Using the [audio features endpoint](https://developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/) we can retrieve the audio features data for the tracks we have collected.

There is a 100 track ID limit per query for this endpoint.  We can use anested for loop to pull track IDs in batches of size 100. 

In [36]:
# Measuring time
start = time.time()

rows = []
batchsize = 100
None_counter = 0

for i in range(0, len(df_tracks['track_id']), batchsize): 
    batch = df_tracks['track_id'][i:i+batchsize]
    feature_results = sp.audio_features(batch)
    
    for i, t in enumerate(feature_results):
        if t == None:
            None_counter = None_counter + 1
        else:
            rows.append(t)
            
print('Number of tracks where no audio features were available:', None_counter)

stop = time.time()
print ('Code runtime (sec):', stop - start)

Number of tracks where no audio features were available: 0
Code runtime (sec): 4.689333200454712


In [37]:
print('Number of elements in audio_features list:', len(rows))

Number of elements in audio_features list: 3141


In [38]:
df_audio_features = pd.DataFrame.from_dict(rows, orient='columns')
print("Shape of dataset:", df_audio_features.shape)
df_audio_features.head()

Shape of dataset: (3141, 18)


Unnamed: 0,acousticness,analysis_url,danceability,duration_ms,energy,id,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,track_href,type,uri,valence
0,0.000144,https://api.spotify.com/v1/audio-analysis/0T3P...,0.278,459133,0.97,0T3Pft6gyhtBCrGOsSrvlE,0.0481,10,0.966,-4.537,0,0.145,131.288,4,https://api.spotify.com/v1/tracks/0T3Pft6gyhtB...,audio_features,spotify:track:0T3Pft6gyhtBCrGOsSrvlE,0.152
1,0.000109,https://api.spotify.com/v1/audio-analysis/2T9a...,0.239,444667,0.976,2T9a3ToWrj8ijGUzCtlRNz,0.362,11,0.954,-5.29,1,0.344,96.485,4,https://api.spotify.com/v1/tracks/2T9a3ToWrj8i...,audio_features,spotify:track:2T9a3ToWrj8ijGUzCtlRNz,0.125
2,7e-06,https://api.spotify.com/v1/audio-analysis/1L3x...,0.292,473067,0.949,1L3xLJzYMRCN6ydha1XF42,0.279,10,0.992,-5.735,0,0.134,124.283,4,https://api.spotify.com/v1/tracks/1L3xLJzYMRCN...,audio_features,spotify:track:1L3xLJzYMRCN6ydha1XF42,0.238
3,0.00067,https://api.spotify.com/v1/audio-analysis/54bm...,0.539,515387,0.828,54bm2e3tk8cliUz3VSdCPZ,0.421,4,0.154,-9.108,0,0.035,105.25,4,https://api.spotify.com/v1/tracks/54bm2e3tk8cl...,audio_features,spotify:track:54bm2e3tk8cliUz3VSdCPZ,0.562
4,1.6e-05,https://api.spotify.com/v1/audio-analysis/5sny...,0.294,346267,0.977,5snyl56jcL8JMOqR02HYCA,0.00209,7,0.928,-3.732,1,0.0731,92.375,4,https://api.spotify.com/v1/tracks/5snyl56jcL8J...,audio_features,spotify:track:5snyl56jcL8JMOqR02HYCA,0.243


Renaming `id` to `track_id` to match the `df_tracks` dataframe:

In [39]:
df_audio_features.rename(columns = {'id': 'track_id'}, inplace=True)
df_audio_features.shape # checking our progress

(3141, 18)

To combine the two dataframes we do an inner merge to only keep track IDs that are in both datasets.

In [40]:
df = pd.merge(df_tracks, df_audio_features, on='track_id', how='inner')
print("Shape of dataset:", df.shape)
df.head()

Shape of dataset: (3141, 21)


Unnamed: 0,artist_name,track_name,track_id,popularity,acousticness,analysis_url,danceability,duration_ms,energy,instrumentalness,...,liveness,loudness,mode,speechiness,tempo,time_signature,track_href,type,uri,valence
0,Metallica,Enter Sandman - Live with the SFSO,0T3Pft6gyhtBCrGOsSrvlE,36,0.000144,https://api.spotify.com/v1/audio-analysis/0T3P...,0.278,459133,0.97,0.0481,...,0.966,-4.537,0,0.145,131.288,4,https://api.spotify.com/v1/tracks/0T3Pft6gyhtB...,audio_features,spotify:track:0T3Pft6gyhtBCrGOsSrvlE,0.152
1,Metallica,Battery - Live with the SFSO,2T9a3ToWrj8ijGUzCtlRNz,34,0.000109,https://api.spotify.com/v1/audio-analysis/2T9a...,0.239,444667,0.976,0.362,...,0.954,-5.29,1,0.344,96.485,4,https://api.spotify.com/v1/tracks/2T9a3ToWrj8i...,audio_features,spotify:track:2T9a3ToWrj8ijGUzCtlRNz,0.125
2,Metallica,One - Live with the SFSO,1L3xLJzYMRCN6ydha1XF42,37,7e-06,https://api.spotify.com/v1/audio-analysis/1L3x...,0.292,473067,0.949,0.279,...,0.992,-5.735,0,0.134,124.283,4,https://api.spotify.com/v1/tracks/1L3xLJzYMRCN...,audio_features,spotify:track:1L3xLJzYMRCN6ydha1XF42,0.238
3,Metallica,Master of Puppets (Remastered),54bm2e3tk8cliUz3VSdCPZ,66,0.00067,https://api.spotify.com/v1/audio-analysis/54bm...,0.539,515387,0.828,0.421,...,0.154,-9.108,0,0.035,105.25,4,https://api.spotify.com/v1/tracks/54bm2e3tk8cl...,audio_features,spotify:track:54bm2e3tk8cliUz3VSdCPZ,0.562
4,Metallica,Sad But True - Live with the SFSO,5snyl56jcL8JMOqR02HYCA,35,1.6e-05,https://api.spotify.com/v1/audio-analysis/5sny...,0.294,346267,0.977,0.00209,...,0.928,-3.732,1,0.0731,92.375,4,https://api.spotify.com/v1/tracks/5snyl56jcL8J...,audio_features,spotify:track:5snyl56jcL8JMOqR02HYCA,0.243


In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3141 entries, 0 to 3140
Data columns (total 21 columns):
artist_name         3141 non-null object
track_name          3141 non-null object
track_id            3141 non-null object
popularity          3141 non-null int64
acousticness        3141 non-null float64
analysis_url        3141 non-null object
danceability        3141 non-null float64
duration_ms         3141 non-null int64
energy              3141 non-null float64
instrumentalness    3141 non-null float64
key                 3141 non-null int64
liveness            3141 non-null float64
loudness            3141 non-null float64
mode                3141 non-null int64
speechiness         3141 non-null float64
tempo               3141 non-null float64
time_signature      3141 non-null int64
track_href          3141 non-null object
type                3141 non-null object
uri                 3141 non-null object
valence             3141 non-null float64
dtypes: float64(9), int64(5)

In [42]:
df.to_csv('data/saved-songs.csv')