# Data Collection Using Spotify Web API

## Purpose of Notebook
The purpose of this notebook is to show how to collect and store audio features data for tracks from the [official Spotify Web API](https://developer.spotify.com/documentation/web-api/) for futher exploratory data analysis and machine learning.  

## Related Notebooks

1. **[Data Retrieval](https://nbviewer.jupyter.org/github/rtedwards/spotify-data-visualizations/blob/master/notebooks/spotify-data-exploration.ipynb)** walks through collecting liked tracks using Spotipy as a Python wrapper for the Spotify API and attaching audio features to each track and storing in a dataframe.
2. [Data Exploration](https://nbviewer.jupyter.org/github/rtedwards/spotify-data-visualizations/blob/master/notebooks/spotify-data-visualizations/spotify-data-exploration.ipynb) is an exploratory data analysis of my liked tracks
3. [Data Clustering](https://nbviewer.jupyter.org/github/rtedwards/spotify-data-visualizations/blob/master/notebooks/spotify-data-visualizations/spotify-data-clustering.ipynb)  attempts to finds genres in my liked tracks using K-Means clustering

## Spotify Web API
Spotify has a number of [API endpoints](https://developer.spotify.com/documentation/web-api/reference-beta/) available to access the Spoitfy data.  In this notebook, I use the following endpoints:

+ [search endpoint](https://developer.spotify.com/documentation/web-api/reference/search/search/) to get the track IDs
+ [audio features endpoint](https://developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/) to get the corresponding audio features.

# 1. Setup
The following code uses `spotipy` from the [Spotify](https://spotipy.readthedocs.io/en/latest/) library.  Spotipy is a python library for accessing the Spotify web API.  

In [1]:
'''
user-read-private \
user-read-email \
user-read-recently-played \
user-read-playback-state \
user-read-currently-playing \
user-library-read \
playlist-modify-public \
playlist-read-private \
user-follow-read \
user-top-read streaming
'''

import os  # for accessing environment variables
import time  # for execution times
import spotipy  # python library for Spotify API
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.util as util
import seaborn as sns
import pandas as pd

# 2. Data Collection

Data collection is done in 2 parts: first the track IDs and then the audio features for each track ID. 

In [2]:
# Retreive client_id and client_secret from environment variables
CLIENT_ID = os.getenv('SPOTIFY_CLIENT_ID')
CLIENT_SECRET = os.getenv('SPOTIFY_CLIENT_SECRET')
REDIRECT_URI = "http://localhost:2222/"
USERNAME = ""
SCOPE = 'user-library-read \
        playlist-read-private' 

# Setting Spotify Client Credentials
client_credentials_manager = SpotifyClientCredentials(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

# Retrieving API token
token = util.prompt_for_user_token(USERNAME, SCOPE, CLIENT_ID, CLIENT_SECRET, REDIRECT_URI)


## I. Retrieving Saved Tracks

In [3]:
if token:
    # timeit library to measure the time needed to run this code
    start = time.time()
    
    # create empty lists where the results are going to be stored
    artist_name = []
    track_name = []
    popularity = []
    track_id = []
    num_tracks = 3500
    sp = spotipy.Spotify(auth=token)
    
    for i in range(0,num_tracks,50):
        track_results = sp.current_user_saved_tracks(limit=50, offset=i)
        for item in track_results['items']:
            track = item['track']
            artist_name.append(track['artists'][0]['name'])
            track_name.append(track['name'])
            track_id.append(track['id'])
            popularity.append(track['popularity'])
        
    stop = time.time()
    print ('Time to run this code (in seconds):', stop - start)
    
else:
    print("Can't get token for", username)
    

Time to run this code (in seconds): 15.997757911682129


Checking `track_id` list

In [4]:
print('number of elements in track_id list:', len(track_id))

number of elements in track_id list: 3240


Loading data into a dataframe

In [5]:
df_tracks = pd.DataFrame({'artist_name':artist_name, 
                          'track_name':track_name, 
                          'track_id':track_id, 
                          'popularity':popularity})
print(df_tracks.shape)
df_tracks.head()

(3240, 4)


Unnamed: 0,artist_name,track_name,track_id,popularity
0,Cubicolor,Fictionalise - Lindstrom & Prins Thomas Extend...,1m8BthDjfQS47iuULONgLi,12
1,Cubicolor,Fictionalise - Lindstrom & Prins Thomas Remix,40X549muIHXIvWoj2S6MKb,16
2,Cubicolor,Dead End Thrills - Patrice Bäumel Remix,3MEjDXwQWnDzqwRanBgFzm,37
3,Cubicolor,Counterpart,2Jm5TBQziCdySQg2J7w0PN,33
4,Cubicolor,No Dancers,6NH78lyZkS05PotKqg0ZKw,45


Let's view some information about the dataframe:

In [64]:
df_tracks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3240 entries, 0 to 3239
Data columns (total 4 columns):
artist_name    3240 non-null object
track_name     3240 non-null object
track_id       3240 non-null object
popularity     3240 non-null int64
dtypes: int64(1), object(3)
memory usage: 101.3+ KB


## Checking Our Data
There can be duplicates of the same track under different IDs.  This is caused by the track being released in single albums and full albums. 

In [65]:
df_tracks[df_tracks.duplicated(subset=['artist_name','track_name'], keep=False)].count()

artist_name    185
track_name     185
track_id       185
popularity     185
dtype: int64

In [66]:
# group the entries by artist_name and track_name and check for duplicates

duplicates = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()
print("Number of duplicate tracks: ", duplicates[duplicates > 1].count() )


Number of duplicate tracks:  91


Dropping duplicate tracks:

In [67]:
df_tracks.drop_duplicates(subset=['artist_name', 'track_name'], inplace=True)

Check again for duplicates:

In [68]:
# group the entries by artist_name and track_name and check for duplicates

duplicates = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()
print("Number of duplicate tracks: ", duplicates[duplicates > 1].count() )


Number of duplicate tracks:  0


Now to check the shape of our data:

In [69]:
df_tracks.shape

(3146, 4)

# II. Retrieve Audio Features for each Track
Using the [audio features endpoint](https://developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/) we can retrieve the audio features data for the tracks we have collected.

There is a 100 track ID limit per query for this endpoint.  We can use anested for loop to pull track IDs in batches of size 100. 

In [70]:
# Measuring time
start = time.time()

rows = []
batchsize = 100
None_counter = 0

for i in range(0, len(df_tracks['track_id']), batchsize): 
    batch = df_tracks['track_id'][i:i+batchsize]
    feature_results = sp.audio_features(batch)
    
    for i, t in enumerate(feature_results):
        if t == None:
            None_counter = None_counter + 1
        else:
            rows.append(t)
            
print('Number of tracks where no audio features were available:', None_counter)

stop = time.time()
print ('Code runtime (sec):', stop - start)

Number of tracks where no audio features were available: 0
Code runtime (sec): 3.805426836013794


In [71]:
print('Number of elements in audio_features list:', len(rows))

Number of elements in audio_features list: 3146


In [72]:
df_audio_features = pd.DataFrame.from_dict(rows, orient='columns')
print("Shape of dataset:", df_audio_features.shape)
df_audio_features.head()

Shape of dataset: (3146, 18)


Unnamed: 0,acousticness,analysis_url,danceability,duration_ms,energy,id,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,track_href,type,uri,valence
0,0.051,https://api.spotify.com/v1/audio-analysis/1m8B...,0.635,583263,0.829,1m8BthDjfQS47iuULONgLi,0.742,1,0.113,-8.771,0,0.0415,116.989,4,https://api.spotify.com/v1/tracks/1m8BthDjfQS4...,audio_features,spotify:track:1m8BthDjfQS47iuULONgLi,0.636
1,0.135,https://api.spotify.com/v1/audio-analysis/40X5...,0.616,312301,0.863,40X549muIHXIvWoj2S6MKb,0.556,1,0.158,-10.52,0,0.0377,116.978,4,https://api.spotify.com/v1/tracks/40X549muIHXI...,audio_features,spotify:track:40X549muIHXIvWoj2S6MKb,0.527
2,0.28,https://api.spotify.com/v1/audio-analysis/3MEj...,0.7,443661,0.673,3MEjDXwQWnDzqwRanBgFzm,0.24,7,0.102,-10.217,1,0.0374,122.008,3,https://api.spotify.com/v1/tracks/3MEjDXwQWnDz...,audio_features,spotify:track:3MEjDXwQWnDzqwRanBgFzm,0.143
3,0.42,https://api.spotify.com/v1/audio-analysis/2Jm5...,0.361,314043,0.471,2Jm5TBQziCdySQg2J7w0PN,0.884,0,0.084,-12.726,1,0.0441,122.845,4,https://api.spotify.com/v1/tracks/2Jm5TBQziCdy...,audio_features,spotify:track:2Jm5TBQziCdySQg2J7w0PN,0.0392
4,0.0146,https://api.spotify.com/v1/audio-analysis/6NH7...,0.577,347717,0.853,6NH78lyZkS05PotKqg0ZKw,0.438,6,0.249,-11.989,0,0.0384,119.988,4,https://api.spotify.com/v1/tracks/6NH78lyZkS05...,audio_features,spotify:track:6NH78lyZkS05PotKqg0ZKw,0.14


Renaming `id` to `track_id` to match the `df_tracks` dataframe:

In [73]:
df_audio_features.rename(columns = {'id': 'track_id'}, inplace=True)
df_audio_features.shape # checking our progress

(3146, 18)

To combine the two dataframes we do an inner merge to only keep track IDs that are in both datasets.

In [74]:
df = pd.merge(df_tracks, df_audio_features, on='track_id', how='inner')
print("Shape of dataset:", df.shape)
df.head()

Shape of dataset: (3146, 21)


Unnamed: 0,artist_name,track_name,track_id,popularity,acousticness,analysis_url,danceability,duration_ms,energy,instrumentalness,...,liveness,loudness,mode,speechiness,tempo,time_signature,track_href,type,uri,valence
0,Cubicolor,Fictionalise - Lindstrom & Prins Thomas Extend...,1m8BthDjfQS47iuULONgLi,12,0.051,https://api.spotify.com/v1/audio-analysis/1m8B...,0.635,583263,0.829,0.742,...,0.113,-8.771,0,0.0415,116.989,4,https://api.spotify.com/v1/tracks/1m8BthDjfQS4...,audio_features,spotify:track:1m8BthDjfQS47iuULONgLi,0.636
1,Cubicolor,Fictionalise - Lindstrom & Prins Thomas Remix,40X549muIHXIvWoj2S6MKb,16,0.135,https://api.spotify.com/v1/audio-analysis/40X5...,0.616,312301,0.863,0.556,...,0.158,-10.52,0,0.0377,116.978,4,https://api.spotify.com/v1/tracks/40X549muIHXI...,audio_features,spotify:track:40X549muIHXIvWoj2S6MKb,0.527
2,Cubicolor,Dead End Thrills - Patrice Bäumel Remix,3MEjDXwQWnDzqwRanBgFzm,37,0.28,https://api.spotify.com/v1/audio-analysis/3MEj...,0.7,443661,0.673,0.24,...,0.102,-10.217,1,0.0374,122.008,3,https://api.spotify.com/v1/tracks/3MEjDXwQWnDz...,audio_features,spotify:track:3MEjDXwQWnDzqwRanBgFzm,0.143
3,Cubicolor,Counterpart,2Jm5TBQziCdySQg2J7w0PN,33,0.42,https://api.spotify.com/v1/audio-analysis/2Jm5...,0.361,314043,0.471,0.884,...,0.084,-12.726,1,0.0441,122.845,4,https://api.spotify.com/v1/tracks/2Jm5TBQziCdy...,audio_features,spotify:track:2Jm5TBQziCdySQg2J7w0PN,0.0392
4,Cubicolor,No Dancers,6NH78lyZkS05PotKqg0ZKw,45,0.0146,https://api.spotify.com/v1/audio-analysis/6NH7...,0.577,347717,0.853,0.438,...,0.249,-11.989,0,0.0384,119.988,4,https://api.spotify.com/v1/tracks/6NH78lyZkS05...,audio_features,spotify:track:6NH78lyZkS05PotKqg0ZKw,0.14


In [79]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3146 entries, 0 to 3145
Data columns (total 21 columns):
artist_name         3146 non-null object
track_name          3146 non-null object
track_id            3146 non-null object
popularity          3146 non-null int64
acousticness        3146 non-null float64
analysis_url        3146 non-null object
danceability        3146 non-null float64
duration_ms         3146 non-null int64
energy              3146 non-null float64
instrumentalness    3146 non-null float64
key                 3146 non-null int64
liveness            3146 non-null float64
loudness            3146 non-null float64
mode                3146 non-null int64
speechiness         3146 non-null float64
tempo               3146 non-null float64
time_signature      3146 non-null int64
track_href          3146 non-null object
type                3146 non-null object
uri                 3146 non-null object
valence             3146 non-null float64
dtypes: float64(9), int64(5)

In [76]:
df.to_csv('data/saved-songs.csv')

# What's Next?

Now that we have collected our data in part 2 we will spend some time exploring the data.  

1. [Data Retrieval](https://nbviewer.jupyter.org/github/rtedwards/spotify-data-visualizations/blob/master/notebooks/spotify-data-visualizations/spotify-data-retrieval.ipynb) walks through collecting liked tracks using Spotipy as a Python wrapper for the Spotify API and attaching audio features to each track and storing in a dataframe.
2. **[Data Exploration](https://nbviewer.jupyter.org/github/rtedwards/spotify-data-visualizations/blob/master/notebooks/spotify-data-exploration.ipynb)** is an exploratory data analysis of my liked tracks
3. [Data Clustering](https://nbviewer.jupyter.org/github/rtedwards/spotify-data-visualizations/blob/master/notebooks/spotify-data-visualizations/spotify-data-clustering.ipynb)  attempts to finds genres in my liked tracks using K-Means clustering