# (Part 1) Data Collection Using Spotify Web API

## Spotify Web API
Spotify has a number of [API endpoints](https://developer.spotify.com/documentation/web-api/reference-beta/) available to access the Spoitfy data.  In this notebook, I use the following endpoints:

+ [search enpoint](https://developer.spotify.com/documentation/web-api/reference/search/search/) to get the track IDs
+ [audio features endpoint](https://developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/) to get the corresponding audio features.

## Purpose of Notebook
The purpose of this notebook is to show how to collect and store audio features data for tracks from the [official Spotify Web API](https://developer.spotify.com/documentation/web-api/) for futher exploratory data analysis and machine learning.  

## Related Notebooks

+ **[Part 1](https://github.com/rtedwards/spotify-data-visualizations/blob/master/spotify-data-visualizations/spotify-data-retrieval.ipynb)** walks through collecting liked tracks using Spotipy as a Python wrapper for the Spotify API and attaching audio features to each track and storing in a dataframe.
+ [Part 2](https://github.com/rtedwards/spotify-data-visualizations/blob/master/spotify-data-visualizations/spotify-data-exploration.ipynb) is an exploratory data analysis of my liked tracks
+ [Part 3](https://github.com/rtedwards/spotify-data-visualizations/blob/master/spotify-data-visualizations/spotify-data-clustering.ipynb) attempts to finds genres in my liked tracks using K-Means clustering from 


# 1. Setup
The following code uses `spotipy` from the [Spotify](https://spotipy.readthedocs.io/en/latest/) library.  Spotipy is a python library for accessing the Spotify web API.  

In [21]:
'''
user-read-private \
user-read-email \
user-read-recently-played \
user-read-playback-state \
user-read-currently-playing \
user-library-read \
playlist-modify-public \
playlist-read-private \
user-follow-read \
user-top-read streaming
'''

import os  # for accessing environment variables
import time  # for execution times
import spotipy  # python library for Spotify API
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.util as util
import seaborn as sns
import pandas as pd

# 2. Data Collection

Data collection is done in 2 parts: first the track IDs and then the audio features for each track ID. 

In [None]:
# Retreive client_id and client_secret from environment variables
client_id = os.getenv('SPOTIFY_CLIENT_ID')
client_secret = os.getenv('SPOTIFY_CLIENT_SECRET')
redirect_uri = "https://localhost/8888"
username = ""
scope = 'user-library-read \
        playlist-read-private' 

# Setting Spotify Client Credentials
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

# Retrieving API token
token = util.prompt_for_user_token(username, scope, client_id, client_secret, redirect_uri)


## I. Retrieving Saved Tracks

In [1]:
if token:
    # timeit library to measure the time needed to run this code
    start = time.time()
    
    # create empty lists where the results are going to be stored
    artist_name = []
    track_name = []
    popularity = []
    track_id = []
    num_tracks = 3500
    sp = spotipy.Spotify(auth=token)
    
    for i in range(0,num_tracks,50):
        track_results = sp.current_user_saved_tracks(limit=50,offset=i)
        for item in track_results['items']:
            track = item['track']
            artist_name.append(track['artists'][0]['name'])
            track_name.append(track['name'])
            track_id.append(track['id'])
            popularity.append(track['popularity'])
        
    stop = time.time()
    print ('Time to run this code (in seconds):', stop - start)
    
else:
    print("Can't get token for", username)
    

Time to run this code (in seconds): 16.363475799560547


Checking `track_id` list

In [2]:
print('number of elements in track_id list:', len(track_id))

number of elements in track_id list: 3203


Loading data into a dataframe

In [7]:
df_tracks = pd.DataFrame({'artist_name':artist_name, 
                          'track_name':track_name, 
                          'track_id':track_id, 
                          'popularity':popularity})
print(df_tracks.shape)
df_tracks.head()

(3203, 4)


Unnamed: 0,artist_name,track_name,track_id,popularity
0,Lauer,Mirrors (feat. Jasnau),1tTwnpx4TvhnTAzBjFditv,30
1,Anderholm,Wonderland,4wl0zWqB3FMgQvAJZx2Z9V,20
2,Ben Böhmer,Fliederregen,7kltLoEOuwwSNsCZChIaj9,46
3,Jai Piccone,Care,7eT788aFqbLJNZbYf3QaJl,40
4,Daniel T.,Heat-Wave,5RwQuZtvlmLq6rutrcugxF,34


Let's view some information about the dataframe:

In [8]:
df_tracks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3203 entries, 0 to 3202
Data columns (total 4 columns):
artist_name    3203 non-null object
track_name     3203 non-null object
track_id       3203 non-null object
popularity     3203 non-null int64
dtypes: int64(1), object(3)
memory usage: 100.2+ KB


## Checking Our Data
There can be duplicates of the same track under different IDs.  This is caused by the track being released in single albums and full albums. 

In [9]:
df_tracks[df_tracks.duplicated(subset=['artist_name','track_name'],keep=False)].count()

artist_name    183
track_name     183
track_id       183
popularity     183
dtype: int64

In [10]:
# group the entries by artist_name and track_name and check for duplicates

duplicates = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()
print("Number of duplicate tracks: ", duplicates[duplicates > 1].count() )


Number of duplicate tracks:  90


Dropping duplicate tracks:

In [11]:
df_tracks.drop_duplicates(subset=['artist_name', 'track_name'], inplace=True)

Check again for duplicates:

In [12]:
# group the entries by artist_name and track_name and check for duplicates

duplicates = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()
print("Number of duplicate tracks: ", duplicates[duplicates > 1].count() )


Number of duplicate tracks:  0


Now to check the shape of our data:

In [13]:
df_tracks.shape

(3110, 4)

# II. Retrieve Audio Features for each Track
Using the [audio features endpoint](https://developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/) we can retrieve the audio features data for the tracks we have collected.

There is a 100 track ID limit per query for this endpoint.  We can use anested for loop to pull track IDs in batches of size 100. 

In [14]:
# Measuring time
start = time.time()

rows = []
batchsize = 100
None_counter = 0

for i in range(0, len(df_tracks['track_id']), batchsize): 
    batch = df_tracks['track_id'][i:i+batchsize]
    feature_results = sp.audio_features(batch)
    
    for i, t in enumerate(feature_results):
        if t == None:
            None_counter = None_counter + 1
        else:
            rows.append(t)
            
print('Number of tracks where no audio features were available:', None_counter)

stop = time.time()
print ('Code runtime (sec):', stop - start)

Number of tracks where no audio features were available: 0
Code runtime (sec): 7.467611312866211


In [15]:
print('Number of elements in audio_features list:', len(rows))

Number of elements in audio_features list: 3110


In [16]:
df_audio_features = pd.DataFrame.from_dict(rows, orient='columns')
print("Shape of dataset:", df_audio_features.shape)
df_audio_features.head()

Shape of dataset: (3110, 18)


Unnamed: 0,acousticness,analysis_url,danceability,duration_ms,energy,id,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,track_href,type,uri,valence
0,0.015,https://api.spotify.com/v1/audio-analysis/1tTw...,0.668,258067,0.878,1tTwnpx4TvhnTAzBjFditv,0.554,5,0.146,-7.538,1,0.0472,122.006,4,https://api.spotify.com/v1/tracks/1tTwnpx4Tvhn...,audio_features,spotify:track:1tTwnpx4TvhnTAzBjFditv,0.655
1,0.24,https://api.spotify.com/v1/audio-analysis/4wl0...,0.688,230620,0.908,4wl0zWqB3FMgQvAJZx2Z9V,0.962,1,0.107,-6.662,1,0.0345,120.0,4,https://api.spotify.com/v1/tracks/4wl0zWqB3FMg...,audio_features,spotify:track:4wl0zWqB3FMgQvAJZx2Z9V,0.497
2,0.0121,https://api.spotify.com/v1/audio-analysis/7klt...,0.672,265868,0.336,7kltLoEOuwwSNsCZChIaj9,0.851,5,0.14,-9.384,0,0.0348,120.994,4,https://api.spotify.com/v1/tracks/7kltLoEOuwwS...,audio_features,spotify:track:7kltLoEOuwwSNsCZChIaj9,0.311
3,0.514,https://api.spotify.com/v1/audio-analysis/7eT7...,0.756,327096,0.497,7eT788aFqbLJNZbYf3QaJl,0.696,8,0.108,-7.571,1,0.0419,120.994,4,https://api.spotify.com/v1/tracks/7eT788aFqbLJ...,audio_features,spotify:track:7eT788aFqbLJNZbYf3QaJl,0.646
4,0.011,https://api.spotify.com/v1/audio-analysis/5RwQ...,0.705,356224,0.656,5RwQuZtvlmLq6rutrcugxF,0.54,1,0.178,-8.549,0,0.0513,126.0,4,https://api.spotify.com/v1/tracks/5RwQuZtvlmLq...,audio_features,spotify:track:5RwQuZtvlmLq6rutrcugxF,0.397


Renaming `id` to `track_id` to match the `df_tracks` dataframe:

In [17]:
df_audio_features.rename(columns = {'id': 'track_id'}, inplace=True)
df_audio_features.shape # checking our progress

(3110, 18)

To combine the two dataframes we do an inner merge to only keep track IDs that are in both datasets.

In [22]:
df = pd.merge(df_tracks, df_audio_features, on='track_id', how='inner')
print("Shape of dataset:", df.shape)
df.head()

Shape of dataset: (3110, 21)


Unnamed: 0,artist_name,track_name,track_id,popularity,acousticness,analysis_url,danceability,duration_ms,energy,instrumentalness,...,liveness,loudness,mode,speechiness,tempo,time_signature,track_href,type,uri,valence
0,Lauer,Mirrors (feat. Jasnau),1tTwnpx4TvhnTAzBjFditv,30,0.015,https://api.spotify.com/v1/audio-analysis/1tTw...,0.668,258067,0.878,0.554,...,0.146,-7.538,1,0.0472,122.006,4,https://api.spotify.com/v1/tracks/1tTwnpx4Tvhn...,audio_features,spotify:track:1tTwnpx4TvhnTAzBjFditv,0.655
1,Anderholm,Wonderland,4wl0zWqB3FMgQvAJZx2Z9V,20,0.24,https://api.spotify.com/v1/audio-analysis/4wl0...,0.688,230620,0.908,0.962,...,0.107,-6.662,1,0.0345,120.0,4,https://api.spotify.com/v1/tracks/4wl0zWqB3FMg...,audio_features,spotify:track:4wl0zWqB3FMgQvAJZx2Z9V,0.497
2,Ben Böhmer,Fliederregen,7kltLoEOuwwSNsCZChIaj9,46,0.0121,https://api.spotify.com/v1/audio-analysis/7klt...,0.672,265868,0.336,0.851,...,0.14,-9.384,0,0.0348,120.994,4,https://api.spotify.com/v1/tracks/7kltLoEOuwwS...,audio_features,spotify:track:7kltLoEOuwwSNsCZChIaj9,0.311
3,Jai Piccone,Care,7eT788aFqbLJNZbYf3QaJl,40,0.514,https://api.spotify.com/v1/audio-analysis/7eT7...,0.756,327096,0.497,0.696,...,0.108,-7.571,1,0.0419,120.994,4,https://api.spotify.com/v1/tracks/7eT788aFqbLJ...,audio_features,spotify:track:7eT788aFqbLJNZbYf3QaJl,0.646
4,Daniel T.,Heat-Wave,5RwQuZtvlmLq6rutrcugxF,34,0.011,https://api.spotify.com/v1/audio-analysis/5RwQ...,0.705,356224,0.656,0.54,...,0.178,-8.549,0,0.0513,126.0,4,https://api.spotify.com/v1/tracks/5RwQuZtvlmLq...,audio_features,spotify:track:5RwQuZtvlmLq6rutrcugxF,0.397


In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3110 entries, 0 to 3109
Data columns (total 21 columns):
artist_name         3110 non-null object
track_name          3110 non-null object
track_id            3110 non-null object
popularity          3110 non-null int64
acousticness        3110 non-null float64
analysis_url        3110 non-null object
danceability        3110 non-null float64
duration_ms         3110 non-null int64
energy              3110 non-null float64
instrumentalness    3110 non-null float64
key                 3110 non-null int64
liveness            3110 non-null float64
loudness            3110 non-null float64
mode                3110 non-null int64
speechiness         3110 non-null float64
tempo               3110 non-null float64
time_signature      3110 non-null int64
track_href          3110 non-null object
type                3110 non-null object
uri                 3110 non-null object
valence             3110 non-null float64
dtypes: float64(9), int64(5)

In [20]:
columns_to_drop = ['analysis_url', 'track_href', 'type', 'uri']
df.drop(columns_to_drop, axis=1, inplace=True)

In [24]:
df.to_csv('data/saved-songs.csv')