# Extract and Clean Current Running Playlist

Here, we will extract and clean data on the playlist I currently use for running using the spotipy library for Python. I already have a playlist of 81 songs that I use for running. I specifically choose songs that are around 90 or 180 beats per minute because I like to run at around 180 steps per minute. The goal of gathering this data is to identify the qualities of the songs I already enjoy for running so that I can pick new songs to add to my rotation.

In [1]:
import pandas as pd
import config
from spotifyfuncs import authenticate_general, get_track_uris, get_track_data
from cleaningfuncs import percent_missing, remove_strings

pd.set_option('display.max_rows', 10)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

## I. Extract Playlist

We will use the spotipy library to extract data on my current running playlist from the Spotify API. Before we can access this data, we need keys for authentication. On the [Spotify developer website](https://developer.spotify.com/dashboard/applications), I registered an app and received a client ID and client secret to use to authenticate with the Spotify API. These are stored in a separate config file not included with this project.

Spotify allows authenticating either to access a specific user's information or generally without a specific user in mind [1]. Since my running playlist is public, we can authenticate here without a specific user in mind. We will also create a spotipy object. We will use a custom authenticate_general function to do both of these.

In [2]:
sp = authenticate_general(config.CLIENT_ID, config.CLIENT_SECRET)

In order to extract data from a specific playlist, we need the URI for that playlist. We can extract the URI of my running playlist from the URL for that playlist. We can then extract the URI for each track on this playlist using the playlist_tracks method on the spotipy object. We will do all of these with a custom get_track_uris function.

In [3]:
track_uris = get_track_uris(config.PLAYLIST_LINK, sp)

Now that we have the URI for each track, we can use this to extract data about each track. Spotify has data on various audio features of each song, such as danceability, energy, and valence (i.e., how "positive" or "negative" a song is) [2]. This will all be useful for analyzing the types of songs I like. We would also like data on track popularity, track name, artist name, artist ID, and artist genres. For the sake of simplicity, we will just use the main artist for each track.

Spotipy will extract audio features as nested JSON, so we can use the pandas json_normalize function to flatten this into a pandas DataFrame.

All of these steps are combined in a custom get_track_data function, which we will run on the playlist.

In [4]:
tracks = get_track_data(config.PLAYLIST_LINK, sp)
tracks

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,...,uri,track_href,analysis_url,duration_ms,time_signature,track_pop,name,artist,artist_id,artist_genres
0,0.65,0.95,6,-4.72,1,0.10,0.13,0.00,0.15,0.51,...,spotify:track:0dCr3qIupWh36ilLjRHi4P,https://api.spotify.com/v1/tracks/0dCr3qIupWh3...,https://api.spotify.com/v1/audio-analysis/0dCr...,228333,4,0,Virtual Insanity - Remastered,Jamiroquai,spotify:artist:6J7biCazzYhU3gM9j1wfid,[dance pop]
1,0.73,0.69,11,-7.16,0,0.04,0.04,0.00,0.10,0.81,...,spotify:track:4aKIs5t9TqP59btlCGPrgw,https://api.spotify.com/v1/tracks/4aKIs5t9TqP5...,https://api.spotify.com/v1/audio-analysis/4aKI...,271893,4,0,Maneater,Daryl Hall & John Oates,spotify:artist:77tT1kLj6mCWtFNqiOmP9H,"[album rock, classic rock, mellow gold, rock, ..."
2,0.52,0.55,6,-13.74,1,0.04,0.50,0.01,0.18,0.91,...,spotify:track:4gvea7UlDkAvsJBPZAd4oB,https://api.spotify.com/v1/tracks/4gvea7UlDkAv...,https://api.spotify.com/v1/audio-analysis/4gve...,288733,4,76,The Boys Of Summer,Don Henley,spotify:artist:5dbuFbrHa1SJlQhQX9OUJ2,"[album rock, art rock, classic rock, country r..."
3,0.58,0.83,1,-4.60,1,0.13,0.01,0.00,0.04,0.57,...,spotify:track:49FYlytm3dAAraYgpoJZux,https://api.spotify.com/v1/tracks/49FYlytm3dAA...,https://api.spotify.com/v1/audio-analysis/49FY...,275987,4,81,Umbrella,Rihanna,spotify:artist:5pKCCKE2ajJHZ9KAiaK11H,"[barbadian pop, dance pop, pop, urban contempo..."
4,0.54,0.81,0,-5.27,1,0.11,0.06,0.00,0.08,0.53,...,spotify:track:0oxxzbHXRNkBHOxayAh49N,https://api.spotify.com/v1/tracks/0oxxzbHXRNkB...,https://api.spotify.com/v1/audio-analysis/0oxx...,228333,4,0,Why Can't This Be Love,Van Halen,spotify:artist:2cnMpRsOVqtPMfq7YiFE6K,"[album rock, classic rock, hard rock, metal, r..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76,0.65,0.79,11,-7.02,0,0.27,0.01,0.00,0.27,0.58,...,spotify:track:6Ah1jYLTaxl8EeRHP0L0tY,https://api.spotify.com/v1/tracks/6Ah1jYLTaxl8...,https://api.spotify.com/v1/audio-analysis/6Ah1...,243600,4,40,The Predator,Ice Cube,spotify:artist:3Mcii5XWf6E0lrY3Uky4cA,"[conscious hip hop, g funk, gangster rap, hip ..."
77,0.81,0.61,9,-5.91,1,0.19,0.03,0.00,0.04,0.48,...,spotify:track:76gJ7ATRkD3WZuYKPl84xm,https://api.spotify.com/v1/tracks/76gJ7ATRkD3W...,https://api.spotify.com/v1/audio-analysis/76gJ...,255627,4,47,Y'All Gone Miss Me,Snoop Dogg,spotify:artist:7hJcb9fa4alzcOq3EaNPoG,"[g funk, gangster rap, hip hop, rap, west coas..."
78,0.84,0.61,10,-10.68,1,0.22,0.01,0.00,0.56,0.72,...,spotify:track:5thts3213xwSroRd11fv5A,https://api.spotify.com/v1/tracks/5thts3213xwS...,https://api.spotify.com/v1/audio-analysis/5tht...,296333,4,56,People Everyday - Metamorphosis Mix,Arrested Development,spotify:artist:5Va9LuEmaZxnbk1gMnjMD7,"[atl hip hop, conscious hip hop, hip hop, old ..."
79,0.51,0.82,1,-8.82,0,0.37,0.25,0.00,0.34,0.56,...,spotify:track:5mOzvm41Pdh1WOtBKlCtZd,https://api.spotify.com/v1/tracks/5mOzvm41Pdh1...,https://api.spotify.com/v1/audio-analysis/5mOz...,272867,4,48,B.I.B.L.E. (Basic Instructions Before Leaving ...,GZA,spotify:artist:6ns6XAOsw4B0nDUIovAOUO,"[alternative hip hop, east coast hip hop, gang..."


Now that we have a DataFrame of track data from the playlist, we need to clean it.

## II. Clean Playlist Data

First, let's look at the data types for each column in the playlist data.

In [5]:
pd.set_option('display.max_rows', 25)
tracks.dtypes

danceability        float64
energy              float64
key                   int64
loudness            float64
mode                  int64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
type                 object
id                   object
uri                  object
track_href           object
analysis_url         object
duration_ms           int64
time_signature        int64
track_pop             int64
name                 object
artist               object
artist_id            object
artist_genres        object
dtype: object

We have everything we need in the right data type for analysis. 

Next, we would like to create a list of the genres of all the artists featured on this playlist. These do not need to be distinct genres; in fact, we would like to use this list later to analyze the frequency of each genre in the playlist. We can do this by creating a DataFrame from the artist_genres column of the playlist data and then using the pandas explode method to break each individual genre onto its own row.

In [6]:
pd.set_option('display.max_rows', 10)

genres = pd.DataFrame(tracks['artist_genres'])
genres = genres['artist_genres'].explode(ignore_index=True)
genres

0             dance pop
1            album rock
2          classic rock
3           mellow gold
4                  rock
             ...       
433        gangster rap
434    hardcore hip hop
435             hip hop
436      queens hip hop
437                 rap
Name: artist_genres, Length: 438, dtype: object

We can export this genres list for now. We will use it later when we explore the playlist data.

In [7]:
genres.to_csv('./running_genres_clean.csv', index=False)

Next, we would like to drop the type, uri, track_href, analysis_url, and artist_genres columns from the playlist data. We will not need these anymore. We also need to trim the "spotify:artist:" string from the beginning of each artist_id. We will use a custom remove_strings function to help with that.

In [8]:
tracks.drop(['type', 'uri', 'track_href', 'analysis_url', 'artist_genres'], axis=1, inplace=True)
tracks['artist_id'] = [remove_strings(artist_id, 'spotify:artist:')
                       for artist_id in tracks['artist_id']]

And we should rename the id column to "track_id" to distinguish it from artist_id.

In [9]:
tracks.rename(columns={'id':'track_id'})

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,track_id,duration_ms,time_signature,track_pop,name,artist,artist_id
0,0.65,0.95,6,-4.72,1,0.10,0.13,0.00,0.15,0.51,91.91,0dCr3qIupWh36ilLjRHi4P,228333,4,0,Virtual Insanity - Remastered,Jamiroquai,6J7bCzzYhU3gM9j1wd
1,0.73,0.69,11,-7.16,0,0.04,0.04,0.00,0.10,0.81,88.75,4aKIs5t9TqP59btlCGPrgw,271893,4,0,Maneater,Daryl Hall & John Oates,77T1kLj6mCWFNqOmP9H
2,0.52,0.55,6,-13.74,1,0.04,0.50,0.01,0.18,0.91,176.94,4gvea7UlDkAvsJBPZAd4oB,288733,4,76,The Boys Of Summer,Don Henley,5dbuFbH1SJlQhQX9OUJ2
3,0.58,0.83,1,-4.60,1,0.13,0.01,0.00,0.04,0.57,174.03,49FYlytm3dAAraYgpoJZux,275987,4,81,Umbrella,Rihanna,5KCCKE2jJHZ9KAK11H
4,0.54,0.81,0,-5.27,1,0.11,0.06,0.00,0.08,0.53,88.41,0oxxzbHXRNkBHOxayAh49N,228333,4,0,Why Can't This Be Love,Van Halen,2cnMROVqPMq7YFE6K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76,0.65,0.79,11,-7.02,0,0.27,0.01,0.00,0.27,0.58,175.29,6Ah1jYLTaxl8EeRHP0L0tY,243600,4,40,The Predator,Ice Cube,3Mc5XW6E0lY3Uk4cA
77,0.81,0.61,9,-5.91,1,0.19,0.03,0.00,0.04,0.48,91.06,76gJ7ATRkD3WZuYKPl84xm,255627,4,47,Y'All Gone Miss Me,Snoop Dogg,7hJcb94lzcOq3ENPG
78,0.84,0.61,10,-10.68,1,0.22,0.01,0.00,0.56,0.72,91.12,5thts3213xwSroRd11fv5A,296333,4,56,People Everyday - Metamorphosis Mix,Arrested Development,5V9LuEmZxnbk1gMnjMD7
79,0.51,0.82,1,-8.82,0,0.37,0.25,0.00,0.34,0.56,181.80,5mOzvm41Pdh1WOtBKlCtZd,272867,4,48,B.I.B.L.E. (Basic Instructions Before Leaving ...,GZA,6n6XAOw4B0nDUIvAOUO


Next, we should determine if any data is missing. We can use a custom function to compute the percentage of each column that is missing.

In [10]:
percent_missing(tracks)

danceability - 0.0%
energy - 0.0%
key - 0.0%
loudness - 0.0%
mode - 0.0%
speechiness - 0.0%
acousticness - 0.0%
instrumentalness - 0.0%
liveness - 0.0%
valence - 0.0%
tempo - 0.0%
id - 0.0%
duration_ms - 0.0%
time_signature - 0.0%
track_pop - 0.0%
name - 0.0%
artist - 0.0%
artist_id - 0.0%


There doesn't appear to be any data that is missing outright. Let's also look at a table of descriptive statistics for the data to determine if any data is out-of-range.

In [11]:
tracks.describe(include='all')

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,id,duration_ms,time_signature,track_pop,name,artist,artist_id
count,81.00,81.00,81.00,81.00,81.00,81.00,81.00,81.00,81.00,81.00,81.00,81,81.00,81.00,81.00,81,81,81
unique,,,,,,,,,,,,81,,,,81,47,47
top,,,,,,,,,,,,0dCr3qIupWh36ilLjRHi4P,,,,Virtual Insanity - Remastered,The Roots,78xUw6FkVZRAzFddu
freq,,,,,,,,,,,,1,,,,1,7,7
mean,0.63,0.70,5.31,-6.65,0.60,0.18,0.13,0.02,0.20,0.53,114.45,,260724.85,4.01,50.63,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
min,0.24,0.24,0.00,-13.95,0.00,0.03,0.00,0.00,0.04,0.10,81.47,,177120.00,4.00,0.00,,,
25%,0.52,0.61,1.00,-7.74,0.00,0.05,0.01,0.00,0.09,0.39,89.68,,229827.00,4.00,39.00,,,
50%,0.65,0.71,6.00,-6.19,1.00,0.14,0.05,0.00,0.14,0.54,92.08,,256347.00,4.00,55.00,,,
75%,0.75,0.82,9.00,-5.27,1.00,0.26,0.20,0.00,0.27,0.72,171.00,,283227.00,4.00,70.00,,,


The data all appears to be within a range that would make sense. Danceability, energy, speechiness, acousticness, instrumentalness, liveness, and valence all should have ranges from 0 to 1 inclusive [2]. Mode should be 0 or 1, key should be an integer from 0 to 11 inclusive, and time_signature should be an integer from 3 to 7 inclusive [2]. Loudness should be greater than -60 and tempo and duration_ms should be greater than 0 [2]. And track_pop and artist_pop should be integers from 0 to 100 inclusive [3].

## III. Conclusion

Now we can export the playlist data to csv. We will explore and analyze this data next with the genres data in the exploratory_analysis file.

In [12]:
tracks.to_csv('./running_tracks_clean.csv', index=False)

## IV. References

1. Watts C. December 17, 2021, 'Extracting Song Data From the Spotify API Using Python'. Towards Data Science. https://towardsdatascience.com/extracting-song-data-from-the-spotify-api-using-python-b1e79388d50.
2. 'Get Track's Audio Features'. Spotify for Developers: Web API Reference. https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features.
3. 'Get Track'. Spotify for Developers: Web API Reference. https://developer.spotify.com/documentation/web-api/reference/#/operations/get-track.