<span style="font-family:Helvetica Light">
    
# Data Preparation

The goal of this notebook is to:
- import the streaming history requested from Spotify,
- connect to the Spotify API using the Spotipy library to extend the dataset with additional track and artist details, as well as audio features.

## User's Streaming History
Spotify allows its users to get a copy of one year worth of their streaming history. To get your own simply <a href="https://www.spotify.com/us/account/privacy/" target="_blank">click here</a> and follow the steps. 

## Spotify API
Spotify grants developers access to its API, enabling retrieval of general information regarding music artists, albums, and tracks from its extensive catalog. Additionally, the Web API provides access to user-related data such as playlists and saved music in the user's library. However, access to user data is generally limited, with most functions imposing a result limit of 50.

In this notebook following endpoints were used:
* <a href="https://developer.spotify.com/documentation/web-api/reference/#/operations/search" target="_blank">search</a>
* <a href="https://developer.spotify.com/documentation/web-api/reference/#/operations/get-several-audio-features" target="_blank">audio features</a>
* <a href="https://developer.spotify.com/documentation/web-api/reference/#/operations/get-multiple-artists" target="_blank">artists</a>


The list of all available endpoints references can be found <a href="https://developer.spotify.com/documentation/web-api/reference/#/" target="_blank">here</a>.

</span>

In [37]:
import pandas as pd
import config
import os
import spotipy 
from spotipy.oauth2 import SpotifyClientCredentials 
from spotipy.oauth2 import SpotifyOAuth
from ratelimit import limits, sleep_and_retry

pd.set_option('display.max_rows', 500)

<span style="font-family:Helvetica Light">
    
## Loading Historical Music Data from Spotify
Loading the JSON file containing my one-year music streaming history, as requested and received from Spotify.

In [34]:
df = pd.read_json('../data/raw/StreamingHistory_music_0.json')
print(df.shape)
df.head()

(4940, 4)


Unnamed: 0,endTime,artistName,trackName,msPlayed
0,2023-05-16 15:43,John Lennon,Isolation - Remastered 2010,142789
1,2023-05-17 04:43,John Lennon,Isolation - Remastered 2010,29448
2,2023-05-17 04:44,John Lennon,Love - Remastered 2010,65613
3,2023-05-17 04:46,Felipe Gordon,Inherently Deep,101308
4,2023-05-17 04:47,Golf Trip,L in Vain,92322


<span style="font-family:Helvetica Light">

## Spotify API Connector with Spotipy library

In order to connect to Spotify API, Client ID and Secret Client ID are needed. In order to get those two keys, it is neccessary to <a href="https://developer.spotify.com/documentation/general/guides/authorization/app-settings/" target="_blank">set up an App in the Spotify for Developers platform</a>.
    
Spotipy has two authentication methods: SpotifyClientCredentials and SpotifyOAuth. 
- Using *SpotifyClientCredentials*, we can fetch information from Spotify data that is not linked to a user, such as artists and albums. This method doesn’t require a Spotify login using the redirect URI since it’s general data. 
- On the other hand, *SpotifyOAuth* allows us to get information related to a specific user, such as saved tracks or recently played songs. We need to specify a <a href="https://developer.spotify.com/documentation/web-api/concepts/scopes" target="_blank">scope</a> for this method, and it will prompt a login page using the redirect URI.

Both methods work seamlessly with the os.environ step below, eliminating the need for re-specifying credentials since they can be fetched automatically from os.environ.

</span>

In [38]:
os.environ['SPOTIPY_CLIENT_ID'] = config.spotify['client_id']
os.environ['SPOTIPY_CLIENT_SECRET'] = config.spotify['client_secret']
os.environ['SPOTIPY_REDIRECT_URI'] = config.spotify['redirect_uri']

In [39]:
# method to get general data 
sp = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials())

In [40]:
# method to get user specific data 
scope="user-top-read"
#scope="user-library-read"

sp = spotipy.Spotify(auth_manager=SpotifyOAuth(scope=scope))

<span style="font-family:Helvetica Light">

### Fetching Additional Track Details

First, we create the list of unique combinations of artistName and trackName from our historical streaming data. This list serves as the basis for fetching additional track details (i.e. artist id, track id, track popularity score) using the Spotify API's search endpoint. We use this information as a starting point to enrich our dataset and deepen our analysis of streaming data.

In [6]:
# create a list of unique artist - track pairs to use for searching through the Spotify API
artist_track_list = df[['artistName', 'trackName']].drop_duplicates(ignore_index=True)
print(artist_track_list.shape)
artist_track_list.head()

(2097, 2)


Unnamed: 0,artistName,trackName
0,John Lennon,Isolation - Remastered 2010
1,John Lennon,Love - Remastered 2010
2,Felipe Gordon,Inherently Deep
3,Golf Trip,L in Vain
4,Loyle Carner,Ottolenghi


In [7]:
%%time 

track_details = []

# iterate over each row in the artist_track_list dataframe, perform a search request 
# to Spotify API and extract relevant details (artist id, track id and popularity score) 
# from the search results and append them to the track_details list
for __, row in artist_track_list.iterrows():
    query = f"{row.iloc[1]} {row.iloc[0]}"
    results = sp.search(q=query, type='track', market="HR", limit=1)
    for track in results['tracks']['items']:
        track_details.append({
            'artistName': track['artists'][0]['name'],
            'trackName': track['name'],
            'artistId': track['artists'][0]['id'],
            'trackId': track['id'],
            'trackPopularity': track['popularity']
        })

# convert the list of dictionaries into a DataFrame
track_details = pd.DataFrame(track_details)

print(track_details.shape)
track_details.head()

(2097, 5)
CPU times: user 11.3 s, sys: 2.43 s, total: 13.8 s
Wall time: 6min 35s


Unnamed: 0,artistName,trackName,artistId,trackId,trackPopularity
0,John Lennon,Isolation - Remastered 2010,4x1nvY2FN8jxqAFA0DA02H,3sRQJYlA7P4oIRUwy8Im9r,43
1,John Lennon,Love - Remastered 2010,4x1nvY2FN8jxqAFA0DA02H,0SEmf7XdvzCmmEjtpZKIKl,50
2,Felipe Gordon,Inherently Deep,7rQKvsWUOJgXmInx2JuaXj,7uvLegwoUsnra3oZzimE4a,20
3,Golf Trip,L in Vain,2cSZwherHAASXofK9ZFK2A,7GbiGMXMmtoAOznmgLDt4H,30
4,Loyle Carner,Ottolenghi,4oDjh8wNW5vDHyFRrDYC4k,64I9byMYBlS1ARsC3vtpgW,62


In [8]:
track_details[track_details.duplicated(keep=False)]

Unnamed: 0,artistName,trackName,artistId,trackId,trackPopularity
9,Buč Kesidi,Idemo do hodnika - uživo,0yujOFSHf3DlwirE8dsGuG,0wpXIDlxIs5Tj4IzphYInv,14
156,Buč Kesidi,Idemo do hodnika - uživo,0yujOFSHf3DlwirE8dsGuG,0wpXIDlxIs5Tj4IzphYInv,14
163,Moon Diagrams,Nightmoves,2MqjEhTz8CDRF4JUIaodjS,7MrYYUuNJRCMGerojDffO8,23
164,Moon Diagrams,Nightmoves,2MqjEhTz8CDRF4JUIaodjS,7MrYYUuNJRCMGerojDffO8,23
174,Jon Hopkins,Abandon Window - Remaster 2023,7yxi31szvlbwvKq9dYOmFI,5BKp7nLEzAtazOxYla2sBr,44
210,The Black Keys,Gold on the Ceiling,7mnBLXK823vNxN3UWB7Gfz,5lN1EH25gdiqT1SFALMAq1,76
219,The Black Keys,Gold on the Ceiling,7mnBLXK823vNxN3UWB7Gfz,5lN1EH25gdiqT1SFALMAq1,76
263,L'Impératrice,Everything Eventually Ends,4PwlsrN0t5mLN0C827cbEU,5LInOGDHqrgetnMlnvaDNq,59
546,George Harrison,Got My Mind Set on You,7FIoB5PHdrMZVC3q2HE5MS,4wswaG5vmNINMZcVBsAyBP,76
856,Sufjan Stevens,Futile Devices (Doveman Remix),4MXUO7sVCaFgFjoTI5ox5c,5vTSnZTmS1gMiWuA9kDE19,67


In [9]:
track_details = track_details.drop_duplicates(keep='first')

In [10]:
track_details.shape

(2087, 5)

<span style="font-family:Helvetica Light">

### Fetching Artists' Musical Genres
The <a href="https://developer.spotify.com/documentation/web-api/reference/get-multiple-artists" target="_blank">artists</a> endpoint was utilized to retrieve the genres of all the artists in my streaming history.

In [11]:
%%time

unique_artist_ids = list(set(track_details['artistId']))

artist_genres_list = []
batch_size = 50
none_counter = 0

for i in range(0, len(unique_artist_ids), batch_size):
    batch = unique_artist_ids[i:i + batch_size]
    artists = sp.artists(batch)
    for artist in artists['artists']:
        if artist is None:
            none_counter += 1
        else:
            artist_genres_list.append(artist)
            
print('Number of artists where no data were available:', none_counter)

artist_genres_df = pd.DataFrame(artist_genres_list)
artist_genres_df.head()

Number of artists where no data were available: 0
CPU times: user 240 ms, sys: 45 ms, total: 285 ms
Wall time: 6.93 s


Unnamed: 0,external_urls,followers,genres,href,id,images,name,popularity,type,uri
0,{'spotify': 'https://open.spotify.com/artist/5...,"{'href': None, 'total': 7990}",[synthpop],https://api.spotify.com/v1/artists/5De1ZtOD9ZU...,5De1ZtOD9ZUyckEjIKFXAi,[{'url': 'https://i.scdn.co/image/ab67616d0000...,Vicious Pink,22,artist,spotify:artist:5De1ZtOD9ZUyckEjIKFXAi
1,{'spotify': 'https://open.spotify.com/artist/5...,"{'href': None, 'total': 380049}","[edm, electro house, pop dance, progressive el...",https://api.spotify.com/v1/artists/5fahUm8t5c0...,5fahUm8t5c0GIdeTq0ZaG8,[{'url': 'https://i.scdn.co/image/ab6761610000...,Otto Knows,56,artist,spotify:artist:5fahUm8t5c0GIdeTq0ZaG8
2,{'spotify': 'https://open.spotify.com/artist/4...,"{'href': None, 'total': 27660}","[chill abstract hip hop, indie hip hop, jazz rap]",https://api.spotify.com/v1/artists/4YISTUJnoZt...,4YISTUJnoZtAy6LjgOpRL7,[{'url': 'https://i.scdn.co/image/ab6761610000...,Ovrkast.,39,artist,spotify:artist:4YISTUJnoZtAy6LjgOpRL7
3,{'spotify': 'https://open.spotify.com/artist/0...,"{'href': None, 'total': 114842}","[lo-fi indie, newfoundland indie, slacker rock...",https://api.spotify.com/v1/artists/04GCjO1r1hP...,04GCjO1r1hPelibCUq9S8H,[{'url': 'https://i.scdn.co/image/ab6761610000...,Fog Lake,52,artist,spotify:artist:04GCjO1r1hPelibCUq9S8H
4,{'spotify': 'https://open.spotify.com/artist/5...,"{'href': None, 'total': 461377}","[adult standards, british invasion, soul, voca...",https://api.spotify.com/v1/artists/5zaXYwewAXe...,5zaXYwewAXedKNCff45U5l,[{'url': 'https://i.scdn.co/image/8b2bbbe1dde1...,Dusty Springfield,64,artist,spotify:artist:5zaXYwewAXedKNCff45U5l


In [12]:
# extract number of followers for each artist
artist_genres_df['total_followers'] = artist_genres_df['followers'].apply(lambda x: x['total'])

In [13]:
columns_to_keep = ['id', 'name', 'popularity', 'total_followers', 'genres']
artist_genres_df = artist_genres_df[columns_to_keep]
artist_genres_df.columns = ['artistId', 'artistName', 'artistPopularity', 'noFollowers', 'artistGenres']
artist_genres_df.head()

Unnamed: 0,artistId,artistName,artistPopularity,noFollowers,artistGenres
0,5De1ZtOD9ZUyckEjIKFXAi,Vicious Pink,22,7990,[synthpop]
1,5fahUm8t5c0GIdeTq0ZaG8,Otto Knows,56,380049,"[edm, electro house, pop dance, progressive el..."
2,4YISTUJnoZtAy6LjgOpRL7,Ovrkast.,39,27660,"[chill abstract hip hop, indie hip hop, jazz rap]"
3,04GCjO1r1hPelibCUq9S8H,Fog Lake,52,114842,"[lo-fi indie, newfoundland indie, slacker rock..."
4,5zaXYwewAXedKNCff45U5l,Dusty Springfield,64,461377,"[adult standards, british invasion, soul, voca..."


In [14]:
track_details = pd.merge(track_details, artist_genres_df, on=['artistId', 'artistName'], how='left')
track_details.shape

(2087, 8)

In [16]:
track_details.head()

Unnamed: 0,artistName,trackName,artistId,trackId,trackPopularity,artistPopularity,noFollowers,artistGenres
0,John Lennon,Isolation - Remastered 2010,4x1nvY2FN8jxqAFA0DA02H,3sRQJYlA7P4oIRUwy8Im9r,43,66,5979523,"[classic rock, rock]"
1,John Lennon,Love - Remastered 2010,4x1nvY2FN8jxqAFA0DA02H,0SEmf7XdvzCmmEjtpZKIKl,50,66,5979523,"[classic rock, rock]"
2,Felipe Gordon,Inherently Deep,7rQKvsWUOJgXmInx2JuaXj,7uvLegwoUsnra3oZzimE4a,20,36,17379,"[jazz house, lo-fi house]"
3,Golf Trip,L in Vain,2cSZwherHAASXofK9ZFK2A,7GbiGMXMmtoAOznmgLDt4H,30,29,2729,[]
4,Loyle Carner,Ottolenghi,4oDjh8wNW5vDHyFRrDYC4k,64I9byMYBlS1ARsC3vtpgW,62,61,711297,"[indie soul, london rap]"


<span style="font-family:Helvetica Light">

### Fetching Track's Audio Features
When working with the Spotify API, it’s important to consider the rate limits imposed by Spotify. The rate limit is calculated based on the number of API calls your application makes within a rolling 30-second window. <a href="https://community.spotify.com/t5/Spotify-for-Developers/Web-API-ratelimit/td-p/5330410" target="_blank">Testing</a> has shown that Spotify allows approximately 180 requests per minute without returning a 429 error (Too Many Requests). Additionally, there may be daily rate limits for specific request types.

To efficiently handle rate limiting, a custom retry strategy is implemented to handle rate limit errors (HTTP 429). Since the Spotify API’s audio_features endpoint does not include a Retry-After header, the retry logic uses a predefined sleep time between attempts.

Given the rate limits, the track IDs are processed in batches of 100, which is the maximum number of IDs you can pass in a single request to <a href="https://spotipy.readthedocs.io/en/latest/#spotipy.client.Spotify.audio_features" target="_blank">sp.audio_features</a>. This batching approach allows efficient handling of large datasets while respecting Spotify’s rate limits, ensuring that all track IDs are processed without being blocked.

In [17]:
%%time

audio_features_list = []
batch_size = 100
none_counter = 0
requests_per_minute = 180

# decorator to enforce rate limiting
@sleep_and_retry
@limits(calls=requests_per_minute, period=60)
def fetch_audio_features(track_ids):
    features = sp.audio_features(track_ids)
    return features

# process track IDs in batches
for i in range(0, len(track_details['trackId']), batch_size):
    batch = track_details['trackId'][i:i + batch_size]
    features = fetch_audio_features(batch)
    for feature in features:
        if feature is None:
            none_counter += 1
        else:
            audio_features_list.append(feature)

print('Number of tracks where no audio features were available:', none_counter)

audio_features_df = pd.DataFrame(audio_features_list)
audio_features_df.head()

Number of tracks where no audio features were available: 2
CPU times: user 272 ms, sys: 41.7 ms, total: 314 ms
Wall time: 3.45 s


Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,0.571,0.0788,2,-15.284,1,0.0393,0.909,0.00921,0.0794,0.254,115.726,audio_features,3sRQJYlA7P4oIRUwy8Im9r,spotify:track:3sRQJYlA7P4oIRUwy8Im9r,https://api.spotify.com/v1/tracks/3sRQJYlA7P4o...,https://api.spotify.com/v1/audio-analysis/3sRQ...,172253,4
1,0.613,0.0609,2,-21.297,1,0.0391,0.887,0.0197,0.0525,0.156,80.988,audio_features,0SEmf7XdvzCmmEjtpZKIKl,spotify:track:0SEmf7XdvzCmmEjtpZKIKl,https://api.spotify.com/v1/tracks/0SEmf7XdvzCm...,https://api.spotify.com/v1/audio-analysis/0SEm...,202147,4
2,0.803,0.57,6,-8.236,0,0.0638,0.147,0.915,0.107,0.331,121.95,audio_features,7uvLegwoUsnra3oZzimE4a,spotify:track:7uvLegwoUsnra3oZzimE4a,https://api.spotify.com/v1/tracks/7uvLegwoUsnr...,https://api.spotify.com/v1/audio-analysis/7uvL...,395204,4
3,0.773,0.727,11,-6.767,0,0.0417,0.182,0.0358,0.113,0.95,108.003,audio_features,7GbiGMXMmtoAOznmgLDt4H,spotify:track:7GbiGMXMmtoAOznmgLDt4H,https://api.spotify.com/v1/tracks/7GbiGMXMmtoA...,https://api.spotify.com/v1/audio-analysis/7Gbi...,254619,4
4,0.776,0.593,7,-10.535,1,0.252,0.327,0.00616,0.186,0.247,94.97,audio_features,64I9byMYBlS1ARsC3vtpgW,spotify:track:64I9byMYBlS1ARsC3vtpgW,https://api.spotify.com/v1/tracks/64I9byMYBlS1...,https://api.spotify.com/v1/audio-analysis/64I9...,197601,4


In [18]:
print(audio_features_df.shape)
print(track_details.shape)

(2085, 18)
(2087, 8)


In [19]:
audio_features_df.time_signature.value_counts()

time_signature
4    1863
3     170
5      31
1      13
0       8
Name: count, dtype: int64

In [20]:
audio_features_df.type.value_counts()

type
audio_features    2085
Name: count, dtype: int64

In [21]:
# dropping columns we don't need
columns_to_drop = ['type', 'uri', 'track_href', 'analysis_url']
audio_features_df = audio_features_df.drop(columns_to_drop, axis=1)

audio_features_df = audio_features_df.rename(columns={'id': 'trackId'})

In [22]:
track_details = pd.merge(track_details, audio_features_df, on='trackId', how='left')
track_details.head()

Unnamed: 0,artistName,trackName,artistId,trackId,trackPopularity,artistPopularity,noFollowers,artistGenres,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,John Lennon,Isolation - Remastered 2010,4x1nvY2FN8jxqAFA0DA02H,3sRQJYlA7P4oIRUwy8Im9r,43,66,5979523,"[classic rock, rock]",0.571,0.0788,...,-15.284,1.0,0.0393,0.909,0.00921,0.0794,0.254,115.726,172253.0,4.0
1,John Lennon,Love - Remastered 2010,4x1nvY2FN8jxqAFA0DA02H,0SEmf7XdvzCmmEjtpZKIKl,50,66,5979523,"[classic rock, rock]",0.613,0.0609,...,-21.297,1.0,0.0391,0.887,0.0197,0.0525,0.156,80.988,202147.0,4.0
2,Felipe Gordon,Inherently Deep,7rQKvsWUOJgXmInx2JuaXj,7uvLegwoUsnra3oZzimE4a,20,36,17379,"[jazz house, lo-fi house]",0.803,0.57,...,-8.236,0.0,0.0638,0.147,0.915,0.107,0.331,121.95,395204.0,4.0
3,Golf Trip,L in Vain,2cSZwherHAASXofK9ZFK2A,7GbiGMXMmtoAOznmgLDt4H,30,29,2729,[],0.773,0.727,...,-6.767,0.0,0.0417,0.182,0.0358,0.113,0.95,108.003,254619.0,4.0
4,Loyle Carner,Ottolenghi,4oDjh8wNW5vDHyFRrDYC4k,64I9byMYBlS1ARsC3vtpgW,62,61,711297,"[indie soul, london rap]",0.776,0.593,...,-10.535,1.0,0.252,0.327,0.00616,0.186,0.247,94.97,197601.0,4.0


In [27]:
track_details.to_csv('../data/track_details.csv', index=False)

In [25]:
df = pd.merge(df, track_details, on=['artistName', 'trackName'], how='left')
print(df.shape)
df.head()

(4940, 23)

In [28]:
df.to_csv('../data/streaming_history_music_enriched.csv', index=False)

<span style="font-family:Helvetica Light">

## Loading Historical Podcast Data from Spotify
Loading the JSON file containing my one-year podcast streaming history, as requested and received from Spotify.

In [41]:
podcasts = pd.read_json('../data/raw/StreamingHistory_podcast_0.json')
print(podcasts.shape)
podcasts.head()

(102, 4)


Unnamed: 0,endTime,podcastName,episodeName,msPlayed
0,2023-05-17 16:00,Lex Fridman Podcast,"#367 – Sam Altman: OpenAI CEO on GPT-4, ChatGP...",1806958
1,2023-05-21 08:38,Meditation Mountain,10 Minute Guided Meditation for Mindfulness,434688
2,2023-06-04 08:04,This Past Weekend w/ Theo Von,E436 Caleb Pressley,2165418
3,2023-06-09 10:23,anything goes with emma chamberlain,a talk with mac demarco [video],401953
4,2023-06-09 11:41,anything goes with emma chamberlain,a talk with mac demarco [video],1805573


<span style="font-family:Helvetica Light">

### Fetching Additional Podcast Details

In [43]:
# create a list of unique podcast - episode pairs to use for searching through the Spotify API
podcast_episode_list = podcasts[['podcastName', 'episodeName']].drop_duplicates(ignore_index=True)
print(podcast_episode_list.shape)
podcast_episode_list.head()

(48, 2)


Unnamed: 0,podcastName,episodeName
0,Lex Fridman Podcast,"#367 – Sam Altman: OpenAI CEO on GPT-4, ChatGP..."
1,Meditation Mountain,10 Minute Guided Meditation for Mindfulness
2,This Past Weekend w/ Theo Von,E436 Caleb Pressley
3,anything goes with emma chamberlain,a talk with mac demarco [video]
4,This Past Weekend w/ Theo Von,E447 Trevor Wallace


In [44]:
%%time 

show_details = []

for __, row in podcast_episode_list.iterrows():
    query = f"{row.iloc[1]} {row.iloc[0]}"
    results = sp.search(q=query, type='show', market="HR", limit=1)
    for show in results['shows']['items']:
        show_details.append({
            'podcastName': show['name'],
            'podcastId': show['id'],
            'imageUrl': show['images'][0]['url']
        })

# convert the list of dictionaries into a DataFrame
show_details = pd.DataFrame(show_details)

print(show_details.shape)
show_details.head()

(48, 3)
CPU times: user 345 ms, sys: 79.7 ms, total: 425 ms
Wall time: 13.4 s


Unnamed: 0,podcastName,podcastId,imageUrl
0,Lex Fridman Podcast,2MAi0BvDc6GTFvKFPXnkCL,https://i.scdn.co/image/ab6765630000ba8a563ebb...
1,Meditation Mountain,6rmydpcCvLzN4744S1fCsW,https://i.scdn.co/image/ab6765630000ba8a4beb4a...
2,The Joe Rogan Experience,4rOoJ6Egrf8K2IrywzwOMk,https://i.scdn.co/image/ab6765630000ba8a20741e...
3,anything goes with emma chamberlain,5VzFvh1JlEhBMS6ZHZ8CNO,https://i.scdn.co/image/ab6765630000ba8a072189...
4,The Joe Rogan Experience,4rOoJ6Egrf8K2IrywzwOMk,https://i.scdn.co/image/ab6765630000ba8a20741e...


In [45]:
show_details.to_csv('../data/show_details.csv', index=False)