# Collecting Data from the Spotify Web API using Spotipy

## About the Spotipy Library:

From the [official Spotipy docs](https://spotipy.readthedocs.io/en/latest/): 
>"Spotipy is a lightweight Python library for the Spotify Web API. With Spotipy you get full access to all of the music data provided by the Spotify platform."


## About using the Spotify Web API:

Spotify offers a number of [API endpoints](https://beta.developer.spotify.com/documentation/web-api/reference/) to access the Spotify data. In this notebook, I used the following:

- [search endpoint](https://beta.developer.spotify.com/documentation/web-api/reference/search/search/) to get the track IDs 
- [audio features endpoint](https://beta.developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/) to get the corresponding audio features.



### Set Spotipy Credentials.

In [554]:
# pip install spotipy.
!pip install spotipy



In [1]:
# imports.
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

# spotify credentials.
cid = '51c2d40e57e74fbb852374c80b4f0054'
secret = '8b0cafe24a7c45818f2b01f12fec13e5'
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

### Song Tracks Data.

In [2]:
# imports.
import timeit
import time

# timeit library to measure the time needed to run this code.
start = timeit.default_timer()

# empty lists where the results are going to be stored
artist_name = []
track_name = []
popularity = []
track_id = []

for i in range(0,2000,50):
    # search by year.
    #track_results = sp.search(q='year:2020', limit=50, offset=i, market='EU')
    # search by genre.
    #track_results = sp.search(q='genre:dance', limit=50, offset=i, market='US')
    # search year and genre.
    track_results = sp.search(q='year:2018 AND tag:hipster', limit=50, offset=i, market='US')

    #time.sleep(0.50)
    for i, t in enumerate(track_results['tracks']['items']):
        artist_name.append(t['artists'][0]['name'])
        track_name.append(t['name'])
        track_id.append(t['id'])
        popularity.append(t['popularity'])
        
stop = timeit.default_timer()
print ('Time to run this code (in seconds):', stop - start)
track_results['tracks']

Time to run this code (in seconds): 11.404927899999999


{'href': 'https://api.spotify.com/v1/search?query=year%3A2018+AND+tag%3Ahipster&type=track&market=US&offset=1950&limit=50',
 'items': [{'album': {'album_type': 'album',
    'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/7dJ05O9EOtFVROVfEEHBBc'},
      'href': 'https://api.spotify.com/v1/artists/7dJ05O9EOtFVROVfEEHBBc',
      'id': '7dJ05O9EOtFVROVfEEHBBc',
      'name': 'Lullabies for Deep Meditation',
      'type': 'artist',
      'uri': 'spotify:artist:7dJ05O9EOtFVROVfEEHBBc'},
     {'external_urls': {'spotify': 'https://open.spotify.com/artist/5xj0jN2EifogsCRT1f91Zy'},
      'href': 'https://api.spotify.com/v1/artists/5xj0jN2EifogsCRT1f91Zy',
      'id': '5xj0jN2EifogsCRT1f91Zy',
      'name': 'Zen Meditation and Natural White Noise and New Age Deep Massage',
      'type': 'artist',
      'uri': 'spotify:artist:5xj0jN2EifogsCRT1f91Zy'},
     {'external_urls': {'spotify': 'https://open.spotify.com/artist/3MSV8ibxF4Tn6eyDSUjFuY'},
      'href': 'https://api

In [None]:
#track_results['tracks']

In [611]:
# imports.
import pandas as pd

# create the tracks data frame.
df_tracks = pd.DataFrame({'artist_name':artist_name,'track_name':track_name,'track_id':track_id,'popularity':popularity})
# show the data frame shape.
print(df_tracks.shape)
# show the data frame with headers.
df_tracks.head()

(2000, 4)


Unnamed: 0,artist_name,track_name,track_id,popularity
0,Luis Estrada,Vida Loca,0t3hASUCmUhRHIRT6CUUfi,4
1,Sabrina Is Not In This Chat,This Innocent Fish,2KsyQRQkUfqnl5X3VTDfpm,4
2,Luis Estrada,Cara de Mala,7aAkCL7607QsEqWq6N696G,4
3,Leah Voysey,Poison,7wng5BHHY4a1jXxTTVxN3x,4
4,Andrei Krylov,Ophelia Dancing Alone in the Castle,00bljtxxcl13Bt2hL9sXUH,4


In [612]:
# group entries by artist_name and track_name, check for duplicates.
grouped = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()
print(grouped[grouped > 1].count())
df_tracks.drop_duplicates(subset=['artist_name','track_name'], inplace=True)

46


In [613]:
# verify duplicates were dropped.
grouped_after_dropping = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()
grouped_after_dropping[grouped_after_dropping > 1].count()

0

In [614]:
# show basic info for tracks data.
df_tracks.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1954 entries, 0 to 1999
Data columns (total 4 columns):
artist_name    1954 non-null object
track_name     1954 non-null object
track_id       1954 non-null object
popularity     1954 non-null int64
dtypes: int64(1), object(3)
memory usage: 76.3+ KB


### Track Audio Features Data.

In [615]:
# timeit library to measure the time needed to run this code.
start = timeit.default_timer()

# empty list, batchsize, counter for 'none' results.
rows = []
batchsize = 100
None_counter = 0

for i in range(0,len(df_tracks['track_id']),batchsize):
    batch = df_tracks['track_id'][i:i+batchsize]
    feature_results = sp.audio_features(batch)
    for i, t in enumerate(feature_results):
        if t == None:
            None_counter = None_counter + 1
        else:
            rows.append(t)
            
print('Number of tracks where no audio features were available:',None_counter)
stop = timeit.default_timer()
print ('Time to run this code (in seconds):',stop - start)

Number of tracks where no audio features were available: 0
Time to run this code (in seconds): 3.0535056440003245


In [616]:
# create the audio features data frame.
df_audio_features = pd.DataFrame.from_dict(rows,orient='columns')
# show the shape of the data frame.
print(df_audio_features.shape)
# show the data frame with headers.
df_audio_features.head()

(1954, 18)


Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,0.688,0.826,0,-5.63,0,0.0626,0.165,0.0,0.182,0.89,100.033,audio_features,0t3hASUCmUhRHIRT6CUUfi,spotify:track:0t3hASUCmUhRHIRT6CUUfi,https://api.spotify.com/v1/tracks/0t3hASUCmUhR...,https://api.spotify.com/v1/audio-analysis/0t3h...,213600,4
1,0.522,0.744,9,-8.014,1,0.0411,0.119,0.0305,0.0697,0.535,119.695,audio_features,2KsyQRQkUfqnl5X3VTDfpm,spotify:track:2KsyQRQkUfqnl5X3VTDfpm,https://api.spotify.com/v1/tracks/2KsyQRQkUfqn...,https://api.spotify.com/v1/audio-analysis/2Ksy...,236151,4
2,0.785,0.923,2,-3.532,1,0.105,0.0252,0.0,0.33,0.663,100.003,audio_features,7aAkCL7607QsEqWq6N696G,spotify:track:7aAkCL7607QsEqWq6N696G,https://api.spotify.com/v1/tracks/7aAkCL7607Qs...,https://api.spotify.com/v1/audio-analysis/7aAk...,236743,4
3,0.576,0.65,1,-6.184,0,0.0346,0.14,0.0,0.0898,0.0812,87.024,audio_features,7wng5BHHY4a1jXxTTVxN3x,spotify:track:7wng5BHHY4a1jXxTTVxN3x,https://api.spotify.com/v1/tracks/7wng5BHHY4a1...,https://api.spotify.com/v1/audio-analysis/7wng...,139605,4
4,0.528,0.15,5,-21.634,0,0.0554,0.977,0.92,0.109,0.418,115.877,audio_features,00bljtxxcl13Bt2hL9sXUH,spotify:track:00bljtxxcl13Bt2hL9sXUH,https://api.spotify.com/v1/tracks/00bljtxxcl13...,https://api.spotify.com/v1/audio-analysis/00bl...,102940,3


In [617]:
# show basic info for audio features data.
df_audio_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1954 entries, 0 to 1953
Data columns (total 18 columns):
danceability        1954 non-null float64
energy              1954 non-null float64
key                 1954 non-null int64
loudness            1954 non-null float64
mode                1954 non-null int64
speechiness         1954 non-null float64
acousticness        1954 non-null float64
instrumentalness    1954 non-null float64
liveness            1954 non-null float64
valence             1954 non-null float64
tempo               1954 non-null float64
type                1954 non-null object
id                  1954 non-null object
uri                 1954 non-null object
track_href          1954 non-null object
analysis_url        1954 non-null object
duration_ms         1954 non-null int64
time_signature      1954 non-null int64
dtypes: float64(9), int64(4), object(5)
memory usage: 274.9+ KB


In [None]:
# rename id column to merge with tracks data frame.
df_audio_features.rename(columns={'id': 'track_id'}, inplace=True)

In [619]:
# drop useless columns.
columns_to_drop = ['analysis_url','track_href','type','uri']
df_audio_features.drop(columns_to_drop, axis=1,inplace=True)
# show the data frame shape.
print(df_audio_features.shape)

(1954, 14)


In [620]:
# merge both dataframes with inner method to keep track IDs present in both data frames.
df = pd.merge(df_tracks,df_audio_features,on='track_id',how='inner')
print(df.shape)
df.head()

(1954, 17)


Unnamed: 0,artist_name,track_name,track_id,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Luis Estrada,Vida Loca,0t3hASUCmUhRHIRT6CUUfi,4,0.688,0.826,0,-5.63,0,0.0626,0.165,0.0,0.182,0.89,100.033,213600,4
1,Sabrina Is Not In This Chat,This Innocent Fish,2KsyQRQkUfqnl5X3VTDfpm,4,0.522,0.744,9,-8.014,1,0.0411,0.119,0.0305,0.0697,0.535,119.695,236151,4
2,Luis Estrada,Cara de Mala,7aAkCL7607QsEqWq6N696G,4,0.785,0.923,2,-3.532,1,0.105,0.0252,0.0,0.33,0.663,100.003,236743,4
3,Leah Voysey,Poison,7wng5BHHY4a1jXxTTVxN3x,4,0.576,0.65,1,-6.184,0,0.0346,0.14,0.0,0.0898,0.0812,87.024,139605,4
4,Andrei Krylov,Ophelia Dancing Alone in the Castle,00bljtxxcl13Bt2hL9sXUH,4,0.528,0.15,5,-21.634,0,0.0554,0.977,0.92,0.109,0.418,115.877,102940,3


### Final.

In [621]:
# check for NA values.
df.isna().sum()

artist_name         0
track_name          0
track_id            0
popularity          0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
duration_ms         0
time_signature      0
dtype: int64

In [622]:
#print(df.shape)
df.head()

Unnamed: 0,artist_name,track_name,track_id,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Luis Estrada,Vida Loca,0t3hASUCmUhRHIRT6CUUfi,4,0.688,0.826,0,-5.63,0,0.0626,0.165,0.0,0.182,0.89,100.033,213600,4
1,Sabrina Is Not In This Chat,This Innocent Fish,2KsyQRQkUfqnl5X3VTDfpm,4,0.522,0.744,9,-8.014,1,0.0411,0.119,0.0305,0.0697,0.535,119.695,236151,4
2,Luis Estrada,Cara de Mala,7aAkCL7607QsEqWq6N696G,4,0.785,0.923,2,-3.532,1,0.105,0.0252,0.0,0.33,0.663,100.003,236743,4
3,Leah Voysey,Poison,7wng5BHHY4a1jXxTTVxN3x,4,0.576,0.65,1,-6.184,0,0.0346,0.14,0.0,0.0898,0.0812,87.024,139605,4
4,Andrei Krylov,Ophelia Dancing Alone in the Castle,00bljtxxcl13Bt2hL9sXUH,4,0.528,0.15,5,-21.634,0,0.0554,0.977,0.92,0.109,0.418,115.877,102940,3


### Create a CSV file.

In [None]:
from google.colab import files
df.to_csv('SpotifyTracks_2018hipster.csv')
files.download('SpotifyTracks_2018hipster.csv')