<div class="alert alert-block alert-info">
This notebook is to pull data from Spotify API and build master data for our analysis of Joey Yung. <br>
Note the following pipeline is good for pulling data based on artist name.
</div>

### Interesting features to look into:

1. `Album features`: album_type (single vs album), available_markets, genres, popularity, release_date, release_date_precision, total tracks
2. `Track features`: available_markets, is_playable, duration_ms, popularity
3. `Audio features`: acousticness, danceability, duration_ms, energy, instrumentalness, key, liveness, loudness, mode, speechiness, tempo, valence

The following is not important here because we're only looking at one artist. <br>
`Artist features`: name, followers, popularity, genres

**Reference:**<br>
[Web API Reference Beta](https://developer.spotify.com/documentation/web-api/reference-beta/#objects-index) <br>
[Audio Features Distributions](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/) <br>
[Medium Article](https://medium.com/@RareLoot/extracting-spotify-data-on-your-favourite-artist-via-python-d58bc92a4330)

## Import libraries and set up modules/variables

In [1]:
import pandas as pd
import numpy as np
import sys
import time

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
np.set_printoptions(threshold=sys.maxsize)

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

In [2]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

CLIENT_ID = 'YOUR CLIENT ID' 
CLIENT_SECRET = 'YOUR CLIENT SECRET'

client_credentials_manager = SpotifyClientCredentials(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

## Get artist's albums and related info

In [3]:
fav_artist = 'Joey Yung' # Chosen artist

In [4]:
# TODO: in "get all albums", when the artist has more than 50 albums, still need to get all
def get_artist_albums(artist_name, album_type='album', limit=50):
    """ Get all albums from the artist.
    
        Parameters:
            - artist_name - string
            - album_type - 'album', 'single', 'appears_on', 'compilation'
            - limit  - the number of albums to return with maximum 50
    """
    # Get artist's uri
    result = sp.search(artist_name)
    artist_uri = result['tracks']['items'][0]['artists'][0]['uri']
    
    # Get all albums
    sp_albums = sp.artist_albums(artist_uri, album_type=album_type, limit=limit)
    
    return sp_albums
    
# Save albums names and uris
album_names = []
album_uris = []
album_release_date = []
album_release_date_precision = []
album_total_tracks = []

sp_albums = get_artist_albums(fav_artist)

for i in range(len(sp_albums['items'])):
    album_names.append(sp_albums['items'][i]['name'])
    album_uris.append(sp_albums['items'][i]['uri'])
    album_release_date.append(sp_albums['items'][i]['release_date'])
    album_release_date_precision.append(sp_albums['items'][i]['release_date_precision'])
    album_total_tracks.append(sp_albums['items'][i]['total_tracks'])

In [5]:
df_album =  pd.DataFrame(
    {'album_names': album_names,
     'album_uris': album_uris,
     'album_release_date': album_release_date,
     'album_release_date_precision': album_release_date_precision,
     'album_total_trakcs': album_total_tracks
    })

In [6]:
df_album.shape

(46, 5)

In [7]:
df_album.head()

Unnamed: 0,album_names,album_uris,album_release_date,album_release_date_precision,album_total_trakcs
0,答案之書,spotify:album:0zPpdDiDX6JnGZRHXFYuZt,2018-10-19,day,10
1,Joey • My Secret • Live,spotify:album:2iRFzWNrrMfUI1X9f5Ozwa,2017-08-25,day,41
2,一百個我 國語新曲+精選,spotify:album:2IRcqVOTFzU5Mbn7grINjB,2016-12-16,day,31
3,J-POP,spotify:album:7iX1exROLsCg1Qt0wyKZPl,2016-06-16,day,12
4,Joey Yung X Hacken Lee Concert 2015 (Live),spotify:album:3Dc8vyw8nSjhOyZkNzExqc,2015-12-24,day,55


In [8]:
# We can see there's one album has more than 50 songs (which exceeds the return limit), let's find out what the album is
album_ix = [album_total_tracks.index(i) for i in album_total_tracks if i > 50]

for i in album_ix:
    print(f"Album '{album_names[i]}' has more than 50 songs.")

Album 'Joey Yung X Hacken Lee Concert 2015 (Live)' has more than 50 songs.


## Get the songs from each album

In [9]:
def get_songs_from_album(uri):
    album = uri # Assign album uri to a name

    spotify_albums[album] = {} #Creates dictionary for that specific album

    # Create keys-values of empty lists inside nested dictionary for album
    spotify_albums[album]['album'] = [] # Create empty list to save the info
    spotify_albums[album]['track_number'] = []
    spotify_albums[album]['name'] = []
#     spotify_albums[album]['id'] = []
    spotify_albums[album]['uri'] = []
    spotify_albums[album]['available_markets'] = []
    spotify_albums[album]['duration_ms'] = []

    
    tracks = sp.album_tracks(album) # Pull data on album tracks
    for n in range(len(tracks['items'])): # For each song track
        spotify_albums[album]['album'].append(album_names[album_count]) # Append album name tracked via album_count
        spotify_albums[album]['track_number'].append(tracks['items'][n]['track_number'])
        spotify_albums[album]['name'].append(tracks['items'][n]['name'])
#         spotify_albums[album]['id'].append(tracks['items'][n]['id'])
        spotify_albums[album]['uri'].append(tracks['items'][n]['uri'])
        spotify_albums[album]['available_markets'].append(tracks['items'][n]['available_markets'])
        spotify_albums[album]['duration_ms'].append(tracks['items'][n]['duration_ms'])

In [10]:
spotify_albums = {}
album_count = 0

for i in album_uris:
    get_songs_from_album(i)
    print(f"Album {album_names[album_count]} completed. This is album number {album_count+1}.")
    album_count += 1

Album 答案之書 completed. This is album number 1.
Album Joey • My Secret • Live completed. This is album number 2.
Album 一百個我 國語新曲+精選 completed. This is album number 3.
Album J-POP completed. This is album number 4.
Album Joey Yung X Hacken Lee Concert 2015 (Live) completed. This is album number 5.
Album 1314 容祖兒演唱會 completed. This is album number 6.
Album All Delicious Collection completed. This is album number 7.
Album Hopelessly Romantic Collection completed. This is album number 8.
Album 小日子 completed. This is album number 9.
Album Moment completed. This is album number 10.
Album Joey & Joey 新城容祖兒音樂會 completed. This is album number 11.
Album Joey & Joey completed. This is album number 12.
Album Joey Yung Concert Number 6 completed. This is album number 13.
Album Perfect 10 Live 2009 completed. This is album number 14.
Album A Time For Us completed. This is album number 15.
Album 新城唱好 容祖兒 黃耀明 祖戀明歌音樂會 completed. This is album number 16.
Album 很忙 completed. This is album number 17.
Album 

In [11]:
# Add missing songs from Album 'Joey Yung X Hacken Lee Concert 2015 (Live)'
# Kinda a manual process here. In the future, need to find a better way to handle return limit issues
additional_tracks = sp.album_tracks(album_uris[album_ix[0]], offset=50) # 50 is the return limit set by Spotify

for n in range(len(additional_tracks['items'])):
        spotify_albums[album_uris[album_ix[0]]]['album'].append(album_names[album_ix[0]])
        spotify_albums[album_uris[album_ix[0]]]['track_number'].append(additional_tracks['items'][n]['track_number'])
        spotify_albums[album_uris[album_ix[0]]]['name'].append(additional_tracks['items'][n]['name'])
#         spotify_albums[album_uris[album_ix[0]]]['id'].append(additional_tracks['items'][n]['id'])
        spotify_albums[album_uris[album_ix[0]]]['uri'].append(additional_tracks['items'][n]['uri'])
        spotify_albums[album_uris[album_ix[0]]]['available_markets'].append(additional_tracks['items'][n]['available_markets'])
        spotify_albums[album_uris[album_ix[0]]]['duration_ms'].append(additional_tracks['items'][n]['duration_ms'])

In [14]:
assert len(spotify_albums[album_uris[album_ix[0]]]['name']) == album_total_tracks[album_ix[0]] # sanity check to match with total number of tracks

## Get audio features for each song

In [15]:
def get_audio_features(album):
    # Add new key-values to store audio features
    spotify_albums[album]['acousticness'] = []
    spotify_albums[album]['danceability'] = []
    spotify_albums[album]['energy'] = []
    spotify_albums[album]['instrumentalness'] = []
    spotify_albums[album]['liveness'] = []
    spotify_albums[album]['loudness'] = []
    spotify_albums[album]['speechiness'] = []
    spotify_albums[album]['tempo'] = []
    spotify_albums[album]['valence'] = []
    spotify_albums[album]['popularity'] = []
    spotify_albums[album]['key'] = []
    spotify_albums[album]['mode'] = []
    spotify_albums[album]['time_signature'] = []
    
    # Create a track counter
    track_count = 0
    for track in spotify_albums[album]['uri']:
        # Pull audio features per track
        features = sp.audio_features(track)
        
        # Append to relevant key-value
        spotify_albums[album]['acousticness'].append(features[0]['acousticness'])
        spotify_albums[album]['danceability'].append(features[0]['danceability'])
        spotify_albums[album]['energy'].append(features[0]['energy'])
        spotify_albums[album]['instrumentalness'].append(features[0]['instrumentalness'])
        spotify_albums[album]['liveness'].append(features[0]['liveness'])
        spotify_albums[album]['loudness'].append(features[0]['loudness'])
        spotify_albums[album]['speechiness'].append(features[0]['speechiness'])
        spotify_albums[album]['tempo'].append(features[0]['tempo'])
        spotify_albums[album]['valence'].append(features[0]['valence'])
        spotify_albums[album]['key'].append(features[0]['key'])
        spotify_albums[album]['mode'].append(features[0]['mode'])
        spotify_albums[album]['time_signature'].append(features[0]['time_signature'])
        
        # Popularity is stored elsewhere
        pop = sp.track(track)
        spotify_albums[album]['popularity'].append(pop['popularity'])
        track_count+=1

In [16]:
# Need random delay to avoid sending too many requests at Spotify's API
sleep_min = 2
sleep_max = 5
start_time = time.time()
request_count = 0

for i in spotify_albums:
    get_audio_features(i)
    request_count+=1
    if request_count % 5 == 0:
        print(str(request_count) + " playlists completed")
        time.sleep(np.random.uniform(sleep_min, sleep_max))
        print('Loop #: {}'.format(request_count))
        print('Elapsed Time: {} seconds'.format(time.time() - start_time))

5 playlists completed
Loop #: 5
Elapsed Time: 27.924896001815796 seconds
10 playlists completed
Loop #: 10
Elapsed Time: 53.188470125198364 seconds
15 playlists completed
Loop #: 15
Elapsed Time: 72.68922185897827 seconds
20 playlists completed
Loop #: 20
Elapsed Time: 86.9282329082489 seconds
25 playlists completed
Loop #: 25
Elapsed Time: 110.11194610595703 seconds
30 playlists completed
Loop #: 30
Elapsed Time: 128.70181393623352 seconds
35 playlists completed
Loop #: 35
Elapsed Time: 150.8106210231781 seconds
40 playlists completed
Loop #: 40
Elapsed Time: 165.01903891563416 seconds
45 playlists completed
Loop #: 45
Elapsed Time: 179.303946018219 seconds


## Put everything into a dictionary and convert into dataframe

In [17]:
dic_df = {}

dic_df['album'] = []
dic_df['track_number'] = []
dic_df['name'] = []
# dic_df['id'] = []
dic_df['uri'] = []
dic_df['available_markets'] = []
dic_df['duration_ms'] = []
dic_df['acousticness'] = []
dic_df['danceability'] = []
dic_df['energy'] = []
dic_df['instrumentalness'] = []
dic_df['liveness'] = []
dic_df['loudness'] = []
dic_df['speechiness'] = []
dic_df['tempo'] = []
dic_df['valence'] = []
dic_df['popularity'] = []
dic_df['key'] = []
dic_df['mode'] = []
dic_df['time_signature'] = []

for album in spotify_albums: 
    for feature in spotify_albums[album]:
        dic_df[feature].extend(spotify_albums[album][feature])

In [18]:
df_track = pd.DataFrame.from_dict(dic_df)

In [22]:
df_final = df_track.merge(df_album, left_on='album', right_on='album_names', how='left') # Add album info as well
df_final.drop(columns=['album_names'], inplace=True)

In [35]:
df_final.head()

Unnamed: 0,album,track_number,name,uri,available_markets,duration_ms,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,time_signature,album_uris,album_release_date,album_release_date_precision,album_total_trakcs
0,答案之書,1,優秀,spotify:track:1ZBUhpscaXX1Q35RmMBR4v,"[AR, AT, AU, BE, BG, BO, BR, CA, CH, CL, CO, C...",217815,0.0443,0.581,0.849,8.3e-05,0.0642,-5.835,0.0353,128.136,0.402,18,9,1,4,spotify:album:0zPpdDiDX6JnGZRHXFYuZt,2018-10-19,day,10
1,答案之書,2,亞亞亞,spotify:track:2FKxsi0BDOZCYBSheFCb0f,"[AR, AT, AU, BE, BG, BO, BR, CA, CH, CL, CO, C...",183077,0.00637,0.524,0.829,8.7e-05,0.0627,-5.081,0.076,171.975,0.391,15,10,0,4,spotify:album:0zPpdDiDX6JnGZRHXFYuZt,2018-10-19,day,10
2,答案之書,3,綁夢,spotify:track:1fMPS7ewHOKFaMI0iDhntr,"[AR, AT, AU, BE, BG, BO, BR, CA, CH, CL, CO, C...",200887,0.229,0.411,0.606,0.0,0.227,-7.09,0.233,74.411,0.51,15,1,1,5,spotify:album:0zPpdDiDX6JnGZRHXFYuZt,2018-10-19,day,10
3,答案之書,4,孤單喧嘩,spotify:track:0iyG2K02EtnTYkJAurcyT5,"[AR, AT, AU, BE, BG, BO, BR, CA, CH, CL, CO, C...",260706,0.307,0.528,0.451,0.0,0.101,-8.345,0.0395,73.059,0.427,18,0,1,4,spotify:album:0zPpdDiDX6JnGZRHXFYuZt,2018-10-19,day,10
4,答案之書,5,容光,spotify:track:5t0f2DIs0qMSjADpWxqDSB,"[AR, AT, AU, BE, BG, BO, BR, CA, CH, CL, CO, C...",272364,0.861,0.393,0.299,0.0,0.0986,-11.124,0.0368,118.469,0.269,19,2,1,4,spotify:album:0zPpdDiDX6JnGZRHXFYuZt,2018-10-19,day,10


## Save the dataframe to local folder

In [39]:
df_final.to_csv('data/joey_yung_album_tracks_asof_20200510.csv', index=False)

In [43]:
df_final.to_parquet('data/joey_yung_album_tracks_asof_20200510.parquet')