## Spotify API Exploratory Analysis.
Credit to [Spotipy Package](https://spotipy.readthedocs.io/en/latest/) for making this Analysis a breeze.

Aim of this notebook is to:
1. Get a feel for the Spotify API and it's structure. 
2. Do some exploratory analysis of my Top Tracks.
3. Visualize the data. [Spotify API Analysis on Tableau](https://public.tableau.com/profile/william8331#!/vizhome/SpotifyMyTracks/TopTracks?publish=yes) 
4. Get some ideas for a potential App that could be built.



## Section 1: Authenticating and getting Users Top Tracks 

The API wrappers we'll be using from spotipy are as follows. Please read the API docs here to get an understanding of
the payload [here](https://developer.spotify.com/documentation/web-api/reference/personalization/get-users-top-artists-and-tracks/):
1. spotify.current_user_top_tracks: To get users Top 50 Tracks.
2. spotify.tracks: To get information around the Top 50 tracks
3. spotify.artists: Information around artist and genre.
4. spotify.audio_features: Audio Features for each Track.

In [None]:
#Dependencies
import spotipy
import spotipy.util as util
from IPython.display import JSON
import pandas as pd
import os
import json
import ast
from pandas.io.json import json_normalize
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
#If you want to use pre-cached results
df_tracks=pd.read_csv(results_dir+'SpotifyMyTracks.csv')

In [None]:
#Client ID/Secret stored in the environment variables. 
#Don't want some random to start fiddling with your Spotify account!
CLIENT_ID=os.getenv('sp_client_id')
CLIENT_SECRET=os.getenv('sp_client_secret')
results_dir='results/'

In [None]:
#Authorize the user via getting a token and returning a spotify object to be used for querying the API. 
#Token Lasts for about an hour or so. Scopes need to be defined to grant your app access to various endpoints.
def sp_authorize():
    scope = 'user-library-read user-top-read user-read-playback-state user-read-recently-played'
    username='wjia26'
    token = util.prompt_for_user_token(username,scope,
                               client_id=CLIENT_ID,
                               client_secret=CLIENT_SECRET,
                               redirect_uri='https://google.com')
    spotify = spotipy.Spotify(auth=token)
    return spotify

In [None]:
#Maps integer key's to human readeable key's.
def int_to_key(key_int):
    key_list = ["C", "C#", "D", "D#", "E", "F", "F#", "G", "G#", "A", "A#", "B"]
    return key_list[int(key_int)]

In [None]:
#Removes prefixes from field's 
def get_unprefixed_keys(tracks_dict,prefix=''):
    keys=[key.replace(prefix,'') for key, value in tracks_dict.items() if prefix in key.lower()]
    return keys

In [None]:
'''
Takes in a time_range short_term/medium_term/long_term and returns a dataframe with the top 50 tracks for that time range.

In hindsight I probably could've used pandas.io.json.json_normalize to do this with less code.
The Methodology I've gone with is to store each value in a giant dictionary and convert the end populated dict into a dataframe.

I first hit the spotify.current_user_top_tracks API to grab all the track id's. Then I use the other three to gather more data
that specific track.

I've prefixed each field with 
which API I've hit to get it from:
1. track_ is from the spotify.tracks API
2. artist_ is from spotify.artists API 
3. ft_ is from the spotify.audio_features API
'''
def top_tracks_to_df(time_range='short_term'):
    spotify=sp_authorize()
    # list of track fields
    tracks_dict={
            'rank': [],
            "time_range":[],
            'track_id':[],    
            'track_name':[],
             'track_popularity':[],
             'track_release_date':[],
             'artist_genre':[],
             'artist_name':[],
            "ft_danceability": [],
            "ft_energy": [],
            "ft_key": [],
            "ft_loudness": [],
            "ft_mode": [],
            "ft_speechiness": [],
            "ft_acousticness": [],
            "ft_instrumentalness": [],
            "ft_liveness": [],
            "ft_valence": [],
            "ft_tempo": []
            }
    #Print Top Tracks with the individual track data for the user.
    track_ids=[]
    top_tracks_data = spotify.current_user_top_tracks(limit=50, offset=0,
                                              time_range=time_range)

    for item in top_tracks_data['items']:
        track_ids.append(item['id'])

    while top_tracks_data['next']:
        top_tracks_data = spotify.next(top_tracks_data)
        for item in top_tracks_data['items']:
            track_ids.append(item['id'])

    #tracks data payload json
    album_ids=[]
    artist_ids=[]

    track_data=spotify.tracks(track_ids)
    rank=0
    
    for track in track_data['tracks']:
        rank=rank+1
        album_ids.append(track['album']['id'])  
        artist_ids.append(track['artists'][0]['id']) 
        #Just grab the first artist to get genre
        tracks_dict['rank'].append(rank)
        tracks_dict['track_id'].append(track['id'])    
        tracks_dict['track_name'].append(track['name'])
        tracks_dict['track_popularity'].append(track['popularity'])
        tracks_dict['track_release_date'].append(track['album']['release_date'])
        tracks_dict['time_range'].append(time_range)

    #get genres for each track through the artists
    artists_data=spotify.artists(artist_ids)
    for artist in artists_data['artists']:
        tracks_dict['artist_genre'].append(artist['genres'])
        tracks_dict['artist_name'].append(artist['name'])

    #get Audio features for each track
    features_data=spotify.audio_features(track_ids)
    for features in features_data:
        keys=get_unprefixed_keys(tracks_dict,prefix='ft_')
        for key in keys:
            tracks_dict['ft_'+key].append(features[key])  

    df = pd.DataFrame(tracks_dict)   
    
    return df

In [None]:
#Concatenate as one dataframe
df1=top_tracks_to_df('short_term')
df2=top_tracks_to_df('medium_term')
df3=top_tracks_to_df('long_term')
df_tracks=pd.concat([df1,df2,df3])
#Convert all integer key signatures to human-readable key signatures.
df_tracks['key_note']=df_tracks['ft_key'].apply(int_to_key)

In [None]:
#For every track there are list of genre's. This unnests the genre's so there is one genre for each row. 
#Track id's do get duplicated but we can handle that on Tableau side. This is mainly so we can display at the genre level.
genre_dict={
            'rank': [],
            "time_range":[],
            'track_id':[],    
            'track_name':[],
             'track_popularity':[],
             'track_release_date':[],
             'artist_genre':[],
             'artist_name':[],
            'key_note': [],
            "ft_danceability": [],
            "ft_energy": [],
            "ft_key": [],
            "ft_loudness": [],
            "ft_mode": [],
            "ft_speechiness": [],
            "ft_acousticness": [],
            "ft_instrumentalness": [],
            "ft_liveness": [],
            "ft_valence": [],
            "ft_tempo": [],
             'artist_name':[],
            "number_of_repeats": []
            }

for index,track in df_tracks.iterrows():
    genre_list=ast.literal_eval(track['artist_genre'])
    for genre in genre_list:
        genre_dict['rank'].append(track['rank'])
        genre_dict["time_range"].append(track["time_range"])
        genre_dict['track_id'].append(track['track_id'])    
        genre_dict['track_name'].append(track['track_name'])
        genre_dict['track_popularity'].append(track['track_popularity'])
        genre_dict['track_release_date'].append(track['track_release_date'])
        genre_dict['artist_name'].append(track['artist_name'])
        genre_dict["ft_danceability"].append(track["ft_danceability"])
        genre_dict["ft_energy"].append(track["ft_energy"])
        genre_dict["ft_key"].append(track["ft_key"])
        genre_dict["ft_loudness"].append(track["ft_loudness"])
        genre_dict["ft_mode"].append(track["ft_mode"])
        genre_dict["ft_speechiness"].append(track["ft_speechiness"])
        genre_dict["ft_acousticness"].append(track["ft_acousticness"])
        genre_dict["ft_instrumentalness"].append(track["ft_instrumentalness"])
        genre_dict["ft_liveness"].append(track["ft_liveness"])
        genre_dict["ft_valence"].append(track["ft_valence"])
        genre_dict["ft_tempo"].append(track["ft_tempo"])
        genre_dict["key_note"].append(track["key_note"])
        genre_dict['artist_genre'].append(genre)
        genre_dict["number_of_repeats"].append(len(genre_list))
    
df_genre = pd.DataFrame(genre_dict)   


In [None]:
#Final Output to be used in the Tableau workbook
df_genre.to_csv('SpotifyMyTracks.csv')

## End of first section

## Section 2
### Audio Analysis Component (still under construction):
Let's look at the nitty-gritty musical analysis for each track. The below isn't visualized in Tableau as I haven't been
able to extract any interesting findings from this. The Audio Analysis generally isn't too accurate from my experience in looking at the data.
Uses the [Audio Analysis API](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-analysis/)

In [None]:
def section_output_df(track_id,analysis_dict):
    spotify=sp_authorize()
    #Audio Analysis

    analysis_data=spotify.audio_analysis(track_id)
    
    for section in analysis_data['sections']:
        keys=get_unprefixed_keys(analysis_dict,prefix='')
        for key in keys:
            analysis_dict[key].append(section[key]) 
   
    df1 = pd.DataFrame(analysis_dict)
    
    df1['track_duration']=analysis_data['track']['duration']
    
    return df1    

In [None]:
analysis_dict={
                'confidence': [],
      'duration': [],
      'key': [],
      'key_confidence': [],
      'loudness': [],
      'mode': [],
      'mode_confidence': [],
      'start': [],
      'tempo': [],
      'tempo_confidence': [],
      'time_signature': [],
      'time_signature_confidence': []
                }

df_analysis = pd.DataFrame(analysis_dict)    

#Get all unique tracks. Some might be included in both short term and long term.
unique_track_ids=list(df_tracks.track_id.unique())
df_unique_tracks=df_tracks.loc[df_tracks['track_id'].isin(unique_track_ids)]

for index,track in df_unique_tracks.iterrows():
    df2=section_output_df(track['track_id'],analysis_dict)
    df2['track_name']=track['track_name']
    df2['track_id']=track['track_id']
    df2['time_range']=track['time_range']
    df_analysis=pd.concat([df_analysis,df2])
    print(str(index) + ' DONE!!!!! ' + track['track_name'])

#Convert to human-readable key
df_analysis['key_note']=df_analysis['key'].apply(int_to_key)

In [None]:
df_analysis.to_csv('SpotifyMyTracksAnalysis.csv')

In [None]:
#Exploratory stats about the sections of the music.
print('Loudness: ')
print(df1['loudness'].describe())
print('Tempo: ')
print(df1['tempo'].describe())
print('Key: ')
print(df1['key'].unique())

In [None]:
segments_dict={
      "start": [],
      "duration": [],
      "confidence": [],
      "loudness_start": [],
      "loudness_max_time": [],
      "loudness_max": [],
      "pitches": [],
      "timbre": []
            }
analysis_data=spotify.audio_analysis(track_1)
for segments in analysis_data['segments']:
    keys=get_unprefixed_keys(segments_dict,prefix='')
    for key in keys:
        segments_dict[key].append(segments[key])

df2 = pd.DataFrame(segments_dict)
    
df2   

medium_term (approximately last 6 months), short_term (approximately last 4 weeks). Default: medium_term.

In [None]:
for time_range in ('short_term','medium_term','long_term'):
    results = spotify.current_user_top_artists(limit=20, offset=0,
                                              time_range=time_range)
    print('\n' + time_range + '\n')
    for item in results['items']:
        print(item['name'], item['popularity'], item['genres'])
    
    print(results['total'])

In [None]:
results = spotify.artist_related_artists(artist_id='0fTav4sBLmYOAzKuJw0grL')
for item in results['artists']:
    print(item['name'], item['popularity'], item['genres'])