# My Music Preferences Based on Spotify Data
For my assignment I decided to explore Spotify API and collect the data about my own music preferences. 
I first obtain the data set, which contains all tracks from all artists, who's tracks I liked. Then I rearrange that dataset from quite complex JSON form to plain data frame. Finally I make couple of charts based on cleansed data.

In [None]:
#install needed packages
!pip install spotipy

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import sys
import spotipy
import spotipy.util as util
import json

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
spotify_client_id = user_secrets.get_secret("spotify_client_id")
spotify_client_secret = user_secrets.get_secret("spotify_client_secret")

## Data Retrieval
The following two snippets of code obtain data from Spotify API.
This code is not requried for the rest of the notebook to work, as the dataset is saved as json file, but I am leaving it here just to demonstrate the logic behind data retrieval.
The first snippet will ***not*** work on Kaggle. It performs OAuth2 authenitication of spotify user, which involves opening browser window on local machine. You can't do in from inside of Docker container, at least I didn't find a way. Instad I run the first snippet on my local machine, generate the token and then paste the token to the second snippet.
The second snippet obtains user's top 50 tracks and then gets all tracks from the artists featured in top 50. Resulting data is stored as a json file.

In [None]:
# This script obtains user authorization token. Run it on local machine, paste the resulting token to the following code block

#define neccessary level of API permissions
scope = 'user-library-read user-top-read'

#define user id (this is mine and it is associated with my API credentials stored in secrets)
user_id='12175561893'

#obtain authorization token
token = util.prompt_for_user_token(user_id,
                           scope,
                           client_id=spotify_client_id,
                           client_secret=spotify_client_secret,
                           redirect_uri='http://localhost:8080')





In [None]:
# This code does the following:
# 1. Obtains 50 of my favorite tracks
# 2. Determines artists for each of those tracks
# 3. Obtains all albums of those artists
# 4. Obtains information about all songs from all those albums. The list is saved as a json file as the first half of the data set
# 5. Obtains audio features for each of these songs and saves them as the second half of the data set
#-------------
To prevent this code from accidential running (it takes forever to execute) this line is left uncommented. Comment it before the code is run
#-------------
token = 'BQCr_uNdj4N_sxq8I-oVXT6yePDWUdRY2Fns8zEtAjoASfx7W13dgNjq-YzyYqnN_3B3zKKqJF6VYcBNl4sF-_aKg78iiYbc_YcC8bi-uRpvCfMTgRpbLJ5SR8kw2Vjn5FlJ46Q-dS4Nzsi6vlcSiHGyKNubqOJbvME'
#constuct API handler
sp = spotipy.Spotify(auth=token)

#get my favorite tracks
fav_tracks = sp.current_user_top_tracks(limit=50, time_range='long_term')

#pull artists from my favorite tracks
artist_list = []
for track in fav_tracks['items']:
    for artist in track['artists']:
        artist_list.append(artist['id'])

#pull all albums by all artists from my favorite tracks
album_list=[]
for artist in artist_list:
    albums = sp.artist_albums(artist)
    for album in albums['items']:
        album_list.append(album['id'])

#Pull all tracks for all albums by all artists from my favorite tracks
#This is our dataset. Unlike the previous steps we need more that not just list of IDs, so we are going to use dictionary
final_track_list={}
for album in album_list:
    album_tracks = sp.album_tracks(album)
    for track in album_tracks['items']:
        final_track_list[track['id']]=track #this is general track information
        final_track_list[track['id']]['audio_features']=sp.audio_features(track['id']) #this is track analysis

#Finally, save resulting data structure as json
json.dump(final_track_list,fp=open('final_track_list.json','w'))

## Data Cleansing
We start by opening the file and flattening the structure. We also get rid of the columns, that are not interesting to us[](http://)

In [None]:

#opening json file
ftl = pd.json_normalize(json.load(fp=open('/kaggle/input/final_track_list.json','r')).values())
#exploding internal structures to columns for artists and audio features
final_dataset=ftl.explode('artists').explode('audio_features')
#one more level of nesting
art = pd.json_normalize(final_dataset['artists'])
feat = pd.json_normalize(final_dataset['audio_features'])
art.index = final_dataset.index
feat.index = final_dataset.index
#adding new columns to final dataset and dropping the garbage
garbage=['href', 'uri', 'external_urls.spotify', 'artists', 'available_markets', 
         'audio_features', 'preview_url', 'type','audio.analysis_url', 
         'audio.duration_ms','artist.href','artist.type','artist.uri',
         'artist.external_urls.spotify', 'audio.type', 'audio.id', 'audio.uri',
         'audio.track_href', 'artist.id']
final_dataset=pd.concat([final_dataset,feat.add_prefix('audio.')], axis=1)
final_dataset=pd.concat([final_dataset,art.add_prefix('artist.')], axis=1).drop(columns=garbage)
final_dataset.head(10)

## Data Visualization

In [None]:
#let's summon the forces of matplot
import matplotlib.pyplot as plt

#for vissualization purposes we want to get rid of extra rows caused by exploding the lists (such as when there are multiple artists for a track)
charts = final_dataset.groupby(['id']).mean()
def audio_key(key):
    all_keys=['C', 'C#', 'D','D#','E','F','F#','G','G#','A','A#','B']
    return all_keys[key]
def audio_meter(sig):
    all_sigs=['Unknown','1/1','1/2','3/4','4/4','5/8']
    if sig>len(all_sigs)-1: return 'Unknown'
    else: return all_sigs[sig]

plt.style.use('seaborn-poster')
charts['audio.time_signature'].apply(audio_meter).value_counts().plot(kind='barh', title='Musical Meter of Tracks in The Dataset')
#plt.scatter(final_dataset['audio.key'],final_dataset['audio.tempo'], alpha=0.5, c='Purple') 

As we can see prevalent music metre is 4/4

In [None]:
charts['audio.key'].apply(audio_key).value_counts().sort_index().plot(kind='bar', use_index=True)

While there are no obvious favorite keys, some of them are certainly out of favor: D# is far less popular, than G or A.

In [None]:
charts['audio.tempo'].hist(bins=200)

In [None]:
charts['audio.tempo'].value_counts(bins=200).nlargest(10)

The most popular tempo is 120 BPM with other populars being below and above by 10, 20, 30 bpm. I would speculate this is more common for electronic, or electronically produced music, where you would just turn the knob to change BPM. "Natural" tempos usually align to x2 or x3. Only 128 bpm made it to top 10.

In [None]:
charts.plot(kind='scatter',x='audio.tempo',y='audio.energy')

This is how the track energy changes with tempo. Without doing too much stats we can see that lower tempo may correspond to both low and high level of energy, while higher tempos almost always mean high-energy music

In [None]:
charts.plot(kind='scatter',x='audio.tempo',y='audio.danceability')

Finally, track suitability for dancing is associated with some moderate tempos in vicinity of 120 BPM. This is kinda expected from the perspective of physics (ability to keep up with tempo decreases with increase of body mass), but also it could be a reason for 120 BPM popularity.