# Part 1: Analyzing My Spotify Streaming History

Spotify's "Spotify.me" feature - access @ https://spotify.me/en - provides a snapshot of your Spotify listening history. Under GDPR, Spotify allows the export of all of your streaming history (saved for as long as you've been a Spotify user). I downloaded my streaming history - and proceeded to run an analysis on when I listen to music, what I listen to, and how it fits in with the rest of my life.

Public Code: https://github.com/shomilj/Explore-Spotify

In [4]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [9]:
from loader import SpotifyAPI, HealthAPI
from dateutil.parser import parse
from pytz import timezone
from datetime import timedelta
import pytz
from datetime import datetime
from collections import defaultdict
import plotly.graph_objects as go
import plotly.express as px
import pandas as pd, numpy as np
from tqdm import tqdm_notebook as tqdm
import plotly
from sklearn import preprocessing
plotly.offline.init_notebook_mode(connected=True)

In [21]:
ROOT = 'data/'

In [22]:
spotify = SpotifyAPI(ROOT)

In [25]:
spotify.help()


        Available Features:
        • load_searches (181 records)
        • load_streaming (40649 records)
        • load_tracks (1723 records)
        


In [38]:
spotify.load_streaming()

[{'endTime': '2020-03-11 23:54',
  'artistName': 'Leonell Cassio',
  'trackName': "Lying We're Fine",
  'msPlayed': 178834},
 {'endTime': '2020-03-12 00:00',
  'artistName': 'Hans Zimmer',
  'trackName': 'Themes (From "Pirates of the Caribbean")',
  'msPlayed': 378973},
 {'endTime': '2020-03-12 00:04',
  'artistName': 'Calvin Harris',
  'trackName': 'Promises (with Sam Smith)',
  'msPlayed': 213309},
 {'endTime': '2020-03-12 00:08',
  'artistName': 'Alesso',
  'trackName': "If It Wasn't For You",
  'msPlayed': 232480},
 {'endTime': '2020-03-12 00:10',
  'artistName': 'Steve Aoki',
  'trackName': 'Bella Ciao',
  'msPlayed': 122488},
 {'endTime': '2020-03-12 00:14',
  'artistName': 'The Script',
  'trackName': 'If You Ever Come Back',
  'msPlayed': 242066},
 {'endTime': '2020-03-12 00:17',
  'artistName': 'Kygo',
  'trackName': 'This Town (feat. Sasha Sloan)',
  'msPlayed': 202280},
 {'endTime': '2020-03-12 00:21',
  'artistName': 'Mumford & Sons',
  'trackName': 'Beloved',
  'msPlayed':

In [26]:
def range_axis(start_date, end_date):
    X = []
    delta = timedelta(days=1)
    while start_date <= end_date:
        ts = start_date.strftime('%Y-%m-%d')
        X.append(ts)
        start_date += delta
    return sorted(X)

def range_axis_months(start_date, end_date):
    r = range_axis(start_date, end_date)
    r = np.unique([x[:7] for x in r]) # remove the --d part
    return r

def date_bucket(dt):
    return dt.strftime("%Y-%m-%d")

def day_axis():
    X = []
    start_date = datetime.now()
    end_date = start_date + timedelta(days=1)
    delta = timedelta(minutes=1)
    while start_date <= end_date:
        ts = start_date.strftime('%H:%M')
        X.append(ts)
        start_date += delta
    return sorted(X)

def time_bucket(dt):
    return dt.strftime("%H:%M")

def plot(X, y, title, xaxis='', yaxis=''):
    fig = go.Figure(data=[go.Scatter(x=X, y=y, line_shape='linear')])
    fig.update_layout(
        title=title,
        yaxis_title=yaxis,
        xaxis_title=xaxis,
        font=dict(size=12)
    )
    fig.show()

### Extracting Time-Relevant Information

In [27]:
actions = []
days = set()

for search in spotify.load_searches():
    dt = parse(search.get('searchTime'), fuzzy=True, ignoretz=True)
    dt = pytz.utc.localize(dt)
    dt = dt.astimezone(timezone('US/Pacific'))
    if dt in days:
        continue
    days.add(dt)
    actions.append((dt, 'search', search))
    
    
for track in spotify.load_streaming():
    dt = parse(track.get('endTime'), fuzzy=True, ignoretz=True)
    dt = pytz.utc.localize(dt)
    dt = dt.astimezone(timezone('US/Pacific'))
    if dt in days:
        continue
    days.add(dt)
    actions.append((dt, 'stream', track))

In [28]:
actions = list(sorted(actions, key=lambda a : a[0]))

In [29]:
# # Filter to Time in the USA for Testing Purposes (we don't have DateTime accurate yet)
# actions = list(filter(lambda a : a[0].year == 2019 and a[0].month < 12 and a[0].month > 8, actions))

### Analyze Historical Usage
How has my Spotify streaming frequency changed over time?

In [30]:
data = defaultdict(int)

for action in actions:
    dt = action[0]
    bucket = date_bucket(dt)
    data[bucket] += 1
    
X = range_axis(actions[0][0], actions[-1][0])
y = [data[bucket] for bucket in X]

plot(X, y, title='Spotify Streaming over All Time', xaxis='Time', yaxis='Count')

### Analyze Daily Usage
When, during the day, do I listen to Spotify?

In [31]:
data = defaultdict(int)

for action in actions:
    data[time_bucket(action[0])] += 1
    
X = day_axis()
y = [data[bucket] for bucket in X]
plot(X, y, title='Spotify Streaming over Day', xaxis='Time', yaxis='Count')

### Most Popular Tracks & Artists
What do I listen to the most?

In [32]:
df = pd.DataFrame.from_dict(spotify.load_streaming())

In [33]:
favorite_tracks = df.groupby('trackName').sum().sort_values('msPlayed', ascending=False)
favorite_tracks.head(15)

Unnamed: 0_level_0,msPlayed
trackName,Unnamed: 1_level_1
Another Place,51822945
The Funeral,34425138
Cough Syrup,33760307
Before I Go,31315386
Soldier,29869061
All I Want,29209524
drivers license,27331711
Home,26704724
Heat Waves,25992261
Out of the Old,25291865


In [34]:
favorite_artists = df.groupby('artistName').sum().sort_values('msPlayed', ascending=False)
favorite_artists.head(15)

Unnamed: 0_level_0,msPlayed
artistName,Unnamed: 1_level_1
Bastille,216966753
OneRepublic,146994137
Lauv,138000484
Olivia Rodrigo,131266447
BANNERS,112318209
Glass Animals,109223813
Kodaline,91326788
Kygo,90730663
ILLENIUM,86242380
WALK THE MOON,81851035


## Comparision to All Music

Let's take a look at these on a plot. It appears that the difference between songs that I really enjoy and those that fit into the "general" category is striking; there's a sharp curve for both of these graphs.

In [35]:
plot(favorite_artists.index,favorite_artists['msPlayed'], title="My Favorite Artists", xaxis='Artist Name', yaxis='ms played')

In [36]:
plot(favorite_tracks.index,favorite_tracks['msPlayed'], title="My Favorite Tracks", yaxis='ms played')

## Top Tracks Over Time

In [42]:
top = favorite_tracks.head(20).index.to_list()
top = df[df.trackName.isin(top)]
top = top.assign(endTime=lambda df: df['endTime'].apply(lambda x : x[:10]))
top.head()

Unnamed: 0,endTime,artistName,trackName,msPlayed
13,2020-03-12,Bastille,Another Place,211680
27,2020-03-12,Young the Giant,Cough Syrup,249520
53,2020-03-12,Bastille,Another Place,211680
67,2020-03-13,Young the Giant,Cough Syrup,249520
82,2020-03-13,James TW,Soldier,224720


In [43]:
def get_data(tracker):
    data_cumulative = []
    data_monthly = []
    for artist, cum_dict in tracker.items():
        y_cum = []
        y_daily = []
        cum_ms = 0
        for dt in X:
            cum_ms += cum_dict.get(dt, 0)
            y_cum.append(cum_ms)

        y_months = [sum([v for k, v in cum_dict.items() if month in k]) for month in X_months]
        data_cumulative.append(go.Scatter(x=X, y=y_cum, name=artist, line_shape='spline'))
        data_monthly.append(go.Scatter(x=X_months, y=y_months, name=artist, line_shape='spline'))
        
    return data_cumulative, data_monthly

def plot_tracker(data, title):
    fig = go.Figure(data=data)
    fig.update_layout(
        title=title,
        yaxis_title='Total Time Listened To',
        xaxis_title='Time',
        font=dict(size=12)
    )
    fig.show()

In [44]:
artist_tracker = defaultdict(dict)
track_tracker = defaultdict(dict)

for i, row in top.iterrows():
    dt = row.get('endTime')
    artist = row.get('artistName')[:18]
    track = row.get('trackName')[:18]
    ms = int(row.get('msPlayed'))
    artist_tracker[artist][dt] = artist_tracker[artist].setdefault(dt, 0) + ms
    track_tracker[track][dt] = track_tracker[track].setdefault(dt, 0) + ms
    
r = list(sorted(top['endTime']))
X = range_axis(parse(r[0]), parse(r[-1]))
X_months = range_axis_months(parse(r[0]), parse(r[-1]))

In [45]:
artists_cum, artists_mon = get_data(artist_tracker)
tracks_cum, tracks_mon = get_data(track_tracker)

plot_tracker(artists_cum, 'Artists over Time (Cumulative)')
plot_tracker(artists_mon, 'Artists over Time (Monthly)')
plot_tracker(tracks_cum, 'Tracks over Time (Cumulative)')
plot_tracker(tracks_mon, 'Tracks over Time (Monthly)')

## Normalized Monthly Charts

In [46]:
def plot_normed(tracker, title):
    normalizer = pd.DataFrame([list(row.y) for row in tracker])
    x = normalizer.values #returns a numpy array
    min_max_scaler = preprocessing.MinMaxScaler()
    x_scaled = min_max_scaler.fit_transform(x)
    normalizer = pd.DataFrame(x_scaled)

    for i, row in enumerate(tracker):
        row.y = normalizer.iloc[i]

    plot_tracker(tracker, title)

In [47]:
plot_normed(artists_mon, 'Artists over Time (Monthly, Normalized)')
plot_normed(tracks_mon, 'Tracks over Time (Monthly, Normalized)')

## Top Songs from Each Month

In [48]:
df = pd.DataFrame.from_dict(spotify.load_streaming())
df['endTime'] = df['endTime'].apply(lambda x : x[:7])
filtered = df.groupby(['endTime','trackName']).size().reset_index().sort_values(0, ascending=False).sort_values('endTime')
filtered = filtered.rename(columns={0: 'count'})

In [49]:
for d in reversed(sorted(set(df['endTime']))):
    print(f"TOP SONGS FOR {d}")
    for i, r in df[(df['endTime'] == d)].groupby(['trackName']).sum().sort_values('msPlayed', ascending=False).head(10).iterrows():
        print(str(r['msPlayed']) + ' - ' + i)
    print('---------------------------')

TOP SONGS FOR 2021-05
18096886 - Waves - Acoustic
15366099 - Home
7920399 - good 4 u
6284604 - Way down We Go
3427068 - Waves
3230701 - Elastic Heart - Piano Version
3121694 - Part of Me
2789835 - drivers license
2717395 - Angel By The Wings
2691049 - Give Me Love
---------------------------
TOP SONGS FOR 2021-04
13781518 - deja vu
12946604 - Way down We Go
12822702 - Gone Are The Days - Piano Jam 4
12789959 - Let Me Take You There
10263119 - Angel By The Wings
9887673 - Someday
9413020 - Gone Are The Days (feat. James Gillespie)
7955332 - IPlayYouListen - Live
6276871 - supercuts
6028270 - Blank Space
---------------------------
TOP SONGS FOR 2021-03
11173484 - drivers license
8282961 - Bulletproof
7340714 - supercuts
7241359 - Broken
7149911 - The Good Parts
7136445 - Sorry
6565907 - Entertainer
6212070 - Turning Page
6133946 - Superhero
5135059 - All I Want
---------------------------
TOP SONGS FOR 2021-02
14826115 - Broken
9743868 - Stubborn Love
7634947 - Geronimo
7522780 - Simple

In [50]:
for d in reversed(sorted(set(df['endTime']))):
    print(f"TOP ARTISTS FOR {d}")
    for i, r in df[(df['endTime'] == d)].groupby(['artistName']).sum().sort_values('msPlayed', ascending=False).head(10).iterrows():
        print(str(r['msPlayed']) + ' - ' + i)
    print('---------------------------')

TOP ARTISTS FOR 2021-05
22091206 - Dean Lewis
15664073 - Phillip Phillips
11944356 - Olivia Rodrigo
9730649 - Hans Zimmer
8822716 - Ed Sheeran
8283511 - Carlos Rafael Rivera
8138666 - Sia
6508447 - KALEO
5010951 - Blake Neely
4953665 - Lauv
---------------------------
TOP ARTISTS FOR 2021-04
32876795 - Kygo
23525156 - Taylor Swift
20869932 - Olivia Rodrigo
20387921 - Sia
20117404 - Plain White T's
18923166 - KALEO
11678065 - Jeremy Zucker
10481176 - ODESZA
8233202 - Frank Ocean
7447458 - Lauv
---------------------------
TOP ARTISTS FOR 2021-03
23994759 - Olivia Rodrigo
13492930 - Noah Kahan
13419576 - Lauv
13373036 - Joshua Bassett
10872623 - Bastille
10558472 - ZAYN
8547113 - Ben Platt
8290907 - Griffin Oskar
8205351 - Andy Grammer
8143544 - Jeremy Zucker
---------------------------
TOP ARTISTS FOR 2021-02
19795774 - Noah Kahan
14539635 - Jonah Kagen
13905737 - Vance Joy
12117780 - American Authors
11885912 - Sleeping At Last
10332992 - The Lumineers
9000517 - Bastille
8568953 - Dean 

# Part 2: How Does Music Affect My Heartbeat?
Or rather, what type of music do I listen to when my heart's pumping? (could be <==>)

In [51]:
health = HealthAPI(ROOT)

In [52]:
health.help()


Available Features:
• load_heartbeats()
        


In [55]:
# hb_df = health.load_heartbeats()

In [56]:
# print(f"We have {len(hb_df)} heartbeat data points available!")

In [33]:
hb_df.head()

Unnamed: 0,creationDate,startDate,endDate,value
0,2019-11-13 17:43:25+00:00,2019-11-13 17:43:18+00:00,2019-11-13 17:43:18+00:00,103
1,2019-11-13 17:48:49+00:00,2019-11-13 17:39:14+00:00,2019-11-13 17:39:14+00:00,76
2,2019-11-13 17:53:32+00:00,2019-11-13 17:48:36+00:00,2019-11-13 17:48:36+00:00,75
3,2019-11-13 17:57:15+00:00,2019-11-13 17:53:21+00:00,2019-11-13 17:53:21+00:00,81
4,2019-11-13 18:00:38+00:00,2019-11-13 18:00:37+00:00,2019-11-13 18:00:37+00:00,92


In [34]:
sp_df = pd.DataFrame.from_dict(spotify.load_streaming())

In [35]:
sp_df.head()

Unnamed: 0,endTime,artistName,trackName,msPlayed
0,2019-03-16 16:51,Cash Cash,Hero (feat. Christina Perri) - Deep Mix,60669
1,2019-03-17 05:17,Cash Cash,Hero (feat. Christina Perri) - Deep Mix,76252
2,2019-03-17 05:18,Cash Cash,Hero (feat. Christina Perri) - Deep Mix,81775
3,2019-03-17 05:22,Steve Void,Perfect Mess,216046
4,2019-03-17 05:26,San Holo,The Future - GOSLO Remix,274480


In [36]:
track_hb = defaultdict(list)
artist_hb = defaultdict(list)

for i, row in tqdm(sp_df.iterrows()):
    dt = parse(row.get('endTime'))
    dt = pytz.utc.localize(dt)
    
    filtered = hb_df[(hb_df["startDate"] <= dt) & (dt <= hb_df["endDate"])]
    if len(filtered) > 0:
        for j, hb_row in filtered.iterrows():
            track_hb[row.get('trackName')].append(hb_row.get('value'))
            artist_hb[row.get('artistName')].append(hb_row.get('value'))


This function will be removed in tqdm==5.0.0
Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [37]:
def get_Xy(tracker):
    X, y = tracker.keys(), [np.mean(list(map(float, tracker[t]))) for t in tracker.keys()]
    X, y = list(zip(*sorted(list(zip(X, y)), key=lambda x : x[1])))
    return X, y

def plot_hb(tracker, title):
    X, y = get_Xy(tracker)
    plot(y, list(X), title)

# Results – The Final Heartbeat/Music Correlation
Do these make sense? Judge for yourself! The DataFrames at the bottom may provide a better visualization.

In [38]:
plot_hb(track_hb, 'Heartbeat by Tracks')
plot_hb(artist_hb, 'Heartbeat by Artists')

### Correlation of Tracks & Heartbeat

In [39]:
X, y = get_Xy(track_hb)
track_df = pd.DataFrame(np.array([X, y]).T, columns=['Track', 'Average Heartbeat'])
track_df = track_df.astype({'Average Heartbeat': 'float'})
track_df = track_df.sort_values('Average Heartbeat', ascending=False)
track_df

Unnamed: 0,Track,Average Heartbeat
429,Everywhere - 2017 Remaster,175.062
423,Carry on My Wayward Son - Brass Version,175.062
419,SOS (feat. Aloe Blacc),175.062
420,Just My Type,175.062
421,"thank u, next",175.062
...,...,...
4,FRIENDS,62.000
2,Dance Monkey,61.000
1,All You Need To Know (feat. Calle Lehmann),61.000
3,Guiding Light,61.000


### Correlation of Artists & Heartbeat

In [40]:
X, y = get_Xy(artist_hb)
artist_df = pd.DataFrame(np.array([X, y]).T, columns=['Artist', 'Average Heartbeat'])
artist_df = artist_df.astype({'Average Heartbeat': 'float'})
artist_df = artist_df.sort_values('Average Heartbeat', ascending=False)
artist_df

Unnamed: 0,Artist,Average Heartbeat
283,BANNERS,175.062000
282,TOTO,175.062000
281,Lord Huron,175.062000
280,Ariana Grande,173.010333
277,Dimitri Vegas & Like Mike,168.907000
...,...,...
4,Sonu Nigam,67.000000
3,The Blue Notes,66.000000
2,Paul Cesar,64.000000
1,Mumford & Sons,61.000000
