# Analyzing My Spotify Streaming History

Spotify's "Spotify.me" feature - access @ https://spotify.me/en - provides a snapshot of your Spotify listening history. Under GDPR, Spotify allows the export of all of your streaming history (saved for as long as you've been a Spotify user). I downloaded my streaming history - and proceeded to run an analysis on when I listen to music, what I listen to, and how it fits in with the rest of my life.

Download your data here: https://support.spotify.com/us/article/data-rights-and-privacy-settings/

View this code here: https://github.com/shomilj/Explore-Spotify

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from loader import SpotifyAPI
from dateutil.parser import parse
from pytz import timezone
from datetime import timedelta
import pytz
from datetime import datetime
from collections import defaultdict
import plotly.graph_objects as go
import plotly.express as px
import pandas as pd, numpy as np
import plotly
plotly.offline.init_notebook_mode(connected=True)

In [3]:
spotify = SpotifyAPI('../Jarvis/data/')

In [4]:
spotify.help()


        Available Features:
        • load_searches (199 records)
        • load_streaming (24774 records)
        • load_tracks (1156 records)
        


In [5]:
def range_axis(start_date, end_date):
    X = []
    delta = timedelta(days=1)
    while start_date <= end_date:
        ts = start_date.strftime('%Y-%m-%d')
        X.append(ts)
        start_date += delta
    return sorted(X)

def range_axis_months(start_date, end_date):
    r = range_axis(start_date, end_date)
    r = np.unique([x[:7] for x in r]) # remove the --d part
    return r

def date_bucket(dt):
    return dt.strftime("%Y-%m-%d")

def day_axis():
    X = []
    start_date = datetime.now()
    end_date = start_date + timedelta(days=1)
    delta = timedelta(minutes=1)
    while start_date <= end_date:
        ts = start_date.strftime('%H:%M')
        X.append(ts)
        start_date += delta
    return sorted(X)

def time_bucket(dt):
    return dt.strftime("%H:%M")

def plot(X, y, title, xaxis='', yaxis=''):
    fig = go.Figure(data=[go.Scatter(x=X, y=y, line_shape='linear')])
    fig.update_layout(
        title=title,
        yaxis_title=yaxis,
        xaxis_title=xaxis,
        font=dict(size=12)
    )
    fig.show()

### Extracting Time-Relevant Information

In [6]:
actions = []

for search in spotify.load_searches():
    dt = parse(search.get('searchTime'), fuzzy=True, ignoretz=True)
    dt = pytz.utc.localize(dt)
    dt = dt.astimezone(timezone('US/Pacific'))
    actions.append((dt, 'search', search))
    
    
for track in spotify.load_streaming():
    dt = parse(track.get('endTime'), fuzzy=True, ignoretz=True)
    dt = pytz.utc.localize(dt)
    dt = dt.astimezone(timezone('US/Pacific'))
    actions.append((dt, 'stream', track))

In [7]:
actions = list(sorted(actions, key=lambda a : a[0]))

In [8]:
# # Filter to Time in the USA for Testing Purposes (we don't have DateTime accurate yet)
# actions = list(filter(lambda a : a[0].year == 2019 and a[0].month < 12 and a[0].month > 8, actions))

### Analyze Historical Usage
How has my Spotify streaming frequency changed over time?

In [9]:
data = defaultdict(int)

for action in actions:
    dt = action[0]
    bucket = date_bucket(dt)
    data[bucket] += 1
    
X = range_axis(actions[0][0], actions[-1][0])
y = [data[bucket] for bucket in X]

plot(X, y, title='Spotify Streaming over All Time', xaxis='Time', yaxis='Count')

### Analyze Daily Usage
When, during the day, do I listen to Spotify?

In [10]:
data = defaultdict(int)

for action in actions:
    data[time_bucket(action[0])] += 1
    
X = day_axis()
y = [data[bucket] for bucket in X]
plot(X, y, title='Spotify Streaming over Day', xaxis='Time', yaxis='Count')

### Most Popular Tracks & Artists
What do I listen to the most?

In [11]:
df = pd.DataFrame.from_dict(spotify.load_streaming())

In [12]:
favorite_tracks = df.groupby('trackName').sum().sort_values('msPlayed', ascending=False)
favorite_tracks.head(15)

Unnamed: 0_level_0,msPlayed
trackName,Unnamed: 1_level_1
You & Me,13954157
Daylight,11280494
Breathe,10615771
Dreamer,10161831
Outnumbered,9988893
Speechless (Full),9951034
Wake Me Up,8766125
Beautiful Creatures (feat. MAX),8254825
No One Compares To You,7972222
Better,7885472


In [13]:
favorite_artists = df.groupby('artistName').sum().sort_values('msPlayed', ascending=False)
favorite_artists.head(15)

Unnamed: 0_level_0,msPlayed
artistName,Unnamed: 1_level_1
Lauv,78715733
Alan Silvestri,44219050
Avicii,41368114
Coldplay,38140870
ILLENIUM,36235610
James TW,35731886
Hans Zimmer,31516690
Maroon 5,31172415
Penn Masala,30267778
Imagine Dragons,28719833


## Comparision to All Music

Let's take a look at these on a plot. It appears that the difference between songs that I really enjoy and those that fit into the "general" category is striking; there's a sharp curve for both of these graphs.

In [14]:
plot(favorite_artists.index,favorite_artists['msPlayed'], title="My Favorite Artists", xaxis='Artist Name', yaxis='ms played')

In [15]:
plot(favorite_tracks.index,favorite_tracks['msPlayed'], title="My Favorite Tracks", yaxis='ms played')

## Top Tracks Over Time

In [16]:
top = favorite_tracks.head(20).index.to_list()
top = df[df.trackName.isin(top)]
top = top.assign(endTime=lambda df: df['endTime'].apply(lambda x : x[:10]))
top.head()

Unnamed: 0,artistName,endTime,msPlayed,trackName
47,ILLENIUM,2019-03-19,240508,Beautiful Creatures (feat. MAX)
108,James TW,2019-03-21,231653,You & Me
116,Maroon 5,2019-03-21,225306,Daylight
148,ILLENIUM,2019-03-22,240508,Beautiful Creatures (feat. MAX)
164,ILLENIUM,2019-03-23,18668,Beautiful Creatures (feat. MAX)


In [17]:
def get_data(tracker):
    data_cumulative = []
    data_monthly = []
    for artist, cum_dict in tracker.items():
        y_cum = []
        y_daily = []
        cum_ms = 0
        for dt in X:
            cum_ms += cum_dict.get(dt, 0)
            y_cum.append(cum_ms)

        y_months = [sum([v for k, v in cum_dict.items() if month in k]) for month in X_months]
        data_cumulative.append(go.Scatter(x=X, y=y_cum, name=artist, line_shape='spline'))
        data_monthly.append(go.Scatter(x=X_months, y=y_months, name=artist, line_shape='spline'))
            
    return data_cumulative, data_monthly

def plot(data, title):
    fig = go.Figure(data=data)
    fig.update_layout(
        title=title,
        yaxis_title='Total Time Listened To',
        xaxis_title='Time',
        font=dict(size=12)
    )
    fig.show()

In [18]:
artist_tracker = defaultdict(dict)
track_tracker = defaultdict(dict)

for i, row in top.iterrows():
    dt = row.get('endTime')
    artist = row.get('artistName')[:18]
    track = row.get('trackName')[:18]
    ms = int(row.get('msPlayed'))
    artist_tracker[artist][dt] = artist_tracker[artist].setdefault(dt, 0) + ms
    track_tracker[track][dt] = track_tracker[track].setdefault(dt, 0) + ms
    
r = list(sorted(top['endTime']))
X = range_axis(parse(r[0]), parse(r[-1]))
X_months = range_axis_months(parse(r[0]), parse(r[-1]))

In [19]:
artists_cum, artists_mon = get_data(artist_tracker)
tracks_cum, tracks_mon = get_data(track_tracker)

plot(artists_cum, 'Artists over Time (Cumulative)')
plot(artists_mon, 'Artists over Time (Monthly)')
plot(tracks_cum, 'Tracks over Time (Monthly)')
plot(tracks_mon, 'Tracks over Time (Monthly)')

## Playlist for Each Month

In [20]:
df = pd.DataFrame.from_dict(spotify.load_streaming())

In [21]:
df['endTime'] = df['endTime'].apply(lambda x : x[:7])

In [22]:
filtered = df.groupby(['endTime','trackName']).size().reset_index().sort_values(0, ascending=False).sort_values('endTime')
filtered = filtered.rename(columns={0: 'count'})

In [31]:
for d in reversed(sorted(set(df['endTime']))):
    print(f"TOP SONGS FOR {d}")
    for i, r in df[(df['endTime'] == d)].groupby(['trackName']).sum().sort_values('msPlayed', ascending=False).head(10).iterrows():
        print(str(r['msPlayed']) + ' - ' + i)
    print('---------------------------')

TOP SONGS FOR 2020-03
5593142 - Mean It
5233106 - Feelings
4877459 - Outnumbered
4756445 - Another Place
4719177 - Breathe
4288678 - Phases
4095349 - Before You Go
3992897 - 8 Letters
3635483 - Better
3539838 - Circles
---------------------------
TOP SONGS FOR 2020-02
4732790 - You should be sad
3387883 - Family
2632121 - Dirty Paws
2612069 - What Am I
2522818 - Intentions
2406470 - Outnumbered
2264458 - Dil Na Jaaneya
2224499 - Blinding Lights
2179482 - Better
2175902 - 8 Letters
---------------------------
TOP SONGS FOR 2020-01
2354088 - Dil Na Jaaneya
1838332 - Lost Soul
1568558 - Family
1564690 - Into the Unknown
1429629 - Listen to Your Heart (feat. Cosette Smith)
1413676 - The Flood
1312870 - Castle on the Hill
1290489 - Finale
1272248 - Iduna's Scarf
1242361 - Reunion
---------------------------
TOP SONGS FOR 2019-12
4414124 - Show Yourself
2508086 - U & Us
2483183 - Good Things Fall Apart (with Jon Bellion)
2347044 - Don't Give Up On Me - (From "Five Feet Apart")
2308987 - Let'

In [32]:
for d in reversed(sorted(set(df['endTime']))):
    print(f"TOP ARTISTS FOR {d}")
    for i, r in df[(df['endTime'] == d)].groupby(['artistName']).sum().sort_values('msPlayed', ascending=False).head(10).iterrows():
        print(str(r['msPlayed']) + ' - ' + i)
    print('---------------------------')

TOP ARTISTS FOR 2020-03
28951577 - Lauv
14773543 - Bastille
13771479 - Lewis Capaldi
9926767 - Why Don't We
8830627 - Harry Styles
6546358 - Of Monsters and Men
6342854 - Jeremy Zucker
5940821 - James Arthur
5277182 - Dan + Shay
4877459 - Dermot Kennedy
---------------------------
TOP ARTISTS FOR 2020-02
9232843 - Wolf Colony
8991706 - The Chainsmokers
6235470 - Lauv
5950048 - Halsey
5161956 - Why Don't We
5146904 - Justin Bieber
4358131 - OneRepublic
3915139 - Avicii
3785877 - Of Monsters and Men
3316463 - Penn Masala
---------------------------
TOP ARTISTS FOR 2020-01
7897663 - Hans Zimmer
4850414 - John Williams
4641080 - Cinematic Pop
4630654 - Christophe Beck
3605672 - The Chainsmokers
3111929 - Ed Sheeran
2612322 - Idina Menzel
2354088 - Rochak Kohli
1961597 - Lauv
1867092 - Gryffin
---------------------------
TOP ARTISTS FOR 2019-12
10927998 - ILLENIUM
8945975 - Christophe Beck
6668953 - Idina Menzel
6057405 - Penn Masala
4695106 - Avicii
4304416 - OneRepublic
4184570 - Martin G