<a href="https://colab.research.google.com/github/sophia-duran/spotify-streaming-analytics/blob/main/Spotify_Analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🎧 Spotify Streaming Analytics (2019–2025)

> This project uses my extended Spotify streaming history to analyze how I’ve listened to music from 2019 to 2025. It breaks down trends, routines, favorite artists, and personal patterns — and adds a little reflection along the way. Think of it as a data-driven look into what’s been playing in the background (and sometimes the foreground) of my life.


## 🛠️ Data Preparation
<details>
<summary> Click to view data loading & cleaning code</summary>

```python
from google.colab import drive
drive.mount('/content/drive')

import zipfile
import os

zip_path = '/content/drive/MyDrive/SpotifyData/my_spotify_data.zip'
extract_path = '/content/drive/MyDrive/SpotifyData/extracted/'

os.makedirs(extract_path, exist_ok=True)

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

import os
import json
import pandas as pd

nested_path = '/content/drive/MyDrive/SpotifyData/extracted/Spotify Extended Streaming History/'

data_frames = []
for file in os.listdir(nested_path):
    if file.endswith('.json'):
        with open(os.path.join(nested_path, file)) as f:
            data = json.load(f)
            data_frames.append(pd.DataFrame(data))

df = pd.concat(data_frames, ignore_index=True)

csv_path = '/content/drive/MyDrive/SpotifyData/listening_history.csv'
df.to_csv(csv_path, index=False)

df = pd.read_csv(csv_path)

df.rename(columns={
    'ts': 'timestamp',
    'ms_played': 'duration_ms',
    'master_metadata_track_name': 'track',
    'master_metadata_album_artist_name': 'artist',
    'master_metadata_album_album_name': 'album',
    'spotify_track_uri': 'track_uri',
    'shuffle': 'shuffled',
    'skipped': 'skipped',
    'platform': 'platform',
    'conn_country': 'country',
    'reason_start': 'start_reason',
    'reason_end': 'end_reason',
    'offline_timestamp': 'offline_ts',
    'incognito_mode': 'incognito'
}, inplace=True)

df.drop(columns=['ip_addr', 'spotify_episode_uri', 'audiobook_title',
                 'audiobook_uri', 'audiobook_chapter_uri',
                 'audiobook_chapter_title', 'offline', 'offline_ts'], inplace=True)

df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.dropna(subset=['timestamp'])

df_filtered = df[
    (df['timestamp'] >= '2019-05-25') & (df['timestamp'] <= '2025-05-25')
].copy()

df_filtered['listening_hours'] = df_filtered['duration_ms'] / (1000 * 60 * 60)
```

</details>


Behind the scenes, I parsed and cleaned over six years of Spotify data from my extended streaming history. For this project, I am using data from May 2019-May 2025. (6 yrs)

## 📆 Yearly Listening Overview

### 🕰️ What Year Did I Listen the Most?

In [None]:
df_filtered['year'] = df_filtered['timestamp'].dt.year

yearly_totals = (
    df_filtered.groupby('year')['duration_ms']
    .sum()
    .reset_index()
)

yearly_totals['listening_hours'] = yearly_totals['duration_ms'] / (1000 * 60 * 60)

fig = px.pie(
    yearly_totals,
    values='listening_hours',
    names='year',
    title='🕰️ What Year Did I Listen the Most?',
    hole=0.4,
    template='plotly_dark'
)

fig.show()


This chart shows the rhythm of my years.

Which years were filled with music and which ones were quieter? Each slice of the pie represents a different year from 2019 to 2025, showing just how much I leaned on music in different seasons of life. Upon reflection, this makes quite a bit of sense. In middle school, I didn't always play music constantly or in headphones; it was more for certain instances. As I got older, I found out which types of music I really wanted to listen to for myself, and I used headphones and earbuds, allowing me to listen MORE.

### 📈 Listening Hours Per Year

In [None]:
yearly_totals = (
    df_filtered.groupby(df_filtered['timestamp'].dt.year)['duration_ms']
    .sum()
    .reset_index()
)
yearly_totals.columns = ['year', 'duration_ms']
yearly_totals['listening_hours'] = yearly_totals['duration_ms'] / (1000 * 60 * 60)

fig = px.line(
    yearly_totals,
    x='year',
    y='listening_hours',
    markers=True,
    title='📈 Listening Hours Per Year',
    labels={'year': 'Year', 'listening_hours': 'Hours Listened'},
    template='plotly_dark'
)
fig.show()


Over the years, my music listening has surged and settled.

This graph reflects shifts in routines, moods, and moments. This line chart visualizes how my total listening time grew—or shrank—each year. This gives a new perspective: minutes not just broken down by percentage or year, but a steady reflection of minutes listened over the years.

## 👑 Artist Trends Over Time

### 👑 Who Defined My Music Taste Each Year?

In [None]:
df_filtered['year'] = df_filtered['timestamp'].dt.year

era_artist = (
    df_filtered.groupby(['year', 'artist'])['duration_ms']
    .sum()
    .reset_index()
)

era_artist['hours'] = era_artist['duration_ms'] / (1000 * 60 * 60)
top_artist_year = era_artist.sort_values(['year', 'hours'], ascending=[True, False]).drop_duplicates('year')

fig = px.bar(
    top_artist_year,
    x='year',
    y='hours',
    color='artist',
    title='🌀 Who Defined My Music Taste Each Year?',
    labels={'hours': 'Hours Listened'},
    template='plotly_dark'
)
fig.show()


Every year had a musical protagonist.

This chart reveals which artist stole the show annually. Whether it was a comfort artist or a fleeting obsession, they each left a mark on that chapter of my life. This is VERY accurate. I've definitely had my fluctuations and steady favorites, but these artists definitely define these years.

### 🎚️ Top 10 Artists Rank Over Time

In [62]:
import pandas as pd
import plotly.express as px

df_filtered['month'] = df_filtered['timestamp'].dt.to_period('M').astype(str)

monthly_totals = (
    df_filtered.groupby(['month', 'artist'])['duration_ms']
    .sum()
    .reset_index()
)

monthly_totals['listening_hours'] = monthly_totals['duration_ms'] / (1000 * 60 * 60)

top5_artists = (
    monthly_totals.groupby('artist')['listening_hours']
    .sum()
    .nlargest(5)
    .index
)

top10_df = monthly_totals[monthly_totals['artist'].isin(top10_artists)].copy()

top10_df['rank'] = (
    top10_df.groupby('month')['listening_hours']
    .rank(ascending=False, method='first')
)

top10_df.sort_values(['month', 'rank'], inplace=True)

fig = px.line(
    top10_df,
    x='month',
    y='rank',
    color='artist',
    markers=True,
    title='🎚️ Top 10 Artists Rank Over Time (May 2019 – May 2025)',
    labels={'month': 'Month', 'rank': 'Rank'},
    template='plotly_dark'
)

fig.update_yaxes(autorange='reversed', dtick=1)

fig.update_layout(
    xaxis=dict(tickangle=45),
    hovermode='x unified',
    height=600
)

fig.show()



Converting to PeriodArray/Index representation will drop timezone information.



The drama of artist loyalty.

This chart shows the monthly rankings of my top 10 most played artists. Who held the crown, who rose through the ranks, and who vanished after a summer fling? I love the detail on this chart. It really highlights how a top artist can really come out of nowhere (like Cigarettes After Sex) and get boosted by a huge amount of minutes listened, or they can be a steady favorite. (like Lil Yachty)

### 🎧 Rolling Top 5 Artists

In [None]:
import plotly.express as px

df_filtered['month'] = df_filtered['timestamp'].dt.to_period('M').astype(str)

top5_artists = top_artists.head(5).index
df_top5 = df_filtered[df_filtered['artist'].isin(top5_artists)].copy()

monthly_artist = (
    df_top5.groupby(['month', 'artist'])['duration_ms']
    .sum()
    .reset_index()
)

monthly_artist['listening_hours'] = monthly_artist['duration_ms'] / (1000 * 60 * 60)

monthly_artist = monthly_artist.sort_values('month')

fig = px.area(
    monthly_artist,
    x='month',
    y='listening_hours',
    color='artist',
    line_group='artist',
    title='🎧 Rolling Top 5 Artists (May 2019 – May 2025)',
    labels={'month': 'Month', 'listening_hours': 'Hours Listened'},
    template='plotly_dark'
)

fig.update_layout(
    xaxis=dict(tickangle=45, tickfont=dict(size=10)),
    yaxis=dict(title='Listening Time (Hours)'),
    legend_title='Artist',
    hovermode='x unified'
)

fig.show()


  df_filtered['month'] = df_filtered['timestamp'].dt.to_period('M').astype(str)


The core artists who soundtracked my life

This smooth, layered graph shows how my top 5 artists shifted over time. You can see who stuck around, who faded, and who rose out of nowhere. These are the core 5 as of now-They've got the most minutes! It's interesting, though; So many artists have thousands of minutes, but this just shows what consistency does, but also, on the flip sice, that it's never too late to try something new and make it last!

## 🎶 Songs & Albums That Stuck


### 🏆 Top 15 Most Played Tracks

In [None]:
top_tracks = (
    df_filtered.groupby('track')['duration_ms']
    .sum()
    .sort_values(ascending=False)
    .head(15)
    .reset_index()
)

top_tracks['listening_hours'] = top_tracks['duration_ms'] / (1000 * 60 * 60)

fig = px.bar(
    top_tracks,
    x='listening_hours',
    y='track',
    orientation='h',
    title='🏆 Top 15 Most Played Tracks of All Time',
    labels={'listening_hours': 'Hours Listened', 'track': 'Track'},
    template='plotly_dark'
)

fig.update_layout(yaxis=dict(autorange='reversed'))
fig.show()


These are the tracks that defined my moods, eras, and daydreams.

Whether it's one I looped on long walks or danced to in my room, each of these tracks earned their place in the hall of fame. Most of these are simply songs I was/am OBSESSED with. I'm a huge song repeater/replayer, so it's no secret how these guys earned these spots.

### 💿 Most Listened Albums

In [None]:
top_albums = (
    df_filtered.groupby('album')['duration_ms']
    .sum()
    .sort_values(ascending=False)
    .head(15)
    .reset_index()
)

top_albums['listening_hours'] = top_albums['duration_ms'] / (1000 * 60 * 60)

fig = px.bar(
    top_albums,
    x='listening_hours',
    y='album',
    orientation='h',
    title='💿 Most Listened Albums',
    labels={'listening_hours': 'Hours Listened', 'album': 'Album'},
    template='plotly_dark'
)
fig.update_layout(yaxis=dict(autorange='reversed'))
fig.show()


I have favorites!

I'm not going to lie, I'm not a huge album listener. I only really go straight through when one of my favorite artists drops, or if I'm trying to really go through their discography. So a lot of this came from just consistent listening to song(s) out of the album or just LOVING the album, and that's ok with me! Still accurate.

## ⏰ My Listening Routine

### 📆 Daily Listening Over Time

In [None]:
df_filtered['date'] = df_filtered['timestamp'].dt.date

daily_totals = (
    df_filtered.groupby('date')['duration_ms']
    .sum()
    .reset_index()
)

daily_totals['listening_hours'] = daily_totals['duration_ms'] / (1000 * 60 * 60)
max_day = daily_totals.sort_values(by='listening_hours', ascending=False).head(1)

print("📅 Your top listening day was:", max_day['date'].values[0], "with", round(max_day['listening_hours'].values[0], 2), "hours.")


📅 Your top listening day was: 2023-12-07 with 20.56 hours.


In [None]:
fig = px.line(
    daily_totals,
    x='date',
    y='listening_hours',
    title='📆 Daily Listening Over Time',
    labels={'date': 'Date', 'listening_hours': 'Hours Listened'},
    template='plotly_dark'
)
fig.show()


This chart maps out every peak and dip in my listening history.

Finals week silence? Travel day binges? Emotional highs and lows? It’s all here—song by song, day by day. I could look at this one all day; it really defines my life and the minute counts make so much sense. The huge jump when I started college in 2023: lots of alone time, and the dip when I started college in 2024: not much alone time, brings a tear to my eye. Music can really show how much you've grown.

### ⏱️ Listening by Hour of Day

In [None]:
import plotly.express as px

df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.dropna(subset=['timestamp'])

df['hour'] = df['timestamp'].dt.hour

hourly_listening = (
    df.groupby('hour')['duration_ms']
    .sum()
    .reset_index()
    .sort_values('hour')
)

hourly_listening['listening_hours'] = hourly_listening['duration_ms'] / (1000 * 60 * 60)

fig = px.bar(
    hourly_listening,
    x='hour',
    y='listening_hours',
    labels={'hour': 'Hour of Day', 'listening_hours': 'Listening Time (Hours)'},
    title='🎧 Total Listening Hours by Hour of Day',
    template='plotly_dark'
)

fig.update_layout(xaxis=dict(dtick=1))
fig.show()


When do I listen the most?

This hourly breakdown reveals my listening patterns—whether I’m a morning motivator, a midday jammer, or a late-night thinker with my headphones in. I'm definitely a late-night listener while I work, so that spike tracks a lot, and I love an evening walk/workout with tons of music.

### 🎛️ Weekly Listening Heatmap

In [65]:
import plotly.express as px

df_filtered['hour'] = df_filtered['timestamp'].dt.hour
df_filtered['day_of_week'] = df_filtered['timestamp'].dt.day_name()

heatmap_data = (
    df_filtered.groupby(['day_of_week', 'hour'])['duration_ms']
    .sum()
    .reset_index()
)

heatmap_data['listening_hours'] = heatmap_data['duration_ms'] / (1000 * 60 * 60)

days_order = ['Saturday', 'Friday', 'Thursday', 'Wednesday', 'Tuesday', 'Monday', 'Sunday']
heatmap_data['day_of_week'] = pd.Categorical(heatmap_data['day_of_week'], categories=days_order, ordered=True)
heatmap_data = heatmap_data.sort_values(['day_of_week', 'hour'])

fig = px.density_heatmap(
    heatmap_data,
    x='hour',
    y='day_of_week',
    z='listening_hours',
    color_continuous_scale='Viridis',
    title='🎶 When Do I Listen to Music? (Hourly vs. Day of Week)',
    labels={'hour': 'Hour of Day', 'day_of_week': 'Day of Week', 'listening_hours': 'Hours Listened'},
    nbinsx=24,
    template='plotly_dark'
)

fig.update_layout(yaxis_title='Day of Week', xaxis_title='Hour (24h)')
fig.show()


This heatmap is my musical fingerprint—an intimate look at how my listening habits shift by day and hour.

You’ll find everything from productive Monday mornings to lazy Saturday nights right here. I really like the detail in this heatmap, and it makes so much sense. It's also very satisfying to see that it's pretty consistent across days of the week too.

### 🎬 Top 10 “Main Character” Listening Days

In [None]:
df_filtered['date_only'] = df_filtered['timestamp'].dt.date

long_sessions = (
    df_filtered.groupby(['date_only'])['duration_ms']
    .sum()
    .reset_index()
)

long_sessions['hours'] = long_sessions['duration_ms'] / (1000 * 60 * 60)
top_sessions = long_sessions.sort_values('hours', ascending=False).head(10)

fig = px.bar(
    top_sessions,
    x='date_only',
    y='hours',
    title='🎬 Top 10 “Main Character” Listening Days',
    labels={'date_only': 'Date', 'hours': 'Hours Listened'},
    template='plotly_dark'
)
fig.show()


These were my movie montage moments

Days when I had music playing for hours on end. Whether I was working hard, walking, or in my feels, these are the days I truly lived inside a soundtrack. It's cool to go back in my camera roll and completely understand why these days exist; I had my headphones GLUED to my head on December 7.

## 🌍 Music That Traveled With Me

### 📌 Where I Listened Most (Map)

In [None]:
!pip install pycountry


Collecting pycountry
  Downloading pycountry-24.6.1-py3-none-any.whl.metadata (12 kB)
Downloading pycountry-24.6.1-py3-none-any.whl (6.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m38.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pycountry
Successfully installed pycountry-24.6.1


In [None]:
import plotly.express as px
import pycountry

# Function to convert Alpha-2 to Alpha-3 codes
def alpha2_to_alpha3(alpha2):
    try:
        return pycountry.countries.get(alpha_2=alpha2.upper()).alpha_3
    except:
        return None

# Filter and clean
country_data = df_filtered[df_filtered['country'].notnull()].copy()

# Group by 2-letter codes
country_totals = (
    country_data.groupby('country')['duration_ms']
    .sum()
    .reset_index()
)

# Add listening hours
country_totals['listening_hours'] = country_totals['duration_ms'] / (1000 * 60 * 60)

# Convert to ISO-3
country_totals['iso_alpha3'] = country_totals['country'].apply(alpha2_to_alpha3)
country_totals = country_totals[country_totals['iso_alpha3'].notnull()]  # drop unconvertible rows

# Plot
fig = px.choropleth(
    country_totals,
    locations='iso_alpha3',
    color='listening_hours',
    hover_name='country',
    color_continuous_scale='Turbo',
    title='🌍 Where I Listened Most (2019–2025)',
    labels={'listening_hours': 'Hours Listened'},
    template='plotly_dark'
)

fig.update_geos(showcountries=True, showcoastlines=True, showland=True, fitbounds="locations")
fig.update_layout(margin=dict(l=0, r=0, t=50, b=0))
fig.show()


Music travels with me.

This world map shows where in the world I pressed play the most. Whether I was on a plane, walking foreign streets, or at home—my music was always with me. We've only got US, Mexico, and Canada for these 6 years, but definitely still cool to see this map, and motivates me to travel much more in the next 6 years, and grow it!

## 🔁 Songs That Grew On Me

### 🔄 Skipped at First, Loved Later

In [None]:
regret_df = df_filtered[df_filtered['track'].notnull()]

track_stats = (
    regret_df.groupby('track')
    .agg(plays=('track', 'count'), skips=('skipped', 'sum'), duration=('duration_ms', 'sum'))
    .reset_index()
)

track_stats['skip_rate'] = track_stats['skips'] / track_stats['plays']
track_stats['hours'] = track_stats['duration'] / (1000 * 60 * 60)

# Pick songs with high skip rate but still a lot of listening time
redemption = track_stats[(track_stats['skip_rate'] > 0.4) & (track_stats['hours'] > 1)].sort_values('hours', ascending=False).head(10)

fig = px.bar(
    redemption,
    x='hours',
    y='track',
    orientation='h',
    title='🔁 Skipped at First, Loved Later',
    labels={'track': 'Track', 'hours': 'Hours Listened'},
    template='plotly_dark'
)
fig.update_layout(yaxis=dict(autorange='reversed'))
fig.show()


Redemption arcs!

These are the tracks I didn’t vibe with at first but came back to and fell for. Proof that first impressions aren't everything—even in music. I LOVE that I was able to create this unique from the Spotify data. I definitely wasn't an instant fan of these songs, but they circled back over the 6 years and genuinely became repeaters or favorites.