# Love Data Week 2025 : learn to make your own Spotify wrapped
Date Created : 2025-02-07  
Workshop delivered on : 2025-02-13

Jupyter notebook by : R. Antonio Muñoz Gómez (Cataloguing and Metadata Librarian, University of Waterloo)

Workshop facilitated by : Anneliese Eber (Research Data Management Librarian, University of Waterloo) and R. Antonio Muñoz Gómez (Cataloguing and Metadata Librarian, University of Waterloo)

# Table of contents

1. Prep work  
 1.1 Request your data from Spotify  
 1.2 Place your files in the same directory as the Jupyter notebook  
 1.3 Import and/or install libraries  
 1.4 Set a `file_name` variable for the files to be read  
 1.5 Create a pandas `DataFrame` and add the `.json` data to it  

2. Cleaning, evaluating, and analyzing the data  
 2.1 Inspect the `DataFrame` contents  
 2.2 Convert the `timestamp` column to `datetime` format with the local time zone  
 2.3 Extract the year from the `timestamp` onto a new `year` column  
 2.4 Calculate the total minutes played in the year  
 2.5 Count the distinct songs listened to in the year  
 2.6 Obtain the day of the year on which you listened to the most music

3. Visualization  
 3.1 Word cloud with top 50 words in song titles  
 3.2 Bar chart with top ten artists listened to in a given year (2024)  
 3.3 Bar charts with top ten artists listened to in a given year (2024) broken down by month

4. Exporting files  
 4.1 Export the `DataFrame` as `.csv` for further exploration and analysis  
 4.2 Export the word cloud image to `.jpg` file  
 4.3 Export all the visualizations into a `.pdf` file and as separate `.jpg` files

5. References
 5.1 Artificial Intelligence Disclosure (AID) Statement

# 1. Prep work






## Step 1.1 : Request your data from Spotify

During the live workshop demo, we used the data requested and obtained by the facilitators.

In order to create your own Spotify wrapped, you will need to request your data directly from Spotify by going to [Spotify's Account privacy page](https://www.spotify.com/ca-en/account/privacy/) and request your **extended streaming history** data. (Note: website last accessed 2025-02-06).

You will receive an email to confirm your request, and a separate email to confirm that your data is ready for download.

The download will include various files, including a README file in `.pdf` format that explains the data.

The files that you will use will be in [JSON (JavaScript Object Notation)](https://www.json.org/json-en.html) format.

The files will include data for Audio and for Video streaming activity. For this workshop we will only be using the Audio streaming activity.

## Step 1.2 : Place your files in the same directory as the Jupyter notebook

The Python code in this notebook assumes that all your files are saved in the same directory as the Jupyter notebook itself.

## Step 1.3 : Import and/or install libraries

In programming, 'libraries' refer to code that someone else has written and made available for re-use. Rather than having to write all that code from scratch, you can use it by 'importing' the libraries into your own code.

To successfully import a library, first you must make sure that it has been installed.

The code below will try to import the following libraries (follow the links if you want to learn more about each of these libraries and what they are used for):
- [pandas](https://pandas.pydata.org/) (Python Data Analysis Library)
- [glob](https://en.wikipedia.org/wiki/Glob_(programming))
- [pytz](https://github.com/stub42/pytz/blob/master/src/README.rst)
- [Matplotlib](https://matplotlib.org/)
- [collections](https://docs.python.org/3/library/collections.html)
- [wordcloud](https://pypi.org/project/wordcloud/)
- [nltk](https://www.nltk.org/) (Natural Language Toolkit)

If these libraries cannot be found in the system, the code will install them and then import them.

In [None]:
try:
    import pandas as pd
except:
    !pip install pandas
    import pandas as pd

try:
    import glob
except:
    !pip install glob
    import glob

try:
    import pytz
except:
    !pip install pytz
    import pytz

try:
    import matplotlib as plt
except:
    !pip install matplotlib
    import matplotlib as plt

try:
    import collections
    from collections import Counter
except:
    !pip install collections
    import collections
    from collections import Counter

try:
    import wordcloud
    from wordcloud import WordCloud
except:
    !pip install wordcloud
    from wordcloud import WordCloud

try:
    import nltk
    from nltk.corpus import stopwords
except:
    !pip install nltk
    import nltk
    from nltk.corpus import stopwords

## Step 1.4 : Set a `file_name` variable for the file(s) to be read

The following code will look for all the files with the `.json` extension so that we can work with the data.

In [None]:
file_name = '*.json'

## Step 1.5 : Create a pandas `DataFrame` and add the `.json` data to it

In the pandas library, a `DataFrame` refers to a table with data.

- Create a main `DataFrame` called (arbitrarily) `df` onto which all the data from the `.json` files will be added.
- Iterate through each of the `.json` files performing the following tasks:
 - Read the `.json` file and temporarily store its contents on a `temp` `DataFrame`
 - Transfer the contents of `temp` onto the main `DataFrame`

After the above actions are completed, the code will print the total number of rows and columns in the `DataFrame`.

In [None]:
df = pd.DataFrame()

for file in glob.glob(file_name):
  temp = pd.read_json(file)
  df = pd.concat([df, temp])

print(df.shape)

# 2. Cleaning, evaluating, and analyzing the data

## Step 2.1 : Inspect the `DataFrame` contents

The following code will display a portion of the `DataFrame` contents. This is useful to start analyzing the contents and get ideas for what kinds of cleanup may be required, as well as which analysis may be possible.

**NOTE:** The following code specifically excludes one column from display (`ip_addr`). The reason is that the Jupyter notebook was designed for instructional purposes, with the instructors using their personal data while sharing screens. If you would like to see all the available columns, simply replace all the following code with the command `df`. In this notebook, every time that we display data from our `DataFrame`, we have used the `excludes` option. The three lines of code in the next cell can be collectively replaced with the basic `df` as needed.

In [None]:
excludes = ['ip_addr']
df_excludes = df.drop(columns=excludes).head()
df_excludes

## Step 2.2 : Convert the timestamp column to `datetime` format with local time zone

The timestamp (`ts`) column as obtained from the `.json` files is not formatted as a `datetime` column, but as a text string. This impacts the way in which the column can be analyzed.

The following code will:
- Convert the data in the `ts` column from string to `datetime` format
- Change Spotify's default [UTC time zone](https://en.wikipedia.org/wiki/Coordinated_Universal_Time) to Canadian Eastern time
- Sort the `DataFrame` by time zone column from oldest to newest
- Display the resulting changes (excluding the `ip_addr` column like in Step 4 above)

In [None]:
df['ts'] = pd.to_datetime(df['ts'], format='%Y-%m-%dT%H:%M:%SZ', utc=True)

local_tz = pytz.timezone('America/Toronto')
df['ts'] = df['ts'].dt.tz_convert(local_tz)

df.sort_values(by='ts', inplace=True)

df_excludes = df.drop(columns=excludes).head()
df_excludes

## Step 2.3 : Extract the year from the timestamp onto a new `year` column

In [None]:
df['year'] = df['ts'].dt.year
df[['ts' , 'year']]

## Step 2.4 : Calculate the total minutes played in the year

Given a specific year:
- Add all the values from the `ms_played` column
- Convert from ms to min (dividing by 60000)
- Round the number to the next integer
- Print a message indicating the total number of minutes listened

**NOTE:** In the code below, the year is set to **2024**, but can be changed if/as needed by updating the code as in the following example.

### Before
`total_ms = df[df['year']==`**2024**`]['ms_played'].sum()`
### After
`total_ms = df[df['year']==`**2025**`]['ms_played'].sum()`

In [None]:
total_ms = df[df['year']==2024]['ms_played'].sum()
total_minutes = total_ms / 60000
total_minutes = round(total_minutes)
print('You listened for ' + str(total_minutes) + ' minutes this year')

## Step 2.5 : Count the distinct songs listened to in the year

Given a specific year:
- Count the distinct values in the `spotify_track_uri` column

**NOTE:** In the code below, the year is set to **2024**, but can be changed if/as needed by updating the code as in the following example.

### Before
`distinct_songs = df[df['year']==`**2024**`]['spotify_track_uri'].nunique()`
### After
`distinct_songs = df[df['year']==`**2025**`]['spotify_track_uri'].nunique()`

In [None]:
distinct_songs = df[df['year']==2024]['spotify_track_uri'].nunique()
print('You listened to ' + str(distinct_songs) + ' songs this year')

## Step 2.6 : Obtain the top ten artists played in a given year

This code does the following operations:
- Filters data by a given year
- Counts the number of times each artist name (column = `master_metadata_album_artist_name`)appears in the `DataFrame`
- Takes the top ten results
- Sorts the results in descending order

**NOTE:** The year can be changed based on your own data set characteristics. Simply change the year in the code below. You can also change the total number of results that you get by updating the code as follows:
### Before
`df[df['year']==`**2024**`].groupby('master_metadata_album_artist_name')['master_metadata_album_artist_name'].count().sort_values(ascending=False).head(`**10**`)`
### After
`df[df['year']==`**2025**`].groupby('master_metadata_album_artist_name')['master_metadata_album_artist_name'].count().sort_values(ascending=False).head(`**8**`)`

In [None]:
df[df['year']==2024].groupby('master_metadata_album_artist_name')['master_metadata_album_artist_name'].count().sort_values(ascending=False).head(10)

## STEP 2.7 : Obtain the top ten artists played in a given year broken down by month

The following code defines a function `called top_artists_by_month` which takes two arguments:
1) the `DataFrame`
2) the year you are interested in

This function will:

- Filter the data by a specified year (example, 2024)
- Extract the month from the timestamp
- Group the data by month and artist, then count the occurrences
- Sort by month and count in descending order
- Get the top 10 artists for each month

In [None]:
def top_artists_by_month(df, year):
  df_year = df[df['year'] == year]

  df_year['month'] = df_year['ts'].dt.month

  top_artists = df_year.groupby(['month', 'master_metadata_album_artist_name'])['master_metadata_album_artist_name'].count().reset_index(name='count')

  top_artists = top_artists.sort_values(['month', 'count'], ascending=[True, False])

  result = {}
  for month in top_artists['month'].unique():
    result[month] = list(top_artists[top_artists['month'] == month]['master_metadata_album_artist_name'].head(10))

  return result

The following cell runs the function above for the year 2024. You can change the `year` argument in the function as in the following example:

### Before
`top_artists_`**2024**` = top_artists_by_month(df, `**2024**`)
### After
`top_artists_`**2025**` = top_artists_by_month(df, `**2025**`)

In [None]:
top_artists_2024 = top_artists_by_month(df, 2024)
top_artists_2024

## Step 2.8 : Obtain the day of the year on which you listened to the most music

Given a year, the following code defines a function called `calculate_highest_listening_day` which takes two arguments:
1) the `DataFrame`
2) the year

This function will:

- Calculate the amount of ms played each day
- Convert the amount from ms to minutes, rounding to the next integer
- Pick the day with the highest value
- Print the message 'Your biggest listening day was [date] with [total] minutes'.

**NOTE:** The year can be changed based on your own data set characteristics. Simply change the year in the code below. Please note you should **ONLY CHANGE THE ARGUMENT IN THE LAST LINE OF CODE**:
### Before
`calculate_highest_listening_day(df, `**2024**`)`
### After
`calculate_highest_listening_day(df, `**2025**`)`

In [None]:
df['date'] = df['ts'].dt.date

def calculate_highest_listening_day(df, year):
    df_filtered = df[df['ts'].dt.year == year]

    total_ms_per_day = df_filtered.groupby('date')['ms_played'].sum().reset_index()

    total_ms_per_day['minutes'] = round(total_ms_per_day['ms_played'] / 60000)

    max_minutes_day = total_ms_per_day.loc[total_ms_per_day['minutes'].idxmax()]

    print(f"Your biggest listening day in {year} was {max_minutes_day['date']} with {max_minutes_day['minutes']} minutes.")

calculate_highest_listening_day(df, 2024)

# 3. Visualization

## Example 3.1 : Word cloud with top 50 words of song titles

Before we count the top 50 words, some cleanup is required.

### Removing stopwords
- Stopwords are words that are excluded from the frequency analysis. Usually these are words that have high frequency in a language but may have little value for the purpose of analysis (for example, articles). The following code will download existing sets of common stopwords for English, Spanish, and French (the predominant languages present in the data set used for the workshop as an example).
### Normalizing text
- There are different ways to normalize text. In this example, all the words are converted to lowercase. Otherwise, they would be counted separately (e.g. 'happy' and 'Happy' would be counted as two different words)
### Removing punctuation

In [None]:
import matplotlib.pyplot as plt

try:
    stop_words = set(stopwords.words('english'))
    stop_words.update(set(stopwords.words('spanish')))
    stop_words.update(set(stopwords.words('french')))
except LookupError:
    nltk.download('stopwords')
    stop_words = set(stopwords.words('english'))
    stop_words.update(set(stopwords.words('french')))

text = ' '.join(df['master_metadata_track_name'].astype(str).tolist())

words = text.lower().split()

words = [word for word in words if word.isalnum() and word not in stop_words]

The following code will:

- Count all the distinct words in the `master_metadata_track_name` column (except stopwords)
- Identify the top 50
- Produce a word cloud

**NOTE:** The number of words in your wordcloud can be changed in the code as in the following example:
### Before
`most_common_words = word_counts.most_common(`**50**`)`
### After
`most_common_words = word_counts.most_common(`**100**`)`

In [None]:
word_counts = Counter(words)

most_common_words = word_counts.most_common(50)

wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(dict(most_common_words))

plt.figure(figsize=(10, 5), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

## Example 3.2 : Bar chart with top ten artists listened to in a given year (2024)

In [None]:
top_ten_artists = df[df['year']==2024].groupby('master_metadata_album_artist_name')['master_metadata_album_artist_name'].count().sort_values(ascending=False).head(10)

artists = top_ten_artists.index.tolist()
counts = top_ten_artists.values.tolist()

plt.figure(figsize=(10, 6))
plt.barh(artists, counts, color='skyblue')
plt.xlabel('Number of Times Listened')
plt.ylabel('Artist Name')
plt.title('Top Ten Artists of the Year')

for i, v in enumerate(counts):
    plt.text(v, i, str(v), color='black', va='center')

plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## Example 3.3 : Bar charts for top ten artists in a given year (2024) broken down by month


In [None]:
month_names = {
    1: 'January', 2: 'February', 3: 'March', 4: 'April', 5: 'May', 6: 'June',
    7: 'July', 8: 'August', 9: 'September', 10: 'October', 11: 'November', 12: 'December'
}

for month, artists in top_artists_2024.items():
    artist_counts = df[df['ts'].dt.month == month].groupby('master_metadata_album_artist_name')['master_metadata_album_artist_name'].count().sort_values(ascending=False).head(10)

    artists = artist_counts.index.tolist()
    counts = artist_counts.values.tolist()

    plt.figure(figsize=(10, 6))
    plt.barh(artists, counts, color='skyblue')
    plt.xlabel('Number of Times Listened')
    plt.ylabel('Artist Name')
    plt.title(f'Top ten artists by month - {month_names[month]}')

    for i, v in enumerate(counts):
        plt.text(v, i, str(v), color='black', va='center')

    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()

# 4. Exporting files

## Step 4.1 : Export the `DataFrame` as `.csv` for further exploration and analysis

In the code below, the column called `ip_addr` is excluded from the export. The data is saved into a `.csv` file with name `spotify-data.csv`

In [None]:
df.drop(columns=['ip_addr']).to_csv('spotify-data.csv', index=False)

## Step 4.2 : Export the word cloud image to `.jpg` file

The output will be a `.jpg` file with file name `spotify-wordcloud.jpg`

In [None]:
wordcloud.to_file('spotify-wordcloud.jpg')

## Step 4.3 : Export all the visualizations into a `.pdf` file and as separate `.jpg` files

In [None]:
from matplotlib.backends.backend_pdf import PdfPages

pdf_filename = 'spotify-analysis.pdf'

with PdfPages(pdf_filename) as pdf:

    fig = plt.figure(figsize=(10, 5), facecolor=None)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.tight_layout(pad=0)
    pdf.savefig(fig)
    plt.close(fig)

    fig, ax = plt.subplots(figsize=(10, 6))
    ax.barh(artists, counts, color='skyblue')
    ax.set_xlabel('Number of Times Listened')
    ax.set_ylabel('Artist Name')
    ax.set_title('Top Ten Artists of the Year')

    for i, v in enumerate(counts):
        ax.text(v, i, str(v), color='black', va='center')

    ax.invert_yaxis()
    plt.tight_layout()
    pdf.savefig(fig)
    plt.close(fig)

    plt.figure(figsize=(10, 6))
    plt.barh(artists, counts, color='skyblue')
    plt.xlabel('Number of Times Listened')
    plt.ylabel('Artist Name')
    plt.title('Top Ten Artists of the Year')
    for i, v in enumerate(counts):
        plt.text(v, i, str(v), color='black', va='center')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.savefig('spotify-top-ten-artists.jpg')
    plt.close()

    for month, artists in top_artists_2024.items():
      artist_counts = df[df['ts'].dt.month == month].groupby('master_metadata_album_artist_name')['master_metadata_album_artist_name'].count().sort_values(ascending=False).head(10)
      artists_month = artist_counts.index.tolist()
      counts_month = artist_counts.values.tolist()

      fig, ax = plt.subplots(figsize=(10, 6))
      ax.barh(artists_month, counts_month, color='skyblue')
      ax.set_xlabel('Number of Times Listened')
      ax.set_ylabel('Artist Name')
      ax.set_title(f'Top ten artists by month - {month_names[month]}')

      for i, v in enumerate(counts_month):
          ax.text(v, i, str(v), color='black', va='center')

      ax.invert_yaxis()
      plt.tight_layout()
      pdf.savefig(fig)
      plt.close(fig)

      plt.figure(figsize=(10, 6))
      plt.barh(artists_month, counts_month, color='skyblue')
      plt.xlabel('Number of Times Listened')
      plt.ylabel('Artist Name')
      plt.title(f'Top ten artists by month - {month_names[month]}')
      for i, v in enumerate(counts_month):
          plt.text(v, i, str(v), color='black', va='center')

      plt.gca().invert_yaxis()
      plt.tight_layout()
      plt.savefig(f'spotify-top-ten-artists-{month_names[month]}.jpg')
      plt.close()

print(f"PDF file '{pdf_filename}' and .jpg files for monthly top ten artists created successfully.")

# 5. References

Data Liam. (2024, October 4). _How To Analyze Your Own Spotify Streaming Data in Python_ [Video recording]. https://youtu.be/OLxqUuiwO_g?si=zWTOrbI96g5gGaeS

Khan, F. (2023). _Unveiling Sentiments: Analyzing BBC Interview Comments_. GitHub. https://github.com/furqaan12/Unveiling-Sentiments-Analyzing-BBC-Interview-Comments

## 5.1 Artificial Intelligence Disclosure [(AID) Statement](https://doi.org/10.48550/arXiv.2408.01904)

_Artificial Intelligence Tools_ : Gemini (no specified version) and Microsoft Copilot (University of Waterloo institutional instance). _Execution_ : Gemini and Microsoft Copilot were used to write and troubleshoot portions of the Python code in this Jupyter notebook.