# Analyzing my Spotify Streaming History

Author: Noah Stemen

Course Project, UC Irvine, Math 10, Summer 2023

## Introduction

The goal of this project is to explore aspects of my perosonal streaming history as provided by Spotify and examine the unique variables provided by the "1 Million Tracks" dataset. After creating a new dataframe of only what is included in both, I will explore a few trends of the variables with visuals and use linear regression to make a predictory line between two columns of data.

## Creating & Filtering my "Streaming History" DataFrame

In [1]:
import pandas as pd
#first, lets import all of our files and use "sh" for "streaming history"
sh1 = pd.read_json("StreamingHistory0.json") #10,000 rows
sh2 = pd.read_json("StreamingHistory1.json") #10,000 rows
sh3 = pd.read_json("StreamingHistory2.json") #10,000 rows
sh4 = pd.read_json("StreamingHistory3.json") #10,000 rows
sh5 = pd.read_json("StreamingHistory4.json") #1,650 rows
#next, because each file is of my streaming history during a different period of time, let's combine them
sh = pd.DataFrame()
sh = sh.append([sh1, sh2, sh3, sh4, sh5])
#to check it is properly combined, let's check the shape of this new dataframe - it should be 41,650 rows
sh.shape

(41650, 4)

In [2]:
sh
#pulling this up to get the column names

Unnamed: 0,endTime,artistName,trackName,msPlayed
0,2022-09-08 00:35,Jon Bellion,All Time Low,119820
1,2022-09-08 00:35,Jon Bellion,Eyes To The Sky,1560
2,2022-09-08 00:38,Lawrence,False Alarms (with Jon Bellion),84710
3,2022-09-08 00:38,Jon Bellion,While You Count Sheep,1250
4,2022-09-08 00:39,Jon Bellion,Blu,58630
...,...,...,...,...
1645,2023-09-08 23:37,Hozier,De Selby (Part 1),3940
1646,2023-09-08 23:37,half•alive,Nobody - Live,220990
1647,2023-09-08 23:37,half•alive,What's Wrong,36933
1648,2023-09-08 23:45,Hozier,Unknown / Nth,280106


Next, I want to filter out certain tracks recorded I don't want included. More specifically, I personally listen white noise and rain sounds on Spotify frequently when getting work done as a means to help me keep focus. Because of how many minutes would be recorded of this and it's irrelevancy to an analysis of my taste in music, let's remove it.

In [3]:
# first, let's find the full name of the white noise track
sh[sh['trackName'].str.contains('Noise')]
# "White Noise 2 Ho...", "White Noise 3 Ho...", and "Sleep Sounds Rai..." are all white noise tracks
# "Street Noise" & "Turn Off The Noise" are songs we will include in our work

Unnamed: 0,endTime,artistName,trackName,msPlayed
1367,2022-09-25 03:41,Erik Eriksson,White Noise 2 Hour Long,4216
1383,2022-09-25 04:05,Erik Eriksson,White Noise 2 Hour Long,560
5253,2022-11-01 18:33,Erik Eriksson,White Noise 2 Hour Long,612690
5254,2022-11-01 18:33,Erik Eriksson,White Noise 2 Hour Long,9600
5256,2022-11-01 19:42,Erik Eriksson,White Noise 2 Hour Long,2822980
5257,2022-11-01 19:46,Erik Eriksson,White Noise 2 Hour Long,255920
5258,2022-11-01 20:43,Erik Eriksson,White Noise 2 Hour Long,1380030
5259,2022-11-01 21:21,Erik Eriksson,White Noise 2 Hour Long,666990
5367,2022-11-02 04:23,Erik Eriksson,White Noise 2 Hour Long,16362
1684,2023-01-10 18:39,Erik Eriksson,White Noise 2 Hour Long,2326260


In [4]:
sh = sh[~sh['trackName'].str.contains('White Noise|Sleep Sounds', case=False)]
# to test this worked, this should now only return the songs "Street Noise" and "Turn off the Noise"
sh[sh['trackName'].str.contains('Noise')]

Unnamed: 0,endTime,artistName,trackName,msPlayed
4503,2023-02-02 18:09,Thymes,Street Noise,1045
4504,2023-02-02 18:09,Thymes,Street Noise,30859
4505,2023-02-02 18:11,Thymes,Street Noise,83412
4758,2023-02-03 19:31,Thymes,Street Noise,114000
5827,2023-07-12 05:46,Peter McPoland,Turn Off The Noise,231561


Now, let's make sure there are no missing values in any of our columns or rows.

In [5]:
sh.isnull().sum()
# nope, we're all good to go :)

endTime       0
artistName    0
trackName     0
msPlayed      0
dtype: int64

We also need to convert the "endTime" column into usable dates instead of strings for later use. I will also create a new "minutesPlayed" column from the "msPlayed" column for the sake of legibility of units.

In [6]:
sh["endTime"] = pd.to_datetime(sh["endTime"])
sh["minutesPlayed"] = sh["msPlayed"]/60000
#I'm not sure why this caveat description at the bottom is popping up

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


## Filtering the "1 Million Songs" DataFrame into a new sub-DataFrame

In [7]:
#let's use "ms" for "million songs"
ms = pd.read_csv("spotify_data.csv")
ms.isnull().sum()
#for a dataframe of 1 million tracks, miraculously there are no missing values

Unnamed: 0          0
artist_name         0
track_name          0
track_id            0
popularity          0
year                0
genre               0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
duration_ms         0
time_signature      0
dtype: int64

In [8]:
df_ms = pd.merge(ms, sh, left_on=['track_name', 'artist_name'], right_on=['trackName', 'artistName'], how='inner')
#Now I do not need repeat columns of the same information (and one random column)
df_ms = df_ms.drop(['artistName', 'trackName', 'Unnamed: 0'], axis=1)
df_ms

Unnamed: 0,artist_name,track_name,track_id,popularity,year,genre,danceability,energy,key,loudness,...,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,endTime,msPlayed,minutesPlayed
0,Jason Mraz,I Won't Give Up,53QF56cjZA9RTuuMZDrSA6,68,2012,acoustic,0.483,0.303,4,-10.058,...,0.69400,0.000,0.115,0.1390,133.406,240166,3,2023-03-01 20:05:00,36608,0.610133
1,Jason Mraz,I Won't Give Up,53QF56cjZA9RTuuMZDrSA6,68,2012,acoustic,0.483,0.303,4,-10.058,...,0.69400,0.000,0.115,0.1390,133.406,240166,3,2023-03-01 20:09:00,200447,3.340783
2,Jason Mraz,I Won't Give Up,53QF56cjZA9RTuuMZDrSA6,68,2012,acoustic,0.483,0.303,4,-10.058,...,0.69400,0.000,0.115,0.1390,133.406,240166,3,2023-03-01 23:54:00,6498,0.108300
3,Neon Trees,Everybody Talks,2iUmqdfGZcHIhS3b9E9EWq,77,2012,alt-rock,0.471,0.924,8,-3.906,...,0.00301,0.000,0.313,0.7250,154.961,177280,4,2022-09-10 06:39:00,177280,2.954667
4,Neon Trees,Everybody Talks,2iUmqdfGZcHIhS3b9E9EWq,77,2012,alt-rock,0.471,0.924,8,-3.906,...,0.00301,0.000,0.313,0.7250,154.961,177280,4,2022-09-12 04:58:00,177280,2.954667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27858,The Drums,I Don't Know How To Love,2YvWonOJesvP0yu9IFJY7S,61,2011,rock,0.411,0.890,11,-6.062,...,0.06170,0.127,0.227,0.0669,169.970,202054,4,2023-03-13 04:36:00,202053,3.367550
27859,The Drums,Days,6113aOfHIC0vbZVDZ6PpRV,44,2011,rock,0.586,0.721,2,-7.743,...,0.36900,0.160,0.141,0.6570,84.987,269082,4,2023-03-03 19:16:00,269081,4.484683
27860,The Drums,Days,6113aOfHIC0vbZVDZ6PpRV,44,2011,rock,0.586,0.721,2,-7.743,...,0.36900,0.160,0.141,0.6570,84.987,269082,4,2023-03-25 00:26:00,105290,1.754833
27861,The Drums,Days,6113aOfHIC0vbZVDZ6PpRV,44,2011,rock,0.586,0.721,2,-7.743,...,0.36900,0.160,0.141,0.6570,84.987,269082,4,2023-03-28 04:55:00,269081,4.484683


I'd also like to briefly look at the start and end dates of when streams were recorded for the sake of seeing how long my data was recorded and if there were any changes to the earliest and most recent streams included.

In [9]:
earliest_date = sh['endTime'].min()
latest_date = sh['endTime'].max()
earliest_date_kept = df_ms['endTime'].min()
latest_date_kept = df_ms['endTime'].max()

In [10]:
earliest_date

Timestamp('2022-09-08 00:35:00')

In [11]:
earliest_date_kept

Timestamp('2022-09-08 00:35:00')

In [12]:
latest_date

Timestamp('2023-09-08 23:58:00')

In [13]:
latest_date_kept

Timestamp('2023-09-08 23:37:00')

Out of the 41,650 initial streams recorded in my streaming history, we are left with 27,863 streams. From a random sample of 1 million songs, to still be left with 66% of what I have streamed is impressive. But unfortunately that does mean that 34% of what I have streamed will not be represented in Altair charts or any work going forward. Also, I personally know that I downloaded Spotify back in 2021, so even the inital "sh" dataframe contained only streams going back to one year ago and not my "entire" streaming history. With all of that being said, my work going forward will show include this a sample of my overall streaming history that will be used in the rest of the project.

## Determining my Most Streamed Songs and Artists

Next, instead of having 27,863 rows for each stream, lets combine each stream by song/artist and create two new columns; "total_times_streamed," and "total_minutes_streamed."

In [14]:
most_songs = df_ms.groupby(['artist_name', 'track_name']).agg(
    total_times_streamed=('minutesPlayed', 'count'),
    total_minutes_streamed=('minutesPlayed', 'sum')
).reset_index()
most_songs
streams = pd.merge(most_songs, df_ms, left_on=['artist_name', 'track_name'], right_on=['artist_name', 'track_name'], how='inner')
#i no longer need the endTime, msPlayed, or minutesPlayed columns
#as they are keeping me from being able to drop any duplicates and are accounted for in
# the total number and total time listened
streams = streams.drop(["endTime", "msPlayed", "minutesPlayed"], axis=1)
streams = streams.drop_duplicates()
streams

Unnamed: 0,artist_name,track_name,total_times_streamed,total_minutes_streamed,track_id,popularity,year,genre,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,$uicideboy$,$uicideboy$ Were Better In 2015,1,2.389100,6LoaYlv0bC1TyctuADqNFh,66,2022,hip-hop,0.883,0.8220,...,-4.029,0,0.1080,0.0301,0.000002,0.1110,0.3270,110.024,143347,4
1,$uicideboy$,1000 Blunts,1,2.924600,09riz9pAPJyYYDVynE5xxY,75,2022,hip-hop,0.830,0.6980,...,-6.517,0,0.0770,0.2240,0.000001,0.1910,0.5950,132.990,175476,4
2,$uicideboy$,Antarctica,2,0.030167,5UGAXwbA17bUC0K9uquGY2,77,2016,hip-hop,0.715,0.6330,...,-6.869,1,0.0804,0.5530,0.000004,0.0905,0.3190,105.945,126850,5
4,$uicideboy$,Avalon,1,0.266917,7CxFWAnQ8eqiRL4W12Xzb6,68,2021,hip-hop,0.877,0.6000,...,-4.577,1,0.0813,0.0210,0.000054,0.2440,0.1760,149.996,140859,4
5,$uicideboy$,For the Last Time,5,7.830050,240audWazVjwvwh7XwfSZE,74,2017,hip-hop,0.844,0.5330,...,-9.612,1,0.5520,0.0735,0.000003,0.0953,0.2300,140.078,156081,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27855,soho,At Peace,2,0.010433,7fJ1v1CninD1DsfNVbs4HU,34,2018,chill,0.809,0.3040,...,-12.764,1,0.2180,0.9630,0.898000,0.1080,0.5270,79.971,120000,4
27857,thuy,girls like me don't cry,1,0.097000,2DtUUBwYwEzKMTMDrc5EiO,64,2022,chill,0.871,0.3720,...,-9.077,0,0.0413,0.2530,0.000002,0.1040,0.6080,110.011,214387,4
27858,thuy,universe,1,0.091667,7B4UxdHwRKJYRhvXxmgZhM,62,2021,chill,0.636,0.4520,...,-8.298,1,0.0329,0.1360,0.000002,0.1040,0.0678,80.004,186627,4
27859,Ólafur Arnalds,Saudade (When We Are Born),3,7.500000,1ijwLR1iybtxaUbasUj7kJ,59,2021,ambient,0.289,0.0253,...,-31.435,1,0.0376,0.9940,0.919000,0.0837,0.1380,99.801,150000,4


Finally, I have the final form of my dataset that represents my streaming history, time spent listening to each artist or song, and each unique attribute Spotify records for each unique song. To check that this final step was done correctly, the sum of "total_times_streamed" should equal the number of rows in df_ms (27,863)

In [15]:
streams["total_times_streamed"].sum()

27863

Now that we have this final form of my data, let's determine my most streamed artists and most streamed songs that exists within both my streaming history and the dataset of 1 million songs.

In [16]:
# five most streamed songs
top_tracks = streams.groupby('track_name')['total_minutes_streamed'].sum()
# Sort the tracks by total_minutes_streamed in descending order and select the top 5
# and round it so there aren't any endless decimals
top_tracks.sort_values(ascending=False).head(5).round(3)

track_name
Stories                                       795.002
Stick Season                                  715.914
One More Time                                 666.362
Ode to a Conversation Stuck in Your Throat    645.537
All My Love                                   607.431
Name: total_minutes_streamed, dtype: float64

In [17]:
# five most streamed artists
top_artists = streams.groupby('artist_name')['total_minutes_streamed'].sum()
# Sort the tracks by total_minutes_streamed in descending order and select the top 5
# and round it so there aren't any endless decimals
top_artists.sort_values(ascending=False).head(5).round(3)

artist_name
Noah Kahan            5857.644
Paramore              3033.930
Taylor Swift          2595.682
Hippo Campus          1629.579
Tyler, The Creator    1564.350
Name: total_minutes_streamed, dtype: float64

According to my filtered dataset of "streams", my five most streamed songs of the past year are "Stories", "Stick Season", "One More Time", "Ode to a Conversation Stuck in Your Throat", and "All My Love" and that my five most streamed artists are Noah Kahan, Paramore, Taylor Swift, Hippo Campus, and Tyler the Creator. While a large amount of streams were lost in filtering through the "1 Million Songs" Dataset, I can tell you with confidence that these results are very representative of my taste in music.

## Visualizing My Taste in Music

In [18]:
import altair as alt

This scatterplot shows the entirety of "streams" according to energy and popularity, and the darker colors signfy where my most streamed songs lay with the rest.

In [19]:
alt.Chart(streams).mark_circle().encode(
    x='popularity:Q',
    y='energy:Q',
    color=alt.Color("total_minutes_streamed:Q", scale=alt.Scale(scheme="goldorange")),
    tooltip=["artist_name","track_name","total_minutes_streamed", "total_times_streamed","genre"],
)

Unfortunately, that means it is hard to see the rest of my overall taste in music. So, let's exclude the most streamed songs (those higher than about 300 minutes) as outliers. I will also cut the amount of songs in half so that I can have better accuracy in my interaction with each point when hovering over it.

In [20]:
df_streams2 = streams[streams["total_minutes_streamed"] < 300]
df_streams = df_streams2.sample(frac=0.5, random_state=76) 

alt.Chart(df_streams).mark_circle().encode(
    x='popularity:Q',
    y='energy:Q',
    color=alt.Color("total_minutes_streamed:Q", scale=alt.Scale(scheme="goldorange")),
    tooltip=["artist_name","track_name","total_minutes_streamed", "total_times_streamed","genre"],
)

Now that I can see more of the higher ends of "total minutes streamed", I can see that the majority of my streams rest within 40 to 80 on the popularity scale, but are far more spread out across energy. This tells me that while I am more likely to listen to songs that are "heard of but not too popular," I will listen to any a wider range of energy levels, while still leaning towards higher energy music. Next, I want to see how the year released plays into my taste in music.

In [21]:
yearly_minutes = streams.groupby('year')['total_minutes_streamed'].sum().round(2).reset_index()

chart = alt.Chart(yearly_minutes).mark_bar().encode(
    x='year:N',
    y='total_minutes_streamed:Q',
    tooltip=['year:N', 'total_minutes_streamed:Q']
)
chart

This suggests to me that there is an error with the dataset I did not anticipate, but now somewhat understand. The "year" variable certainly does not represent the year originally released as I anticipated. But it possibly represents the year that the song was added (or last updated/remastered) on Spotify. But even that makes no sense as Spotify was created in 2006, yet the oldest year listed is 2000. But, for whatver true meaning given to year if not actual year released, this shows my preference of listening to music made within the last 6-7 years. The last chart I would like to examine which genres recorded are more listened to than others.

In [22]:
streams_df_genre = streams.sort_values(by='total_minutes_streamed', ascending=False)

chart = alt.Chart(streams_df_genre).mark_bar().encode(
    x=alt.X('genre:N', sort='y'), 
    y='total_minutes_streamed:Q',
    tooltip=['genre:N', 'total_minutes_streamed:Q']
)
chart

While I was not expecting "electro" to be my most streamed genre, it does make sense to be up there. This all makes sense to me as genres like pop, rock and indie-pop are sure to be broad enough genres to cover plenty of songs I listen to.

## Machine Learning

Can I use Linear Regression to predict a trend between two variables given? For example, in my selection in music, how do loudness and energy interact?

In [23]:
from sklearn.linear_model import LinearRegression

X = streams[['energy']]
y = streams['loudness']
model = LinearRegression()
model.fit(X, y)
predictions = model.predict(X)
streams['predictions'] = predictions

scatter_plot = alt.Chart(streams).mark_circle().encode(
    x='energy:Q',
    y='loudness:Q',
    tooltip=['energy:Q', 'loudness:Q'],
)

best_fit_line = alt.Chart(streams).mark_line(color='red').encode(
    x='energy:Q',
    y='predictions:Q',
)

combined_chart = scatter_plot + best_fit_line
combined_chart

I used scikit-learn to create a LinearRegression model, fit it to the data, and calculate predictions to add as a new “predictions” column to my “streams” dataframe. I then created an Altair scatterplot with my streams and a best fit line with the predictions, combining the two into one chart. 

Unsuprisingly, there is a positive correlation between "loudness" and "energy." What is interesting however is that the best-fit line cannot fit linearly for the "dropoff" in energy as loudness decreases. The true nature of a best-fit line for the relationship between "loudness" and "energy" would be closer to that of a square root line, so the best-fit line given fails to properly predict songs with loudness values less than around -15.

Let's see if we can make a better fitting line with Polynomial Regression (to the second degree).

In [24]:
from sklearn.preprocessing import PolynomialFeatures

In [25]:
poly = PolynomialFeatures(degree=2)  # You can change the degree as needed
X_poly = poly.fit_transform(X)
poly_reg = LinearRegression()
poly_reg.fit(X_poly, y)
y_pred = poly_reg.predict(X_poly)
streams['predictions_poly'] = y_pred
chart = alt.Chart(streams).mark_circle().encode(
    x='energy:Q',
    y='loudness:Q',
    tooltip=['energy:Q', 'loudness:Q']
)

# Overlay the polynomial regression line on the chart
regression_line = alt.Chart(streams).mark_line(color='red').encode(
    x='energy:Q',
    y='predictions_poly:Q'
)

chart + regression_line

As you can see, with the change of it being a polynomial to the second dgree now better represents the curve made as both energy and loudness decrease. If I were to add any more polynomials, the curvature provided would be forceful against the true nature of the relationship between loudness and energy and would result in overfitting. The argument can also be made that now the issue has reversed now. Instead of how the data predicted for values less than -15 would be too high, the new polynomial regression curve fails to address the more linear and upward realtionship between the two somewhere around (energy = 0.2, loudness = -15).

## Summary

First, I created a dataframe of my own out of two pre-existing dataframes and determined certain maximums and minimums pertaining to music. Then, I visualized different traits/columns recorded and their nature with one another. Finally, I used Linear and Polynomial regression to make a best-fit line predicitng the nature between two of those traits/columns.

## References

Your code above should include references.  Here is some additional space for references.

* What is the source of your dataset(s)?
-- My "Streaming History" comes directly from Spotify
-- https://www.kaggle.com/datasets/amitanshjoshi/spotify-1million-tracks

* List any other references that you found helpful.
-- (how i used the "merge" feature") https://www.w3schools.com/python/pandas/ref_df_merge.asp#:~:text=The%20merge()%20method%20updates,keep%20and%20which%20to%20replace.
-- learning more about the ".agg" tool https://stackoverflow.com/questions/38174155/group-dataframe-and-get-sum-and-count
-- unless i am accidentally missing any other references, I spent most of my time looking over lecture codes for references.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=c1527f31-2210-420e-afac-3ae990ee548e' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>