## My Spotify Wrapped Trends, since 2018

This is a notebook that performs some basic data processing operations on my own Spotify Wrapped playlists from 2018 (when I started using Spotify). 

*Note on Fetching Playlist Data:*

Fetching the data from Spotify's API is done using the `spotify` Python Library. Being an asynchronous implementation of the API, it didn't play too nice with Jupyter in my experience, so it has been separately implemented in `playlists.py`. By default, running the script returns `list.csv`, which is used as the data source in this notebook.


In [None]:
import pandas as pd
import hashlib
import matplotlib.pyplot as plt

In [None]:
# accessing csv (created by playlists.py script)

file = "list.csv"
df = pd.read_csv(file)
df.info()

#### Data Processing

##### Creating 'song_id' column:

The CSV is all populated with datapoints directly fetched from the API. We must do some cleaning and other processing for our purposes.

1. Creating a 'combined' column, and then dropping it: There are some tracks in these playlists that are the exact same songs and performances, but have different Spotify IDs<sup>[1]</sup>. So we needed an id of our own, created by the concatenation of the song name and artist names. By making this field all lowercase, we also avoid another edge case<sup>[2]</sup> Not the most precise implementation<sup>[3]</sup>, but it works for our use case. 

2. The 'combined' column is then hashed to create 'song_id'. There is not much reason to hash the data; there can be any alternative method to make up an identifier of this data, but hashing in this case is more straightforward to create unique values from the 'combined' field (see [2])


<sup><sub>[1] Possibly being Single releases vs. Album tracks, etc.</sub></sup>

<sup><sub>[2] Some tracks can have some words capitalised or not ("the" vs. "The"), despite being the same tracks.</sub></sup>

<sup><sub>[3] Re-released music with changed song names (yes, this is the "(Taylor's Version)" edge case) would require you to match the titles in the DataFrame</sub></sup>

In [None]:
# Combine "name" and "artists" columns into a new column "combined"
df['combined'] = (df['name'] + ' - ' + df['artists']).str.lower() # addressing the "the","The" issue

# Hash the combined values to create a unique identifier - some songs have different spotify IDs but are the exact same track
df['song_id'] = df['combined'].apply(lambda x: hashlib.md5(x.encode()).hexdigest())

df = df.drop(columns=['combined'])
# might need some further wrangling for tv tracks later - keeping older tracks as it is
# also, Between The Bars for example has two entries, not merged ('the' and "The")

##### Pivoting the DataFrame:

This is where song_id comes in handy. Songs appearing in more than one Wrapped lists have the same ID, so by pivoting over song_id as our index, for all the 'year' columns, we get a more manuverable table which we can use for all further processing. This creates a table with only song_id's and the yearly ranks. Songs that don't appear in some year's list get the value ``0`` for that year.

We add the song name (``name``) and artist name (``artists``) back in the next step, to bring together all distinct songs, despite them having different Spotify IDs.

In [None]:
# pivoting table, for years on song_id

years = df['year'].unique().tolist() # final, don't redefine
pivot_df = df.pivot_table(index='song_id', columns='year', values='index', fill_value=0)

# Reset the index to make 'song_id' a regular column
pivot_df = pivot_df.reset_index()

# convert floats to int to make data cleaner
pivot_df[years] = pivot_df[years].astype(int)


In [None]:
# adding back song names and artists on song_id values

if df['song_id'].duplicated().any():
    df = df.drop_duplicates(subset='song_id')
    
pivot_df = pd.merge(pivot_df, df[['name', 'artists', 'song_id']], on='song_id', how='left')
pivot_df # final working dataset - data cleaned

##### Creating 'list_appearances' column: How many yearly lists is each song in?

The ``'list_appearances'`` column is added to view at-a-glance how many times has a track appeared in a yearly Wrapped list. It simply checks how many columns for each row have a non-zero value. 

In [None]:
# counting number of appearances for each song in the lists
pivot_df['list_appearances'] = pivot_df[years].apply(lambda row: row.astype(bool).sum(), axis=1)

##### Finishing up Data Processing:

Just rearranging our data columns.

In [None]:
# rearranging df to have all year indices to the end
cols = ['name', 'artists', 'song_id', 'list_appearances'] + years
pivot_df = pivot_df[cols]
pivot_df.head()

#### Data Analysis

In [None]:
# song_artists = list of each unique artist and how many songs they have on the list
song_artists = pivot_df['artists'].value_counts()

print(song_artists.loc['Elliott Smith'])
# pivot_df[pivot_df['artists'] == 'Tame Impala']

In [None]:
year_to_refer = 2019
artist_to_refer = 'Daft Punk'
pivot_df[(pivot_df['artists'] == artist_to_refer) & (pivot_df[year_to_refer] > 0)][['name', 'artists', year_to_refer]]

In [None]:
# present in 3 or more lists

three_df = pivot_df[pivot_df['list_appearances'] >= 3]

three_df

In [None]:
# present in 2 or more lists

two_df = pivot_df[pivot_df['list_appearances'] >= 2]

two_df

In [None]:
# present in 4 or more lists

four_df = pivot_df[pivot_df['list_appearances'] >= 4]

four_df

In [None]:
# one timers

one_df = pivot_df[pivot_df['list_appearances'] == 1]

one_df

In [None]:
# generic line graph maker - for reference

sample_df = three_df
years = sample_df.columns[4:]

for index, row in sample_df.iterrows():
    non_zero_values = [(year, value) for year, value in zip(years, row[4:]) if value != 0]
    if non_zero_values:
        years_non_zero, values_non_zero = zip(*non_zero_values)
        plt.plot(years_non_zero, values_non_zero, marker='D', label=row['name'])
        
# Adding title and labels
plt.title('Line Graph for Songs Over Years')
plt.xlabel('Year')
plt.ylabel('Rank in Wrapped Playlist')

plt.xticks(range(int(min(years)), int(max(years)) + 1))

# Adding a legend to identify each line
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.gca().invert_yaxis()

# Display the graph
plt.show()

In [None]:
# one hit wonders, top 10 to nowhere in the following list
# starting with year-wise to-and-from


In [None]:
# "recovering tracks - with >= 3 appearances, the ones that went up over a year in ranks"

In [None]:
# How are the 2018 tracks doing in subsequent lists (if they made it in any lists afterwards,
# and the ones that didn't, where were they ranked?)

In [None]:
# artists' best years, most common apperances, ups-and-downs over the years 
# an illustration: MGMT - 2021 super dense, fell off after

In [None]:
# 3 or more appearances, graphing them