# Spotify Data Project

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Intial EDA (Pre-Clustering Work)

The datasets we are using are from a kaggle set that uses the Spotify API to query song data. https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks

In [3]:
df = pd.read_csv("archive/data.csv")
df_artists = pd.read_csv("archive/data_by_artist.csv")
df_genres = pd.read_csv("archive/data_by_genres.csv")
df_year = pd.read_csv("archive/data_by_year.csv")
df_w_genres = pd.read_csv("archive/data_w_genres.csv")

In [4]:
df.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
0,0.995,['Carl Woitschach'],0.708,158648,0.195,0,6KbQ3uYMLKb5jDxLF7wYDD,0.563,10,0.151,-12.428,1,Singende Bataillone 1. Teil,0,1928,0.0506,118.469,0.779,1928
1,0.994,"['Robert Schumann', 'Vladimir Horowitz']",0.379,282133,0.0135,0,6KuQTIu1KoTTkLXKrwlLPV,0.901,8,0.0763,-28.454,1,"Fantasiestücke, Op. 111: Più tosto lento",0,1928,0.0462,83.972,0.0767,1928
2,0.604,['Seweryn Goszczyński'],0.749,104300,0.22,0,6L63VW0PibdM1HDSBoqnoM,0.0,5,0.119,-19.924,0,Chapter 1.18 - Zamek kaniowski,0,1928,0.929,107.177,0.88,1928
3,0.995,['Francisco Canaro'],0.781,180760,0.13,0,6M94FkXd15sOAOQYRnWPN8,0.887,1,0.111,-14.734,0,Bebamos Juntos - Instrumental (Remasterizado),0,1928-09-25,0.0926,108.003,0.72,1928
4,0.99,"['Frédéric Chopin', 'Vladimir Horowitz']",0.21,687733,0.204,0,6N6tiFZ9vLTSOIxkj8qKrd,0.908,11,0.098,-16.829,1,"Polonaise-Fantaisie in A-Flat Major, Op. 61",1,1928,0.0424,62.149,0.0693,1928


Most of the other datasets are aggregations of this one. The genre data is the only one that presents information that is not found in this dataset, and it provides aggregations of the data at the genre level or includes what genres an artist encapsualtes.

In [5]:
df_artists.head()

Unnamed: 0,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count
0,"""Cats"" 1981 Original London Cast",0.575083,0.44275,247260.0,0.386336,0.022717,0.287708,-14.205417,0.180675,115.9835,0.334433,38.0,5,1,12
1,"""Cats"" 1983 Broadway Cast",0.862538,0.441731,287280.0,0.406808,0.081158,0.315215,-10.69,0.176212,103.044154,0.268865,33.076923,5,1,26
2,"""Fiddler On The Roof” Motion Picture Chorus",0.856571,0.348286,328920.0,0.286571,0.024593,0.325786,-15.230714,0.118514,77.375857,0.354857,34.285714,0,1,7
3,"""Fiddler On The Roof” Motion Picture Orchestra",0.884926,0.425074,262890.962963,0.24577,0.073587,0.275481,-15.63937,0.1232,88.66763,0.37203,34.444444,0,1,27
4,"""Joseph And The Amazing Technicolor Dreamcoat""...",0.605444,0.437333,232428.111111,0.429333,0.037534,0.216111,-11.447222,0.086,120.329667,0.458667,42.555556,11,1,9


In [None]:
df_genres.head()

In [None]:
df_w_genres.head()

In [None]:
df_year.head()

In [None]:
df.groupby("year").mean().head()

The columns in this dataset mostly go over technical muscial information, more detail can be found at this link: https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/

This link contains a detailed description of the popularity variable https://developer.spotify.com/documentation/web-api/reference/tracks/get-track/

#### EDA

Problem Statement Idea:

- Can we extract genre from the various musical features at the song level, using an unsupervised learning technique?
    - most likely learn genre via clustering, K-means or GMM?

In [None]:
df.isna().sum()

Null value check, perhaps we aren't accounting for the way this data represents null values, i.e. empty brackets, zero values, certain text strings

First off, how does spotify define genre? Let's take a look at how many genres they define genre in their aggregate dataset

In [None]:
unique_genres = df_genres["genres"].unique()
print(len(unique_genres))
unique_genres[:20]

In [None]:
df_w_genres[df_w_genres["genres"] == "[]"].head()

In [None]:
df_w_genres[df_w_genres["genres"] == "[]"].loc[56]

In trying to query our favorite artists from the song data, we noticed an interesting issue with how the artists are represented.

There are 2,664 genres which is a very large amount, we see that there are multiple genres that have an "acid" prefix. Likely we will cluster and assign our own intuitive genres to each cluster or try to reduce this genre layer down to something we could use for supervised learning.

We also see that there is an empty value for genre indicated by '[]', so we know that null values are indicated in this dataset beyond an 'na'

Through this, we determined that the non-numeric variables are stored as strings (even though some appear to be lists, this comes from later EDA). This means we will have to do some preprocessing if we want to use pandas functions to query to through them.

In [None]:
print(type(df["artists"][0]))
df["artists"]

In [None]:
df["artists"] = df["artists"].apply(lambda x: x.replace("'", "").strip('][').split(', '))

In [None]:
type(df["artists"][0])

In [None]:
def query_artist(artist):
    return [True if df["artists"][i] == [artist] else False for i in range(len(df["artists"]))]

In [None]:
df[query_artist("MGMT")].sort_values("popularity", ascending = False).head()

In [None]:
# artist = top10_artists["artists"]
# artists_pop = top10_artists["popularity"]
# plt.bar(artist, artists_pop)
top10_artists.plot.bar("artists", "popularity")
plt.xticks(rotation= 45)
plt.title("Top 10 Most Popular Artists")

Looking to the popularity of artists in the dataset we see that the top 10 artists are relatively unknown artists (at least to us). Why could that be?


In [None]:
top10_artists

We see that the count value for these artists are extremely low, so likely these artists are "one-hit wonders" or have 2 very successful songs. Let's see how popularity measures for a universally loved artist like The Beatles

In [None]:
df_artists[df_artists["artists"] == "The Beatles"]

Interestingly, The Beatles have a popularity score of 48.06 compared to the above artists scores of 86-95. How does Spotify measure popularity? Let's look to the API

The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are.

Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an album) are rated independently. Artist and album popularity is derived mathematically from track popularity. Note that the popularity value may lag actual popularity by a few days: the value is not updated in real time.

So likely the Beatles score is averaged over all their songs, lowering their score as there is a count of 823. It is also interesting that popularity is affected by how recent a song has been played. Let's see how time affects popularity.

In [None]:
plt.plot(df_year['year'], df_year['popularity'])
plt.xlabel("Year")
plt.ylabel("Popularity")
plt.title("Popularity Over Time")

As we suspected popularity shows an increase over time, favoring more recent songs. This indicates that popularity is more better defined as "current popularity". Thus, the variable does not indicate how popular a song was when it came out, rather how popular a song was when the data was queried, roughly October, 11th 2020. This means it may not be a reliable variable to use, or we must use it acknowleding that it is not a measure of how popular an artist or song has been historically, rather currently.

Let's look at the most popular songs in the dataset as a sanity check

In [None]:
df.sort_values("popularity", ascending = False).head(20)

These look a lot more like familiar artists. This indicates we want to stick to the song level data as opposed to data aggregated at the artist level so we do not lose detail about the data through issues like the "one-hit wonder" inflation seen above.

It seems that we will want to focus our efforts on clustering the song level data using the technical music aspects to try and discern some innate pattern that we can abstract as genre. Let's look at some of this technical music data and observe.

In [None]:
df["acousticness"].hist()
plt.title("Distribution of Acousticness")
plt.xlabel("Accousticness")
plt.ylabel("Count")

In [None]:
df["danceability"].hist()
plt.title("Distribution of Danceability")
plt.xlabel("Danceability")
plt.ylabel("Count")

We see that the spread for acousticness is heavily concentrated in the 0 and 1 bins, and Danceability is more evenly spread throughout with low concentration in the lower and upper bound bins

We'll use the genre level data to look at trends in the technical music aspects, since it helps us learn how genre behavior trends for these technical music aspects

In [None]:
plt.scatter(df_genres["acousticness"], df_genres["energy"])
plt.title("ScatterPlot of Genres of Acoustiness vs Energy")
plt.xlabel("Acousticness")
plt.ylabel("Energy")

In [None]:
plt.scatter(df_genres["acousticness"], df_genres["loudness"])
plt.title("ScatterPlot of Genres of Acoustiness vs Loudness")
plt.xlabel("Acousticness")
plt.ylabel("Loudness")

Overall, through our EDA we've really decided on trying to cluster for genres at the song level. With multiple other aggregated data sets, we found that we lose specificity from the aggregation so we will choose the raw song data. Perhaps one way that we can measure our success is to compare our song clusters to the genres assigned to artists (though there is no genre variable in the song dataset).

In [None]:
plotting_cols = ["acousticness", "danceability", "energy", "instrumentalness", "liveness", "valence", "speechiness"]
def plot_song(song):
    song_df = df[df["name"] == song]
    song_df.iloc[0][plotting_cols].plot.bar()
    plt.xticks(rotation= 45)
    plt.title("Technical Values of " + song)
    plt.xlabel("Musical Features measured from 0-1")
    plt.ylabel("Value")

In [None]:
plot_song("Ymca")