# **Data 550 Mini Project 2**
Group: 15

Dataset: Spotify

Names: Val Veeramani and Sara Hall

Date: February, 2022

[Link to our Recording](https://youtu.be/cY_rl8B_sAs)

In [1]:
import pandas as pd
import altair as alt
# Save a vega-lite spec and a PNG blob for each plot in the notebook
#alt.renderers.enable('mimetype')
# Handle large data sets without embedding them in the notebook
alt.data_transformers.enable('data_server')
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('data_server')

## **Exploratory Data Analysis**

### Dataset Description

This [dataset](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-01-21) was provided on Github as the January 21, 2020 [Tidy Tuesday challenge](https://|github.com/rfordatascience/tidytuesday). As a result, this datset was provided with the purpose of learning how to wrangle and visualize data in R, and we are using it in a similar context to practice exploratory data analysis in Python. The data in the `csv` file were collected in January 2020 using the [`spotifyr`](https://www.rcharlie.com/spotifyr/) R package, which connects to the [Spotify Web API](https://developer.spotify.com/documentation/web-api/). The dataset contains information about around 30000 songs available on Spotify. This includes several variables identifying the song (id, name, artist, release date, album id, and album name), along with information about the playlist on which it was found (name, id, genre, and subgenre). Finally, it includes several numerical variables about the songs that we are mainly interested in analyzing (popularity, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, and duration). 

### Load the Dataset

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

### Dataset Exploration

We're going to start by looking at the first few rows of the data. 

In [3]:
data.head()

Unnamed: 0,track_id,track_name,track_artist,track_popularity,track_album_id,track_album_name,track_album_release_date,playlist_name,playlist_id,playlist_genre,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,6f807x0ima9a1j3VPbc7VN,I Don't Care (with Justin Bieber) - Loud Luxur...,Ed Sheeran,66,2oCs0DGTsRO98Gh5ZSl2Cx,I Don't Care (with Justin Bieber) [Loud Luxury...,2019-06-14,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,6,-2.634,1,0.0583,0.102,0.0,0.0653,0.518,122.036,194754
1,0r7CVbZTWZgbTCYdfa2P31,Memories - Dillon Francis Remix,Maroon 5,67,63rPSO264uRjW1X5E6cWv6,Memories (Dillon Francis Remix),2019-12-13,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,11,-4.969,1,0.0373,0.0724,0.00421,0.357,0.693,99.972,162600
2,1z1Hg7Vb0AhHDiEmnDE79l,All the Time - Don Diablo Remix,Zara Larsson,70,1HoSmj2eLcsrR0vE9gThr4,All the Time (Don Diablo Remix),2019-07-05,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,1,-3.432,0,0.0742,0.0794,2.3e-05,0.11,0.613,124.008,176616
3,75FpbthrwQmzHlBJLuGdC7,Call You Mine - Keanu Silva Remix,The Chainsmokers,60,1nqYsOef1yKKuGOVchbsk6,Call You Mine - The Remixes,2019-07-19,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,7,-3.778,1,0.102,0.0287,9e-06,0.204,0.277,121.956,169093
4,1e8PAfcKUYoKkxPhrHqw4x,Someone You Loved - Future Humans Remix,Lewis Capaldi,69,7m7vv9wlQ4i0LFuJiE2zsQ,Someone You Loved (Future Humans Remix),2019-03-05,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,1,-4.672,1,0.0359,0.0803,0.0,0.0833,0.725,123.976,189052


In [4]:
len(data)

32833

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32833 entries, 0 to 32832
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   track_id                  32833 non-null  object 
 1   track_name                32828 non-null  object 
 2   track_artist              32828 non-null  object 
 3   track_popularity          32833 non-null  int64  
 4   track_album_id            32833 non-null  object 
 5   track_album_name          32828 non-null  object 
 6   track_album_release_date  32833 non-null  object 
 7   playlist_name             32833 non-null  object 
 8   playlist_id               32833 non-null  object 
 9   playlist_genre            32833 non-null  object 
 10  playlist_subgenre         32833 non-null  object 
 11  danceability              32833 non-null  float64
 12  energy                    32833 non-null  float64
 13  key                       32833 non-null  int64  
 14  loudne

It looks like the dataset is missing 5 track names and 5 track artists. This probably doesn't matter given we are focusing on the numerical variables, but we could consider getting rid of these rows since there are a lot of rows anyway. 

In [6]:
data.describe()

Unnamed: 0,track_popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
count,32833.0,32833.0,32833.0,32833.0,32833.0,32833.0,32833.0,32833.0,32833.0,32833.0,32833.0,32833.0,32833.0
mean,42.477081,0.65485,0.698619,5.374471,-6.719499,0.565711,0.107068,0.175334,0.084747,0.190176,0.510561,120.881132,225799.811622
std,24.984074,0.145085,0.18091,3.611657,2.988436,0.495671,0.101314,0.219633,0.22423,0.154317,0.233146,26.903624,59834.006182
min,0.0,0.0,0.000175,0.0,-46.448,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4000.0
25%,24.0,0.563,0.581,2.0,-8.171,0.0,0.041,0.0151,0.0,0.0927,0.331,99.96,187819.0
50%,45.0,0.672,0.721,6.0,-6.166,1.0,0.0625,0.0804,1.6e-05,0.127,0.512,121.984,216000.0
75%,62.0,0.761,0.84,9.0,-4.645,1.0,0.132,0.255,0.00483,0.248,0.693,133.918,253585.0
max,100.0,0.983,1.0,11.0,1.275,1.0,0.918,0.994,0.994,0.996,0.991,239.44,517810.0


Given the playist variables, we want to make sure that the dataset doesn't have songs repeated multiple times for different playlists.

In [7]:
len(data[['track_name','track_artist', 'track_album_id']].drop_duplicates())

28349

It looks like there are duplicate songs in the data set so we're going to have to do something in the wrangling stage in order to deal with this. 

In [8]:
sum(data[['track_name','track_artist', 'track_album_id']].duplicated())

4484

### Initial Thoughts

It looks like although mode is technically numerical, it's actually a binary response variable (0 means it's a minor key and 1 means it's a major key. This means it's essentially a categorical variable, so we may need to think careful about using it in future analyses. 

To deal with the duplicated songs, I'm thinking one thing we could to is just group by track name and track artist, then take the mean of the numerical values.

### Wrangling

Since we have decided to identify songs by the track name and track artist (and average the duplicates across albums and playlists), We're going to start by dropping the nas. 

In [9]:
data = data.dropna()

Next, we'll average the values for duplicate track and artist names. This operation has the added bonus of dropping all the categorical variables that we don't care about. We'll keep a second data frame with the repeat tracks so that we can do some analysis about playlist genre. 

In [10]:
spotify_songs = data
data = data.groupby(['track_name', 'track_artist']).mean().reset_index()

In [11]:
data.head()

Unnamed: 0,track_name,track_artist,track_popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,"""I TRIED FOR YEARS... NOBODY LISTENED""",Iceberg Black,18.0,0.914,0.408,10.0,-6.712,0.0,0.141,0.0268,0.00179,0.116,0.0944,140.026,150909.0
1,"""This Is Seagull….""",The Snake Corps,34.0,0.516,0.58,9.0,-13.288,0.0,0.0295,2e-06,0.857,0.11,0.235,135.903,238227.0
2,#1 Stunna,Big Tymers,24.0,0.552,0.8405,5.0,-4.9725,1.0,0.2845,0.0163,0.003655,0.258,0.565,89.0435,281960.0
3,#NAKAMA,XLII,26.0,0.797,0.97,3.0,-3.204,1.0,0.0545,0.385,0.000157,0.318,0.568,108.041,192094.0
4,#Natural,Paty Cantú,50.0,0.8,0.836,0.0,-3.535,0.0,0.0568,0.114,0.0,0.134,0.816,97.023,227013.0


Let's further drop mode and key since they are both discrete variables. 

In [12]:
data = data.drop(columns = ['mode', 'key'])

In [13]:
data.columns = ["Name", "Artist", "Popularity", "Danceability", 
               "Energy", "Loudness", "Speechiness", 
               "Acousticness", "Instrumentalness", "Liveness", "Valence", 
              "Tempo", "Duration (ms)"]

### Research Questions

1. What are the underlying distributions of the different numerical variables? 

2. Are there any correlations between the different variables?

3. How has music evolved over the years? In terms of metrics as well as genre preference

4. If we want to create a popular song based on metrics, what should that entail?

### Data Anlaysis and Visualizations

Starting out, it's important to get an idea of both the underlying distributions of our data, and how they might relate do each other. As a result, our first plot estimates the distributions. This will be important if we decide to run analyses like linear regression in the future. 

In [14]:
cnames = data.select_dtypes(include=['float64', 'int64']).columns.tolist()
dist = alt.Chart(data).mark_bar().encode(
        x = alt.X(alt.repeat('repeat'), type = 'quantitative', bin = alt.Bin(maxbins = 30)), 
        y = alt.Y('count()', title = "Number of Tracks")
        ).properties(
        height=80,
        width = 200
        ).repeat(
        repeat=cnames, 
        columns=3
    )
dist.title = "Distributions of Different Track Qualities"
dist

I chose histograms as a way to approximate the shapes of the underlying distributions of these different variables. Interestingly, we can see that a lot of them do not seem to follow normal distributions, which will be important in the future when running regressions. In particular, it's interesting that a large number of the popularity ratings are in the smallest bin (0-5). I'm wondering if there's something weird here where a lot of songs are just never listened to. The majority of tracks also seem to have lower scores of instrumentalness, acousticness, speechiness, and liveness. This would suggest that most tracks on Spotify contain vocals, are likely to have a low confidence of being acoustic, don't much contain speech (unlike rap), and were not recorded live. 

Danceability, loudness, and energy all appear to be negatively skewed, implying that most tracks on spotify get higher ratings for these three metrics. 

In contrast, valence appears to have a nice rounded curve, with the center right over 0.5. This indicates that most tracks don't fall on either extreme of being positive or negative. It is also interesting that there is a peak in the number of tracks with a tempo between 120 and 130. Finally, the duration distribution shows that the mean track length is around 3 mins. 



Next, we wanted to check to see if there were any correlations between these different song metrics.

In [15]:
corr_df = data.select_dtypes('number').corr('spearman').stack().reset_index(name='corr')
corr_df.loc[corr_df['corr'] == 1, 'corr'] = 0
corr_df['abs'] = corr_df['corr'].abs()
alt.Chart(corr_df, title = "Weak Relationships Between Most Track Qualities").mark_square().encode(
    x= alt.X('level_0', title="Track Quality"),
    y= alt.Y('level_1', title="Track Quality"),
    size= alt.Size('abs', title = "Absolute Value of Correlation"),
    color=alt.Color('corr', scale=alt.Scale(scheme='redblue', reverse=True, domain = [-1,1]),
                       title = "Correlation")
).properties(height = 400, width = 400
).configure_axis(
    labelFontSize = 12,
    titleFontSize = 14
).configure_title(
    fontSize = 18
)

Here we can see that the correlations between most of these variables are quite small. Interestingly, popularity is very weakly correlated with pretty much all of the other variables. The two biggest correlations we see are between energy and acousticness, and energy and loudness. As a result, we decided initally to make scatterplots of these two energy against these other two variables. However, due to over-crowding with the large number of observations, we ending up choosing 2D histograms instead. 

In [16]:
a = alt.Chart(data).mark_rect().encode(
    alt.Color('count()', scale=alt.Scale(type='log',scheme='reds')),
    x = alt.X('Loudness', bin=alt.Bin(maxbins=20)), 
    y = alt.Y('Energy', bin=alt.Bin(maxbins=20))
    ).properties(
    height=200,
    width =400
    )
b = alt.Chart(data).mark_rect().encode(
    alt.Color('count()', scale=alt.Scale(type='log',scheme='reds')),
    x = alt.X('Acousticness', bin=alt.Bin(maxbins=20)), 
    y = alt.Y('Energy', bin=alt.Bin(maxbins=20))
    ).properties(
    height=200,
    width = 400
    )

plot = a|b
plot.title = "Relationships Between Energy and both Loudness and Acousticness"
plot

Here we can see that as suggested by the correlation plot, there appears to be a positive relationship between energy and loudness, and a negative relationship between energy and acousticness. One slight issue might be the high number of songs with 0 confidence of acousticness.

### Music Metrics Evolution over the years

In [17]:
spotify_songs["track_album_release_date"]=pd.to_datetime(spotify_songs["track_album_release_date"])
spotify_song_monthly = spotify_songs[~spotify_songs.isnull()]
spotify_song_monthly= spotify_song_monthly[["track_album_release_date","speechiness","tempo","track_popularity"]].resample('Y', on='track_album_release_date').mean()
spotify_song_monthly.index = spotify_song_monthly.index.strftime("%Y")
spotify_song_monthly=spotify_song_monthly.dropna()
spotify_song_monthly=spotify_song_monthly.reset_index()

In [18]:
graph1 = alt.Chart(spotify_song_monthly).mark_line(color="orange",opacity=0.8).encode(
    y=alt.Y("speechiness", type='quantitative',),
    x=alt.X('track_album_release_date',title="Album Release Date",axis=alt.Axis()),
).properties(    
    width=800,
    height=100,
    title="Speechiness vs Album Release Date"

)
graph2 = alt.Chart(spotify_song_monthly).mark_line(color="blue",opacity=0.5).encode(
    y=alt.Y('tempo', type='quantitative'),
    x=alt.X('track_album_release_date:O',title="Album Release Date",axis=alt.Axis()),
).properties(
    width=800,
    height=100,
    title = "Tempo vs Album Release Date"
)

graph3 = alt.Chart(spotify_song_monthly).mark_line(color="green",opacity=0.3).encode(
    y=alt.Y('track_popularity', type='quantitative',title="Popularity",axis=alt.Axis()),
    x=alt.X('track_album_release_date:O',title="Album Release Date",axis=alt.Axis()),
).properties(
    width=800,
    height=100,
    title = "Popularity vs Release Date")
       
graph1 & graph2 & graph3

It seems like vocals used in music took a sharp turn up ever since the mid 80s. This is also when the Pop genre gained in popularity. There might seems to be some link between these two.
Tempo has also been steadily decreasing over the years though it seems like it has reached a sweet spot of around 120 now.
Interestingly, popularity in music overall has decreased gradually over the last 50 years though it has started to pick back up ever since the mid 2010s

### Genre Popularity over the years

In [19]:
spotify_songs["year"] = spotify_songs['track_album_release_date'].dt.year.astype('int')
spotify_songs_track = pd.DataFrame()
for artist in set(spotify_songs["track_artist"]):
    artist_df = spotify_songs[spotify_songs["track_artist"]==artist]
    for year in set(artist_df["year"]):
        artist_df_year = spotify_songs[spotify_songs["year"]==year]
        song = artist_df_year.sort_values("track_popularity",ascending=False).reset_index().loc[0,["track_artist","playlist_genre","track_name","year"]]
        spotify_songs_track = spotify_songs_track.append(song,ignore_index=True)
        
spotify_songs_track = spotify_songs.groupby(['year'])['track_popularity'].max().reset_index()

spotify_songs_track2 = pd.DataFrame()
for index in spotify_songs_track.index:
    temp=spotify_songs[spotify_songs["year"]==spotify_songs_track.loc[index,"year"]]
    temp = temp[temp["track_popularity"]==spotify_songs_track.loc[index,"track_popularity"]]
    spotify_songs_track2=spotify_songs_track2.append(temp)       

In [20]:
track_all = alt.Chart(spotify_songs_track2).mark_bar().encode(
    y=alt.Y('track_popularity:Q',title="Popularity"),
    x=alt.X('year:O',title="year"),
    color=alt.Color('playlist_genre'),
    order=alt.Order('playlist_genre',
      sort='ascending'
    )
).properties(title="Music Genre evolution over the years (popularity) ", width=900).configure_title(fontSize=20)

track_all

Genre preference has evolved quite a bit over the last 50 years. It seems like R&B as well as Rock dominated the mainstream popularity till the mid 80s. Ever since then, Pop seems to have emerged as the dominant genre. The late 2000s till 2020 sees the most diverse variety in terms of musical interests.

### Popular music key metrics

In [21]:
graph1 = alt.Chart(spotify_songs).mark_bar().encode(
    x=alt.X("playlist_genre:O",sort="-y",title="Genre"),
    y=alt.Y('sum(track_popularity)',title="Popularity"),
    color=alt.Color('playlist_genre',title="Genre")
).properties(
    width=250,
    height=300,
    title="Genre vs Popularity"
)

graph2 = alt.Chart(spotify_songs).mark_circle(color="violet").encode(
    x=alt.X("loudness:Q",sort="-y",title="Loudness"),
    y=alt.Y('sum(track_popularity)',title="Popularity"),
).properties(
    width=250,
    height=300,
    title="Loudness vs Popularity"
)

graph4 = alt.Chart(spotify_songs).mark_bar().encode(
    x=alt.X("duration_ms:Q",sort="-y",bin=True,title="Song Duration (milliseconds)"),
    y=alt.Y('sum(track_popularity)',title="Popularity"),
).properties(
    width=250,
    height=300,
    title="Song Duration vs Popularity"
)
(graph1|graph2|graph4)

If we wanted to create a song that would get popular, we might have a good chance of creating one which follow the metrics highlighted above. It should be a pop song, with loudness around -5, and between 200,000 milliseconds and 300,000 milliseconds.

### Summary and Conclusions

We saw that the different track metrics have very different distributions, which reveal some tendencies in the tracks that are available on Spotify. Notably, most songs are not very acoustic, don't contain much speech, and were not recorded. Additionally, while most songs appear to have moderate valence and popularity, there is a higher density of songs with high levels of energy and danceability. We then saw that most of these metrics are quite weakly or not at all correlated with each other, with the exception of a positive correlation between loudness and energy, and a negative correlation between accousticness and energy. Interestingly, these both seem to be fairly intuitive relationships, and were further affirmed by the 2D histograms. 

There were several things that we expected to see during our observations and that seemed to have consensis with the data as well. But there were several surprises as well. I never thought that in terms of popularity, there would be a steady decrease since the 50s. We always like to think in the sense that the generation we were born in had the most popular music but that doesn't seem to be the case when we view this objectively. 

## **Follow-Up Research Questions**

1. Would there be more clear relationships or different relationships between the metrics if we separated into genres? The initial data set didn't have song genres, just playlist ones, so we would need another data set that included identifying information about the song (track name and artist name), along with genre. 

2. While the correlations between individual metrics and song popularity are quite small, how much of the variation in song popularity can be explained by these metrics if we use all of them as predictors in a linear regression? 

3. What could have triggered the dramatic increase in music interests in the mid 2010s? Does it have to do with economic or political factors? We would like to have access to consumer sentiment over those years to see if there was any shift in that for there to be an interest back into music.