## What's this notebook?
This notebook is an analysis done over a couple of hours on the popular songs on Spotify for 2019.

## Why this notebook?
Tried Spotify for a couple weeks, wasn't extremely appealing. Then saw the data, had a few subjective opinions, and thought I'd confirm things that I thought I already knew.

## For whom is this notebook?
Well, anyone who's not looking for serious data analysis and very intense machine learning. Although, there is one track that's got some explicit terms on it's title. So it's kinda PG+(?) :P

## What are assumptions made during this analysis?
* All analyses are based only on the numbers in the data. 
* Any opinions are based on data, and probably my own subjective thoughts, well they're mostly duh! moments in them. 
* No major analyses done.
  * Except for that one pairplot visualization towards the end which shouldn't really be done, but it's easier than generating so many other graphs in separate plots and unnecessarily complicating this
* External Factors for the track (Sentiment for the song, release dates, lyrics, YouTube likes/dislikes/views/comments/whatever etc.) not considered.
* Correlation does not equal causation. So, Pop music doesn't mean the death of metal (but would it kill to have a couple of Jazz or Metal or Rock or something besides Pop in the top 50?! ¯\\_(ツ)_/¯

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
# Read the data, the file has got an encoding that's not UTF-8
song_data = pd.read_csv("../input/top50spotify2019/top50.csv", encoding="ISO-8859-1")
# Read the 2018 data from another user, this will only be used for comparing popular artists from the previous years.
song_data_2018 = pd.read_csv("../input/top-spotify-tracks-of-2018/top2018.csv")

# There was an Unnamed column which was basically the index, which I'll be dropping.
song_data.drop("Unnamed: 0",axis=1,inplace=True)
# Format the columns into a more accessible format. Strip the dots and replace with underscores.
song_data.columns = [x.lower().replace("."," ").strip().replace("  "," ").replace(" ","_") for x in song_data.columns]
# Sort the data on the Popularity metric.
song_data.sort_values("popularity",inplace=True, ascending=False)
# Generate the new index after sorting.
song_data.reset_index(drop=True,inplace=True)
# Generate an additional column that's basically a combination of the Artist and the Track.
song_data["artist_track"] = song_data.apply(lambda x: "{0}, {1}".format(x["artist_name"],x["track_name"]),axis=1)
# A little preview of the data.
song_data.head()

#### Parent Genre
* There's variants for each genre. Who knew Pop had so many sub-genres in it?!
* So it's my genre experience (and Google & Wikipedia for validation) to the rescue!
* Anything with Pop is Pop, anything with Reggae is Reggae, and anything that I find Electronic-y is Electronic.

In [None]:
song_data["parent_genre"] = song_data.genre.apply(lambda x: 
{'canadian pop':"Pop",
 'reggaeton flow':"Reggae", 
 'dance pop':"Pop",
 'pop':"Pop",
 'dfw rap':"Hip Hop",
 'trap music':"Hip Hop",
 'country rap':"Country",
 'electropop':"Electronic",
 'reggaeton':"Reggae",
 'panamanian pop':"Pop",
 'canadian hip hop':"Hip Hop",
 'latin':"Pop",
 'escape room':"Escape Room",
 'pop house':"Pop",
 'australian pop':"Pop",
 'edm':"Electronic",
 'atl hip hop':"Pop",
 'big room':"Electronic",
 'boy band':"Pop",
 'r&b en espanol':"R&B",
 'brostep':"Electronic"}[x]
)

#### Popular Tracks and Artists

In [None]:
# Canvas with two plots
fig,ax = plt.subplots(nrows=2,ncols=1,figsize=(8,16));

# Sort Data by Popularity, and then plot the 15 most popular tracks. The X and Y are given that way but the barh type moves it the other way round
song_data.sort_values("popularity").head(15).plot(x="artist_track",y="popularity",kind="barh",ax=ax[0],title="Top 15 Popular Tracks in 2019");

# Get the number of tracks by artist for 2019, and compare it with 2018.
pd.concat([song_data.artist_name.value_counts().rename("2019"),song_data_2018.artists.value_counts().rename("2018")],axis=1,sort=False).sort_values("2019",ascending=False).head(15)[::-1].plot(kind="barh",title="Top 15 Popular Artists",ax=ax[1]);

* When I think Spotify, I don't think Ella Fitzgerald, Satchmo, or Hank Williams (I'm not conidering YT or other music streams here);
* Pop artists on the top, I find. 
* Interetingly, there's a track from a Netflix series in the list too! :D
* Whatever Post Malone did in 2018, he didn't do in 2019, so there's that drop.

#### Popular Genres

In [None]:
# Canvas with two plots
fig,ax = plt.subplots(nrows=1,ncols=2,figsize=(16,4));

# Plot the most popular Parent Genres in 2019.
song_data.parent_genre.value_counts().plot(kind="bar",ax=ax[0],title="Popular Parent Genres");

# Plot the mot popular Genres in 2019
song_data.genre.value_counts().plot(kind="bar",ax=ax[1],title="Popular Genres");

* All those Pop Genres on the top! and so little every other genres :'(
* I can't expect anyone dancing out to Metal or Shoegaze music. So could Spotify appeal to the dancer? (probably not)
* I've listened to almost every genre on the top, and almost none of the artists that I listened to, made the list.
* Spotify even has a unique genre called Escaperoom, for which I couldn't find a Wiki, or a Google Page, and listening to songs on that list, well it's (subjectively) "different". :|

#### The Pairplot
I was hooked the moment I saw a pairplot in a report. The moment I saw so many numbers in the music data, I went, there's so many interesting things to look at and there's so many graphs to plot, this could do well with a pairplot. (WRONG!)

In [None]:
# Just a pairplot, although I really need to consider removing one part of the graphs. It's basically the axes swapped in half the cases.
sns.pairplot(song_data[["parent_genre","beats_per_minute","energy","danceability","loudness_db","liveness","valence","length","acousticness","speechiness","popularity"]],hue="parent_genre");

Look at all those pretty dots! But I digress.

So many graphs, all crammed into such small space. I had to double click on the graph and then zoom into it. (facepalm) Ugh, I felt I'm better off generating these graphs individually. But I'll leave it here as a reminder to self, not to exhaustively analyze into a dataset with 50 records in it.

**Popular Song summaries without listening to them**

Can data give a general idea of the tracks on Spotify?

In [None]:
fig,ax = plt.subplots(nrows=3,ncols=3,figsize=(24,24));
plt.suptitle("Popular Tracks and their attributes");
sns.kdeplot(song_data.popularity,song_data.beats_per_minute,shade=True,ax=ax[0][0]).set_title("Popularity to Tempo");
sns.kdeplot(song_data.popularity,song_data.energy,shade=True,ax=ax[0][1]).set_title("Popularity to Energy");
sns.kdeplot(song_data.popularity,song_data.danceability,shade=True,ax=ax[0][2]).set_title("Popularity to Danceability");
sns.kdeplot(song_data.popularity,song_data.loudness_db,shade=True,ax=ax[1][0]).set_title("Popularity to Loudness");
sns.kdeplot(song_data.popularity,song_data.liveness,shade=True,ax=ax[1][1]).set_title("Popularity to Liveness");
sns.kdeplot(song_data.popularity,song_data.valence,shade=True,ax=ax[1][2]).set_title("Popularity to Valence");
sns.kdeplot(song_data.popularity,song_data.length,shade=True,ax=ax[2][0]).set_title("Popularity to Length");
sns.kdeplot(song_data.popularity,song_data.acousticness,shade=True,ax=ax[2][1]).set_title("Popularity to Accousticness");
sns.kdeplot(song_data.popularity,song_data.speechiness,shade=True,ax=ax[2][2]).set_title("Popularity to Speechiness");

#### What does the data tell about these tracks?
* They have a medium tempo
* They tend to be a not too energetic and below average in terms of liveliness. So, mellow-lively tracks.
* They're danceable.
* These songs are not too loud.
* They're not too lively.
* These songs last around 4mins on average.
* Less accoustic and speechy.

#### Conclusion
* So a popular track on Spotify is one that is lightly mellow, danceable, and less accoustic.
* I'd attribute less accoustic nature could be to more Pop and Electronic genres.


#### Closing thoughts (Notes to self)
* There's so much more I could add to this, but I'll keep it at this. A short notebook with short and not so complete analysis.
* Turns out if I can get a short analysis done without delving into military precision and detail.
* Pairplot for fewer parameters and not for general anlaysis.
* Still need some practice at being objective at topics like music without spiraling into madness because of subjective opinions. 
* Maybe Spotify isn't for me unless I'm looking for modern genres and not the older bands. (duh!)
