In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Overview
This is my first notebook in Kaggle which is dedicated to be an example for my students in applying Python language in Jupyter Notebook for Data-Driven Marketing Course. I want my notebook to be able to guide my students in a systematic manner, so that they can understand the idea behind every code that I have learned before and written here. 

In [None]:
filename = '/kaggle/input/top50spotify2019/top50.csv' #locating the file from the available directory
df = pd.read_csv(filename, encoding = 'ISO-8859-1') #read the CSV file with specific encoding; because without this encoding, the file cannot be read properly
df.head(10) #see the first 10 object from the dataframe

In [None]:
#Now I see that there is this column 'Unnamed: 0' which I don't think isn't required. So, I need to remove this column.
del df['Unnamed: 0']

In [None]:
#show the updated dataframe, first 10 data
df.head(10)

In [None]:
#I want to know the shape, column details
print("Data shape:",df.shape)
print("Column Details:",df.columns)

In [None]:
#and understanding the non-null and the data type
df.info()

In [None]:
#check null values
df.isnull().sum()

In [None]:
df.describe().round(decimals=2) #understanding the descriptive statistics overview, with 2point decimals

In [None]:
#creating a bar plot to count frequency of 'Genre'
sns.catplot(y = "Genre", kind = "count",
            palette = "pastel", edgecolor = ".6",
            data = df)
plt.show()

In [None]:
#creating a correlation heatmap from the dataframe
plt.figure(figsize=(15,15))
sns.heatmap(df.corr(), annot=True)

# Initial Findings: 
The output shows that these items below has moderate correlation amongst the top 50 songs given in the data.

* Energy and Loudness
* BPM and Speechiness
* Energy and Valence


Now, we want to again look the details of these three characteristics

In [None]:
#looking the data again, but this time I only focus on five parameters that have relatively moderate correlation
df[['Energy','Loudness..dB..','Beats.Per.Minute','Speechiness.','Valence.']].describe().round(decimals=2)

# Initial Findings (cont.):
Based on the Output, we can infer the followings:
* the Energy is 64.06 on average
* the Loudness is in the middle of the range
* BPM is around 120 with standard deviation of 30 (so it's around 90 to 150)
* Speechiness is around 12 with stdev of 11 (which means it's not too speechy)
* Valence is around 54 with stdev 22.34

## Commentary:
I have just realized that the parameter in this data is different from the (I assume) 'official' scale from [Spotify](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/). For example, 'Energy' scale should be between 0 to 1, as described below:
> Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

So now I think I will translate the data into some sort of standard value.

In [None]:
df1 = df[['Energy','Loudness..dB..','Beats.Per.Minute','Speechiness.','Valence.']] #create a new dataframe 'df1' which only focus on five parameters
df1.head()

In [None]:
#now I want to know the descriptive statistics
print(df[['Energy','Loudness..dB..','Beats.Per.Minute','Speechiness.','Valence.']].describe())

print('=============================================================================')

dfnorm = (df1-df1.min())/(df1.max()-df1.min())
print(dfnorm.describe())

In [None]:
fig, axs = plt.subplots(1, 2, sharey=True, tight_layout=True, figsize=(10,5))

axs[0].hist(df['Speechiness.'], bins=5) #for non-normalized data
axs[1].hist(dfnorm['Speechiness.'], bins=5) #for normalized data

In [None]:
fig, axs = plt.subplots(1, 2, sharey=True, tight_layout=True, figsize=(10,5))

axs[0].hist(df['Beats.Per.Minute'], bins=5) #for non-normalized data
axs[1].hist(dfnorm['Beats.Per.Minute'], bins=5) #for normalized data

In [None]:
fig, axs = plt.subplots(1, 2, sharey=True, tight_layout=True, figsize=(10,5))

axs[0].hist(df['Valence.'], bins=15) #for non-normalized data
axs[1].hist(dfnorm['Valence.'], bins=15) #for normalized data

In [None]:
fig, axs = plt.subplots(1, 2, sharey=True, tight_layout=True, figsize=(10,5))

axs[0].hist(df['Loudness..dB..'], bins=10) #for non-normalized data
axs[1].hist(dfnorm['Loudness..dB..'], bins=10) #for normalized data

In [None]:
fig, axs = plt.subplots(1, 2, sharey=True, tight_layout=True, figsize=(10,5))

axs[0].hist(df['Energy'], bins=10) #for non-normalized data
axs[1].hist(dfnorm['Energy'], bins=10) #for normalized data

# Discussion
## Speechiness
It appears that 'Speechiness' falls under the mean of 0.22 (std=0.25). Now, if we refer to the description from Spotify:
> Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. *Values below 0.33 most likely represent music and other non-speech-like tracks*. 

We can infer that the most popular music in 2019 tends be non-speech-like according to  the standardized value mean. For me personally, to a certain degree, I think songs such as 'Bad Guy'- Bille Ellish or 'You Need to Calm Down' - Taylor Swift, support this analysis.

## Beats Per Minute
Based on the analysis, I find that the BPM skews left (standardized mean = 0.3, standardized std = 0.2). Now we see again the description from Spotify to infer conclusion:
> The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. 

Even though the above description denotes that we do not need to standardize BPM, we can still infer that within the range of the given BPM, most of the songs tend to be in the 90-150 BPM (mean = 120, std = 30).

## Valence
We can see from 'Valence' that most songs on average skew to the right (standardized mean = 0.5). However, there are still several songs which skew to the left. By definition:

> A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Thus, we can infer that these 50 songs tend to sound 'happier', 'more cheerful', or 'more euphoric'.

## Energy
The output shows that the standardized average score is 0.57. By definition:
> Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. 

On average, top 50 songs tend to be 'in the middle', not too 'energetic' nor to 'gloomy'. I think we can just say these music are 'enjoyable' in terms of the 'energy'.

## Loudness
The analysis indicates that the average loudness is around -5.6 dB. By definition:
> The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. 

There's not much I can say in terms of loudness, but if I listen the lowest and highest score in terms of loudness in this data, low Loudness sounds 'fatter' or 'deeper', whereas high Loudness indicates 'brighter' or 'crispier' sounds.

# Conclusion and Limitation
This is my exploratory analysis which I, again, hope can provide a glimpse of that sense of how data can be used to better understand the nature of an event or occurence or experience. From my analysis, it can be inferred that the top 50 songs on Spotify in 2019 has these five identical characteristics. 

These aspects, of course, is limited with the data provided in the dataset. We can agree that there are a lot of other factors (other than musical parameter) that affect popularity of a song in a certain year. 

I hope this can be useful. Please **upvote** if you learn something from this notebook!