# Looking at Data Distributions

When faced with a dataset that you are seeing for the first time, one of the first things that you should do is to plot distributions of its columns. 

What information can we make out of data distributions?

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# read and process the tracks dataset
tracks_df = pd.read_csv('data/spotify_daily_charts_tracks.csv')
tracks_df.head()

### 2. Get a grasp of all possible values in each column
1. Get length of all unique entries in each columns with string type (dtype: object)
> Q: Why are there more track ids than track names? List all possible reasons you could think. 
2. Confirm if the range (range = max - min) of the song metrics columns (danceability, energy, ...) matches what is declared in the documentation 
3. Using `describe`, generate a table of basic statistics for the song metrics columns
> Q: Give 3 insights based on the output of `describe`

In [None]:
len(tracks_df['track_id'].unique())

In [None]:
len(tracks_df['track_name'].unique())

In [None]:
len(tracks_df['album_id'].unique())

In [None]:
tracks_df['key'].unique()

In [None]:
tracks_df['mode'].unique()

In [None]:
tracks_df[['popularity', 'danceability', 'energy',
       'loudness','speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo']].describe()

### 3. Histograms
We can now try to plot histograms of the datasets
Put simply, histograms are graphical representation of tallies.
Read more about histograms here: (https://statistics.laerd.com/statistical-guides/understanding-histograms.php).

These are very useful in EDA because at a glance, you could already see how the data is spread over its range.

In particular, you should look out for:
1. Skewness - Do the values peak around the mean, or over lower (left-skewed)/higher values(right-skewed)?
2. Mode - Does it have one peak (unimodal)? two peaks (bimodal)? How many peaks?
3. Outliers - Are there a few data points that are substantially distant from bulk of all values?

It is strongly advised that you look at histograms before you do any aggregations.

> Q: Modify the code below to plot histograms for all the numeric columns in df. For each histogram, create a markdown cell below and write a 1-3 sentence about what you observe in the plot.

In [None]:
tracks_df.columns

In [None]:
#make duration ms to minutes
tracks_df['duration_mins']=tracks_df['duration']/60000

In [None]:
sns.distplot(tracks_df['duration_mins'])
plt.title('Duration in Minutes')
plt.ylabel('Frequency')
plt.show()

#sometimes the line might not fit the histogram bars. 
#these are called Gaussian Kernel Density Estimations and we dont expect them to work for noncontinuous values

- Most tracks in the Top 200 tend to last around 3-4 mins. There are more songs that last longer than 4 mins than songs that are shorter than 2.5 mins.

In [None]:
for col in ['popularity', 'danceability', 'energy', 'key',
       'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo']:
    sns.distplot(tracks_df[col])
    plt.title(col)
    plt.ylabel('Frequency')
    plt.show()


## 4. Distribution Properties
### Skewness
[Skewness](https://towardsdatascience.com/testing-for-normality-using-skewness-and-kurtosis-afd61be860) lets you test by how much the overall shape of a distribution deviates from the shape of the normal distribution.
   - Skew < 0 indicates that the tail is on the left side of the distribution, which extends towards more negative values.(left-tailed/left-modal)
   - Skew > 0 indicates that the tail is on the right side of the distribution, which extends towards more positive values.(right-tailed/right-modal)
   - Skew = 0 indicates that there is no skewness in the distribution at all, meaning the distribution is perfectly symmetrical.

<div>
<img src="https://www.conversion-uplift.co.uk/wp-content/uploads/2020/06/Skewness-photo.png" width="500"/>
</div>

In [None]:
from scipy.stats import skew, kurtosis

In [None]:
def skew_type(skewval, skewthres):
    test_skew_value = abs(skewval)-skewthres    
    if (test_skew_value > 0) & (np.sign(skewval)>0):
        return "right-tailed"
    elif  (test_skew_value > 0) & (np.sign(skewval)<0):
        return "left-tailed"
    else:
        return "approximately symmetric"
    


In [None]:
for col in ['popularity', 'danceability', 'energy', 'key',
       'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo']:
    print("Skewness of variable %s : %0.2f (%s)" %(col, skew(tracks_df[col]), skew_type(skew(tracks_df[col]),0.1)))


### Kurtosis
[Kurtosis](https://towardsdatascience.com/testing-for-normality-using-skewness-and-kurtosis-afd61be860) is a measure of how differently shaped are the tails of a distribution as compared to the tails of the normal distribution. While skewness focuses on the overall shape, Kurtosis focuses on the tail shape.

![Kurtosis](https://external-content.duckduckgo.com/iu/?u=http%3A%2F%2Fimg.tfd.com%2Fmk%2FK%2FX2604-K-11.png&f=1&nofb=1)

- The kurtosis of a normal distribution is 3.
- If kurtosis<3, it is said to be *playkurtic*, which means it tends to produce fewer and less extreme outliers than the normal distribution.
- If kurtosis>3, it is said to be *leptokurtic*, which means it tends to produce more outliers than the normal distribution.

In [None]:
def kurtosis_type(kurtval, kurtthres):
    test_kurtosis_value = abs(kurtval-kurtthres)
    #in scipy's implementation, 3 is subtracted from the original definition of kurtosis   
    if (test_kurtosis_value > 0) & (np.sign(kurtval)>0):
        return "heavy-tailed"
    elif  (test_kurtosis_value  > 0) & (np.sign(kurtval)<0):
        return "light-tailed"
    else:
        return "approximately normal"
    


In [None]:
for col in ['popularity', 'danceability', 'energy', 'key',
       'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo']:
    print("Kurtosis of variable %s : %0.2f (%s)" %(col, kurtosis(tracks_df[col]), kurtosis_type(kurtosis(tracks_df[col]),0.1)))


## Try it yourself!
Pick an artist and compare each of the audio features distribution of his/her songs to all the charting tracks in the whole time period. What does this say about the artist?

### Resources
More details on skewness and kurtosis [here](https://codeburst.io/2-important-statistics-terms-you-need-to-know-in-data-science-skewness-and-kurtosis-388fef94eeaa) and [here](https://brownmath.com/stat/shape.htm)

