<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objectives" data-toc-modified-id="Objectives-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Objectives</a></span></li><li><span><a href="#Descriptive-Statistics" data-toc-modified-id="Descriptive-Statistics-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Descriptive Statistics</a></span><ul class="toc-item"><li><span><a href="#Sample-Data" data-toc-modified-id="Sample-Data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Sample Data</a></span></li></ul></li><li><span><a href="#Measures-of-Center" data-toc-modified-id="Measures-of-Center-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Measures of Center</a></span><ul class="toc-item"><li><span><a href="#Mathematical-Properties" data-toc-modified-id="Mathematical-Properties-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Mathematical Properties</a></span></li></ul></li><li><span><a href="#Measures-of-Spread" data-toc-modified-id="Measures-of-Spread-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Measures of Spread</a></span><ul class="toc-item"><li><span><a href="#Min,-Max,-and-Range" data-toc-modified-id="Min,-Max,-and-Range-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Min, Max, and Range</a></span></li><li><span><a href="#Percentiles-and-IQR" data-toc-modified-id="Percentiles-and-IQR-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Percentiles and IQR</a></span></li><li><span><a href="#Standard-Deviation" data-toc-modified-id="Standard-Deviation-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Standard Deviation</a></span></li></ul></li><li><span><a href="#df.describe()" data-toc-modified-id="df.describe()-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>df.describe()</a></span></li><li><span><a href="#Visual-Description" data-toc-modified-id="Visual-Description-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Visual Description</a></span><ul class="toc-item"><li><span><a href="#Histograms" data-toc-modified-id="Histograms-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Histograms</a></span></li><li><span><a href="#Box-and-Whisker---Box-Plot" data-toc-modified-id="Box-and-Whisker---Box-Plot-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Box and Whisker - Box Plot</a></span></li></ul></li><li><span><a href="#Other-Shape-Descriptors" data-toc-modified-id="Other-Shape-Descriptors-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Other Shape Descriptors</a></span><ul class="toc-item"><li><span><a href="#Skew" data-toc-modified-id="Skew-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Skew</a></span></li><li><span><a href="#Kurtosis" data-toc-modified-id="Kurtosis-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Kurtosis</a></span><ul class="toc-item"><li><span><a href="#Statistical-Moments" data-toc-modified-id="Statistical-Moments-7.2.1"><span class="toc-item-num">7.2.1&nbsp;&nbsp;</span>Statistical Moments</a></span></li></ul></li><li><span><a href="#Modality" data-toc-modified-id="Modality-7.3"><span class="toc-item-num">7.3&nbsp;&nbsp;</span>Modality</a></span></li><li><span><a href="#Rock-Music-Data" data-toc-modified-id="Rock-Music-Data-7.4"><span class="toc-item-num">7.4&nbsp;&nbsp;</span>Rock Music Data</a></span></li></ul></li><li><span><a href="#Level-Up:-Rock-Artist-Analysis" data-toc-modified-id="Level-Up:-Rock-Artist-Analysis-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Level Up: Rock Artist Analysis</a></span></li><li><span><a href="#Level-Up:-Iris-Plots" data-toc-modified-id="Level-Up:-Iris-Plots-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Level Up: Iris Plots</a></span><ul class="toc-item"><li><span><a href="#Histograms-Across-Groups" data-toc-modified-id="Histograms-Across-Groups-9.1"><span class="toc-item-num">9.1&nbsp;&nbsp;</span>Histograms Across Groups</a></span></li><li><span><a href="#Categorical-Plots" data-toc-modified-id="Categorical-Plots-9.2"><span class="toc-item-num">9.2&nbsp;&nbsp;</span>Categorical Plots</a></span><ul class="toc-item"><li><span><a href="#Swarm-Plot" data-toc-modified-id="Swarm-Plot-9.2.1"><span class="toc-item-num">9.2.1&nbsp;&nbsp;</span>Swarm Plot</a></span></li><li><span><a href="#Violin-Plot" data-toc-modified-id="Violin-Plot-9.2.2"><span class="toc-item-num">9.2.2&nbsp;&nbsp;</span>Violin Plot</a></span></li></ul></li></ul></li></ul></div>

In [None]:
from scipy import stats
from sklearn.datasets import load_iris
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

%matplotlib inline

# Objectives

- Use measures of center and spread to describe data
- Use histograms and box-and-whisker plots to describe data

# Descriptive Statistics

When trying to understand your data, it is typically impossible to just look at raw data and get much insight. We need ways to turn a bunch of data into a smaller set of numbers that are easily digestible summaries of your data. This will make them understandable both for you and for the people you work with. We call these **descriptive statistics**.

Each of our data table's columns has a bunch of values. We might have a set of body temperatures or house prices or birth rates or frog leg lengths. How, in general, can we characterize such a set of numbers?

## Sample Data

Let's build a simple dataset, based on a hypothetical survey of the number of pairs of shoes owned by 11 random people:

In [None]:
data = np.array([5, 6, 3, 4, 3, 4, 8, 8, 1, 8, 2])

This dataset has a bunch of individual observations in a range of values. These observations have an **empirical distribution** describing how the values are distributed across this range. We'll shorten this to just **distribution** for now. Everything that follows is our attempt to understand the distribution of our data.

# Measures of Center

One natural place to begin is to ask about where the **middle** of the data is. In other words, what is the value that is closest to our other values? 

There are three common measures used to describe the "middle":

- **Mean**: The sum of values / number of values
- **Median**: The value with as many values above it as below it
    - If the dataset has an even number of values, the median is the mean of the two middle numbers.
- **Mode**: The most frequent value(s)
    - A dataset can have multiple modes if multiple values are tied for the most frequent.

Let's see what we have for our example:

In [None]:
print(f"Mean: {np.mean(data)}")
print(f"Median: {np.median(data)}")
print(f"Mode: {stats.mode(data)[0][0]}")

In [None]:
## You can also find the mode(s) using np.unique()
counts = np.unique(data, return_counts=True)
counts

**Discussion**: If somebody asked you "How many pairs of shoes do people usually have?", how would you answer (based on these data)?

## Mathematical Properties

The mean $\bar{x}$ is the point that minimizes the *sum of squared differences* for a given set of data.

<details>
    <summary>
        Proof
    </summary>
    We want to find the point $k$ that minimizes $L(k) = \Sigma^n_{i=1}(x_i-k)^2$. Now, a calculus trick, which we'll see again: To find the minimum of a function, we'll set its derivative to 0. Taking the derivative, we have:

$L'(k) = -2\Sigma^n_{i=1}(x_i-k)$.

Now we solve $L'(k) = 0$ for $k$:

$-2\Sigma^n_{i=1}(x_i-k) = 0$, so <br/><br/>
$\Sigma^n_{i=1}(x_i-k) = 0$, so <br/><br/>
$\Sigma^n_{i=1}x_i = \Sigma^n_{i=1}k = nk$, so <br/><br/>
$k = \frac{\Sigma^n_{i=1}x_i}{n} = \bar{x}$.
    </details>


By contrast, the median is the point that minimizes the *sum of absolute differences*.

<details>
    <summary>
    Proof
    </summary>
    We want to find the point $k$ that minimizes $D(k) = \Sigma^n_{i=1}|x_i-k|$. Taking the derivative, we have:

$D'(k) = \Sigma^n_{i=1}\frac{k-x_i}{|k-x_i|}$.

Now we solve $D'(k) = 0$ for $k$:

Consider the sum $\Sigma^n_{i=1}\frac{k-x_i}{|k-x_i|} = 0$. Ignoring the case where $k = x_i$, each of the addends in this sum is $1$ if $k\geq x_i$ and $-1$ if not. To make this sum equal to 0, we thus want to choose $k$ such that there are the same number of $1$s and $-1$s, which means that we want to choose $k$ to be the middle number, i.e. the median.

Notes:
- if $n$ is odd, then the minimum of the function occurs not where its derivative is 0 but where it is *undefined*;
- if $n$ is even, then *any* number between the two middle numbers will minimize our function:
    </details>

# Measures of Spread

Another natural question is about the **spread** of the data. In other words, how wide a range of values do you have? And how close or far are they from the "middle"?

## Min, Max, and Range

The minimum and maximum values of a dataset tell you the full extent of the values of your dataset. The range of the dataset is the difference between those two values.

In [None]:
print(f"Min: {data.min()}")
print(f"Max: {data.max()}")
print(f"Range: {data.max() - data.min()}")

## Percentiles and IQR

You can also calculate values at various **percentiles** to understand the spread. An "Nth Percentile" value is the value that is greater than N% of other values. The 25th and 75th percentiles are commonly used to describe spread, and the **interquartile range (IQR)** is the difference between these two values.

> See [the docs](https://numpy.org/doc/stable/reference/generated/numpy.percentile.html) for more specifics about how percentiles are calculated, which is surprisingly tricky.

In [None]:
print(f"25th Percentile: {np.percentile(data, 25)}")
print(f"75th Percentile: {np.percentile(data, 75)}")
print(f"IQR: {np.percentile(data, 75) - np.percentile(data, 25)}")

## Standard Deviation

The **standard deviation** is a measure of how far away values are from the mean. It is usually calculated as:

$$\sqrt\frac{\Sigma(x_i - \bar{x})^2}{n}$$

In [None]:
print(f"Standard Deviation: {data.std()}")

In [None]:
data.std()

As with percentiles, there are different ways that standard deviation is calculated.

In [None]:
data_df = pd.DataFrame(data, columns=["Pairs of Shoes"])

In [None]:
data_df.std()

**Discussion**: If somebody asked you "How much do people differ in the number of pairs of shoes they have?", how would you answer (based on these data)?

# df.describe()

You can actually get a bunch of descriptive statistics from any `pandas` DataFrame using the `.describe()` method. This should be one of the first things you'll do when exploring a new dataset.

In [None]:
data_df.describe()

# Visual Description

A picture is worth a thousand words - or numbers! Here we will show how to use histograms and box-and-whisker plots to describe your data.

## Histograms

One natural way of starting to understand a dataset is to construct a **histogram**, which is a bar chart showing the counts of the different values in the dataset.

There will usually be many distinct values in your dataset, and you will need to decide how many **bins** to use in the histogram. The bins define the ranges of values captured in each bar in your chart. 

In [None]:
fig, ax = plt.subplots()
ax.hist(data, bins=14)
plt.title('Counts, 14 Bins')

In [None]:
fig, ax = plt.subplots()
ax.hist(data, bins=10)
plt.title('Counts, 10 Bins')

In [None]:
fig, ax = plt.subplots()
ax.hist(data, bins=5)
plt.title('Counts, 5 Bins')

In [None]:
fig, ax = plt.subplots()
ax.hist(data, bins=8)
plt.title('Counts, 8 Bins')

## Box and Whisker - Box Plot

A box-and-whisker plot can also be useful for visually summarizing your data by showing the min, IQR, and max.

In [None]:
fig, ax = plt.subplots()
ax.boxplot(data)
plt.title('Counts of Pairs of Shoes')

# Other Shape Descriptors

Here are a few other ways that people describe the distributions of data.

## Skew

**Skewness**, also known as **skew**, is a measure of **asymmetry**. Specifically, it measures how much your distribution is weighted toward one side. 

**Right skew**, also known as **positive skew**, means that your distribution has more extreme values on the high end than on the low end.

**Left skew**, also known as **negative skew**, means that your distribution has more extreme values on the low end than on the high end.

![skew](https://upload.wikimedia.org/wikipedia/commons/c/cc/Relationship_between_mean_and_median_under_different_skewness.png)
Diva Jain, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons

In [None]:
data_df.skew()

Skewness of 0 represents a distribution with equal weight on both sides. There are multiple ways to calculate skew, but values between -1 and 1 are usually considered low.

## Kurtosis

**Kurtosis** is a measure of how heavy the tails are. You can also think of it in terms of how pointy the peak is.

Usually kurtosis is calculated in reference to a **normal distribution**, which we'll learn more about later in the course. We usually calculate **excess kurtosis** to see if a distribution has more or less kurtosis than a normal distribution (in dots below). 

![kurt](images/positive-kurtosis.jpg)
[Simple Psychology](https://www.simplypsychology.org/kurtosis.html)

In [None]:
data_df.kurt()

Excess kurtosis of 0 means that the tails have similar weight to the tails of a normal distribution. Positive values mean heavier tails than normal, negative values mean lighter tails than normal.

### Statistical Moments

Mean, standard deviation, skewness, and kurtosis are all called **moments** in statistics. If you're interested, you can learn more about these [here](https://www.statisticshowto.datasciencecentral.com/what-is-a-moment/) or [here](https://www.spcforexcel.com/knowledge/basic-statistics/are-skewness-and-kurtosis-useful-statistics).

## Modality

Distributions can have different **modality**. This relates to whether the distribution has distinct peaks and, if so, how many peaks there are and how strong they are.

In [None]:
X = np.linspace(0, 1, 40)
y = stats.uniform.pdf(X) + 0.05 * np.random.rand(40)

fig, ax = plt.subplots(figsize=(8, 7))
ax.plot(X, y, lw=5)
plt.ylim(0.5, 1.5)
plt.title('Flat Distribution');

In [None]:
X = np.linspace(-5, 5, 40)
y = stats.norm.pdf(X, loc=0) \
+ 0.05 * np.random.rand(40)

fig, ax = plt.subplots(figsize=(8, 7))
ax.plot(X, y, lw=5)
plt.title('Unimodal Distribution')

In [None]:
X = np.linspace(-5, 5, 40)
y = stats.norm.pdf(X, loc=-2) + stats.norm.pdf(X, loc=2)\
+ 0.05 * np.random.rand(40)

fig, ax = plt.subplots(figsize=(8, 7))
ax.plot(X, y, lw=5)
plt.title('Bimodal Distribution')

## Rock Music Data

Let's see what stats or graphs we can pull out of this dataset about rock songs.

In [None]:
songs = pd.read_csv('classic-rock-song-list.csv')
songs.head()

**Activity**: Describe the `PlayCount` variable using...

- `.describe()`
- A histogram

Summarize the distribution of the data in 1-2 sentences.

<details>
    <summary>
        Answer Code
    </summary>
    
    songs['PlayCount'].describe()
    
    songs['PlayCount'].hist()
    
</details>

In [None]:
## Your Code Here

# Level Up: Rock Artist Analysis

We might also try grouping by artist and analyzing the counts of their songs in the dataset.

In [None]:
artist_song_counts = songs.groupby('ARTIST CLEAN')\
                          .count()['Song Clean']\
                          .sort_values(ascending=False)

artist_song_counts

In [None]:
fig, ax = plt.subplots(figsize=(20, 8))
ax.bar(artist_song_counts[:10].index, artist_song_counts[:10].values)

In [None]:
artist_song_counts.skew()

# Level Up: Iris Plots

In [None]:
iris_data = load_iris()

In [None]:
X = iris_data.data
y = iris_data.target

In [None]:
# Documentation for this dataset

# print(iris_data.DESCR)

In [None]:
iris_df = pd.DataFrame(np.hstack([X, y.reshape(-1, 1)]),
                  columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'spec'])

In [None]:
iris_df

In [None]:
iris_df['spec'] = iris_df['spec'].astype(int)

In [None]:
cypher = {0: 'setosa', 1: 'versicolor', 2: 'virginica'}
iris_df['spec'] = iris_df['spec'].map(cypher)

## Histograms Across Groups

Let's create histograms for each of our flower groups.

In [None]:
iris_df[iris_df['spec'] == 'setosa']['sepal_wid'].hist(bins=20)

In [None]:
iris_df[iris_df['spec'] == 'versicolor']['sepal_wid'].hist(bins=20)

In [None]:
iris_df[iris_df['spec'] == 'virginica']['sepal_wid'].hist(bins=20)

What if we want to visualize the three histograms together?

In [None]:
# WARNING: This is not a good visualization

iris_df.groupby('spec')['sepal_wid'].hist(bins=20)

## Categorical Plots

### Swarm Plot

In [None]:
iris_df.head()

In [None]:
ax=sns.violinplot(x='sepal_wid', y='spec', data=iris_df)
sns.swarmplot(y="spec", x="sepal_wid", data=iris_df, ax=ax, color='black', order=iris_df.spec.unique());

### Violin Plot

In [None]:
sns.catplot(x='sepal_wid', y='spec',
            kind='violin', data=iris_df);

In [None]:
g = sns.catplot(x='sepal_wid', y='spec',
            kind='violin', data=iris_df, order=iris_df.spec.unique());
g.map(sns.swarmplot, "sepal_wid", "spec",color='black', order=iris_df.spec.unique())