# Descriptive Statistics

When trying to understand your data, it is typically impossible to just look at raw data and get much insight. We need ways to turn a bunch of data into a smaller set of numbers that are easily digestible summaries of your data. This will make them understandable both for you and for the people you work with. We call these **descriptive statistics**.

# Objectives

- Use measures of center and spread to describe data
- Use histograms and box-and-whisker plots to describe data

In [None]:
from scipy import stats
from sklearn.datasets import make_blobs, make_regression, load_iris
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

%matplotlib inline

Each of our data table's columns has a bunch of values. We might have a set of body temperatures or house prices or birth rates or frog leg lengths. How, in general, can we characterize such a set of numbers?

## Sample Data

Let's build a simple dataset, based on a hypothetical survey of the number of pairs of shoes owned by 11 random people:

In [None]:
data = np.array([5, 6, 3, 4, 3, 4, 8, 8, 1, 8, 2])

This dataset has a bunch of individual observations in a range of values. These observations have an **empirical distribution** describing how the values are distributed across this range. We'll shorten this to just **distribution** for now. Everything that follows is our attempt to understand the distribution of our data.

# Measures of Center

One natural place to begin is to ask about where the **middle** of the data is. In other words, what is the value that is closest to our other values? 

There are three common measures used to describe the "middle":

- **Mean**: The sum of values / number of values
- **Median**: The value with as many values above it as below it
    - If the dataset has an even number of values, the median is the mean of the two middle numbers.
- **Mode**: The most frequent value(s)
    - A dataset can have multiple modes if multiple values are tied for the most frequent.

Let's see what we have for our example:

In [None]:
print(f"Mean: {np.mean(data)}")
print(f"Median: {np.median(data)}")
print(f"Mode: {stats.mode(data)[0][0]}")

In [None]:
## You can also find the mode(s) using np.unique()
counts = np.unique(data, return_counts=True)
counts

**Discussion**: If somebody asked you "How many pairs of shoes do people usually have?", how would you answer (based on these data)?

## Mathematical Properties

The mean $\bar{x}$ is the point that minimizes the *sum of squared differences* for a given set of data.

<details>
    <summary>
        Proof
    </summary>
    We want to find the point $k$ that minimizes $L(k) = \Sigma^n_{i=1}(x_i-k)^2$. Now, a calculus trick, which we'll see again: To find the minimum of a function, we'll set its derivative to 0. Taking the derivative, we have:

$L'(k) = -2\Sigma^n_{i=1}(x_i-k)$.

Now we solve $L'(k) = 0$ for $k$:

$-2\Sigma^n_{i=1}(x_i-k) = 0$, so <br/><br/>
$\Sigma^n_{i=1}(x_i-k) = 0$, so <br/><br/>
$\Sigma^n_{i=1}x_i = \Sigma^n_{i=1}k = nk$, so <br/><br/>
$k = \frac{\Sigma^n_{i=1}x_i}{n} = \bar{x}$.
    </details>


By contrast, the median is the point that minimizes the *sum of absolute differences*.

<details>
    <summary>
    Proof
    </summary>
    We want to find the point $k$ that minimizes $D(k) = \Sigma^n_{i=1}|x_i-k|$. Taking the derivative, we have:

$D'(k) = \Sigma^n_{i=1}\frac{k-x_i}{|k-x_i|}$.

Now we solve $D'(k) = 0$ for $k$:

Consider the sum $\Sigma^n_{i=1}\frac{k-x_i}{|k-x_i|} = 0$. Ignoring the case where $k = x_i$, each of the addends in this sum is $1$ if $k\geq x_i$ and $-1$ if not. To make this sum equal to 0, we thus want to choose $k$ such that there are the same number of $1$s and $-1$s, which means that we want to choose $k$ to be the middle number, i.e. the median.

Notes:
- if $n$ is odd, then the minimum of the function occurs not where its derivative is 0 but where it is *undefined*;
- if $n$ is even, then *any* number between the two middle numbers will minimize our function:
    </details>

# Measures of Spread

Another natural question is about the **spread** of the data. In other words, how wide a range of values do you have? And how close or far are they from the "middle"?

## Min, Max, and Range

The minumun and maximum values of a dataset tell you the full extent of the values of your dataset. The range of the dataset is the difference between those two values.

In [None]:
print(f"Min: {data.min()}")
print(f"Max: {data.max()}")
print(f"Range: {data.max() - data.min()}")

## Percentiles and IQR

You can also calculate values at various **percentiles** to understand the spread. An "Nth Percentile" value is the value that is greater than N% of other values. The 25th and 75th percentiles are commonly used to describe spread, and the **interquartile range (IQR)** is the difference between these two values.

See [the docs](https://numpy.org/doc/stable/reference/generated/numpy.percentile.html) for more specifics about how percentiles are calculated, which is suprisingly tricky.

In [None]:
print(f"25th Percentile: {np.percentile(data, 25)}")
print(f"75th Percentile: {np.percentile(data, 75)}")
print(f"IQR: {np.percentile(data, 75) - np.percentile(data, 25)}")

## Standard Deviation

The standard deviation is a kind of measure of how far the "average" value is from the mean. It is calculated as $\sqrt\frac{\Sigma(x_i - \bar{x})^2}{n}$.

In [None]:
print(f"Standard Deviation: {data.std()}")

**Discussion**: If somebody asked you "How much do people differ in the number of pairs of shoes they have?", how would you answer (based on these data)?

# df.describe

You can actually get a bunch of descriptive statistics from any `pandas` DataFrame using the `.describe()` method. This should be one of the first things you'll do when exploring a new dataset.

In [None]:
pd.DataFrame(data, columns=["Pairs of Shoes"]).describe()

# Visual Description

A picture is worth a thousand words - or numbers! Here we will show how to use histograms and box-and-whisker plots to describe your data.

## Histograms

One natural way of starting to understand a dataset is to construct a **histogram**, which is a bar chart showing the counts of the different values in the dataset.

There will usually be many distinct values in your dataset, and you will need to decide how many **bins** to use in the histogram. The bins define the ranges of values captured in each bar in your chart. 

In [None]:
fig, ax = plt.subplots()
ax.hist(data, bins=14)
plt.title('Counts, 14 Bins')

In [None]:
fig, ax = plt.subplots()
ax.hist(data, bins=10)
plt.title('Counts, 10 Bins')

In [None]:
fig, ax = plt.subplots()
ax.hist(data, bins=5)
plt.title('Counts, 5 Bins')

In [None]:
fig, ax = plt.subplots()
ax.hist(data, bins=8)
plt.title('Counts, 8 Bins')

## Box and Whisker

A box-and-whisker plot can also be useful for visually summarizing your data by showing the min, IQR, and max.

In [None]:
fig, ax = plt.subplots()
ax.boxplot(data)
plt.title('Counts of Pairs of Shoes')

# Other Shape Descriptors

Here are a few other ways that people describe the distributions of data.

## Moments

The mean is related to $\Sigma(x_i - \bar{x})$ while the standard deviation is related to $\Sigma(x_i - \bar{x})^2$. We could consider higher exponents as well, of the form $\Sigma(x_i - \bar{x})^n$. For each exponent $n>0$, we can define a related statistical **moment**. For $n=3$, the moment is called the **skewness**, which is a measure of how the mean and median diverge. For $n=4$, the moment is called the **kurtosis**, which is a measure of how many values are relatively far from the mean.

There are a few different definitions of skewness and kurtosis that are commonly used, but the basic quantities are:

- $\frac{\Sigma(x_i - \bar{x})^3}{n\sigma^3}$ (for skewness)
- $\frac{\Sigma(x_i - \bar{x})^4}{n\sigma^4}$ (for kurtosis)

For more on statistical moments, see [here](https://www.statisticshowto.datasciencecentral.com/what-is-a-moment/) and [here](https://www.spcforexcel.com/knowledge/basic-statistics/are-skewness-and-kurtosis-useful-statistics).

### Skewness
![skew](images/skew.png)

In [None]:
stats.skew(data)

### Kurtosis
![kurt](images/kurtosis.png)

In [None]:
# Excess kurtosis

stats.kurtosis(data)

## Symmetry

In [None]:
X = np.linspace(-3, 3, 40)
y = stats.norm.pdf(X) + 0.05 * np.random.rand(40)

fig, ax = plt.subplots(figsize=(8, 7))
ax.plot(X, y, lw=5)
ax.vlines(x=0, ymin=0, ymax=0.5, colors = "black")
plt.title('Symmetric Distribution');

In [None]:
X = np.linspace(0, 1, 40)
y = stats.expon.pdf(X) + 0.05 * np.random.rand(40)

fig, ax = plt.subplots(figsize=(8, 7))
ax.plot(X, y, lw=5)
plt.title('Asymmetric Distribution');

## Modality

In [None]:
X = np.linspace(0, 1, 40)
y = stats.uniform.pdf(X) + 0.05 * np.random.rand(40)

fig, ax = plt.subplots(figsize=(8, 7))
ax.plot(X, y, lw=5)
plt.ylim(0.5, 1.5)
plt.title('Flat Distribution');

In [None]:
X = np.linspace(-5, 5, 40)
y = stats.norm.pdf(X, loc=-2) + stats.norm.pdf(X, loc=2)\
+ 0.05 * np.random.rand(40)

fig, ax = plt.subplots(figsize=(8, 7))
ax.plot(X, y, lw=5)
plt.title('Bimodal Distribution');

# Blob Example

Let's generate a fake dataset with two variables to practice describing data. To do this, we'll use the `make_blobs()` function from sklearn, which you'll learn more about later. 

In [None]:
X, c = make_blobs(random_state=42)

In [None]:
x1, x2 = X[:, 0], X[:, 1]

In [None]:
fig, ax = plt.subplots()
ax.scatter(x1, x2, c=c)
plt.xlabel('x1')
plt.ylabel('x2')
plt.title('Blob Distribution');

In [None]:
# Let's describe x1 and x2 in statistical terms!

print(f"The maximum of x1 is {x1.max()}.")
print(f"The average of x1 is {x1.mean()}.")
print(f"The minimum of x1 is {x1.min()}.")
print(f"The standard deviation of x1 is {x1.std()}.")
print(f"The interquartile range of x1 is \
{np.percentile(x1, q=75) - np.percentile(x1, q=25)}.")

In [None]:
print(f"The maximum of x2 is {x2.max()}.")
print(f"The average of x2 is {x2.mean()}.")
print(f"The minimum of x2 is {x2.min()}.")
print(f"The standard deviation of x2 is {x2.std()}.")
print(f"The interquartile range of x2 is \
{np.percentile(x2, q=75) - np.percentile(x2, q=25)}.")

Let's use the `.describe()` method to better understand our dataset.

In [None]:
df = pd.DataFrame(np.concatenate([X, c.reshape(-1, 1)], axis=1),
                  columns=['x1', 'x2', 'y'])
df.head()

In [None]:
df['y'] = df['y'].astype(int)

In [None]:
df.describe()

In [None]:
fig, ax = plt.subplots()
ax.hist(x1);

In [None]:
fig, ax = plt.subplots()
ax.hist(x2);

We can plot these side by side, but notice that the axes are different.

In [None]:
fig, ax = plt.subplots(1, 2)
ax[0].hist(x1)
ax[1].hist(x2)

## Rock Music Data

Let's see what stats or graphs we can pull out of this dataset about rock songs.

In [None]:
songs = pd.read_csv('classic-rock-song-list.csv')
songs.head()

**Activity**: Describe the `PlayCount` variable using...

- .describe()
- A histogram

Summarize the distribution of the data in 1-2 sentences.

<details>
    <summary>
        Answer Code
    </summary>
    
    songs['PlayCount'].describe()
    
    songs['PlayCount'].hist()
    
</details>

In [None]:
## Your Code Here

## Rock Artist Analysis

We might also try grouping by artist

In [None]:
songs.groupby('ARTIST CLEAN').count()['Song Clean'].sort_values(ascending=False)

nums_sorted = songs.groupby('ARTIST CLEAN')\
.count()['Song Clean'].sort_values(ascending=False)

In [None]:
fig, ax = plt.subplots(figsize=(20, 8))
ax.scatter(nums_sorted[:10].index, nums_sorted[:10]);

In [None]:
nums_sorted.skew()

In [None]:
nums_sorted.kurt()

# Seaborn Iris Example

In [None]:
data = load_iris()

In [None]:
X = data.data
y = data.target

In [None]:
# print(data.DESCR)

In [None]:
df = pd.DataFrame(np.hstack([X, y.reshape(-1, 1)]),
                  columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'spec'])

In [None]:
df

In [None]:
df['spec'] = df['spec'].astype(int)

In [None]:
cypher = {0: 'setosa', 1: 'versicolor', 2: 'virginica'}
df['spec'] = df['spec'].map(cypher)

## Categorical Plots

### Swarm Plot

In [None]:
sns.catplot(x="spec", y="sepal_wid",
            kind='swarm', data=df);

### Violin Plot

In [None]:
sns.catplot(x='sepal_wid', y='spec',
            kind='violin', data=df);