# Measures of Spread in Pandas and Numpy
Measures of spread are typically defined as quantiles, standard deviation, and variance. The goal is to find what and how much variation there is in your data. We can use `.quantile()`, `.std()`, and `.var()` on our DataFrame, Series, or Group to calculate these measurements.

Explore this notebook using the `census_income_data.csv` dataset to answer questions from these methods. We'll utilize the groupby method again to facilitate our methodology.

In [None]:
import pandas as pd

In [None]:
# Load the dataset
df_census = pd.read_csv('census_income_data.csv')

## Using quantile on a DataFrame

#### What was the 25%, 50%, 75%, 90%, and 95% quantile of capital gained and lost in our dataset?
Let's use the `.quantile()` method on our DataFrame to aggregate these totals at a high level.

Wow! We had to get to the 95% quantile to show any `capital-gain`. This means at least 90% of our dataset never bought or sold any assets, 0 `capital-gain` and 0 `capital-loss`.

In [None]:
df_census[["capital-gain", "capital-loss"]].quantile(q=[0.25, 0.5, 0.75, 0.9, 0.95])

## Using Quantile, Standard Deviation, and Variance on a Group

#### What are the different workclass types

In [None]:
df_census["workclass"].value_counts()

#### If we group by 'workclass', what are some interesting questions and answers?
We'll use `.quantile()`, `.std()`, `.var()` to see how the each metric tells a different story for each 'workclass'. For quantile, since we already know that most of the data is 0, we will only look at the 50%, 90%, and 95% quantiles. Let's focus only on "age", "capital-gain", "capital-loss", and "hours-per-week".

We finally see some differences between different work classes for `capital-gain`. Notably `Self-emp-inc`, `Self-emp-not-inc`, and `Without-pay` have a larger portion of their group selling assets, `capital-gain` > 0 in the 90% quantile compared to the 95% quantile for others.

What can you say about the `age` and `hours-per-week` quantiles for the different groups?

In [None]:
df_census.groupby(by="workclass")[["age", "capital-gain", "capital-loss", "hours-per-week"]].quantile(q=[0.5, 0.9, 0.95])

Standard deviation is useful when paired with mean, as it will tell you how distributed you data is in relation to the mean.

For "age", "capital-gain", "capital-loss", and "hours-per-week", you can see that `Self-emp-inc` and `Self-emp-not-inc` are heavily distributed across all measurements.
An interesting observation is that the mean `hours-per-week` for all government groups are similar, but their standard deviations are different. Federal positions have the lowest standard deviation of the three, how would you interpret that?

In [None]:
df_census.groupby(by="workclass")[["age", "capital-gain", "capital-loss", "hours-per-week"]].mean()

In [None]:
df_census.groupby(by="workclass")[["age", "capital-gain", "capital-loss", "hours-per-week"]].std()

Because of how variance is calculated (squaring of the standard deviation), it can be used to help spot distributions with outliers. Look again at the three government groups. It appears the state government group is more widly spread than the other two. Maybe the outliers in the group are more obvious?

Some additional notes about variance:
 - While using variance to spot outliers can be used, there are better methods available, such as interquartile range (IQR) or median absolute deviation (MAD).
 - Variance is important in other areas, like calculating confidence intervals, data normalization, testing hypotheses, and assessing the quality of statistical models.

In [None]:
df_census.groupby(by="workclass")[["age", "capital-gain", "capital-loss", "hours-per-week"]].var()

Remember that we can use the `.describe()` method on groups to produce many of these measurements. We can also use the `percentiles` parameter to customize which quantiles we want to target.

In [None]:
df_census.groupby(by="workclass")[["capital-gain", "capital-loss"]].describe(percentiles=[0.5, 0.9, 0.95])