# Descriptive Stats

Let's
* Load some data.
* Plot to get an overview.
* Compute some descriptive statistics.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
#
# Uncomment the following lines for floating plot windows
# import matplotlib
# matplotlib.use('Qt5Agg')
# %matplotlib qt5

Public data from the annual CDC Behavioral Risk Factor Surveillance System (BRFSS) telephone survey. The full data has 200+ columns, here are only four. [BRFSS](https://www.cdc.gov/brfss/index.html)

In [None]:
df = pd.read_csv('BRFSS.csv')
df

Let's consider the heights. To get an overview, let's make a histogram. 

In [None]:
df['hght_cm'].hist()

Maybe a few more bins and a probability mass function. 

In [None]:
df['hght_cm'].hist(bins=50, density=True)

### Task

From my or a modified histogram,
* estimate the average.
* estimate the median.
* estimate the standard deviation.

Okay, let's use pandas functions to find all these statistics:

In [None]:
df['hght_cm'].mean()

In [None]:
df['hght_cm'].mode()

In [None]:
df['hght_cm'].std()

## Task

Use pandas functions (not numbers) to find Peason's skewness of the heights.

In [None]:
## 77-percenile
df['hght_cm'].quantile(0.77)

The `describe` functions gives overview over all numeric columns:

In [None]:
df.describe()

In [None]:
pattern=(df['hght_cm']<62.0)
#pattern=(df['wght_kg']<21.0)
df[pattern]

## Tasks

* Analyze the weights.
* Convert the metric columns to British impirical: `df['hght_in']=df['hght_cm']*...` and repeat the analysis.
* Repeat the analysis with data from males or females only: 
  * `pattern=(df['sex']=='?')`
  * `df_fem=df[pattern]`

## $z$-scores

$z=\frac{x-\mu}{\sigma}$

Assigns to each $x$ its deviation from the mean $\mu$ in units of the standard deviation $\sigma$.

In [None]:
df['z-hght']=(df['hght_cm']-df['hght_cm'].mean())/df['hght_cm'].std()
df.describe()

In [None]:
df['z-hght'].hist(bins=50, density=True)
# after plotting, uncomment the next line and replot
# plt.xlim(-4,4)   

## Empirical rule and Chebyshev's theorem

This is a bit more tricky. Computing percentile rank for each element in  a column is easy, but scrolling around in 400,000 rows is not.

So let's use a trick and find the number of elements by pattern matching. For example, to find the data within one standard deviation, we need everything with z-scores between $-1$ and $1$, right?
* Define a pattern accordingly.
* Grab all the rows fulfilling the pattern.
* Count the rows with `describe` or `count()`.

In [None]:
# the blanks in the pattern are for readablity only
pattern=( (df['z-hght']>=-1) & (df['z-hght']<=1) )
df[pattern].describe()

In [None]:
df[pattern].shape[0]

In [None]:
df[pattern]['z-hght'].count()

In [None]:
# total number of heights in the dataset
df['z-hght'].count()

In [None]:
# ratio as a percentage
df[pattern]['z-hght'].count() / df['z-hght'].count() * 100

## Tasks

* Check the percentages with one, two, and three standard deviations for the heights and the weights.
* Are these datasets close to the Empirical or rather to Chebyshev's rule?