# Plotting with Seaborn

Most plotting in Python is done with `matplotbib`. Great library, very versatile, and a large variety of plots. Unfortunately, it takes often many lines of codes to make a plot of production quality.

Apart from using `matplotlib` directly, we can use either `pandas` - and we have done that: `df['column'].hist()` - or `seaborn`.

`Seaborn` is not as quite as versatile, but the most common plots types are available, and the result is already of high quality. Moreover, as `seaborn` is build on top of `matplotlib` you can use `matplotlib` commands to enhance `seaborn` plots. 

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Public data from the annual CDC Behavioral Risk Factor Surveillance System (BRFSS) telephone survey. The full data has 200+ columns, here are only four. [BRFSS](https://www.cdc.gov/brfss/index.html)

In [None]:
df = pd.read_csv('BRFSS.csv')
df['z-hght']=(df['hght_cm']-df['hght_cm'].mean())/df['hght_cm'].std()
df['z-wght']=(df['wght_kg']-df['wght_kg'].mean())/df['wght_kg'].std()
df.describe()

A box-plot. The box represents the interquartile range. Outliers are defined by 1.5-times-IQR from the quartiles, and the whiskers extend from the minimum to the maximum of the non-outlier points.

### Task

Characterize the datasets from the boxplots.

In [None]:
sns.boxplot(data=df[['wght_kg','hght_cm']])

In [None]:
# Seaborn makes nice histograms
sns.histplot(data=df['wght_kg'], bins=61)

Try changing `stat='count'` to
* stat='frequency'
* stat='probability'
* stat='percent'
* stat='density'

In [None]:
sns.histplot(data=df['hght_cm'], stat='count', bins=61, color='orange')

You can think of a *probability density function* (PDF) as a smoothed version of the density-histogram. One idealized model you may be familar with is the normal or Gaussian distrubution or bell-curve. With `seaborn` we can create and plot a smooth curve from data: `kdeplot`, where `kde` stands for *kernal density function* an empirical PDF approximated from data.

In [None]:
sns.histplot(data=df['wght_kg'], stat='density', color='royalblue', 
             label='density histogram', bins=101)
sns.kdeplot(data=df['wght_kg'], color='black', label='PDF')
plt.legend()

Now we can plot the heights and weights in a single plot without too much clutter.

In [None]:
sns.kdeplot(data=df['hght_cm'], color='orange', label='height')
sns.kdeplot(data=df['wght_kg'], color='blue', label='weight')
plt.legend()

In this plot the data share a common $y$-axes, density, which means the PDF integrates to $1$, or $100$%. But because the $x$-axis are different, a comparison is not straightforward. This is where the $z$-scores come in. By switching to $z$scores all data share a common $x$-axes. 

In [None]:
sns.kdeplot(data=df['z-hght'], color='orange', label='height')
sns.kdeplot(data=df['z-wght'], color='blue', label='weight')
plt.xlim(-5,5) # -5 to 5 standard deviations
plt.xlabel('z-score')
plt.legend()