# 08 - Histograms

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('..//data/diamonds.csv')
df.shape

In [None]:
df.head(5)

## Matplotlib histograms

Now, let's look at numeric variables in the diamond data set: width (in mm).

Matplotlib has `hist`, while seaborn's histogram function is `histplot`. Both work in similar ways. 

By default, matplotlib will split the data into 10 bins, which is usually too few. In addition, the bin boundaries are not aligned with the tick marks. This is confusing!

In [None]:
plt.hist(data=df, x='x');

To solve this, we want to use the `bins` parameter. If we supply an integer value, that's the nubmer of bins to use.

You can see the bin edges and counts return by hist when I remove the semicolon.

In [None]:
plt.hist(data=df, x='x', bins=25)

### Specifying bin boundaries

You can also specify the bin boundaries explicitly by providing `bins` a list.

For the numpy `arange` function: the first argument is the minimum value, the second is the maximum value. The third argument gives the step-size for the bins.

I will add 1/4mm to the second argument. This is because the values generated by arange _do not_ include the maximum value.

In [None]:
bins = np.arange(0, df['x'].max()+0.25, 0.25)
plt.hist(data=df, x='x', bins=bins);

What happens if we try an extremely small bin size? Like one? On the one hand, this bin size is perhaps a bit too small, introducing a lot of noise in the plot.

On the other hand, this does a good job of showing that many diamonds listings are probably rounded to some nearest value.

In [None]:
bins = np.arange(0, df['x'].max()+0.05, 0.05)
plt.hist(data=df, x='x', bins=bins);

## Seaborn histograms

Seaborn also has a function for creating histograms, histogram. It works just like matplotlib.

The default bin count is larger than in matplotlib. Also, you can automatically normalize the plot with the parameter `stat=percent`.

You can also add a kernel density estimate (KDE), which is an estimate of the density of the data distribution, with the total area underneath the curve set to be = 1.

In [None]:
sns.histplot(data=df, x='x', stat='percent');

In [None]:
sns.histplot(data=df, x='x', stat='percent', kde=True);