# Vis: Histograms

*Purpose*: *Histograms* are a key tool for EDA, as they are a powerful lense to help us understand how a single variable is "distributed" in a dataset. In this exercise we'll introduce and interpret histograms and some variants (frequency polygons).


## Setup


In [None]:
import grama as gr
DF = gr.Intention()
%matplotlib inline

We'll use the `mpg` dataset from `plotnine`: This is a dataset describing different automobiles, including their mileage (hence mpg).


In [None]:
from plotnine.data import mpg as df_mpg

# Introduction

## What is a Histogram?

As we saw in the previous exercise, a bar plot shows the count of observations within various categories:


In [None]:
## NOTE: No need to edit
(
    df_mpg
    >> gr.ggplot(gr.aes(x="class"))
    + gr.geom_bar()
)

However, if we try to visualize a *continuous* variable with a bar chart, we're likely to run into issues. It happens that the *exact same* values do occur multiple times in the `df_mpg` dataset, but that is due to rounding:


In [None]:
## NOTE: No need to edit
(
    df_mpg
    >> gr.ggplot(gr.aes(x="displ"))
    + gr.geom_bar()
)

If we didn't have rounding, it would be harder to visualize the data with a bar chart:


In [None]:
## NOTE: No need to edit
(
    df_mpg
    >> gr.tf_mutate(
        ## Simulate a lack of rounding by "jittering" the displ values
        displ=DF.displ + gr.marg_mom("norm", mean=0, sd=0.1).r(df_mpg.shape[0])
    )
    >> gr.ggplot(gr.aes(x="displ"))
    + gr.geom_bar()
)

Rather than *assuming* repetitions of continuous values, we can instead *bin* the values into groups. For instance, the following bins the `displ` values to the nearest integer:


In [None]:
## NOTE: No need to edit
(
    df_mpg
    >> gr.tf_mutate(displ_group=DF.displ // 1)
    >> gr.tf_select(DF.displ, DF.displ_group, gr.everything())
)

We can use the binned `displ` to create groups and compute counts within each group.


In [None]:
## NOTE: No need to edit
(
    df_mpg
    >> gr.tf_mutate(displ_group=DF.displ // 1)
    >> gr.tf_count(DF.displ_group)
)

Now we can visualize those counts `n` with a column chart:


In [None]:
## NOTE: No need to edit
(
    df_mpg
    >> gr.tf_mutate(displ_group=DF.displ // 1)
    >> gr.tf_count(DF.displ_group)
    >> gr.ggplot(gr.aes(x="displ_group", y="n"))
    + gr.geom_col()
)


## Enter `gr.geom_histogram()`

Rather than do all of that grouping manually, the geometry `gr.geom_histogram()` does this automatically. This geometry accepts a `bins` argument that allows us to change the **number** of groups; ggplot then figures out the bin widths automatically.


In [None]:
## NOTE: No need to edit
(
    df_mpg
    >> gr.ggplot(gr.aes(x="displ"))
    + gr.geom_histogram(bins=20)
)

## What do Histograms tell us?

Histograms give us a visual sense of *frequency of values* in a dataset. This provides an *extremely* useful window into "what is going on" in any given continuous variable. From a histogram, we can tell:

- Which are more common values?
- Which are less common values?
- Do observations tend to cluster at particular values?
  - *Note*: A "bump" or "cluster" of points is sometimes called a *mode*. A dataset with multiple bumps is then called *multi-modal*.
- What range of values occur?

And so much more.

For example, let's interpret the `displ` histogram from above.


In [None]:
## NOTE: No need to edit
(
    df_mpg
    >> gr.ggplot(gr.aes(x="displ"))
    + gr.geom_histogram(bins=10)
)

*Here are my observations*

- `displ` ranges between a little under `2` and a bit above `6`.
- Smaller values (around `2`) are more common.
- There is an additional "bump" around `displ == 5`.
- Values of `displ == 6` and above are quite rare.

Part of the value of such a plot is that it can help us to frame other questions, such as "Which vehicles *do* have a `displ >= 6`?


## Rule #1 of Histograms

Note that we have to select a number of `bins` when we construct a histogram. This is an important parameter that makes a big difference. Therefore, we have `Rule #1 of Histograms`:

```{admonition} Rule #1 of Histograms
Experiment with different numbers of `bins` when plotting a histogram.
```

Different bin counts will enable different observations. For instance, still with the `df_mpg` dataset. You'll practice this below.


### __q1__ Change the bin size

Re-construct the histogram of `displ` in `df_mpg`, and experiment with different bin sizes. Answer the questions under *observations* below.


In [None]:
## TASK: Create a histogram of `displ`, experiment with the number of bins
(
    df_mpg

)

*Observations*

- What *additional* observations can you make, based on varying the `bins` argument?
  - (Your response here)


## Histogram Variant: Frequency Polygons

A useful variant of the histogram is the *frequency polygon*. This simply visualizes the counts as a line rather than with bars:


In [None]:
## NOTE: No need to edit
(
    df_mpg
    >> gr.ggplot(gr.aes(x="displ"))
    + gr.geom_freqpoly(bins=20)
)

This may seem like a silly distinction to make, but it is actually *extremely useful* when we start comparing multiple histograms on the same plot. While we would have to dodge (or stack---yuck) bars with multiple groups, lines can sit on top of one another:


In [None]:
## NOTE: No need to edit
(
    df_mpg
    >> gr.ggplot(gr.aes(x="displ", color="class"))
    + gr.geom_freqpoly(bins=10)
)

From this frequency polygon plot, we can now see all manner of new insights! For instance:

- The `2seater` class tends to have much higher `displ`
- The `suv` class tends to have higher `displ`, but some are near the low end
- The `midsize` class tends to have much lower `displ`


# Case Study: Diamonds

Now let's practice using histograms with the diamonds dataset.


In [None]:
from grama.data import df_diamonds
df_diamonds

This is a dataset of nearly 54,000 diamonds, including their sale price and characteristics (carat, cut, color, clarity) and geometry.

### __q2__ Study the `carat` distribution

Create a histogram of the `carat`. Answer the questions under *observations* below.


In [None]:
## TASK: Create a histogram of `carat`
(
    df_diamonds

)

*Observations*

- What range of `carat` values occur in the dataset?
  - (Your response here)
- What values tend to be more common?
  - (Your response here)
- What values tend to be less common?
  - (Your response here)


### __q3__ Study the `carat` distribution, closer look

Inspect the following plot. Answer the questions under *observations* below.


In [None]:
## TASK: No need to edit; run and inspect
(
    df_diamonds
    >> gr.tf_filter(DF.carat < 2.0)
    >> gr.ggplot(gr.aes(x="carat"))
    + gr.geom_histogram(bins=60)
    + gr.scale_x_continuous(breaks=(0.3, 0.4, 0.5, 0.7, 0.9, 1.0, 1.2, 1.5, 1.7))
)

*Observations*

- What do you notice about the most-common values of `carat`?
  - (Your response here)
- Note that `carat` is a quality that a jeweler can control, to some extent. When cutting a diamond, a jeweler can choose to take off less material in order to preserve `carat`, possibly at the loss of not getting as high quality a `cut`. What does the histogram above suggest about jeweler behavior?
  - (Your response here)


### __q4__ Bring in the `cut`

Re-create the plot from __q3__, but visualize the `cut` as well. Answer the questions under *observations* below.


In [None]:
## TASK: Re-create the plot from q3, but visualize the `cut` as well
(
    df_diamonds

)

*Observations*

- What is the most numerous `cut` at lower `carat` values? (Say `carat <= 0.5`)
  - (Your response here)
- What is the most numerous cut around `carat == 1.0`?
  - (Your response here)
