# Binning Numeric Columns

When grouping, numeric columns are often used as the aggregating column and not the grouping column. In this chapter, we'll learn how to bin numeric columns into specific groups using the `cut` and `qcut` **functions** (not methods). After binning, we'll be able to more easily use them for grouping. Let's begin with the housing dataset, which has a few numeric columns that make sense to bin.

In [None]:
import pandas as pd
usecols = ['Neighborhood', 'OverallQual', 'YearBuilt', 'Exterior1st', 
           'Foundation', 'GrLivArea', 'SalePrice']
df = pd.read_csv('../data/housing.csv', usecols=usecols)
df.head()

## Grouping with numeric columns

Any column regardless of its data type may be used as the grouping column. Although numeric columns are usually used as the aggregating column, there are cases where it is sensible to use them as the grouping column.  Here, we find the median price for the ten unique values of `OverallQual` and also report the total number of houses in the group.

In [None]:
df.groupby('OverallQual')['SalePrice'].agg(['mean', 'size'])

The `GrLivArea` column is also numeric, but is a poor choice for grouping as there are many unique values. Let's perform the same operation as above.

In [None]:
df_temp = df.groupby('GrLivArea')['SalePrice'].agg(['mean', 'size'])
df_temp.head()

The first five unique values of `GrLivArea` all appear exactly once. Groups with one observation are usually not that interesting. In fact, the average group size has just 1.7 rows in it.

In [None]:
df_temp['size'].mean()

There are more than half as many groups as there are rows in the DataFrame.

In [None]:
print(f'There are {len(df_temp)} groups from {len(df)} total rows.')

## Binning with `pd.cut`

The `pd.cut` function provides the machinery for binning numeric columns into a specific number of bins. Pass a numeric Series as the first argument and the boundaries of the bins as the second.

In [None]:
s = pd.cut(df['GrLivArea'], bins=[0, 500, 1000, 1500, 2000, 3000, 10_000])
s.head()

An ordered categorical Series will be returned with one category less than the number of boundaries given. Each category will be an interval with two endpoints. The left endpoint is **exclusive**, while the right is **inclusive**. For instance, the interval `(1500, 2000]` does not include 1500 exactly, but does include 2000. 

### Interval data type

While the resulting column is categorical, each individual value in the column is an **Interval** object, which is specific to pandas. The `cat` accessor is used to return all six of these Interval categories.

In [None]:
s.cat.categories

A single value may be retrieved using integer location.

In [None]:
s.cat.categories[2]

### Must know minimum and maximum value

You must know both the minimum and maximum value of the column you are binning to make precise bins around the current data. In this case, 0 is lower than the minimum and 10,000 is much greater than the maximum `GrLivArea` so all values will be placed within a bin. If there are values greater than the last given bin value, then these values will be missing in the returned Series. 

Now that the data is binned, we can count the number of houses within each of these six categories. Notice how only three houses have `GrLivArea` less than 500.

In [None]:
s.value_counts(sort=False)

To get the precise lower and upper boundaries, use the minimum and maximum of the column. You'll also need to set the `include_lowest` parameter to `True` so the very first bin includes the lowest value.

In [None]:
area_min, area_max = df['GrLivArea'].agg(['min', 'max'])
s = pd.cut(df['GrLivArea'], bins=[area_min, 500, 1000, 1500, 2000, 3000, area_max],
          include_lowest=True)
s.value_counts(sort=False)

### Cut into a specific number of bins

A second way to use `pd.cut` is to supply it a single integer for the number of bins to create. Each bin created will have equal width. Here, we create eight bins on the same column and immediately find the counts of each.

In [None]:
pd.cut(df['GrLivArea'], bins=8).value_counts(sort=False)

### Take care when setting the `precision` parameter

By default, pandas uses up to three digits of precision for creating the bins. You may use the `precision` parameter to set the decimal precision (just like rounding), though care must be taken, as it only affects the boundary value after the cut has taken place. The real boundaries are still the same as above. To show this, we'll set `precision` to -3.

In [None]:
pd.cut(df['GrLivArea'], bins=8, precision=-3).value_counts(sort=False)

Setting precision to -3 (rounding to the nearest thousand) results in the exact same counts as above. It would appear that the same number of houses (740) have `GrLivArea` greater than 998 up to 1661 as those with `GrLivArea` greater than 1000 up to 1700.

The `between` method is used below to determine whether a house has a `GrLivArea` within a certain range. The resulting boolean Series is summed to find the count. Note how the true count below does not match the count produced from `pd.cut` as setting the `precision` parameter only round the boundaries after the cut has been made.

In [None]:
df['GrLivArea'].between(999, 1661).sum()

In [None]:
df['GrLivArea'].between(1001, 1700).sum()

### Label the bins with string names

Each bin may be labeled with a string instead of the interval by setting the `labels` parameter to a list of strings, one for each bin. Here, we create three equal-width bins with three string labels. When using string labels, you won't know the endpoints for the bins unless you return them by setting `retbins` to `True`. Both the Series and the bin boundaries will be returned as a tuple, which we unpack into separate variable names.

In [None]:
s, bins = pd.cut(df['GrLivArea'], bins=3, 
                 labels=['small', 'medium', 'large'], retbins=True)
s.head()

The bin boundaries are displayed below.

In [None]:
bins

## Quantile binning with `pd.qcut`

When we cut our Series into eight equal-width bins, one of the categories had zero observations in it. Instead of using equal-width bins, you may wish to have an equal number of observations in each bin. The `pd.qcut` function bins according to quantiles. You may provide it a list of floats as the quantile boundaries or an integer to create that many bins all with (approximately) equal number of observations in each. Below, we attempt to create eight bins with the same number of observations in each. Because there are duplicate `GrLivArea` values, it may be impossible to create boundaries where each bin has an equal number of observations.

In [None]:
pd.qcut(df['GrLivArea'], 8, precision=0).value_counts(sort=False)

Provide a list of quantiles as the second argument to create bins of a specific size. Here, three bins are created that hold 20%, 70%, and 10% of the data.

In [None]:
pd.qcut(df['GrLivArea'], [0, 0.2, 0.9, 1], precision=0).value_counts(sort=False)

We can use the `quantile` method to verify the bin edge values.

In [None]:
df['GrLivArea'].quantile([0, 0.2, 0.9, 1])

## Grouping with bins

Grouping is often more sensible after binning numeric columns that have many unique values. Let's create a new column, `AreaBin`, that cuts `GrLivArea` into five categories each with the same number of observations.

In [None]:
df['AreaBin'] = pd.qcut(df['GrLivArea'], 5)
df.head(3)

We can now use this column like we do any other grouping column and do so below to find the median price for houses in each bin.

In [None]:
df.groupby('AreaBin')['SalePrice'].median().round(-3)

Here, we create a pivot table of the median price by `Foundation` and `AreaBin`.

In [None]:
df.pivot_table(index='Foundation', columns='AreaBin', 
               values='SalePrice', aggfunc='median')

## Exercises

Use the `bikes` DataFrame for the following exercises.

In [None]:
bikes = pd.read_csv('../data/bikes.csv')
bikes.head(3)

### Exercise 1

<span style="color:green; font-size:16px">Find the number of rides between trip durations of 0 to 100, 101 to 1000, and 1001 and above.</span>

### Exercise 2

<span style="color:green; font-size:16px">Cut the trip duration into five bins where the width of each bin is the same size. Count the occurrence of each bin. Sort the resulting Series by the index. Does it make sense to use the type of binning?</span>

### Exercise 3

<span style="color:green; font-size:16px">Cut the trip duration into five bins where the number of observations in each bin is the approximately the same. Count the occurrence of each bin. Sort the resulting Series by the index. Does it make sense to use the type of binning?</span>

### Exercise 4

<span style="color:green; font-size:16px">Quantile cut trip duration and temperature into five equal-sized bins and count the occurrences using `pd.crosstab`. Do you notice any patterns?</span>

### Exercise 5

<span style="color:green; font-size:16px">Create a pivot table containing the average trip duration by gender and temperature quantile cut into 10 equal-sized bins.</span>

### Exercise 6

<span style="color:green; font-size:16px">The temperature column has a single obviously wrong value. Replace this value with the numpy nan object and then cut the resulting Series into five bins, labeling them 'cold', 'cool', 'mild', 'warm', 'hot'. Choose the boundaries of the bins that make sense for these labels. Then count the occurence of each label and include the missing values.</span>