![rmotr](https://i.imgur.com/jiPp4hj.png)
<hr style="margin-bottom: 40px;">

<img src="https://user-images.githubusercontent.com/7065401/39119047-31551228-46eb-11e8-9cfb-d6ca05a05fe4.jpeg"
    style="width:300px; float: right; margin: 0 40px 40px 40px;"></img>

# Cutting and Binning

Sometimes we need to divide a field with a continuous range of data into discrete categories. For example, you might divide age of users as `0-14`, `15-35`, `36-60`, `+60`.

![separator2](https://i.imgur.com/4gX5WFr.png)

## Hands on! 

In [1]:
import pandas as pd
import numpy as np

In [2]:
ages = np.append(np.random.randint(0, 99, size=16), [14, 35, 60])

In [3]:
ages

array([96, 35, 29,  0, 26, 69, 44, 36, 90, 53, 38, 79, 74, 80, 62, 42, 14,
       35, 60])

The pandas method `cut` takes care of the job. Let's define our bins first:

In [11]:
# bins = [0, 14, 35, 60, 100] # this does not equal zero... 
bins = [-1, 14, 35, 60, 100] # this will hack it so that zero is included

In [8]:
categories = pd.cut(ages, bins)

In [9]:
categories

[(60, 100], (14, 35], (14, 35], (-1, 14], (14, 35], ..., (60, 100], (35, 60], (-1, 14], (14, 35], (35, 60]]
Length: 19
Categories (4, interval[int64]): [(-1, 14] < (14, 35] < (35, 60] < (60, 100]]

In [10]:
categories.codes

array([3, 1, 1, 0, 1, 3, 2, 2, 3, 2, 2, 3, 3, 3, 3, 2, 0, 1, 2],
      dtype=int8)

In [8]:
categories.categories

IntervalIndex([(0, 14], (14, 35], (35, 60], (60, 100]]
              closed='right',
              dtype='interval[int64]')

In [9]:
categories.value_counts()

(0, 14]      4
(14, 35]     6
(35, 60]     3
(60, 100]    6
dtype: int64

In [10]:
np.sort(ages)

array([ 2,  2,  8, 14, 18, 18, 23, 25, 27, 35, 39, 57, 60, 71, 71, 80, 91,
       92, 98])

Mathematically speaking, the categories created have been split including the right value (for example, age `14` is included in the first category `(0, 14]`). You can change which one is the inclusive side with the `right` parameter. By default, `right` is `True`, which makes it the inclusive side.

In [11]:
lefty_cats = pd.cut(ages, bins, right=False)

In [12]:
lefty_cats.value_counts()

[0, 14)      3
[14, 35)     6
[35, 60)     3
[60, 100)    7
dtype: int64

You can also pass labels to give better names to your bins:

In [13]:
categories = pd.cut(ages, bins, labels=['Age 0-14', 'Age 15-35', 'Age 36-60', '+60'])

In [14]:
categories

[Age 0-14, Age 15-35, Age 36-60, +60, Age 0-14, ..., Age 15-35, Age 0-14, Age 0-14, Age 15-35, Age 36-60]
Length: 19
Categories (4, object): [Age 0-14 < Age 15-35 < Age 36-60 < +60]

In [15]:
categories.value_counts()

Age 0-14     4
Age 15-35    6
Age 36-60    3
+60          6
dtype: int64

But, what happens if you don't know how many bins you'll employ? You need to split the data in similar sized bins. The `qcut` method divides the data in equal-sized bins, using quantiles and the distribution of the data:

In [16]:
pd.qcut(ages, 4).value_counts()

(1.999, 18.0]    6
(18.0, 35.0]     4
(35.0, 71.0]     5
(71.0, 98.0]     4
dtype: int64

In this case, `qcut` has chosen the bin "categories" for us, based on the distribution of the data.

![separator2](https://i.imgur.com/4gX5WFr.png)