# Subdividing and categorising data

Continuous data is often divided into domains or otherwise grouped for analysis.

Suppose you have data on a group of people in a study that you want to divide into discrete age groups. For this, we generate a dataframe with 250 entries between `0` and `99`:

In [1]:
import pandas as pd
import numpy as np

ages = np.random.randint(0, 99, 250)
df = pd.DataFrame({"Age": ages})

df

Unnamed: 0,Age
0,29
1,91
2,45
3,16
4,22
...,...
245,87
246,59
247,93
248,13


Afterwards, pandas offers us a simple way to divide the results into ten ranges with [pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html). To get only whole years, we additionally set `precision=0`:

In [2]:
cats = pd.cut(ages, 10, precision=0)

cats

[(20.0, 29.0], (88.0, 98.0], (39.0, 49.0], (10.0, 20.0], (20.0, 29.0], ..., (78.0, 88.0], (59.0, 69.0], (88.0, 98.0], (10.0, 20.0], (49.0, 59.0]]
Length: 250
Categories (10, interval[float64, right]): [(-0.1, 10.0] < (10.0, 20.0] < (20.0, 29.0] < (29.0, 39.0] ... (59.0, 69.0] < (69.0, 78.0] < (78.0, 88.0] < (88.0, 98.0]]

With [pandas.Categorical.categories](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.categories.html) you can display the categories:

In [3]:
cats.categories

IntervalIndex([(-0.1, 10.0], (10.0, 20.0], (20.0, 29.0], (29.0, 39.0], (39.0, 49.0], (49.0, 59.0], (59.0, 69.0], (69.0, 78.0], (78.0, 88.0], (88.0, 98.0]], dtype='interval[float64, right]')

… or even just a single category:

In [4]:
cats.categories[0]

Interval(-0.1, 10.0, closed='right')

With [pandas.Categorical.codes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.codes.html) you can display an array where for each value the corresponding category is shown:

In [5]:
cats.codes

array([2, 9, 4, 1, 2, 9, 7, 2, 7, 6, 0, 6, 0, 3, 3, 9, 2, 4, 0, 2, 0, 6,
       4, 5, 5, 0, 7, 9, 5, 7, 1, 1, 4, 6, 0, 7, 9, 9, 3, 4, 9, 2, 5, 2,
       4, 9, 8, 9, 2, 0, 9, 5, 4, 9, 3, 4, 3, 3, 8, 7, 2, 5, 3, 2, 2, 9,
       3, 7, 9, 2, 8, 8, 8, 4, 6, 1, 4, 3, 4, 3, 4, 6, 8, 1, 7, 2, 0, 4,
       1, 8, 3, 5, 2, 5, 9, 9, 1, 4, 7, 6, 5, 2, 9, 9, 5, 3, 1, 1, 7, 3,
       3, 7, 8, 4, 6, 3, 0, 2, 6, 8, 4, 6, 6, 5, 0, 4, 3, 5, 4, 0, 1, 6,
       4, 1, 4, 5, 3, 4, 9, 6, 6, 9, 9, 8, 4, 7, 1, 1, 4, 6, 9, 7, 5, 3,
       3, 4, 2, 6, 5, 4, 3, 7, 7, 8, 5, 5, 4, 9, 1, 3, 5, 3, 3, 9, 0, 8,
       7, 8, 9, 1, 5, 7, 8, 6, 0, 1, 4, 2, 2, 8, 7, 4, 2, 0, 6, 2, 7, 4,
       1, 4, 9, 1, 7, 6, 0, 7, 7, 0, 9, 6, 8, 3, 4, 4, 2, 3, 5, 2, 6, 3,
       6, 2, 2, 9, 8, 9, 7, 1, 9, 9, 0, 5, 0, 2, 0, 5, 2, 5, 5, 3, 9, 7,
       3, 0, 0, 8, 6, 9, 1, 5], dtype=int8)

With `value_counts` we can now look at how the number is distributed among the individual areas:

In [6]:
pd.value_counts(cats)

(39.0, 49.0]    32
(88.0, 98.0]    32
(29.0, 39.0]    28
(20.0, 29.0]    27
(49.0, 59.0]    25
(69.0, 78.0]    24
(59.0, 69.0]    23
(-0.1, 10.0]    21
(10.0, 20.0]    20
(78.0, 88.0]    18
dtype: int64

It is striking that the age ranges do not contain an equal number of years, but with `20.0, 29.0` and `69.0, 78.0` two ranges contain only 9 years. This is due to the fact that the age range only extends from `0` to `98`:

In [7]:
df.min()

Age    0
dtype: int64

In [8]:
df.max()

Age    98
dtype: int64

With [pandas.qcut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html), on the other hand, the set is divided into areas that are approximately the same size:

In [9]:
cats = pd.qcut(ages, 10, precision=0)

In [10]:
pd.value_counts(cats)

(12.0, 23.0]    29
(32.0, 41.0]    28
(78.0, 91.0]    28
(56.0, 68.0]    26
(-1.0, 12.0]    25
(48.0, 56.0]    24
(68.0, 78.0]    24
(41.0, 48.0]    23
(91.0, 98.0]    22
(23.0, 32.0]    21
dtype: int64

If we want to ensure that each age group actually includes exactly ten years, we can specify this directly with [pandas.Categorical](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html):

In [11]:
age_groups = ["{0} - {1}".format(i, i + 9) for i in range(0, 99, 10)]
cats = pd.Categorical(age_groups)

cats.categories

Index(['0 - 9', '10 - 19', '20 - 29', '30 - 39', '40 - 49', '50 - 59',
       '60 - 69', '70 - 79', '80 - 89', '90 - 99'],
      dtype='object')

For grouping we can now use [pandas.cut](https://pandas.pydata.org/docs/reference/api/pandas.cut.html). However, the number of labels must be one less than the number of edges:

In [12]:
df["Age group"] = pd.cut(df.Age, range(0, 101, 10), right=False, labels=cats)

df

Unnamed: 0,Age,Age group
0,29,20 - 29
1,91,90 - 99
2,45,40 - 49
3,16,10 - 19
4,22,20 - 29
...,...,...
245,87,80 - 89
246,59,50 - 59
247,93,90 - 99
248,13,10 - 19
