# Categorical data

`Categorical` 是 pandas 中的数据类型, 对应于统计学中分类变量(categorical variables).

与统计学中的分类变量不同, pandas 中的 `categorical` 可能有一个顺序. (比如: '强烈同意' 对 '同意', '首要发现者' 对 '次要发现者'), 但是数值运算是不可能的.

## Object creation

### Series creation

In [8]:
import pandas as pd
import numpy as np

In [2]:
# By specifying `dtype='category'` when constructing a Series
s = pd.Series(['a', 'b', 'c', 'a'], dtype='category')
s

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']

In [6]:
# By converting an existing Series or column to a category dtype
df = pd.DataFrame({'A': ['a', 'b', 'c', 'a']})
df['B'] = df['A'].astype('category')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   A       4 non-null      object  
 1   B       4 non-null      category
dtypes: category(1), object(1)
memory usage: 268.0+ bytes


In [21]:
# By using special functions, such as `cut()`
df = pd.DataFrame({"value": np.random.randint(0, 100, 20)})
labels = ["{0} - {1}".format(i, i + 9) for i in range(0, 100, 10)]
df['group'] = pd.cut(df['value'], range(0, 105, 10), right=False, labels=labels)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   value   20 non-null     int64   
 1   group   20 non-null     category
dtypes: category(1), int64(1)
memory usage: 708.0 bytes


In [22]:
pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)

[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (0.994, 3.0]]
Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] < (5.0, 7.0]]