- 作者：东哥起飞
- 公众号：Python数据科学
- [padnas数据清洗](https://mp.weixin.qq.com/mp/appmsgalbum?action=getalbum&__biz=MzUzODYwMDAzNA==&scene=1&album_id=2217035551810125830&count=3#wechat_redirect)

In [54]:
import pandas as pd
import numpy as np

## 一、分类数据介绍

### 创建分类数据

In [4]:
s = pd.Series(['a','b','c'],dtype='category')
s

0    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

### 自动转换为分类数据

In [10]:
pd.Series(pd.cut(range(1,10,2),3))

0    (0.992, 3.667]
1    (0.992, 3.667]
2    (3.667, 6.333]
3      (6.333, 9.0]
4      (6.333, 9.0]
dtype: category
Categories (3, interval[float64]): [(0.992, 3.667] < (3.667, 6.333] < (6.333, 9.0]]

### 分类数据类型转换

In [11]:
s = pd.Series(['a','b','c'])
s

0    a
1    b
2    c
dtype: object

In [12]:
s.astype('category')

0    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

### 自定义分类数据

In [19]:
from pandas.api.types import CategoricalDtype
# 自定义分类数据，有序
c= CategoricalDtype(categories=['a','b','c'],ordered=True)
pd.Series(list('abcabd'),dtype=c)

0      a
1      b
2      c
3      a
4      b
5    NaN
dtype: category
Categories (3, object): ['a' < 'b' < 'c']

## 二、分类数据的处理方法

### 修改分类

In [42]:
s = pd.Series(['a','b','c'],dtype='category')
# 指定分类为x、y、z
s.cat.categories = ['x','y','z']
s

0    x
1    y
2    z
dtype: category
Categories (3, object): ['x', 'y', 'z']

In [26]:
s.cat.rename_categories(['m','n','o'])

0    m
1    n
2    o
dtype: category
Categories (3, object): ['m', 'n', 'o']

In [29]:
s.cat.rename_categories({'x':'m','y':'n','z':'o'})

0    m
1    n
2    o
dtype: category
Categories (3, object): ['m', 'n', 'o']

### 追加新分类

In [43]:
s = s.cat.add_categories(['r','t'])
s

0    x
1    y
2    z
dtype: category
Categories (5, object): ['x', 'y', 'z', 'r', 't']

### 删除分类

In [36]:
s.cat.remove_categories(['r','t'])

0    x
1    y
2    z
dtype: category
Categories (3, object): ['x', 'y', 'z']

In [48]:
s = s.cat.remove_unused_categories()

### 顺序

In [51]:
s.cat.as_ordered()

0    x
1    y
2    z
dtype: category
Categories (3, object): ['x' < 'y' < 'z']

In [52]:
s.cat.reorder_categories(['y','x','z'], ordered=True)

0    x
1    y
2    z
dtype: category
Categories (3, object): ['y' < 'x' < 'z']

## 三、为什么使用category数据类型？

In [56]:
df_size = 100000
df1 = pd.DataFrame(
    {
        "float_1": np.random.rand(df_size),
        "species": np.random.choice(["cat", "dog", "ape", "gorilla"], size=df_size),
    }
)
df1_cat = df1.astype({"species": "category"})

In [58]:
df1.memory_usage(deep=True)

Index          128
float_1     800000
species    6100932
dtype: int64

In [59]:
df1_cat.memory_usage(deep=True)

Index         128
float_1    800000
species    100404
dtype: int64

## 使用category的一些坑！

### 1、category列的操作

In [60]:
# str类型
%timeit df1["species"].str.upper()

37.4 ms ± 3.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [61]:
# cat类型
%timeit df1_cat["species"].str.upper()

3.23 ms ± 334 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [62]:
# category丢了
df1_cat["species"].str.upper().memory_usage(deep=True)

6101060

In [63]:
# 正确的方法
%timeit df1_cat["species"].cat.rename_categories(str.upper)

331 µs ± 38.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### 2、与category列的合并

In [64]:
# 第二组数据
df2 = pd.DataFrame(
    {
        "species": ["cat", "dog", "ape", "gorilla", "snake"],
        "habitat": ["house", "house", "jungle", "jungle", "jungle"],
    }
)
df2_cat = df2.astype({"species": "category", "habitat": "category"})

In [65]:
# 把object列合并到category列上
df1.merge(df2_cat, on="species").dtypes

float_1     float64
species      object
habitat    category
dtype: object

In [66]:
# 两个category列的合并
df1_cat.merge(df2_cat, on="species").dtypes

float_1     float64
species      object
habitat    category
dtype: object

- `category1+ category2=object`

- `category1+ category1=category1`

**解决办法就是：两个category类别一模一样，让其中一个等于另外一个**。

In [68]:
df1_cat.astype({"species": df2_cat["species"].dtype}).merge(
       df2_cat, on="species"
   ).dtypes

float_1     float64
species    category
habitat    category
dtype: object

### 3、category列的分组

In [69]:
habitat_df = (
    df1_cat.astype({"species": df2_cat["species"].dtype})
           .merge(df2_cat, on="species")
)
house_animals_df = habitat_df.loc[habitat_df["habitat"] == "house"]

In [71]:
house_animals_df.groupby("species")["float_1"].mean()

species
ape             NaN
cat        0.501222
dog        0.500626
gorilla         NaN
snake           NaN
Name: float_1, dtype: float64

**解决办法**是：可以传递`observed=True`到`groupby`调用中，这确保了我们仅获取数据中有值的组。

### 4、category列的索引

In [72]:
species_df = habitat_df.groupby(["habitat", "species"], observed=True)["float_1"].mean().unstack()
species_df

species,ape,cat,dog,gorilla
habitat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
jungle,0.50225,,,0.500874
house,,0.501222,0.500626,


In [73]:
species_df["new_col"] = 1

TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category

**报错原因是**：`species`和`habitat`现在均为`category`类型。使用`.unstack()`会把`species`索引移到列索引中（类似`pivot`交叉表的操作）。

总结

- **category列的变换操作**:直接对category本身操作而不是对它的值操作。这样可以保留分类性质并提高性能。

- **category列的合并**：合并时注意，要保留`category`类型，且每个`dataframe`的合并列中的分类类型必须完全匹配。

- **category列的分组**:默认情况下，获得数据类型中每个值的结果，即使数据中不存在该结果。可以通过设置`observed=True`调整。

- **category列的索引**：当索引为`category`类型的时候，注意是否可能与类别变量发生奇怪的交互作用。