## 分类数据

In [1]:
import numpy as np
import pandas as pd


### cat对象
在 pandas 中提供了 category 类型，使用户能够处理分类类型的变量，将一个普通序列转换成分类变量可以使用 astype 方法。

In [3]:
data=pd.read_csv("C:/Users/gen'ch/pandas学习/joyful-pandas-master/data/learn_pandas.csv",
                usecols=['Grade', 'Name', 'Gender', 'Height', 'Weight'])

In [5]:
s = data.Grade.astype('category')

In [6]:
s.head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman', 'Junior', 'Senior', 'Sophomore']

In [7]:
s.cat

<pandas.core.arrays.categorical.CategoricalAccessor object at 0x000001EB83238D60>

In [8]:
s.cat.categories

Index(['Freshman', 'Junior', 'Senior', 'Sophomore'], dtype='object')

In [9]:
s.cat.codes.head()

0    0
1    0
2    2
3    3
4    3
dtype: int8

### 2、类别的增加、删除和修改


In [10]:
s = s.cat.add_categories('Graduate')

In [11]:
s.cat.categories

Index(['Freshman', 'Junior', 'Senior', 'Sophomore', 'Graduate'], dtype='object')

In [12]:
s = s.cat.remove_categories('Freshman')

In [13]:
s.cat.categories

Index(['Junior', 'Senior', 'Sophomore', 'Graduate'], dtype='object')

In [14]:
s.head()

0          NaN
1          NaN
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Junior', 'Senior', 'Sophomore', 'Graduate']

In [15]:
s = s.cat.set_categories(['Sophomore','PhD'])

In [16]:
s.cat.categories

Index(['Sophomore', 'PhD'], dtype='object')

In [17]:
s.head()

0          NaN
1          NaN
2          NaN
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (2, object): ['Sophomore', 'PhD']

In [18]:
s = s.cat.remove_unused_categories()

In [19]:
s.cat.categories

Index(['Sophomore'], dtype='object')

In [20]:
s = s.cat.rename_categories({'Sophomore':'本科二年级学生'})

In [21]:
s.head()

0        NaN
1        NaN
2        NaN
3    本科二年级学生
4    本科二年级学生
Name: Grade, dtype: category
Categories (1, object): ['本科二年级学生']

## 有序分类
有序类别和无序类别可以通过 as_unordered 和 reorder_categories 互相转化，需要注意的是后者传入的参数必须是由当前序列的无需类别构成的列表，不能够增加新的类别，也不能缺少原来的类别，并且必须指定参数 ordered=True ，否则方法无效。

In [23]:
s = data.Grade.astype('category')

In [24]:
s = s.cat.reorder_categories(['Freshman', 'Sophomore',
                              'Junior', 'Senior'],ordered=True)

In [25]:
s.head()


0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman' < 'Sophomore' < 'Junior' < 'Senior']

In [26]:
s.cat.as_unordered().head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman', 'Sophomore', 'Junior', 'Senior']

### 排序和比较
只需把列的类型修改为 category 后，再赋予相应的大小关系，就能正常地使用 sort_index 和 sort_values 。

In [28]:
data.Grade = data.Grade.astype('category')

In [30]:
data.Grade = data.Grade.cat.reorder_categories(['Freshman',
                                            'Sophomore',
                                            'Junior',
                                            'Senior'],ordered=True)

In [31]:
data.sort_values('Grade').head()

Unnamed: 0,Grade,Name,Gender,Height,Weight
0,Freshman,Gaopeng Yang,Female,158.9,46.0
105,Freshman,Qiang Shi,Female,164.5,52.0
96,Freshman,Changmei Feng,Female,163.8,56.0
88,Freshman,Xiaopeng Han,Female,164.1,53.0
81,Freshman,Yanli Zhang,Female,165.1,52.0


In [32]:
data.set_index('Grade').sort_index().head()

Unnamed: 0_level_0,Name,Gender,Height,Weight
Grade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Freshman,Gaopeng Yang,Female,158.9,46.0
Freshman,Qiang Shi,Female,164.5,52.0
Freshman,Changmei Feng,Female,163.8,56.0
Freshman,Xiaopeng Han,Female,164.1,53.0
Freshman,Yanli Zhang,Female,165.1,52.0


分类变量的比较操作分为两类，第一种是 == 或 != 关系的比较，比较的对象可以是标量或者同长度的 Series （或 list ），第二种是 >,>=,<,<= 四类大小关系的比较，比较的对象和第一种类似，但是所有参与比较的元素必须属于原序列的 categories ，同时要和原序列具有相同的索引

In [33]:
res1 = data.Grade == 'Sophomore'

In [34]:
res1.head()

0    False
1    False
2    False
3     True
4     True
Name: Grade, dtype: bool

In [36]:
res2 = data.Grade == ['PhD']*data.shape[0]

In [37]:
res2.head()

0    False
1    False
2    False
3    False
4    False
Name: Grade, dtype: bool

In [39]:
res3 = data.Grade <= 'Sophomore'

In [40]:
res3.head()

0     True
1     True
2    False
3     True
4     True
Name: Grade, dtype: bool

In [41]:
res4 = data.Grade <= data.Grade.sample(frac=1).reset_index(drop=True) # 打乱后比较

In [42]:
res4.head()

0     True
1     True
2    False
3    False
4     True
Name: Grade, dtype: bool

## 区间类别
### 利用cut和qcut进行区间构造
在实际数据分析中，区间序列往往是通过 cut 和 qcut 方法进行构造的，这两个函数能够把原序列的数值特征进行装箱，即用区间位置来代替原来的具体数值

### cut的常见用法
最重要的参数是 bin ，如果传入整数 n ，则代表把整个传入数组的按照最大和最小值等间距地分为 n 段。由于区间默认是左开右闭，需要进行调整把最小值包含进去，在 pandas 中的解决方案是在值最小的区间左端点再减去 0.001*(max-min) ，因此如果对序列 [1,2] 划分为2个箱子时，第一个箱子的范围 (0.999,1.5] ，第二个箱子的范围是 (1.5,2] 。如果需要指定左闭右开时，需要把 right 参数设置为 False ，相应的区间调整方法是在值最大的区间右端点再加上 0.001*(max-min) 

In [43]:
s = pd.Series([1,2])

In [44]:
pd.cut(s, bins=2)

0    (0.999, 1.5]
1      (1.5, 2.0]
dtype: category
Categories (2, interval[float64]): [(0.999, 1.5] < (1.5, 2.0]]

In [45]:
pd.cut(s, bins=2, right=False)

0      [1.0, 1.5)
1    [1.5, 2.001)
dtype: category
Categories (2, interval[float64]): [[1.0, 1.5) < [1.5, 2.001)]

In [46]:
pd.cut(s, bins=[-np.infty, 1.2, 1.8, 2.2, np.infty])

0    (-inf, 1.2]
1     (1.8, 2.2]
dtype: category
Categories (4, interval[float64]): [(-inf, 1.2] < (1.2, 1.8] < (1.8, 2.2] < (2.2, inf]]

In [47]:
s = pd.Series([1,2])

In [48]:
res = pd.cut(s, bins=2, labels=['small', 'big'], retbins=True)

In [49]:
res[0]

0    small
1      big
dtype: category
Categories (2, object): ['small' < 'big']

In [50]:
res[1]

array([0.999, 1.5  , 2.   ])

In [51]:
s = data.Weight

In [52]:
pd.qcut(s, q=3).head()

0    (33.999, 48.0]
1      (55.0, 89.0]
2      (55.0, 89.0]
3    (33.999, 48.0]
4      (55.0, 89.0]
Name: Weight, dtype: category
Categories (3, interval[float64]): [(33.999, 48.0] < (48.0, 55.0] < (55.0, 89.0]]

In [53]:
pd.qcut(s, q=[0,0.2,0.8,1]).head()

0      (44.0, 69.4]
1      (69.4, 89.0]
2      (69.4, 89.0]
3    (33.999, 44.0]
4      (69.4, 89.0]
Name: Weight, dtype: category
Categories (3, interval[float64]): [(33.999, 44.0] < (44.0, 69.4] < (69.4, 89.0]]

### 一般区间的构造
对于某一个具体的区间而言，其具备三个要素，即左端点、右端点和端点的开闭状态，其中开闭状态可以指定 right, left, both, neither 中的一类

In [54]:
my_interval = pd.Interval(0, 1, 'right')

In [55]:
my_interval

Interval(0, 1, closed='right')

In [56]:
0.5 in my_interval

True

In [57]:
my_interval_2 = pd.Interval(0.5, 1.5, 'left')

In [58]:
my_interval.overlaps(my_interval_2)

True

In [59]:
pd.IntervalIndex.from_breaks([1,3,6,10], closed='both')

IntervalIndex([[1, 3], [3, 6], [6, 10]],
              closed='both',
              dtype='interval[int64]')

In [60]:
pd.IntervalIndex.from_arrays(left = [1,3,6,10],
                             right = [5,4,9,11],
                             closed = 'neither')

IntervalIndex([(1, 5), (3, 4), (6, 9), (10, 11)],
              closed='neither',
              dtype='interval[int64]')

In [61]:
pd.IntervalIndex.from_tuples([(1,5),(3,4),(6,9),(10,11)],closed='neither')

IntervalIndex([(1, 5), (3, 4), (6, 9), (10, 11)],
              closed='neither',
              dtype='interval[int64]')

In [62]:
pd.interval_range(start=1,end=5,periods=8)

IntervalIndex([(1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5], (4.5, 5.0]],
              closed='right',
              dtype='interval[float64]')

In [63]:
pd.interval_range(end=5,periods=8,freq=0.5)

IntervalIndex([(1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5], (4.5, 5.0]],
              closed='right',
              dtype='interval[float64]')

In [64]:
pd.IntervalIndex([my_interval, my_interval_2], closed='left')

IntervalIndex([[0.0, 1.0), [0.5, 1.5)],
              closed='left',
              dtype='interval[float64]')

### 区间的属性和方法
IntervalIndex 上也定义了一些有用的属性和方法。同时，如果想要具体利用 cut 或者 qcut 的结果进行分析，那么需要先将其转为该种索引类型


In [65]:
id_interval = pd.IntervalIndex(pd.cut(s, 3))

与单个 Interval 类型相似， IntervalIndex 有若干常用属性： left, right, mid, length ，分别表示左右端点、两端点均值和区间长度

In [66]:
id_demo = id_interval[:5]

In [67]:
id_demo

IntervalIndex([(33.945, 52.333], (52.333, 70.667], (70.667, 89.0], (33.945, 52.333], (70.667, 89.0]],
              closed='right',
              name='Weight',
              dtype='interval[float64]')

In [69]:
id_demo.left

Float64Index([33.945, 52.333, 70.667, 33.945, 70.667], dtype='float64')

In [70]:
id_demo.right

Float64Index([52.333, 70.667, 89.0, 52.333, 89.0], dtype='float64')

In [71]:
id_demo.mid

Float64Index([43.138999999999996, 61.5, 79.8335, 43.138999999999996, 79.8335], dtype='float64')

In [72]:
id_demo.length

Float64Index([18.387999999999998, 18.334000000000003, 18.333,
              18.387999999999998, 18.333],
             dtype='float64')

IntervalIndex 还有两个常用方法，包括 contains 和 overlaps ，分别指逐个判断每个区间是否包含某元素，以及是否和一个 pd.Interval 对象有交集。

In [73]:
id_demo.contains(4)

array([False, False, False, False, False])

In [74]:
id_demo.overlaps(pd.Interval(40,60))

array([ True,  True, False,  True, False])