## 数据预处理

### Agenda

- 合并
    - 纵向合并
    - 追加
    - 横向合并
- 分组
- 切分
- 排序
    - 按索引排序
    - 按值排序

In [38]:
import pandas as pd
import numpy as np

### 合并

#### 纵向合并

**concat()**函数可以方便的用于纵向合并 Pandas 对象

*通过指定concat的axis参数也可以横向合并，但为了清晰易懂，下面只会使用merge函数实现横向合并。*

In [39]:
# 创建测试数据
df_1 = pd.DataFrame(np.random.randn(1, 4))
df_2 = pd.DataFrame(np.random.randn(2, 4))
df_3 = pd.DataFrame(np.random.randn(3, 4))

df = pd.concat([df_1, df_2, df_3])
df

Unnamed: 0,0,1,2,3
0,-0.184008,-2.562406,0.297049,-0.505108
0,-0.410246,-0.952204,0.851348,-0.409731
1,-0.727076,-1.362547,0.451162,0.07199
0,-0.607709,-0.419632,0.693083,0.221071
1,0.42614,1.081252,-1.279698,0.967855
2,-0.056695,-0.102915,-0.295255,-0.033893


#### 追加

使用**append()**函数为DataFrame追加行

In [40]:
df = pd.DataFrame(np.random.randn(1, 4))
df_1 = pd.DataFrame(np.random.randn(2, 4))

df = df.append(df_1, ignore_index=True)
df

Unnamed: 0,0,1,2,3
0,0.752335,-0.526978,0.480962,-0.615776
1,-1.562264,0.535142,0.075934,1.762854
2,0.10171,2.249128,0.687457,0.976735


#### 横向合并

**merge()**函数可以方便的用于横向合并 Pandas 对象

In [41]:
# 创建测试数据
df_left = pd.DataFrame({'key': ['key1', 'key2'], 'val_1': [1, 2]})
df_right = pd.DataFrame({'key': ['key2', 'key3'], 'val_2': [1, 2]})

df_inner = pd.merge(df_left, df_right, on='key', how='inner')
df_inner

Unnamed: 0,key,val_1,val_2
0,key2,2,1


In [42]:
df_inner = pd.merge(df_left, df_right, on='key', how='outer')
df_inner

Unnamed: 0,key,val_1,val_2
0,key1,1.0,
1,key2,2.0,1.0
2,key3,,2.0


### 分组

**groupby** 指的是涵盖下列一项或多项步骤的处理流程：

- 分割：按条件把数据分割成多组；
- 应用：为每组单独应用函数；
- 组合：将处理结果组合成一个数据结构。

In [43]:
# 创建测试数据
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                        'foo', 'bar', 'foo', 'foo'],
                    'B': ['one', 'one', 'two', 'three',
                        'two', 'two', 'one', 'three'],
                    'C': np.random.randn(8),
                    'D': np.random.randn(8)})

df

Unnamed: 0,A,B,C,D
0,foo,one,0.046221,0.315153
1,bar,one,-0.09937,1.184435
2,foo,two,0.446561,0.28671
3,bar,three,-0.35003,0.247183
4,foo,two,-0.289497,-0.373293
5,bar,two,0.360452,0.620708
6,foo,one,-0.731888,-0.582871
7,foo,three,1.556382,0.085973


先分组，再用 sum() 函数计算每组的汇总数据. 常用的计算方法包括sum、max、min、mean、std等

In [44]:
df.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,-0.088948,2.052326
foo,1.027779,-0.268328


多列分组后，生成多层索引

In [45]:
df.groupby(['A', 'B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-0.09937,1.184435
bar,three,-0.35003,0.247183
bar,two,0.360452,0.620708
foo,one,-0.685667,-0.267718
foo,three,1.556382,0.085973
foo,two,0.157064,-0.086583


### 切分

切分操作常用于一维数组的分类和打标

In [46]:
# 创建测试数据
df = pd.DataFrame({'A': ['A', 'B', 'C', 'D',
                        'E', 'F', 'G', 'H'],
                    'B': [1, 55, 100, 150,
                        180, 1200, 1500, 1600]})

df

Unnamed: 0,A,B
0,A,1
1,B,55
2,C,100
3,D,150
4,E,180
5,F,1200
6,G,1500
7,H,1600


以把分组数值标准传入bins参数

In [47]:
pd.cut(x=df['B'], bins=[0, 100, 1000, 10000], right=False)

0         [0, 100)
1         [0, 100)
2      [100, 1000)
3      [100, 1000)
4      [100, 1000)
5    [1000, 10000)
6    [1000, 10000)
7    [1000, 10000)
Name: B, dtype: category
Categories (3, interval[int64]): [[0, 100) < [100, 1000) < [1000, 10000)]

将标签数据传入进行打标

In [48]:
pd.cut(x=df['B'], bins=[0, 100, 1000, 10000], right=False, labels=['L1', 'L2', 'L3'])

0    L1
1    L1
2    L2
3    L2
4    L2
5    L3
6    L3
7    L3
Name: B, dtype: category
Categories (3, object): [L1 < L2 < L3]

### 排序

#### 按索引排序

**sort_index()** 方法用于按索引层级对 Pandas 对象排序。

In [49]:
# 创建测试数据
df = pd.DataFrame({'A': np.random.randn(4),
                   'B': np.random.randn(4),
                   'C': np.random.randn(4)}, index=[1, 3, 2, 4])

df

Unnamed: 0,A,B,C
1,1.273076,-0.563741,-0.117965
3,-1.12142,-0.389332,-0.178751
2,-2.848915,1.837701,0.014148
4,0.640997,0.159279,0.588423


In [50]:
df.sort_index()

Unnamed: 0,A,B,C
1,1.273076,-0.563741,-0.117965
2,-2.848915,1.837701,0.014148
3,-1.12142,-0.389332,-0.178751
4,0.640997,0.159279,0.588423


#### 按值排序

**sort_values()** 的可选参数 by 用于指定按哪列排序，该参数的值可以是一列或多列数据。

In [51]:
df.sort_values(by='C')

Unnamed: 0,A,B,C
3,-1.12142,-0.389332,-0.178751
1,1.273076,-0.563741,-0.117965
2,-2.848915,1.837701,0.014148
4,0.640997,0.159279,0.588423
