# 第9章：数据聚合与分组运算

# 主要内容
# 3.分组级运算和转换
- 3.1 transform方法
- 3.2 apply方法


昨天说的数据聚合是指将一维数组简化为标量值，实际上是一种数据转换，是分组运算中的一种，今天要说的transform和apply方法，能够执行更多的分组运算。

# 3.1 transform方法

**问题：计算下面这个df数据框按key1分组计算的组内平均值，然后添加到数据框的列上。**  


In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                'key2' : ['one', 'two', 'one', 'two', 'one'],
                'data1' : np.random.randn(5),
                'data2' : np.random.randn(5)})
df

Unnamed: 0,data1,data2,key1,key2
0,0.965937,-0.754136,a,one
1,0.94375,-0.823292,a,two
2,1.315446,-1.741548,b,one
3,0.661197,0.231921,b,two
4,0.623395,-0.130419,a,one


这里介绍两种方法：  

**第一种是先分组计算均值，再与原来的数据框合并。这种方法比较不灵活：**

In [8]:
k1_means = df.groupby('key1').mean().add_prefix('mean_')
k1_means

Unnamed: 0_level_0,mean_data1,mean_data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.844361,-0.569282
b,0.988321,-0.754814


In [9]:
pd.merge(df, k1_means, left_on='key1', right_index=True)

Unnamed: 0,data1,data2,key1,key2,mean_data1,mean_data2
0,0.965937,-0.754136,a,one,0.844361,-0.569282
1,0.94375,-0.823292,a,two,0.844361,-0.569282
4,0.623395,-0.130419,a,one,0.844361,-0.569282
2,1.315446,-1.741548,b,one,0.988321,-0.754814
3,0.661197,0.231921,b,two,0.988321,-0.754814


**第二种方法则是利用transform方法:  **  
transform会将函数应用到各个分组中，然后将结果放入合适的位置上，如果每个分组产生的是一个标量，那么该值就会被广播出去。

In [13]:
df.groupby('key1').transform(np.mean).add_prefix('mean_')

Unnamed: 0,mean_data1,mean_data2
0,0.844361,-0.569282
1,0.844361,-0.569282
2,0.988321,-0.754814
3,0.988321,-0.754814
4,0.844361,-0.569282


当然，也可以自定义函数传入transform中

In [15]:
def demean(arr):
    return arr - arr.mean()
demeaned = df.groupby(key).transform(demean)
demeaned

Unnamed: 0,data1,data2
0,-0.002323,0.121232
1,0.141277,-0.527606
2,0.347187,-0.86618
3,-0.141277,0.527606
4,-0.344864,0.744948


# 3.2  apply方法

**问题：对以下tips数据集按smoker进行分类，并选出各组最高的5个tip_pct值**  


如aggregate一样，transform也是一个有着特殊条件的函数，传入的函数只能产生两种结果，要么是标量，要么是有着相同大小的数组。比transform更一般的方法是apply方法。

首先应该编写一个选取指定列具有最大值的行的函数，再将这个函数传入apply

In [16]:
tips = pd.read_csv('data/tips.csv')
tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips[:6]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
2,21.01,3.5,Male,No,Sun,Dinner,3,0.166587
3,23.68,3.31,Male,No,Sun,Dinner,2,0.13978
4,24.59,3.61,Female,No,Sun,Dinner,4,0.146808
5,25.29,4.71,Male,No,Sun,Dinner,4,0.18624


In [22]:
def top(df, n=5, column='tip_pct'):
    return df.sort_index(by=column)[-n:]

接下来就需要按smoker进行分组并调用top函数。

In [23]:
tips.groupby('smoker').apply(top)

  from ipykernel import kernelapp as app


Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,sex,smoker,day,time,size,tip_pct
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
No,88,24.71,5.85,Male,No,Thur,Lunch,2,0.236746
No,185,20.69,5.0,Male,No,Sun,Dinner,5,0.241663
No,51,10.29,2.6,Female,No,Sun,Dinner,2,0.252672
No,149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312
No,232,11.61,3.39,Male,No,Sat,Dinner,2,0.29199
Yes,109,14.31,4.0,Female,Yes,Sat,Dinner,2,0.279525
Yes,183,23.17,6.5,Male,Yes,Sun,Dinner,4,0.280535
Yes,67,3.07,1.0,Female,Yes,Sat,Dinner,1,0.325733
Yes,178,9.6,4.0,Female,Yes,Sun,Dinner,2,0.416667
Yes,172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345


如果传给apply的函数能够接受其他参数或者关键字，则可以将这些内容放在函数后面。比如，方才的top函数可以写成这样：

In [24]:
tips.groupby('smoker').apply(top, n=6, column='total_bill')

  from ipykernel import kernelapp as app


Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,sex,smoker,day,time,size,tip_pct
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
No,112,38.07,4.0,Male,No,Sun,Dinner,3,0.10507
No,23,39.42,7.58,Male,No,Sat,Dinner,4,0.192288
No,142,41.19,5.0,Male,No,Thur,Lunch,5,0.121389
No,156,48.17,5.0,Male,No,Sun,Dinner,6,0.103799
No,59,48.27,6.73,Male,No,Sat,Dinner,4,0.139424
No,212,48.33,9.0,Male,No,Sat,Dinner,4,0.18622
Yes,95,40.17,4.73,Male,Yes,Fri,Dinner,4,0.11775
Yes,184,40.55,3.0,Male,Yes,Sun,Dinner,2,0.073983
Yes,197,43.11,5.0,Female,Yes,Thur,Lunch,4,0.115982
Yes,102,44.3,2.5,Female,Yes,Sat,Dinner,3,0.056433


# 示例1：稍微复杂的分组统计

**问题：将以下数据集按照data1划分为区间区间，并按区间对data2进行分组统计**

In [31]:
frame = pd.DataFrame({'data1': np.random.randn(1000),
                   'data2': np.random.randn(1000)})
frame.head()

Unnamed: 0,data1,data2
0,0.487111,-0.994468
1,0.283072,-0.740833
2,0.942838,-0.472109
3,1.097927,-0.722959
4,0.073501,-0.023723


In [33]:
# 将frame数据集的data1分到四个区间中。
factor = pd.cut(frame.data1, 4)
factor[:10]

0     (0.0376, 1.424]
1     (0.0376, 1.424]
2     (0.0376, 1.424]
3     (0.0376, 1.424]
4     (0.0376, 1.424]
5    (-2.741, -1.349]
6     (0.0376, 1.424]
7     (0.0376, 1.424]
8    (-1.349, 0.0376]
9     (0.0376, 1.424]
Name: data1, dtype: category
Categories (4, object): [(-2.741, -1.349] < (-1.349, 0.0376] < (0.0376, 1.424] < (1.424, 2.811]]

In [34]:
def get_stats(group):
    return {'min': group.min(), 'max': group.max(),
            'count': group.count(), 'mean': group.mean()}

grouped = frame.data2.groupby(factor)
grouped.apply(get_stats).unstack()

Unnamed: 0_level_0,count,max,mean,min
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(-2.741, -1.349]",80.0,2.65404,-0.114024,-3.454047
"(-1.349, 0.0376]",408.0,2.582453,-0.044754,-2.773375
"(0.0376, 1.424]",454.0,2.912389,-0.086018,-2.840679
"(1.424, 2.811]",58.0,2.163389,0.058213,-2.295776


# 示例2:用特定于分组的值填充缺失值

对于缺失数据的清理，前面我们说到用dropna和fillna的方法处理。有时候我们可能想对不同的分组用不同的值来填充。以下面美国几个州的数据为例进行说明,这些州被分为东部和西部，我们希望用东部和西部的分组平均值去填充缺失值

In [39]:
states = ['Ohio', 'New York', 'Vermont', 'Florida',
          'Oregon', 'Nevada', 'California', 'Idaho']
group_key = ['East'] * 4 + ['West'] * 4
data = pd.Series(np.random.randn(8), index=states)
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
data

Ohio         -0.919351
New York     -0.502679
Vermont            NaN
Florida      -0.843296
Oregon       -0.849466
Nevada             NaN
California   -0.688357
Idaho              NaN
dtype: float64

In [40]:
data.groupby(group_key).mean()

East   -0.755109
West   -0.768912
dtype: float64

In [41]:
fill_mean = lambda g: g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)

Ohio         -0.919351
New York     -0.502679
Vermont      -0.755109
Florida      -0.843296
Oregon       -0.849466
Nevada       -0.768912
California   -0.688357
Idaho        -0.768912
dtype: float64