## 数据聚合与分组运算

- 根据一个或多个键拆分pandas对象
- 计算分组摘要统计，如计数、平均值、标准差、或用户自定义函数
- 对DataFrame的列应用各种各样的函数
- 运用组内转换或其他运算，如规格化、线性回归、排名或选取子集等
- 计算透视表或交叉表
- 执行分位数拆分及其它分组分析

### groupby技术

- split-apply-combine
- 拆分-应用-合并

In [45]:
import pandas as pd
import numpy as np
from pandas import DataFrame, Series

In [3]:
df = DataFrame(
    {
    'key1' : list('aabba'),
    'key2' : ['one','two','one','two','one'],
    'data1': np.random.randn(5),
    'data2': np.random.randn(5),
    }
)
df

Unnamed: 0,data1,data2,key1,key2
0,-0.666295,-1.839074,a,one
1,1.287059,-0.505699,a,two
2,-0.608066,0.139088,b,one
3,1.575067,0.954528,b,two
4,0.270708,-1.176871,a,one


In [7]:
grouped = df['data1'].groupby(df['key1'])
print ( grouped )
print ( grouped.mean() )


<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f251a3e9ba8>
key1
a    0.297157
b    0.483500
Name: data1, dtype: float64


In [10]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

key1  key2
a     one    -0.197793
      two     1.287059
b     one    -0.608066
      two     1.575067
Name: data1, dtype: float64

In [12]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.197793,1.287059
b,-0.608066,1.575067


In [14]:
states = np.array(['Ohio', 'California',
                   'California', 'Ohio', 'Ohio'
                  ])
years = np.array([2005,2005,2006,2005,2006])
df['data1'].groupby([states, years]).mean()

California  2005    1.287059
            2006   -0.608066
Ohio        2005    0.454386
            2006    0.270708
Name: data1, dtype: float64

In [15]:
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.297157,-1.173881
b,0.4835,0.546808


In [16]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,-0.197793,-1.507972
a,two,1.287059,-0.505699
b,one,-0.608066,0.139088
b,two,1.575067,0.954528


In [17]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

### 对分组进行迭代

In [20]:
for name, group in df.groupby('key1'):
    print(name)
    print(group)

a
      data1     data2 key1 key2
0 -0.666295 -1.839074    a  one
1  1.287059 -0.505699    a  two
4  0.270708 -1.176871    a  one
b
      data1     data2 key1 key2
2 -0.608066  0.139088    b  one
3  1.575067  0.954528    b  two


In [21]:
for (k1,k2), group in df.groupby(['key1', 'key2']):
    print(k1,k2)
    print(group)

a one
      data1     data2 key1 key2
0 -0.666295 -1.839074    a  one
4  0.270708 -1.176871    a  one
a two
      data1     data2 key1 key2
1  1.287059 -0.505699    a  two
b one
      data1     data2 key1 key2
2 -0.608066  0.139088    b  one
b two
      data1     data2 key1 key2
3  1.575067  0.954528    b  two


In [22]:
pieces = dict(list(df.groupby('key1')))
pieces['b']

Unnamed: 0,data1,data2,key1,key2
2,-0.608066,0.139088,b,one
3,1.575067,0.954528,b,two


In [23]:
df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

In [24]:
grouped = df.groupby(df.dtypes,axis=1)
dict(list(grouped))

{dtype('float64'):       data1     data2
 0 -0.666295 -1.839074
 1  1.287059 -0.505699
 2 -0.608066  0.139088
 3  1.575067  0.954528
 4  0.270708 -1.176871, dtype('O'):   key1 key2
 0    a  one
 1    a  two
 2    b  one
 3    b  two
 4    a  one}

### 选取一个或一组列

In [25]:
g1 = df.groupby('key1')['data1']
g2 = df.groupby('key1')['data2']

g3 = df['data1'].groupby(df['key1'])
g4 = df['data2'].groupby(df['key1'])

# g3 := g1
# g4 := g2    

In [34]:
df.groupby(['key1','key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,-1.507972
a,two,-0.505699
b,one,0.139088
b,two,0.954528


In [35]:
s_grouped = df.groupby(['key1', 'key2'])['data2']
s_grouped.mean()

key1  key2
a     one    -1.507972
      two    -0.505699
b     one     0.139088
      two     0.954528
Name: data2, dtype: float64

#### 通过字典或Series进行分组

In [36]:
people = DataFrame(
    np.random.randn(5,5),
    columns=list('abcde'),
    index=['Joe', 'Steve','Wes','Jim','Travis']
)

people

Unnamed: 0,a,b,c,d,e
Joe,1.309125,-0.336549,0.35768,0.340857,-0.731133
Steve,0.09467,-0.100437,0.111081,-0.871055,-1.035041
Wes,0.611979,-0.004069,-0.42657,1.273945,1.45273
Jim,1.418557,0.92051,-0.694461,0.313608,2.616405
Travis,1.036586,-0.465943,-0.141507,-0.059626,-0.729437


In [37]:
mapping = {
    'a' : 'red',
    'b' : 'red',
    'c' : 'blue',
    'd' : 'blue',
    'e' : 'red',
    'f' : 'orange'
}
by_column = people.groupby(mapping,axis=1)
by_column.sum()

Unnamed: 0,blue,red
Joe,0.698537,0.241443
Steve,-0.759975,-1.040809
Wes,0.847374,2.06064
Jim,-0.380853,4.955472
Travis,-0.201133,-0.158794


In [38]:
map_series = Series(mapping)
map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [41]:
people.groupby(map_series, axis=1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,2,3
Jim,2,3
Travis,2,3


#### 通过函数进行分组

In [42]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,3.339662,0.579891,-0.763351,1.92841,3.338002
5,0.09467,-0.100437,0.111081,-0.871055,-1.035041
6,1.036586,-0.465943,-0.141507,-0.059626,-0.729437


In [43]:
key_list = ['one','one','one','two','two']
people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,0.611979,-0.336549,-0.42657,0.340857,-0.731133
3,two,1.418557,0.92051,-0.694461,0.313608,2.616405
5,one,0.09467,-0.100437,0.111081,-0.871055,-1.035041
6,two,1.036586,-0.465943,-0.141507,-0.059626,-0.729437


#### 根据索引级别分组

In [46]:
columns = pd.MultiIndex.from_arrays(
    [
        ['US', 'US', 'US', 'JP','JP'],
        [1,3,5,1,3]
    ],
    names = ['city', 'tenor']
)
hief_df = DataFrame(np.random.randn(4,5),
                    columns=columns
                   )
hief_df

city,US,US,US,JP,JP
tenor,1,3,5,1,3
0,-1.270309,1.178106,0.643615,0.146907,1.261704
1,-0.601004,-0.465886,1.173897,0.750467,-0.442832
2,0.521606,0.563679,-0.696802,2.556884,-0.647566
3,-0.460349,0.99248,-0.145026,0.518585,0.362764


### 数据聚合

In [48]:
df

Unnamed: 0,data1,data2,key1,key2
0,-0.666295,-1.839074,a,one
1,1.287059,-0.505699,a,two
2,-0.608066,0.139088,b,one
3,1.575067,0.954528,b,two
4,0.270708,-1.176871,a,one


**quantile可以计算Series或DataFrame列的样本分位数**

In [49]:
grouped =df.groupby('key1')
grouped['data1'].quantile(0.9)

key1
a    1.083789
b    1.356753
Name: data1, dtype: float64

**聚合函数，传入方法aggregate或agg**

In [50]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.agg(peak_to_peak)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1.953354,1.333374
b,2.183133,0.815441


**经过优化的groupby的方法**
- count  分组中非NA值的数量
- sum    非NA值的和
- mean   非NA值的平均值
- median 非NA值的算术中位数
- std,var 无偏（分母为n-1）标准差和方差
- min,max 非NA值的最小和最大值
- prod   非NA值的积
- first,last  第一个和最后一个非NA值


In [52]:
tips = pd.read_csv('../pydata/ch08/tips.csv')

In [55]:
tips['tip_pct']=tips['tip']/tips['total_bill']

### 分组运算和转换

### 透视表和交叉表

### 示例： 2012联邦选举委员会数据库