## 数据聚合与分组运算

- 根据一个或多个键拆分pandas对象
- 计算分组摘要统计，如计数、平均值、标准差、或用户自定义函数
- 对DataFrame的列应用各种各样的函数
- 运用组内转换或其他运算，如规格化、线性回归、排名或选取子集等
- 计算透视表或交叉表
- 执行分位数拆分及其它分组分析

### groupby技术

- split-apply-combine
- 拆分-应用-合并

In [None]:
import pandas as pd
import numpy as np
from pandas import DataFrame, Series

In [None]:
df = DataFrame(
    {
    'key1' : list('aabba'),
    'key2' : ['one','two','one','two','one'],
    'data1': np.random.randn(5),
    'data2': np.random.randn(5),
    }
)
df

In [None]:
grouped = df['data1'].groupby(df['key1'])
print ( grouped )
print ( grouped.mean() )


In [None]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

In [None]:
means.unstack()

In [None]:
states = np.array(['Ohio', 'California',
                   'California', 'Ohio', 'Ohio'
                  ])
years = np.array([2005,2005,2006,2005,2006])
df['data1'].groupby([states, years]).mean()

In [None]:
df.groupby('key1').mean()

In [None]:
df.groupby(['key1', 'key2']).mean()

In [None]:
df.groupby(['key1', 'key2']).size()

### 对分组进行迭代

In [None]:
for name, group in df.groupby('key1'):
    print(name)
    print(group)

In [None]:
for (k1,k2), group in df.groupby(['key1', 'key2']):
    print(k1,k2)
    print(group)

In [None]:
pieces = dict(list(df.groupby('key1')))
pieces['b']

In [None]:
df.dtypes

In [None]:
grouped = df.groupby(df.dtypes,axis=1)
dict(list(grouped))

### 选取一个或一组列

In [None]:
g1 = df.groupby('key1')['data1']
g2 = df.groupby('key1')['data2']

g3 = df['data1'].groupby(df['key1'])
g4 = df['data2'].groupby(df['key1'])

# g3 := g1
# g4 := g2    

In [None]:
df.groupby(['key1','key2'])[['data2']].mean()

In [None]:
s_grouped = df.groupby(['key1', 'key2'])['data2']
s_grouped.mean()

#### 通过字典或Series进行分组

In [None]:
people = DataFrame(
    np.random.randn(5,5),
    columns=list('abcde'),
    index=['Joe', 'Steve','Wes','Jim','Travis']
)

people

In [None]:
mapping = {
    'a' : 'red',
    'b' : 'red',
    'c' : 'blue',
    'd' : 'blue',
    'e' : 'red',
    'f' : 'orange'
}
by_column = people.groupby(mapping,axis=1)
by_column.sum()

In [None]:
map_series = Series(mapping)
map_series

In [None]:
people.groupby(map_series, axis=1).count()

#### 通过函数进行分组

In [None]:
people.groupby(len).sum()

In [None]:
key_list = ['one','one','one','two','two']
people.groupby([len, key_list]).min()

#### 根据索引级别分组

In [None]:
columns = pd.MultiIndex.from_arrays(
    [
        ['US', 'US', 'US', 'JP','JP'],
        [1,3,5,1,3]
    ],
    names = ['city', 'tenor']
)
hief_df = DataFrame(np.random.randn(4,5),
                    columns=columns
                   )
hief_df

### 数据聚合

In [None]:
df

**quantile可以计算Series或DataFrame列的样本分位数**

In [None]:
grouped =df.groupby('key1')
grouped['data1'].quantile(0.9)

**聚合函数，传入方法aggregate或agg**

In [None]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.agg(peak_to_peak)

**经过优化的groupby的方法**
- count  分组中非NA值的数量
- sum    非NA值的和
- mean   非NA值的平均值
- median 非NA值的算术中位数
- std,var 无偏（分母为n-1）标准差和方差
- min,max 非NA值的最小和最大值
- prod   非NA值的积
- first,last  第一个和最后一个非NA值


In [None]:
tips = pd.read_csv('../pydata/ch08/tips.csv')

In [None]:
tips['tip_pct']=tips['tip']/tips['total_bill']

### 分组运算和转换

### 透视表和交叉表

### 示例： 2012联邦选举委员会数据库