# pandas实现groupby分组统计

类似SQL:
select city, max(temperature) from city_weather group by city;

groupby: 先对数据分组, 然后在每个分组上应用聚合函数、转换函数

本次示例:
1. 分组使用聚合函数做数据统计
2. 遍历groupby的结果理解执行流程
3. 实例分组探索天气数据

In [119]:
import pandas as pd
import numpy as np
from pandas import DataFrame
# 加速这一句, 能在jupyter notebook展示matplot图表
% matplotlib inline

In [120]:
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
                   'D': np.random.randn(8)})
df

Unnamed: 0,A,B,C,D
0,foo,one,-1.756825,-1.277181
1,bar,one,-0.143414,0.613374
2,foo,two,-2.046537,1.067192
3,bar,three,-0.483003,1.201565
4,foo,two,-0.450126,1.186188
5,bar,two,0.43769,1.900416
6,foo,one,1.435011,-1.373841
7,foo,three,-0.425954,-1.087321


## 一. 分组使用聚合函数做数据统计
**1.单个列groupby, 查询所有数据列的统计**

In [121]:
df.groupby('A').sum()

Unnamed: 0_level_0,B,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,onethreetwo,-0.188727,3.715355
foo,onetwotwoonethree,-3.244432,-1.484964


**2.多个列groupby, 查询所有数据列的统计**

In [122]:
df.groupby(['A', 'B']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-0.143414,0.613374
bar,three,-0.483003,1.201565
bar,two,0.43769,1.900416
foo,one,-0.160907,-1.325511
foo,three,-0.425954,-1.087321
foo,two,-1.248332,1.12669


我们看到('A', 'B')成对变成了二级索引

In [123]:
df.groupby(['A', 'B'], as_index=False).mean()

Unnamed: 0,A,B,C,D
0,bar,one,-0.143414,0.613374
1,bar,three,-0.483003,1.201565
2,bar,two,0.43769,1.900416
3,foo,one,-0.160907,-1.325511
4,foo,three,-0.425954,-1.087321
5,foo,two,-1.248332,1.12669


**3.同时查看多种数据统计**

In [124]:
exclude_b_df = df.loc[:, df.columns != 'B']
exclude_b_df.groupby('A').agg([np.sum, np.mean, np.std])

Unnamed: 0_level_0,C,C,C,D,D,D
Unnamed: 0_level_1,sum,mean,std,sum,mean,std
A,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
bar,-0.188727,-0.062909,0.465596,3.715355,1.238452,0.644314
foo,-3.244432,-0.648886,1.379564,-1.484964,-0.296993,1.304398


我们看到: 列变成了多级索引

**4.查看单列的结果数据统计**

In [125]:
# 方法1: 预过滤, 性能更好
exclude_b_df.groupby('A')['C'].agg([np.sum, np.mean, np.std])

Unnamed: 0_level_0,sum,mean,std
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,-0.188727,-0.062909,0.465596
foo,-3.244432,-0.648886,1.379564


In [126]:
# 方法2
exclude_b_df.groupby('A').agg([np.sum, np.mean, np.std])['C']

Unnamed: 0_level_0,sum,mean,std
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,-0.188727,-0.062909,0.465596
foo,-3.244432,-0.648886,1.379564


**5.不同列使用不同的聚合函数**

In [127]:
exclude_b_df.groupby('A').agg({'C': np.sum, 'D': [np.mean, np.std]})

Unnamed: 0_level_0,C,D,D
Unnamed: 0_level_1,sum,mean,std
A,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
bar,-0.188727,1.238452,0.644314
foo,-3.244432,-0.296993,1.304398


## 二.遍历groupby的结果理解执行流程
for 循环可以直接遍历每个group

**1.遍历单个聚合的分组**

In [128]:
g = exclude_b_df.groupby('A')
g

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000021061ED8C70>

In [129]:
for name, group in g:
    print(name)
    print(group)
    print()

bar
     A         C         D
1  bar -0.143414  0.613374
3  bar -0.483003  1.201565
5  bar  0.437690  1.900416

foo
     A         C         D
0  foo -1.756825 -1.277181
2  foo -2.046537  1.067192
4  foo -0.450126  1.186188
6  foo  1.435011 -1.373841
7  foo -0.425954 -1.087321



**可以获取单个返祖的数据**

In [130]:
g.get_group('bar')

Unnamed: 0,A,C,D
1,bar,-0.143414,0.613374
3,bar,-0.483003,1.201565
5,bar,0.43769,1.900416


**2遍历多个聚合的分组**

In [131]:
g1 = df.groupby(['A', 'B'])

In [132]:
for name, group in g1:
    print(name)
    print(group)
    print()

('bar', 'one')
     A    B         C         D
1  bar  one -0.143414  0.613374

('bar', 'three')
     A      B         C         D
3  bar  three -0.483003  1.201565

('bar', 'two')
     A    B        C         D
5  bar  two  0.43769  1.900416

('foo', 'one')
     A    B         C         D
0  foo  one -1.756825 -1.277181
6  foo  one  1.435011 -1.373841

('foo', 'three')
     A      B         C         D
7  foo  three -0.425954 -1.087321

('foo', 'two')
     A    B         C         D
2  foo  two -2.046537  1.067192
4  foo  two -0.450126  1.186188



可以看到, name是一个2个元素的tuple, 代表不同的列

In [133]:
g1.get_group(('foo', 'one'))

Unnamed: 0,A,B,C,D
0,foo,one,-1.756825,-1.277181
6,foo,one,1.435011,-1.373841


**可以直接查询group后的某几列, 生成Series或者子DataFrame**

In [134]:
g1['C']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x0000021061ED8430>

In [135]:
for name, group in g1['C']:
    print(name)
    print(group)
    print(type(group))
    print()

('bar', 'one')
1   -0.143414
Name: C, dtype: float64
<class 'pandas.core.series.Series'>

('bar', 'three')
3   -0.483003
Name: C, dtype: float64
<class 'pandas.core.series.Series'>

('bar', 'two')
5    0.43769
Name: C, dtype: float64
<class 'pandas.core.series.Series'>

('foo', 'one')
0   -1.756825
6    1.435011
Name: C, dtype: float64
<class 'pandas.core.series.Series'>

('foo', 'three')
7   -0.425954
Name: C, dtype: float64
<class 'pandas.core.series.Series'>

('foo', 'two')
2   -2.046537
4   -0.450126
Name: C, dtype: float64
<class 'pandas.core.series.Series'>



其实所有的聚合统计, 都是在DataFrame和Series上进行的

## 三. 实例分组探索天气数据

In [136]:
df_weather = pd.read_excel('./data/weather/weater_beijing.xlsx')
df_weather.loc[:, '最高温'] = df_weather['最高温'].str.replace('°', '').replace('', '0')
df_weather['最高温'].fillna('0', inplace=True)
df_weather.loc[:, '最高温'] = df_weather['最高温'].astype('int32')

In [137]:
df_weather

Unnamed: 0,日期,最高温,最低温,天气,风力风向,空气质量指数
0,2011-01-01 周六,-2,-7°,多云~阴,无持续风向微风,
1,2011-01-02 周日,-2,-7°,多云,无持续风向微风,
2,2011-01-03 周一,-2,-6°,多云~阴,西北风~北风3-4级~4-5级,
3,2011-01-04 周二,-2,-9°,晴,北风5-6级,
4,2011-01-05 周三,-2,-10°,晴,北风~无持续风向3-4级~微风,
...,...,...,...,...,...,...
4001,2021-12-27 周一,6,-8°,晴,西北风1级,56 良
4002,2021-12-28 周二,6,-5°,多云~晴,西北风1级,64 良
4003,2021-12-29 周三,5,-5°,晴,西北风3级,43 优
4004,2021-12-30 周四,6,-7°,晴,西北风3级,38 优


In [138]:
df_weather.loc[:, '最低温'] = df_weather['最低温'].str.replace('°', '').replace('', '0')
df_weather.fillna({'最低温': '0'}, inplace=True)
df_weather.loc[:, '最低温'] = df_weather['最低温'].astype('int32')

In [139]:
df_weather

Unnamed: 0,日期,最高温,最低温,天气,风力风向,空气质量指数
0,2011-01-01 周六,-2,-7,多云~阴,无持续风向微风,
1,2011-01-02 周日,-2,-7,多云,无持续风向微风,
2,2011-01-03 周一,-2,-6,多云~阴,西北风~北风3-4级~4-5级,
3,2011-01-04 周二,-2,-9,晴,北风5-6级,
4,2011-01-05 周三,-2,-10,晴,北风~无持续风向3-4级~微风,
...,...,...,...,...,...,...
4001,2021-12-27 周一,6,-8,晴,西北风1级,56 良
4002,2021-12-28 周二,6,-5,多云~晴,西北风1级,64 良
4003,2021-12-29 周三,5,-5,晴,西北风3级,43 优
4004,2021-12-30 周四,6,-7,晴,西北风3级,38 优


In [140]:
df_weather.fillna({'空气质量指数': '未统计'}, inplace=True)
df_weather.loc[:, '空气质量指数'] = df_weather['空气质量指数'].str.split(' ')

air_quality = []
air_quality_index = []


def split_air(d):
    if len(d) > 1:
        air_quality.append(d[0])
        air_quality_index.append(d[1])
    else:
        air_quality.append(-1)
        air_quality_index.append(d[0])


df_weather['空气质量指数'].apply(split_air)
df_weather.loc[:, '空气质量指数'] = air_quality
df_weather.loc[:, '空气质量'] = air_quality_index

In [141]:
df_weather

Unnamed: 0,日期,最高温,最低温,天气,风力风向,空气质量指数,空气质量
0,2011-01-01 周六,-2,-7,多云~阴,无持续风向微风,-1,未统计
1,2011-01-02 周日,-2,-7,多云,无持续风向微风,-1,未统计
2,2011-01-03 周一,-2,-6,多云~阴,西北风~北风3-4级~4-5级,-1,未统计
3,2011-01-04 周二,-2,-9,晴,北风5-6级,-1,未统计
4,2011-01-05 周三,-2,-10,晴,北风~无持续风向3-4级~微风,-1,未统计
...,...,...,...,...,...,...,...
4001,2021-12-27 周一,6,-8,晴,西北风1级,56,良
4002,2021-12-28 周二,6,-5,多云~晴,西北风1级,64,良
4003,2021-12-29 周三,5,-5,晴,西北风3级,43,优
4004,2021-12-30 周四,6,-7,晴,西北风3级,38,优


In [None]:
s = pd.Series([1, 2, 3, 4], name='a')
s1 = pd.Series([5], name='a')
d = pd.DataFrame({'data': ['a', 'b', 'c', 'd']})
pd.concat([d, s], axis=1)