数据分组操作的基本逻辑是
- 拆分（split）：验证行/列方向对数据进行分组
- 应用（apply）：一个函数应用到各个组当中，产生新的值
- 联合（combine）：所有结果又组合成为一个结果对象

![分组聚合](https://raw.githubusercontent.com/samsun277/image/main/20220418183304.png)

分组键的几种形式：
- 未完待续

In [2]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'key1': ['a','a','b','b','a'], 'key2': ['one','two','one','two','one'], 'data1':np.random.randn(5), 'data2':np.random.randn(5)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,-1.363865,0.898007
1,a,two,-0.705061,1.170926
2,b,one,-0.482974,0.045901
3,b,two,-1.167965,-0.214658
4,a,one,-0.126037,0.282766


In [3]:
grouped = df['data1'].groupby(df['key1'])
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001C7790A4460>

In [4]:
grouped.mean()

key1
a   -0.731654
b   -0.825470
Name: data1, dtype: float64

In [7]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

key1  key2
a     one    -0.744951
      two    -0.705061
b     one    -0.482974
      two    -1.167965
Name: data1, dtype: float64

In [8]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.744951,-0.705061
b,-0.482974,-1.167965


df当中根本没有这两列【key】值，但是它可以用来帮助df的数据进行分组，然后在这个分组上进行计算

In [11]:
states = np.array(['Ohio','California','California','Ohio','Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])
df['data1'].groupby([states, years]).mean()

California  2005   -0.705061
            2006   -0.482974
Ohio        2005   -1.265915
            2006   -0.126037
Name: data1, dtype: float64

In [12]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,-1.363865,0.898007
1,a,two,-0.705061,1.170926
2,b,one,-0.482974,0.045901
3,b,two,-1.167965,-0.214658
4,a,one,-0.126037,0.282766


df自身列作为分组键

In [13]:
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.731654,0.7839
b,-0.82547,-0.084379


In [43]:
# df.groupby(['key1', 'key2']).count()
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

以上用法可以用来做**分类统计**，非常有用

In [15]:
 for name, group in df.groupby('key1'):
        print(name)
        print(group)

a
  key1 key2     data1     data2
0    a  one -1.363865  0.898007
1    a  two -0.705061  1.170926
4    a  one -0.126037  0.282766
b
  key1 key2     data1     data2
2    b  one -0.482974  0.045901
3    b  two -1.167965 -0.214658


In [17]:
for (k1,k2),group in df.groupby(['key1', 'key2']):
    print((k1, k2))
    print(group)

('a', 'one')
  key1 key2     data1     data2
0    a  one -1.363865  0.898007
4    a  one -0.126037  0.282766
('a', 'two')
  key1 key2     data1     data2
1    a  two -0.705061  1.170926
('b', 'one')
  key1 key2     data1     data2
2    b  one -0.482974  0.045901
('b', 'two')
  key1 key2     data1     data2
3    b  two -1.167965 -0.214658


In [18]:
pieces = dict(list(df.groupby('key1')))
pieces['b']

Unnamed: 0,key1,key2,data1,data2
2,b,one,-0.482974,0.045901
3,b,two,-1.167965,-0.214658


In [19]:
df.dtypes

key1      object
key2      object
data1    float64
data2    float64
dtype: object

In [21]:
grouped = df.groupby(df.dtypes, axis = 1)
for dtype, group in grouped:
    print(dtype)
    print(group)

float64
      data1     data2
0 -1.363865  0.898007
1 -0.705061  1.170926
2 -0.482974  0.045901
3 -1.167965 -0.214658
4 -0.126037  0.282766
object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one


索引传递的如果是列表则返回的是DataFrame，如果是标量，返回的是Series

In [30]:
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,0.590387
a,two,1.170926
b,one,0.045901
b,two,-0.214658


In [31]:
s_grouped = df.groupby(['key1', 'key2'])['data2']
s_grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001C77B9C3370>

In [33]:
s_grouped.mean()

key1  key2
a     one     0.590387
      two     1.170926
b     one     0.045901
      two    -0.214658
Name: data2, dtype: float64

In [34]:
people = pd.DataFrame(np.random.randn(5,5), columns = ['a','b','c','d','e'], index = ['Joe','Steve','Wes','Jim','Travis'])
people.iloc[2:3, [1,2]] = np.nan
people

Unnamed: 0,a,b,c,d,e
Joe,0.569979,-1.049844,1.267992,-0.226412,-0.59434
Steve,0.811779,-0.351165,0.670186,0.338084,-0.518417
Wes,0.275247,,,-0.519166,0.106682
Jim,-0.521017,-0.60017,-1.520989,-1.524391,-0.311494
Travis,-1.256812,2.006732,1.110352,0.063008,-0.662888


In [35]:
mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}

In [36]:
by_column = people.groupby(mapping, axis = 1)

In [37]:
by_column.sum()

Unnamed: 0,blue,red
Joe,1.041579,-1.074206
Steve,1.00827,-0.057802
Wes,-0.519166,0.381929
Jim,-3.04538,-1.432681
Travis,1.17336,0.087033


In [38]:
map_series = pd.Series(mapping)
map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [41]:
people.groupby(map_series, axis = 1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Jim,2,3
Travis,2,3


In [44]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,0.324209,-1.650014,-0.252997,-2.269969,-0.799153
5,0.811779,-0.351165,0.670186,0.338084,-0.518417
6,-1.256812,2.006732,1.110352,0.063008,-0.662888
