## 1. GroupBy
`groupby()` 传入分组键并对数据进行分组  
- 分组键可以有多种形式，且类型不必相同：
    - 列表或数组，其长度与待分组的轴一样。
    - 表示DataFrame某个列名的值。
    - 字典或Series，给出待分组轴上的值与分组名之间的对应关系。
    - 函数，用于处理轴索引或索引中的各个标签。
    - 若一次传入多个列作为分组键，返回的结果具有层次化索引
- `size()`返回每个分组的大小
- 对分组结果进行计算时，其非数据列（麻烦列）会被过滤掉
- 任何分组关键词中的缺失值，都会被从结果中除去
- 分组后得到的GroupBy对象支持迭代，可产生由分组名和数据块组成的二元元组，对于多个列进行分组的情况，其分组名是列名组成的元组，格式为`(('分组键值1','分组键值2'),数据值)`
- 得到GroupBy对象支持索引选取操作，索引选取时传入数组或列表则返回DataFrame，传入的是单个列名则返回Series
- 对层次化数据进行分组时，可以指定一个层次化索引的级别进行分组

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame({'key1': list('ABBCBCAA'),
                   'key2': list('YZXYZXYZ'),
                   'data1': np.random.randint(100, size=8),
                   'data2': np.random.randint(10, size=8)})
df

Unnamed: 0,key1,key2,data1,data2
0,A,Y,64,7
1,B,Z,57,5
2,B,X,47,1
3,C,Y,50,6
4,B,Z,82,7
5,C,X,59,7
6,A,Y,40,4
7,A,Z,77,8


In [3]:
# 根据key1、key2列对data1列进行分组并对各组求和
# 由于一次传入多个列作为分组键，因此返回结果具有层次化索引
g1 = df['data1'].groupby([df['key1'], df['key2']])
g1.sum()

key1  key2
A     Y       104
      Z        77
B     X        47
      Z       139
C     X        59
      Y        50
Name: data1, dtype: int32

In [4]:
g1.size()

key1  key2
A     Y       2
      Z       1
B     X       1
      Z       2
C     X       1
      Y       1
Name: data1, dtype: int64

In [5]:
g1.sum().unstack()

key2,X,Y,Z
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,,104.0,77.0
B,47.0,,139.0
C,59.0,50.0,


In [6]:
# 使用key1列名对数据进行分组并求和
# 注意key2键由于不是数据列因此在计算时被过滤掉了
g2 = df.groupby('key1')
g2.sum()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
A,181,19
B,186,13
C,109,13


In [7]:
# size()返回每个分组大小
g2.size()

key1
A    3
B    3
C    2
dtype: int64

In [8]:
# 分组后得到的GroupBy对象是可迭代的，由分组名和数据块组成
# 多列进行分组，分组名是由列名组成的元组
for x, y in df.groupby(['key1', 'key2']):
    print(x)
    print(y)

('A', 'Y')
  key1 key2  data1  data2
0    A    Y     64      7
6    A    Y     40      4
('A', 'Z')
  key1 key2  data1  data2
7    A    Z     77      8
('B', 'X')
  key1 key2  data1  data2
2    B    X     47      1
('B', 'Z')
  key1 key2  data1  data2
1    B    Z     57      5
4    B    Z     82      7
('C', 'X')
  key1 key2  data1  data2
5    C    X     59      7
('C', 'Y')
  key1 key2  data1  data2
3    C    Y     50      6


In [9]:
# 分组得到的GroupBy对象支持索引选取操作
df.groupby(['key1', 'key2'])[['data2']].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
A,Y,11
A,Z,8
B,X,1
B,Z,12
C,X,7
C,Y,6


In [10]:
# 对层次化数据进行分组
df_index=pd.MultiIndex.from_arrays([list('XYXXYZ'),list('aabbcc')],names=['level0','level1'])
d=pd.DataFrame(np.random.randn(4,6),columns=df_index)
d

level0,X,Y,X,X,Y,Z
level1,a,a,b,b.1,c,c
0,0.938185,-0.961999,-0.184465,-0.137115,-0.222926,1.686604
1,0.465761,-0.05494,0.577564,-0.550124,-0.523234,-0.447774
2,-1.023086,-0.436056,1.759591,-1.098638,-0.404784,0.102822
3,0.185688,-1.90521,0.987399,-0.143751,-2.34031,1.148509


In [11]:
# 指定层次化索引的级别名称或者序号进行分组
d.groupby(level='level0',axis=1).sum()

level0,X,Y,Z
0,0.616606,-1.184926,1.686604
1,0.493201,-0.578174,-0.447774
2,-0.362133,-0.84084,0.102822
3,1.029336,-4.24552,1.148509


## 2. 使用数组、字典、Series或函数进行分组
- 分组键可以是任意适当长度的数组，数组值替代索引参与分组，并作为分组结果的索引
- 使用字典进行分组时，字典key对应数据索引，然后根据key对应的value进行分组，并将value作为分组结果的索引
- 使用有映射关系的Series进行分组时，和字典相类似，Series的索引看作字典的key，Series的数据看作字典的value
- 使用函数进行分组，是将数据的索引值代入到函数中，然后根据运算结果进行分组，并将运算结果作为分组结果的索引
- 还可以同时使用多种类型作为分组键

In [12]:
df = pd.DataFrame(np.random.randn(5, 5), columns=[
                      'a', 'b', 'c', 'd', 'e'], index=[1,3,5,7,9])
df

Unnamed: 0,a,b,c,d,e
1,-0.432671,-0.5575,0.816826,-0.030766,1.009882
3,-0.720239,-0.319944,-0.86453,-0.595198,0.65695
5,-0.358353,1.737323,-0.93858,0.216987,0.300043
7,-0.894122,-1.150279,0.374781,0.474971,0.060911
9,-0.790254,0.16273,0.788457,0.440188,0.01945


In [13]:
# 可以使用任意适当长度的数组作为分组键
df.groupby(list('AABBA')).sum()

Unnamed: 0,a,b,c,d,e
A,-1.943164,-0.714714,0.740753,-0.185776,1.686283
B,-1.252476,0.587043,-0.563799,0.691958,0.360954


In [14]:
# 使用该字典的关系进行分组，f键值在数据中并不存在但不影响
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'blue', 'e': 'red', 'f': 'orange'}

In [15]:
# axis=1，列方向进行分组
df.groupby(mapping,axis=1).sum()

Unnamed: 0,blue,red
1,0.78606,0.019712
3,-1.459729,-0.383233
5,-0.721592,1.679012
7,0.849752,-1.98349
9,1.228645,-0.608074


In [16]:
# 使用有映射关系的Series分组
map_series=pd.Series(mapping)
df.groupby(map_series,axis=1).sum()

Unnamed: 0,blue,red
1,0.78606,0.019712
3,-1.459729,-0.383233
5,-0.721592,1.679012
7,0.849752,-1.98349
9,1.228645,-0.608074


In [17]:
# 使用函数进行分组
def judgeIndex(x):
    if x>5:
        return '>5'
    elif x<5:
        return '<5'
    elif x==5:
        return '=5'
# 将索引代入函数进行运算，根据返回结果进行分组
df.groupby(judgeIndex).sum()

Unnamed: 0,a,b,c,d,e
<5,-1.15291,-0.877444,-0.047704,-0.625964,1.666832
=5,-0.358353,1.737323,-0.93858,0.216987,0.300043
>5,-1.684376,-0.987549,1.163238,0.915159,0.080361


In [18]:
# 使用多种类型作为分组键进行分组
df.groupby([judgeIndex,list('AABBA')]).sum()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
<5,A,-1.15291,-0.877444,-0.047704,-0.625964,1.666832
=5,B,-0.358353,1.737323,-0.93858,0.216987,0.300043
>5,A,-0.790254,0.16273,0.788457,0.440188,0.01945
>5,B,-0.894122,-1.150279,0.374781,0.474971,0.060911
