##### 任何分组(groupby)操作都涉及原始对象的以下操作之一：
- 分割对象
- 应用一个函数
- 结合的结果

##### 将数据分成多个集合，并在每个子集上应用一些函数。在应用函数中，可以执行以下操作 -
- 聚合 - 计算汇总统计
- 转换 - 执行一些特定于组的操作
- 过滤 - 在某些情况下丢弃数据




In [19]:
import pandas as pd
import numpy as np

In [2]:
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
df

Unnamed: 0,Team,Rank,Year,Points
0,Riders,1,2014,876
1,Riders,2,2015,789
2,Devils,2,2014,863
3,Devils,3,2015,673
4,Kings,3,2014,741
5,kings,4,2015,812
6,Kings,1,2016,756
7,Kings,1,2017,788
8,Riders,2,2016,694
9,Royals,4,2014,701


#### 将数据拆分成组
Pandas对象可以分成任何对象。有多种方式来拆分对象:
- obj.groupby(‘key’)
- obj.groupby([‘key1’,’key2’])
- obj.groupby(key,axis=1)

In [3]:
print(df.groupby('Team'))

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000002294CDA2908>


#### 查看分组

In [4]:
print(df.groupby('Team').groups)

{'Devils': Int64Index([2, 3], dtype='int64'), 'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64'), 'kings': Int64Index([5], dtype='int64')}


In [6]:
# 多列分组
for i in df.groupby(['Team','Year']).groups:
    print(i)

('Devils', 2014)
('Devils', 2015)
('Kings', 2014)
('Kings', 2016)
('Kings', 2017)
('Riders', 2014)
('Riders', 2015)
('Riders', 2016)
('Riders', 2017)
('Royals', 2014)
('Royals', 2015)
('kings', 2015)


#### 迭代遍历分组
使用groupby对象，可以遍历类似itertools.obj的对象

In [11]:
grouped = df.groupby('Year')
for name,group in grouped:
    print(name)
    print(group)

2014
     Team  Rank  Year  Points
0  Riders     1  2014     876
2  Devils     2  2014     863
4   Kings     3  2014     741
9  Royals     4  2014     701
2015
      Team  Rank  Year  Points
1   Riders     2  2015     789
3   Devils     3  2015     673
5    kings     4  2015     812
10  Royals     1  2015     804
2016
     Team  Rank  Year  Points
6   Kings     1  2016     756
8  Riders     2  2016     694
2017
      Team  Rank  Year  Points
7    Kings     1  2017     788
11  Riders     2  2017     690


#### 选择一个分组
使用get_group()方法，可以选择一个组。

In [13]:
grouped.groups

{2014: Int64Index([0, 2, 4, 9], dtype='int64'),
 2015: Int64Index([1, 3, 5, 10], dtype='int64'),
 2016: Int64Index([6, 8], dtype='int64'),
 2017: Int64Index([7, 11], dtype='int64')}

In [15]:
print(grouped.get_group(2016))

     Team  Rank  Year  Points
6   Kings     1  2016     756
8  Riders     2  2016     694


#### 聚合
聚合函数为每个组返回单个聚合值。当创建了分组(group by)对象，就可以对分组数据执行多个聚合操作。一个比较常用的是通过聚合或等效的agg方法聚合

In [20]:
grouped = df.groupby('Year')
print(grouped['Points'].agg(np.mean))

Year
2014    795.25
2015    769.50
2016    725.00
2017    739.00
Name: Points, dtype: float64


In [23]:
# 查看每个分组的大小的方法是应用size()
print(grouped['Points'].agg(np.size))

Year
2014    4
2015    4
2016    2
2017    2
Name: Points, dtype: int64


#### 应用多个聚合函数
通过分组系列，还可以传递函数的列表或字典来进行聚合，并生成DataFrame作为输出

In [25]:
grouped = df.groupby('Team')
print(grouped.groups)
agg = grouped['Points'].agg([np.sum,np.mean,np.std])
print(agg)

{'Devils': Int64Index([2, 3], dtype='int64'), 'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64'), 'kings': Int64Index([5], dtype='int64')}
         sum        mean         std
Team                                
Devils  1536  768.000000  134.350288
Kings   2285  761.666667   24.006943
Riders  3049  762.250000   88.567771
Royals  1505  752.500000   72.831998
kings    812  812.000000         NaN


#### 转换


In [31]:
grouped = df.groupby('Team')
score = lambda x: (x - x.mean()) / x.std()*10
print(grouped.transform(score))

         Rank       Year     Points
0  -15.000000 -11.618950  12.843272
1    5.000000  -3.872983   3.020286
2   -7.071068  -7.071068   7.071068
3    7.071068   7.071068  -7.071068
4   11.547005 -10.910895  -8.608621
5         NaN        NaN        NaN
6   -5.773503   2.182179  -2.360428
7   -5.773503   8.728716  10.969049
8    5.000000   3.872983  -7.705963
9    7.071068  -7.071068  -7.071068
10  -7.071068   7.071068   7.071068
11   5.000000  11.618950  -8.157595


#### 过滤
过滤根据定义的标准过滤数据并返回数据的子集。filter()函数用于过滤数据

In [33]:
filter = df.groupby('Team').filter(lambda x: len(x) >= 3)
filter

Unnamed: 0,Team,Rank,Year,Points
0,Riders,1,2014,876
1,Riders,2,2015,789
4,Kings,3,2014,741
6,Kings,1,2016,756
7,Kings,1,2017,788
8,Riders,2,2016,694
11,Riders,2,2017,690
