# 分组

In [1]:
import numpy as np
import pandas as pd

### 一、分组模式及其对象

#### 1.分组的一般模式：**<font color = red>df.groupby(分组依据)[数据来源].使用操作</font>**

In [3]:
df = pd.read_csv('data/learn_pandas.csv')
df.head()

Unnamed: 0,School,Grade,Name,Gender,Height,Weight,Transfer,Test_Number,Test_Date,Time_Record
0,Shanghai Jiao Tong University,Freshman,Gaopeng Yang,Female,158.9,46.0,N,1,2019/10/5,0:04:34
1,Peking University,Freshman,Changqiang You,Male,166.5,70.0,N,1,2019/9/4,0:04:20
2,Shanghai Jiao Tong University,Senior,Mei Sun,Male,188.9,89.0,N,2,2019/9/12,0:05:22
3,Fudan University,Sophomore,Xiaojuan Sun,Female,,41.0,N,2,2020/1/3,0:04:08
4,Fudan University,Sophomore,Gaojuan You,Male,174.0,74.0,N,2,2019/11/6,0:05:22


In [7]:
df.groupby(['School','Gender'])[['Height','Weight']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Height,Weight
School,Gender,Unnamed: 2_level_1,Unnamed: 3_level_1
Fudan University,Female,158.776923,47.9
Fudan University,Male,174.2125,72.3
Peking University,Female,158.666667,46.65
Peking University,Male,172.03,73.7
Shanghai Jiao Tong University,Female,159.1225,48.513514
Shanghai Jiao Tong University,Male,176.76,76.0
Tsinghua University,Female,159.753333,48.0
Tsinghua University,Male,171.638889,69.947368


#### 2.复杂逻辑：**<font color = red>df.groupby(逻辑判断)[数据来源].使用操作</font>**

In [9]:
df.groupby((df.Weight > df.Weight.mean()) & (df.Height > df.Height.mean()))[['Height','Weight']].mean()

Unnamed: 0,Height,Weight
False,159.089922,48.392593
True,173.07963,71.574074


In [15]:
def func(x):
    if x < df['Weight'].quantile(0.25):
        return 'low'
    elif x > df['Weight'].quantile(0.75):
        return 'high'
    else:
        return 'normal'

df.groupby(func)['Height'].mean()

high      162.945902
low       164.509524
normal    162.110526
Name: Height, dtype: float64

#### 3.Groupby对象：拥有ngroups、groups、size 、get_group 等属性

#### 4.分组的三大操作：聚合agg、变换transform、过滤filter 

### 二、聚合函数

#### 1. 内置聚合函数
##### max/min/mean/median/count/all/any/idxmax/idxmin/mad/nunique/skew/quantile/sum/std/var/sem/size/prod

In [18]:
gb = df.groupby('Gender')['Height']

In [19]:
#如果组中的所有值都是真实的，则返回 True，否则返回 False
gb.all()

Gender
Female    True
Male      True
Name: Height, dtype: bool

In [21]:
#如果组中的任何值是真实的，则返回 True，否则返回 False。
gb.any()

Gender
Female    True
Male      True
Name: Height, dtype: bool

In [22]:
#返回平均绝对偏差
gb.mad()

Gender
Female    4.088108
Male      5.394617
Name: Height, dtype: float64

In [23]:
#返回偏斜
gb.skew()

Gender
Female   -0.219253
Male      0.437535
Name: Height, dtype: float64

In [24]:
#返回组平均值的标准误差，不包括缺失值
gb.sem()

Gender
Female    0.439893
Male      0.986985
Name: Height, dtype: float64

In [25]:
#返回
gb.prod()

Gender
Female    4.232080e+290
Male      1.594210e+114
Name: Height, dtype: float64

#### 2. agg方法

##### a.使用多个函数：df.groupby(逻辑判断)[数据来源].agg([fun1,fun2,...])

In [26]:
df.groupby('Gender')[['Height', 'Weight']].agg(['sum', 'idxmax', 'skew'])

Unnamed: 0_level_0,Height,Height,Height,Weight,Weight,Weight
Unnamed: 0_level_1,sum,idxmax,skew,sum,idxmax,skew
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Female,21014.0,28,-0.219253,6469.0,28,-0.268482
Male,8854.9,193,0.437535,3929.0,2,-0.332393


##### b.对特定的列使用特定的聚合函数：df.groupby(逻辑判断)[数据来源].agg({'col1':[fun1,fun2...],'col2':[fun3,fun4,fun5]})

In [27]:
df.groupby('Gender')[['Height', 'Weight']].agg({'Height':['mean','max'], 'Weight':'count'})

Unnamed: 0_level_0,Height,Height,Weight
Unnamed: 0_level_1,mean,max,count
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Female,159.19697,170.2,135
Male,173.62549,193.9,54


##### c.使用自定义函数

In [28]:
df.groupby('Gender')[['Height', 'Weight']].agg(lambda x: x.max()-x.min())

Unnamed: 0_level_0,Height,Weight
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,24.8,29.0
Male,38.2,38.0


##### 由于传入的是序列，因此序列上的方法和属性都是可以在函数中使用的，只需保证返回值是标量即可。下面的例子是指，如果组的指标均值，超过该指标的总体均值，返回High，否则返回Low。

In [30]:
def my_func(s):
    res = 'High'
    if s.mean() <= df[s.name].mean():
        res = 'Low'
    return res

df.groupby('Gender')[['Height', 'Weight']].agg(my_func)

Unnamed: 0_level_0,Height,Weight
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,Low,Low
Male,High,High


##### d.聚合结果重命名
##### 如果想要对聚合结果的列名进行重命名，只需要将上述函数的位置改写成元组，元组的第一个元素为新的名字，第二个位置为原来的函数，包括聚合字符串和自定义函数
##### .agg([('new_name', old_name/func), ('new_name', old_name/func)])

In [31]:
df.groupby('Gender')[['Height', 'Weight']].agg([('range', lambda x: x.max()-x.min()), ('my_sum', 'sum')])

Unnamed: 0_level_0,Height,Height,Weight,Weight
Unnamed: 0_level_1,range,my_sum,range,my_sum
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Female,24.8,21014.0,29.0,6469.0
Male,38.2,8854.9,38.0,3929.0


In [32]:
df.groupby('Gender')[['Height', 'Weight']].agg({'Height': [('my_func', my_func), 'sum'],'Weight': lambda x:x.max()})

Unnamed: 0_level_0,Height,Height,Weight
Unnamed: 0_level_1,my_func,sum,<lambda>
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Female,Low,21014.0,63.0
Male,High,8854.9,89.0


### 三、变换和过滤

#### 1. 变换函数与transform方法
##### 变换函数的返回值为同长度的序列，最常用的内置变换函数是累计函数： cumcount/cumsum/cumprod/cummax/cummin ，它们的使用方式和聚合函数类似，只不过完成的是组内累计操作。

In [35]:
df.groupby('Gender')[['Height', 'Weight']].cumsum()

Unnamed: 0,Height,Weight
0,158.9,46.0
1,166.5,70.0
2,355.4,159.0
3,,87.0
4,529.4,233.0
...,...,...
195,20699.2,6374.0
196,20860.1,6424.0
197,21014.0,6469.0
198,8699.2,3878.0


### 没弄懂transform的用法

#### 2. 组索引与过滤：类似sql里groupby后的having

In [37]:
df.groupby('Gender')[['Height', 'Weight']].filter(lambda x: x.shape[0] > 100).head()

Unnamed: 0,Height,Weight
0,158.9,46.0
3,,41.0
5,158.0,51.0
6,162.5,52.0
7,161.9,50.0


### 四、跨列分组

#### apply的使用

In [38]:
def BMI(x):
    Height = x['Height']/100
    Weight = x['Weight']
    BMI_value = Weight/Height**2
    return BMI_value.mean()

df.groupby('Gender')[['Height', 'Weight']].apply(BMI)

Gender
Female    18.860930
Male      24.318654
dtype: float64

### 五、练习

#### Ex1：汽车数据集

In [40]:
df = pd.read_csv('data/car.csv')
df.head(3)

Unnamed: 0,Brand,Price,Country,Reliability,Mileage,Type,Weight,Disp.,HP
0,Eagle Summit 4,8895,USA,4.0,33,Small,2560,97,113
1,Ford Escort 4,7402,USA,2.0,33,Small,2345,114,90
2,Ford Festiva 4,6319,Korea,4.0,37,Small,1845,81,63


##### 1.先过滤出所属 Country 数超过2个的汽车，即若该汽车的 Country 在总体数据集中出现次数不超过2则剔除，再按 Country 分组计算价格均值、价格变异系数、该 Country 的汽车数量，其中变异系数的计算方法是标准差除以均值，并在结果中把变异系数重命名为 CoV 。

In [43]:
df.groupby('Brand')['Country'].count()

Brand
Acura Legend V6                  1
Audi 80 4                        1
Buick Century 4                  1
Buick Le Sabre V6                1
Buick Skylark 4                  1
Chevrolet Beretta 4              1
Chevrolet Camaro V8              1
Chevrolet Caprice V8             1
Chevrolet Lumina APV V6          1
Chrysler Le Baron Coupe          1
Chrysler Le Baron V6             1
Chrysler New Yorker V6           1
Dodge Daytona                    1
Dodge Grand Caravan V6           1
Eagle Premier V6                 1
Eagle Summit 4                   1
Ford Aerostar V6                 1
Ford Escort   4                  1
Ford Festiva 4                   1
Ford LTD Crown Victoria V8       1
Ford Mustang V8                  1
Ford Probe                       1
Ford Taurus V6                   1
Ford Tempo 4                     1
Ford Thunderbird V6              1
Honda Accord 4                   1
Honda Civic 4                    1
Honda Civic CRX Si 4             1
Honda Prelude 