# Pandas怎样对每个分组应用apply函数  

知识：Pandas的gourpby遵从split、apply、combine 模式

这里的split指的是pandas的groupby，我们自己实现apply函数，apply返回的结果由pandas进行combine得到结果  

**GroupBy.apply(function)**  
* function的第一个参数是dataframe  
* function的返回结果，可是dataframe、series、单个值，甚至和输入dataframe完全没关系

**本次实例演示**  
 1. 怎样对数值列按分组的归一化？  
 2. 怎样取每个分组的TOPN数据？

## 实例1: 怎样对数值列按分组的归一化？  

将不同范围的数值列进行归一化，映射到[0,1]区间：  
* 更容易做数据横向对比，比如价格字段是几百到几千，增幅字段是0到100
* 机器学习模型学的更快性能更好  
归一化的公式：  
    
    X_normalized = (X-X_minimum)/(X_maximum - X_minimum)

**演示：用户对电影评分的归一化**

每个用户的评分不同，有的乐观评分高，有的悲观评分低，按用户做归一化

In [1]:
import pandas as pd

In [2]:
ratings = pd.read_csv("./datas/movielens-1m/ratings.dat", 
                      sep="::", 
                      engine='python', 
                      names=['userid','movieid','rating','timestamp'])
ratings.head()

Unnamed: 0,userid,movieid,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [3]:
def rating_norm(df):
    min_rate=df['rating'].min()
    max_rate=df['rating'].max()
    df['rating_norm'] = df['rating'].apply(lambda x: (x-min_rate) / (max_rate-min_rate))
    return df
ratings = ratings.groupby('userid').apply(rating_norm)

In [4]:
ratings[ratings["userid"] == 1].head()

Unnamed: 0,userid,movieid,rating,timestamp,rating_norm
0,1,1193,5,978300760,1.0
1,1,661,3,978302109,0.0
2,1,914,3,978301968,0.0
3,1,3408,4,978300275,0.5
4,1,2355,5,978824291,1.0


可以看到userid==1这个用户，rating=3是他的最低分，是个乐观派，我们归一化到0

**实例2:怎样取每个分组的TOPN数据？**

获取2018年每个月温度最高的2天数据

In [5]:
fpath='./datas/beijing_tianqi/beijing_tianqi_2018.csv'
df = pd.read_csv(fpath)
df.head()

Unnamed: 0,ymd,bWendu,yWendu,tianqi,fengxiang,fengli,aqi,aqiInfo,aqiLevel
0,2018-01-01,3℃,-6℃,晴~多云,东北风,1-2级,59,良,2
1,2018-01-02,2℃,-5℃,阴~多云,东北风,1-2级,49,优,1
2,2018-01-03,2℃,-5℃,多云,北风,1-2级,28,优,1
3,2018-01-04,0℃,-8℃,阴,东北风,1-2级,28,优,1
4,2018-01-05,3℃,-6℃,多云~晴,西北风,1-2级,50,优,1


In [6]:
df['bWendu'] = df['bWendu'].str.replace("℃","").astype('int32')
df['yWendu'] = df['yWendu'].str.replace("℃","").astype('int32')
df['month']=df['ymd'].str[:7]
df.head()

Unnamed: 0,ymd,bWendu,yWendu,tianqi,fengxiang,fengli,aqi,aqiInfo,aqiLevel,month
0,2018-01-01,3,-6,晴~多云,东北风,1-2级,59,良,2,2018-01
1,2018-01-02,2,-5,阴~多云,东北风,1-2级,49,优,1,2018-01
2,2018-01-03,2,-5,多云,北风,1-2级,28,优,1,2018-01
3,2018-01-04,0,-8,阴,东北风,1-2级,28,优,1,2018-01
4,2018-01-05,3,-6,多云~晴,西北风,1-2级,50,优,1,2018-01


In [11]:
def get_topN_tmp(df,n):
    return df.sort_values(by='bWendu')[-n:]
df.groupby('month').apply(get_topN_tmp, n=2).head(31)

Unnamed: 0_level_0,Unnamed: 1_level_0,ymd,bWendu,yWendu,tianqi,fengxiang,fengli,aqi,aqiInfo,aqiLevel,month
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2018-01,13,2018-01-14,6,-5,晴~多云,西北风,1-2级,187,中度污染,4,2018-01
2018-01,18,2018-01-19,7,-4,晴,南风,1-2级,115,轻度污染,3,2018-01
2018-02,53,2018-02-23,10,-4,多云,东北风,1-2级,45,优,1,2018-02
2018-02,56,2018-02-26,12,-1,晴~多云,西南风,1-2级,157,中度污染,4,2018-02
2018-03,86,2018-03-28,25,9,多云~晴,东风,1-2级,387,严重污染,6,2018-03
2018-03,85,2018-03-27,27,11,晴,南风,1-2级,243,重度污染,5,2018-03
2018-04,109,2018-04-20,28,14,多云~小雨,南风,4-5级,164,中度污染,4,2018-04
2018-04,118,2018-04-29,30,16,多云,南风,3-4级,193,中度污染,4,2018-04
2018-05,133,2018-05-14,34,22,晴~多云,南风,3-4级,158,中度污染,4,2018-05
2018-05,150,2018-05-31,35,19,晴,南风,1-2级,79,良,2,2018-05


我们看到，groupby的apply函数返回的dataframe，其实和原来的dataframe其实可以完全不一样