# pandas怎样对每个分组应用apply函数
**知识点: pandas的groupby遵从split、 apply、 combine模式**

<img src='./image/groupby_model.png'>

这里的split值得是pandas的groupby, 我们自己实现apply函数, apply返回的结果有pandas进行combine得到结果

**`GroupBy.apply(function)`**
- function的第一参数是DataFrame
- function的返回结果, 可以是DataFrame、 Series、 单个值和输入的DataFrame完全没关系

**本次示例:**
1. 怎样对数值列先分组然后进行归一化
2. 怎样取每个分组的TOPN数据

## 实例1: 怎样对数值案列分组归一化
将不同范围的数值列进行归一化, 映射到[0, 1]区间:
- 更容易做数据横向对比, 比如价格字段是几百到几千, 增幅字段是0到100
- 机器学习模型学的更快性能更好

归一化的公式:
<img src="./image/Normalization_Formula.png">

**演示: 用户对电影评分的归一化**
每个用户的评分不同, 有的乐观评分高, 有的悲观评分低, 按用户做归一化

In [1]:
import pandas as pd

In [2]:
ratings = pd.read_csv('./data/movies/ratings.csv')

In [3]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
from pandas import DataFrame


# 实现安装用户ID分组, 然后对其中一列归一化
def ratings_norm(df: DataFrame):
    min_value = df['rating'].min()
    max_value = df['rating'].max()
    df['rating_norm'] = df['rating'].apply(lambda d: (d - min_value) / (max_value - min_value))
    return df

ratings.groupby(by='userId').apply(ratings_norm)

Unnamed: 0_level_0,Unnamed: 1_level_0,userId,movieId,rating,timestamp,rating_norm
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0,1,1,4.0,964982703,0.750000
1,1,1,3,4.0,964981247,0.750000
1,2,1,6,4.0,964982224,0.750000
1,3,1,47,5.0,964983815,1.000000
1,4,1,50,5.0,964982931,1.000000
...,...,...,...,...,...,...
610,100831,610,166534,4.0,1493848402,0.777778
610,100832,610,168248,5.0,1493850091,1.000000
610,100833,610,168250,5.0,1494273047,1.000000
610,100834,610,168252,5.0,1493846352,1.000000
