# 归一化 && Pandas怎样对每个分组应用apply函数?

**知识：Pandas的GroupBy遵从split、apply、combine模式**

![img](https://nbviewer.jupyter.org/github/peiss/ant-learn-pandas/blob/master/other_files/pandas-split-apply-combine.png)

这里的split指的是pandas的groupby，我们自己实现apply函数，apply返回的结果由pandas进行combine得到结果

**GroupBy.apply(function)**

- function的第一个参数是dataframe
- function的返回结果，可是dataframe、series、单个值，甚至和输入dataframe完全没关系

**本次实例演示：**

1. 怎样对数值列按分组的归一化？
2. 怎样取每个分组的TOPN数据？

## 实例1：怎样对数值列按分组的归一化？

将不同范围的数值列进行归一化，映射到[0,1]区间：

- 更容易做数据横向对比，比如价格字段是几百到几千，增幅字段是0到100
- 机器学习模型学的更快性能更好

### 线性函数的归一化的公式：

![img](https://nbviewer.jupyter.org/github/peiss/ant-learn-pandas/blob/master/other_files/Normalization-Formula.jpg)

### 演示：最高温，最低温，空气指数aqi 归一化


In [133]:
# 处理温差
df['diff'] = df['max_temperature'] - df['min_temperature']

In [134]:
df.head()

Unnamed: 0_level_0,date,week,max_temperature,min_temperature,day_status,wind,aqi,aqi_status,中文月份1,中文月份2,中文月份3,中文月份4,diff
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2017-01-01,2017-01-01,周日,9.0,2.0,多云,无持续风向微风,372,严重,一月,一月,一月,一月,7.0
2017-01-02,2017-01-02,周一,9.0,2.0,霾,无持续风向微风,361,严重,一月,一月,一月,一月,7.0
2017-01-03,2017-01-03,周二,9.0,2.0,霾~雾,无持续风向微风,280,重度,一月,一月,一月,一月,7.0
2017-01-04,2017-01-04,周三,9.0,2.0,小雨,无持续风向微风,193,中度,一月,一月,一月,一月,7.0
2017-01-05,2017-01-05,周四,5.0,1.0,小雨,无持续风向微风,216,重度,一月,一月,一月,一月,4.0


In [135]:
# 检查异常值
print(df['max_temperature'].max())
print(df['min_temperature'].max())
print(df['aqi'].max())

39.0
29.0
422


In [136]:
# 方法1
df['max1'] = (df['max_temperature'] - df['max_temperature'].min()) / \
    (df['max_temperature'].max() - df['max_temperature'].min())
df['max1'].head()

date
2017-01-01    0.268293
2017-01-02    0.268293
2017-01-03    0.268293
2017-01-04    0.268293
2017-01-05    0.170732
Name: max1, dtype: float64

In [137]:
df['max_temperature'].map(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))

  """Entry point for launching an IPython kernel.


date
2017-01-01   NaN
2017-01-02   NaN
2017-01-03   NaN
2017-01-04   NaN
2017-01-05   NaN
              ..
2019-12-27   NaN
2019-12-28   NaN
2019-12-29   NaN
2019-12-30   NaN
2019-12-31   NaN
Name: max_temperature, Length: 1095, dtype: float64

In [138]:
df['max1'] = df[['max_temperature']].apply(
    lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
df['max1'].head()

date
2017-01-01    0.268293
2017-01-02    0.268293
2017-01-03    0.268293
2017-01-04    0.268293
2017-01-05    0.170732
Name: max1, dtype: float64

In [139]:
type(df['max_temperature'])

pandas.core.series.Series

In [140]:
type(df[['max_temperature']])

pandas.core.frame.DataFrame

In [141]:
# min
# aqi
# diff
df['min1'] = df[['min_temperature']].apply(
    lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
df['aqi1'] = df[['aqi']].apply(lambda x: (
    x - np.min(x)) / (np.max(x) - np.min(x)))
df['diff1'] = df[['diff']].apply(lambda x: (
    x - np.min(x)) / (np.max(x) - np.min(x)))

In [142]:
df.cov()

Unnamed: 0,max_temperature,min_temperature,aqi,diff,max1,min1,aqi1,diff1
max_temperature,110.751668,101.611776,-243.277537,9.139892,2.70126,2.60543,-0.605168,0.507772
min_temperature,101.611776,102.135375,-273.362354,-0.523599,2.478336,2.618856,-0.680006,-0.029089
aqi,-243.277537,-273.362354,3624.097009,30.084817,-5.933598,-7.009291,9.015167,1.671379
diff,9.139892,-0.523599,30.084817,9.663491,0.222924,-0.013426,0.074838,0.536861
max1,2.70126,2.478336,-5.933598,0.222924,0.065884,0.063547,-0.01476,0.012385
min1,2.60543,2.618856,-7.009291,-0.013426,0.063547,0.06715,-0.017436,-0.000746
aqi1,-0.605168,-0.680006,9.015167,0.074838,-0.01476,-0.017436,0.022426,0.004158
diff1,0.507772,-0.029089,1.671379,0.536861,0.012385,-0.000746,0.004158,0.029826


In [143]:
df.corr()

Unnamed: 0,max_temperature,min_temperature,aqi,diff,max1,min1,aqi1,diff1
max_temperature,1.0,0.95539,-0.383996,0.279382,1.0,0.95539,-0.383996,0.279382
min_temperature,0.95539,1.0,-0.449315,-0.016666,0.95539,1.0,-0.449315,-0.016666
aqi,-0.383996,-0.449315,1.0,0.160761,-0.383996,-0.449315,1.0,0.160761
diff,0.279382,-0.016666,0.160761,1.0,0.279382,-0.016666,0.160761,1.0
max1,1.0,0.95539,-0.383996,0.279382,1.0,0.95539,-0.383996,0.279382
min1,0.95539,1.0,-0.449315,-0.016666,0.95539,1.0,-0.449315,-0.016666
aqi1,-0.383996,-0.449315,1.0,0.160761,-0.383996,-0.449315,1.0,0.160761
diff1,0.279382,-0.016666,0.160761,1.0,0.279382,-0.016666,0.160761,1.0


### Z-score标准化方法

   这种方法给予原始数据的均值（mean）和标准差（standard deviation）进行数据的标准化。经过处理的数据符合标准正态分布，即均值为0，标准差为1，转化函数为：
   $$
   \hat x=\frac{x-\mu}{\sigma}
   $$
   其中为μ所有样本数据的均值，σ为所有样本数据的标准差。该种归一化方式要求原始数据的分布可以近似为高斯分布，否则处理的效果会变差。

![image.png](attachment:image.png)

In [144]:
# 样本标准差和总体标准差 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html
# ddof  int, default 1
# N - ddof

In [145]:
# df.std(ddof=1)
# 样本标准差
df['max_temperature'].std(ddof=1)

10.523861852688613

In [146]:
# 总体标准差
df['max_temperature'].std(ddof=0)

10.51905533868924