<img width=200 src="https://camo.githubusercontent.com/903f3cc51db134b8c9faed2ba2b18ffedff67ff2aafe75259cbde477b27d9b4f/68747470733a2f2f75706c6f61642e77696b696d656469612e6f72672f77696b6970656469612f636f6d6d6f6e732f7468756d622f652f65642f50616e6461735f6c6f676f2e7376672f3132303070782d50616e6461735f6c6f676f2e7376672e706e673f7261773d74727565"></img>

# Day-16 Pandas 執行聚合運算 (Split-Apply-Combine Strategy)

* 範例目標：
  1. 實做 groupby 函式實現資料科學的 Split-Apply-Combine 策略
* 範例重點：
  1. Groupby：可以同時針對多個欄位做 Group，並在Group中做運算
  2. Split：將大的數據集拆成可獨立計算的小數據集
  3. Apply：獨立計算各個小數據集
  4. Combine：將小數據集運算結果合併

## 匯入套件

In [None]:
# 載入 NumPy, Pandas 套件
import numpy as np
import pandas as pd

# 檢查正確載入與版本
print(np)
print(np.__version__)
print(pd)
print(pd.__version__)

<module 'numpy' from 'D:\\anaconda3\\lib\\site-packages\\numpy\\__init__.py'>
1.19.2
<module 'pandas' from 'D:\\anaconda3\\lib\\site-packages\\pandas\\__init__.py'>
1.1.3


## 平均

In [None]:
score_df = pd.DataFrame([[1,50,80,70,'boy'], 
              [2,60,45,50,'boy'],
              [3,98,43,55,'boy'],
              [4,70,69,89,'boy'],
              [5,56,79,60,'girl'],
              [6,60,68,55,'girl'],
              [7,45,70,77,'girl'],
              [8,55,77,76,'girl'],
              [9,25,57,60,'girl'],
              [10,88,40,43,'girl']],columns=['student_id','math_score','english_score','chinese_score','sex'])
score_df = score_df.set_index('student_id')
score_df

Unnamed: 0_level_0,math_score,english_score,chinese_score,sex
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,50,80,70,boy
2,60,45,50,boy
3,98,43,55,boy
4,70,69,89,boy
5,56,79,60,girl
6,60,68,55,girl
7,45,70,77,girl
8,55,77,76,girl
9,25,57,60,girl
10,88,40,43,girl


### 法一：運用索引將資料分開

In [None]:
boy_score_df = score_df.loc[score_df.sex=='boy']
girl_score_df = score_df.loc[score_df.sex=='girl']
print(boy_score_df.mean())
print(girl_score_df.mean())

math_score       69.50
english_score    59.25
chinese_score    66.00
dtype: float64
math_score       54.833333
english_score    65.166667
chinese_score    61.833333
dtype: float64


### 法二：運用 groupby 方法

In [None]:
score_df.groupby('sex').mean()

Unnamed: 0_level_0,math_score,english_score,chinese_score
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
boy,69.5,59.25,66.0
girl,54.833333,65.166667,61.833333


## 認識 Split-Apply-Combine 策略

* 新增欄位class

In [None]:
score_df['class'] = [1,2,1,2,1,2,1,2,1,2]
score_df

Unnamed: 0_level_0,math_score,english_score,chinese_score,sex,class
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,50,80,70,boy,1
2,60,45,50,boy,2
3,98,43,55,boy,1
4,70,69,89,boy,2
5,56,79,60,girl,1
6,60,68,55,girl,2
7,45,70,77,girl,1
8,55,77,76,girl,2
9,25,57,60,girl,1
10,88,40,43,girl,2


### [Group By](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)

#### 對多個欄位做分析

* 寫法：你的 dataframe 變數名稱.groupby(['要分析之行的名稱', '可以多個']).運算函數名稱()
  * Split：將大的數據集拆成可獨立計算的小數據集，如：拆成男生、女生資料
  * Apply：獨立計算各個小數據集，如成績取平均
  * Combine：將小數據集運算結果合併

In [None]:
score_df.groupby(['sex','class']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,math_score,english_score,chinese_score
sex,class,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
boy,1,74.0,61.5,62.5
boy,2,65.0,57.0,69.5
girl,1,42.0,68.666667,65.666667
girl,2,67.666667,61.666667,58.0


#### 對欄位做多個分析

* 寫法：你的 dataframe 變數名稱.groupby(['要分析之行的名稱']).agg(['運算函數名稱','可以多個運算函數'])

In [None]:
score_df.groupby(['sex']).agg(['mean','std'])

Unnamed: 0_level_0,math_score,math_score,english_score,english_score,chinese_score,chinese_score,class,class
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
boy,69.5,20.680103,59.25,18.191115,66.0,17.530925,1.5,0.57735
girl,54.833333,20.566153,65.166667,14.579666,61.833333,12.952477,1.5,0.547723


#### 對多個欄位做多個分析

* 寫法：你的 dataframe 變數名稱.groupby(['要分析之行的名稱','可以多個']).agg(['運算函數名稱','可以多個運算函數'])

In [None]:
score_df.groupby(['sex','class']).agg(['mean','max'])

Unnamed: 0_level_0,Unnamed: 1_level_0,math_score,math_score,english_score,english_score,chinese_score,chinese_score
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,max,mean,max,mean,max
sex,class,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
boy,1,74.0,98,61.5,80,62.5,70
boy,2,65.0,70,57.0,69,69.5,89
girl,1,42.0,56,68.666667,79,65.666667,77
girl,2,67.666667,88,61.666667,77,58.0,76


## 參考資料

* [groupby,聚合，分組級運算](https://blog.csdn.net/youngbit007/article/details/54288603)
* [GroupBy](https://www.yiibai.com/pandas/python_pandas_groupby.html)
* [Pandas 的 groupby 語法](http://justimchung.blogspot.com/2019/09/pandas-groupby.html)
* [Split-Apply-Combine Strategy for Data Mining](https://medium.com/analytics-vidhya/split-apply-combine-strategy-for-data-mining-4fd6e2a0cc99)
