## 数据透视表(Pivot Table)

数据透视表：从不同维度汇总数值变量，维度即分类变量，汇总方式包含求和，均值，方差等。

在pandas中用df.pivot_table()方法实现数据透视表。

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
titanic = sns.load_dataset("titanic")

In [3]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


使用数据透视表要考虑的要点：

1. 汇总哪个数值变量？
2. 根据哪些维度对数值变量进行分类？
3. 使用什么样的汇总方式？
4. 是否添加边缘汇总？

### 1.1 维度

根据性别和等级分类，计算存活率均值。默认汇总方式是计算均值。

In [6]:
titanic.pivot_table(
    values="survived",
    index="sex",  # index和columns决定分割数据的维度
    columns="class"
)

class,First,Second,Third
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.968085,0.921053,0.5
male,0.368852,0.157407,0.135447


多于两个维度。

In [10]:
age = pd.cut(titanic["age"], bins=[0, 18, 80])  # 数值变量 ==> 分类变量

titanic.pivot_table(
    values="survived",
    index=["sex", age],
    columns=["class", "embark_town"]
)

Unnamed: 0_level_0,class,First,First,First,Second,Second,Second,Third,Third,Third
Unnamed: 0_level_1,embark_town,Cherbourg,Queenstown,Southampton,Cherbourg,Queenstown,Southampton,Cherbourg,Queenstown,Southampton
sex,age,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
female,"(0, 18]",1.0,,0.857143,1.0,,1.0,0.692308,0.75,0.384615
female,"(18, 80]",0.970588,1.0,0.972973,1.0,1.0,0.890909,0.666667,0.333333,0.42
male,"(0, 18]",0.5,,1.0,1.0,,0.571429,0.4,0.0,0.214286
male,"(18, 80]",0.441176,0.0,0.344262,0.0,0.0,0.078947,0.25,0.1,0.122093
