# Reshaping and pivoting tables

- `pivot`, 从 `DataFrame` 的列中选择可以作为"行标签(index)"和"列标签(columns)"的列来对该 `DataFrame` 进行 Reshaping
- `stack`, 从 `DataFrame` 的"列标签(columns)"中选择某一级的"列标签(columns)"作为"行标签(index)"来对该 `DataFrame` 进行 Reshaping
- `unsstack`, 从 `DataFrame` 的"行标签(index)"中选择某一级的"行标签(index)"作为"列标签(columns)"来对该 `DataFrame` 进行 Reshaping
- `melt`, 类似 `pivot` 功能, 但它不会生成新的"行标签(index)"和"列标签(columns)", 而是对数据进行 Reshaping
- `pivot_table`, 类似 `pivot`, 除了它将直接对数字类型进行分组聚合操作, 默认情况下执行 `np.mean` 聚合操作

## Reshaping by pivoting DataFrame objects

In [1]:
import pandas as pd
import numpy as np
import pandas._testing as tm

使用 `pandas` 测试框架中的 `makeTimeDataFrame` 来创建数据集

In [2]:
df_ = tm.makeTimeDataFrame(3)
df_

Unnamed: 0,A,B,C,D
2000-01-03,0.214866,1.181816,-1.186362,2.025151
2000-01-04,-1.553503,-1.044137,2.172232,0.263494
2000-01-05,0.555798,-0.52117,1.364,-0.47866


In [3]:
rows, cols = df_.shape
rows, cols

(3, 4)

In [4]:
data = {
    # 将 df 按列打平为有 12 个值的列
    'value': df_.to_numpy().ravel('F'),
    # 将 df 的列名转为一个有 12 个值的列
    # 这里 repeat 的次数用 df 的行数来填充, 从而保证结果一定是 12
    'variable': np.asarray(df_.columns).repeat(rows),
    # 将 df 的索引转为一个有 12 个值的列
    # 这里 tile 函数第二个值用 df 的列数来填充, 从而保证结果一定是 12
    'date': np.tile(np.asarray(df_.index), cols)
}

> 以下是 `numpy` 的 `repeat` 和 `tile` 的区别, `repeat` 按数组中元素重复, `tile` 按整个数组重复
>
> np.asarray(df.columns).repeat(rows) -> array(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D'], dtype=object)
>
> np.tile(np.asarray(df.columns), rows) -> array(['A', 'B', 'C', 'D', 'A', 'B', 'C', 'D', 'A', 'B', 'C', 'D'], dtype=object)

In [5]:
df = pd.DataFrame(data=data, columns=['date', 'variable', 'value'])
df

Unnamed: 0,date,variable,value
0,2000-01-03,A,0.214866
1,2000-01-04,A,-1.553503
2,2000-01-05,A,0.555798
3,2000-01-03,B,1.181816
4,2000-01-04,B,-1.044137
5,2000-01-05,B,-0.52117
6,2000-01-03,C,-1.186362
7,2000-01-04,C,2.172232
8,2000-01-05,C,1.364
9,2000-01-03,D,2.025151


In [6]:
df.pivot(index='date', columns='variable', values='value')

variable,A,B,C,D
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-03,0.214866,1.181816,-1.186362,2.025151
2000-01-04,-1.553503,-1.044137,2.172232,0.263494
2000-01-05,0.555798,-0.52117,1.364,-0.47866


如果 `pivot` 没有指定 `values` 参数, 且那些既不作为 `index` 也不作为 `columns` 的列将以 '层次列' 的形式出现

In [7]:
df['value2'] = df['value'] * 2
df

Unnamed: 0,date,variable,value,value2
0,2000-01-03,A,0.214866,0.429732
1,2000-01-04,A,-1.553503,-3.107006
2,2000-01-05,A,0.555798,1.111597
3,2000-01-03,B,1.181816,2.363632
4,2000-01-04,B,-1.044137,-2.088274
5,2000-01-05,B,-0.52117,-1.042341
6,2000-01-03,C,-1.186362,-2.372724
7,2000-01-04,C,2.172232,4.344464
8,2000-01-05,C,1.364,2.728001
9,2000-01-03,D,2.025151,4.050301


In [8]:
pivoted = df.pivot(index='date', columns='variable')
# 下面的命令能够得到同样的结果
# df.pivot(index='date', columns='variable', values=['value', 'value2'])
pivoted

Unnamed: 0_level_0,value,value,value,value,value2,value2,value2,value2
variable,A,B,C,D,A,B,C,D
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
2000-01-03,0.214866,1.181816,-1.186362,2.025151,0.429732,2.363632,-2.372724,4.050301
2000-01-04,-1.553503,-1.044137,2.172232,0.263494,-3.107006,-2.088274,4.344464,0.526988
2000-01-05,0.555798,-0.52117,1.364,-0.47866,1.111597,-1.042341,2.728001,-0.957321


In [9]:
# 获取子集
pivoted['value2']

variable,A,B,C,D
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-03,0.429732,2.363632,-2.372724,4.050301
2000-01-04,-3.107006,-2.088274,4.344464,0.526988
2000-01-05,1.111597,-1.042341,2.728001,-0.957321


> `pivot()` 函数会抛出 'ValueError: Index contains duplicate entries, cannot reshape' 错误, 这是应为 index/column 对有重复的值出现, 这时可以考虑使用 `pivot_table()` 函数来解决.

## Reshaping by stacking and unstacking

与 `pivot` 紧密关联的是 `stack` 和 `unstack`, 他们主要和 `MultiIndex` 对象一起工作.

- `stack`: 列标签转行标签. 将列标签(column labels)中的某一级转换为 DataFrame 行标签(index labels)中最内一层
- `unstack`: 行标签转列标签. (`stack` 的逆操作)将行标签( labels)中的某一级转换为 DataFrame 列标签(column labels)中最内一层

In [10]:
index = pd.MultiIndex.from_product([['bar', 'baz', 'foo', 'qux'], ['one', 'tow']], names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-0.705106,1.720253
bar,tow,-0.662458,-0.548312
baz,one,-0.983422,-0.766601
baz,tow,1.686339,1.646547
foo,one,1.2987,-0.49754
foo,tow,-0.261602,-0.034739
qux,one,1.155758,-0.491048
qux,tow,-0.506273,0.16478


In [11]:
df2 = df[:4]
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-0.705106,1.720253
bar,tow,-0.662458,-0.548312
baz,one,-0.983422,-0.766601
baz,tow,1.686339,1.646547


`stack` 函数"压缩(compresses)" `DataFrame` 列中的一级产生如下两种结果:

- 生成一个 `Series`, 当列是一个普通列(即, 不是多级层级列)
- 生成一个 `DataFrame`, 当列是一个层级列(MultiIndex)

如果列是层级列(MultiIndex), 可以选择哪一级去进行 `stack` 操作.

In [12]:
stacked = df2.stack()
stacked

first  second   
bar    one     A   -0.705106
               B    1.720253
       tow     A   -0.662458
               B   -0.548312
baz    one     A   -0.983422
               B   -0.766601
       tow     A    1.686339
               B    1.646547
dtype: float64

In [13]:
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-0.705106,1.720253
bar,tow,-0.662458,-0.548312
baz,one,-0.983422,-0.766601
baz,tow,1.686339,1.646547


In [14]:
stacked.unstack(1)

Unnamed: 0_level_0,second,one,tow
first,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,A,-0.705106,-0.662458
bar,B,1.720253,-0.548312
baz,A,-0.983422,1.686339
baz,B,-0.766601,1.646547


In [15]:
stacked.unstack(0)

Unnamed: 0_level_0,first,bar,baz
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,-0.705106,-0.983422
one,B,1.720253,-0.766601
tow,A,-0.662458,1.686339
tow,B,-0.548312,1.646547


### Multiple levels

`stack` 和 `unstack` 也可以一次性操作多级

In [16]:
columns = pd.MultiIndex.from_tuples(
    [
        ('A', 'cat', 'long'),
        ('B', 'cat', 'long'),
        ('A', 'dog', 'short'),
        ('B', 'dog', 'short')
    ],
    names=['exp', 'animal', 'hair_length']
)
df = pd.DataFrame(np.random.randn(4, 4), columns=columns)
df

exp,A,B,A,B
animal,cat,cat,dog,dog
hair_length,long,long,short,short
0,0.935946,-0.914117,0.327497,-1.512599
1,-0.778146,-0.017815,-1.144527,0.114393
2,1.686533,-0.432218,0.678431,0.641062
3,1.620364,0.174208,0.646346,-1.32237


上面 `DataFrame` 的列是没有排序的, 因此显示为 `A, B, A, B` 的样式, 我们可以调用 `sort_index` 函数来排序列

In [17]:
df.sort_index(axis=1)

exp,A,A,B,B
animal,cat,dog,cat,dog
hair_length,long,short,long,short
0,0.935946,0.327497,-0.914117,-1.512599
1,-0.778146,-1.144527,-0.017815,0.114393
2,1.686533,0.678431,-0.432218,0.641062
3,1.620364,0.646346,0.174208,-1.32237


In [18]:
df.stack(level=['animal', 'hair_length'])

Unnamed: 0_level_0,Unnamed: 1_level_0,exp,A,B
Unnamed: 0_level_1,animal,hair_length,Unnamed: 3_level_1,Unnamed: 4_level_1
0,cat,long,0.935946,-0.914117
0,dog,short,0.327497,-1.512599
1,cat,long,-0.778146,-0.017815
1,dog,short,-1.144527,0.114393
2,cat,long,1.686533,-0.432218
2,dog,short,0.678431,0.641062
3,cat,long,1.620364,0.174208
3,dog,short,0.646346,-1.32237


### Missing data

当我们进行 `stack` 和 `unstack` 操作时, `MultiIndex` 对象并不总是一样的结构, 当结构不同时就可能发生数据缺失的情况

In [21]:
columns = pd.MultiIndex.from_tuples(
    [
        ('A', 'cat'),
        ('B', 'dog'),
        ('B', 'cat'),
        ('A', 'dog'),
    ],
    names=['exp', 'animal']
)
index = pd.MultiIndex.from_product(
    [('bar', 'baz', 'foo', 'qux'), ('one', 'two')], names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=columns)
df

Unnamed: 0_level_0,exp,A,B,B,A
Unnamed: 0_level_1,animal,cat,dog,cat,dog
first,second,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
bar,one,1.233449,-0.849435,0.336231,-0.848637
bar,two,0.143593,1.378358,0.418334,1.415829
baz,one,-1.520797,-0.11449,-1.17622,-0.339953
baz,two,-1.319538,-0.802295,0.410794,-0.68087
foo,one,1.014305,-1.936273,0.116974,-1.693696
foo,two,0.016347,-0.648568,-1.143361,-0.861903
qux,one,0.802772,-0.633525,-0.559954,0.684997
qux,two,1.122281,0.009798,1.06363,1.580826


In [20]:
df2 = df.iloc[[0, 1, 2, 4, 5, 7]]
df2

Unnamed: 0_level_0,exp,A,B,B,A
Unnamed: 0_level_1,animal,cat,dog,cat,dog
first,second,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
bar,one,0.93626,0.511923,1.434649,-0.861668
bar,two,0.023994,-2.058922,-1.331732,-0.385546
baz,one,-0.16109,-0.931336,-0.216378,1.217661
foo,one,-0.91349,-1.285155,-0.539674,-0.705287
foo,two,-1.641164,1.824304,1.95681,-0.737759
qux,two,-1.073595,-0.435275,-1.105778,-0.9993


In [42]:
df2.stack('animal')

Unnamed: 0_level_0,Unnamed: 1_level_0,exp,A,B
first,second,animal,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,one,cat,0.93626,1.434649
bar,one,dog,-0.861668,0.511923
bar,two,cat,0.023994,-1.331732
bar,two,dog,-0.385546,-2.058922
baz,one,cat,-0.16109,-0.216378
baz,one,dog,1.217661,-0.931336
foo,one,cat,-0.91349,-0.539674
foo,one,dog,-0.705287,-1.285155
foo,two,cat,-1.641164,1.95681
foo,two,dog,-0.737759,1.824304


In [39]:
df3 = df.iloc[[0, 1, 4, 7], [1, 2]]
df3

Unnamed: 0_level_0,exp,B,B
Unnamed: 0_level_1,animal,dog,cat
first,second,Unnamed: 2_level_2,Unnamed: 3_level_2
bar,one,-0.849435,0.336231
bar,two,1.378358,0.418334
foo,one,-1.936273,0.116974
qux,two,0.009798,1.06363


In [40]:
df3.unstack()

exp,B,B,B,B
animal,dog,dog,cat,cat
second,one,two,one,two
first,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3
bar,-0.849435,1.378358,0.336231,0.418334
foo,-1.936273,,0.116974,
qux,,0.009798,,1.06363


### With a MultiIndex

In [45]:
df[:3]

Unnamed: 0_level_0,exp,A,B,B,A
Unnamed: 0_level_1,animal,cat,dog,cat,dog
first,second,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
bar,one,1.233449,-0.849435,0.336231,-0.848637
bar,two,0.143593,1.378358,0.418334,1.415829
baz,one,-1.520797,-0.11449,-1.17622,-0.339953


In [44]:
df[:3].unstack(0)

exp,A,A,B,B,B,B,A,A
animal,cat,cat,dog,dog,cat,cat,dog,dog
first,bar,baz,bar,baz,bar,baz,bar,baz
second,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
one,1.233449,-1.520797,-0.849435,-0.11449,0.336231,-1.17622,-0.848637,-0.339953
two,0.143593,,1.378358,,0.418334,,1.415829,


In [47]:
df2.unstack(1)

exp,A,A,B,B,B,B,A,A
animal,cat,cat,dog,dog,cat,cat,dog,dog
second,one,two,one,two,one,two,one,two
first,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
bar,0.93626,0.023994,0.511923,-2.058922,1.434649,-1.331732,-0.861668,-0.385546
baz,-0.16109,,-0.931336,,-0.216378,,1.217661,
foo,-0.91349,-1.641164,-1.285155,1.824304,-0.539674,1.95681,-0.705287,-0.737759
qux,,-1.073595,,-0.435275,,-1.105778,,-0.9993


## Reshaping by melt

`melt` 函数非常像 `pivot` 函数, 除了它不会生成新的"行标签(index)"和"列标签(columns)", 同时会新增两列 `variable` 和 `value`, 新增的两列可以通过参数 `var_name` 和 `value_name` 来自定义名称.

In [48]:
cheese = pd.DataFrame(
    {
        "first": ["John", "Mary"],
        "last": ["Doe", "Bo"],
        "height": [5.5, 6.0],
        "weight": [130, 150],
    }
)
cheese

Unnamed: 0,first,last,height,weight
0,John,Doe,5.5,130
1,Mary,Bo,6.0,150


In [49]:
cheese.melt(id_vars=['first', 'last'])

Unnamed: 0,first,last,variable,value
0,John,Doe,height,5.5
1,Mary,Bo,height,6.0
2,John,Doe,weight,130.0
3,Mary,Bo,weight,150.0


In [50]:
cheese.melt(id_vars=['first', 'last'], var_name='quantity')

Unnamed: 0,first,last,quantity,value
0,John,Doe,height,5.5
1,Mary,Bo,height,6.0
2,John,Doe,weight,130.0
3,Mary,Bo,weight,150.0


`melt` 处理原始索引(original index)的方式靠参数 `ignore_index` 来设定

In [53]:
index = pd.MultiIndex.from_tuples([("person", "A"), ("person", "B")])
cheese2 = cheese.set_index(index)
cheese2

Unnamed: 0,Unnamed: 1,first,last,height,weight
person,A,John,Doe,5.5,130
person,B,Mary,Bo,6.0,150


In [54]:
cheese2.melt(id_vars=['first', 'last'])

Unnamed: 0,first,last,variable,value
0,John,Doe,height,5.5
1,Mary,Bo,height,6.0
2,John,Doe,weight,130.0
3,Mary,Bo,weight,150.0


In [56]:
cheese2.melt(id_vars=['first', 'last'], ignore_index=False)

Unnamed: 0,Unnamed: 1,first,last,variable,value
person,A,John,Doe,height,5.5
person,B,Mary,Bo,height,6.0
person,A,John,Doe,weight,130.0
person,B,Mary,Bo,weight,150.0


## Combining with stats and GroupBy

In [57]:
df

Unnamed: 0_level_0,exp,A,B,B,A
Unnamed: 0_level_1,animal,cat,dog,cat,dog
first,second,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
bar,one,1.233449,-0.849435,0.336231,-0.848637
bar,two,0.143593,1.378358,0.418334,1.415829
baz,one,-1.520797,-0.11449,-1.17622,-0.339953
baz,two,-1.319538,-0.802295,0.410794,-0.68087
foo,one,1.014305,-1.936273,0.116974,-1.693696
foo,two,0.016347,-0.648568,-1.143361,-0.861903
qux,one,0.802772,-0.633525,-0.559954,0.684997
qux,two,1.122281,0.009798,1.06363,1.580826


In [68]:
df.stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,exp,A,B
first,second,animal,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,one,cat,1.233449,0.336231
bar,one,dog,-0.848637,-0.849435
bar,two,cat,0.143593,0.418334
bar,two,dog,1.415829,1.378358
baz,one,cat,-1.520797,-1.17622
baz,one,dog,-0.339953,-0.11449
baz,two,cat,-1.319538,0.410794
baz,two,dog,-0.68087,-0.802295
foo,one,cat,1.014305,0.116974
foo,one,dog,-1.693696,-1.936273


In [63]:
# 下面的命令得到的结果计算公式如下
# (df.stack().loc[('bar', 'one', 'cat'), 'A'] + df.stack().loc[('bar', 'one', 'cat'), 'B']) / 2
df.stack().mean(axis=1)

first  second  animal
bar    one     cat       0.784840
               dog      -0.849036
       two     cat       0.280963
               dog       1.397094
baz    one     cat      -1.348509
               dog      -0.227222
       two     cat      -0.454372
               dog      -0.741583
foo    one     cat       0.565640
               dog      -1.814984
       two     cat      -0.563507
               dog      -0.755236
qux    one     cat       0.121409
               dog       0.025736
       two     cat       1.092955
               dog       0.795312
dtype: float64

In [64]:
df.stack().mean(axis=1).unstack()

Unnamed: 0_level_0,animal,cat,dog
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.78484,-0.849036
bar,two,0.280963,1.397094
baz,one,-1.348509,-0.227222
baz,two,-0.454372,-0.741583
foo,one,0.56564,-1.814984
foo,two,-0.563507,-0.755236
qux,one,0.121409,0.025736
qux,two,1.092955,0.795312


In [65]:
# 使用 groupby 命令实现上面同样的结果
df.groupby(level=1, axis=1).mean()

Unnamed: 0_level_0,animal,cat,dog
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.78484,-0.849036
bar,two,0.280963,1.397094
baz,one,-1.348509,-0.227222
baz,two,-0.454372,-0.741583
foo,one,0.56564,-1.814984
foo,two,-0.563507,-0.755236
qux,one,0.121409,0.025736
qux,two,1.092955,0.795312


In [69]:
df.stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,exp,A,B
first,second,animal,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,one,cat,1.233449,0.336231
bar,one,dog,-0.848637,-0.849435
bar,two,cat,0.143593,0.418334
bar,two,dog,1.415829,1.378358
baz,one,cat,-1.520797,-1.17622
baz,one,dog,-0.339953,-0.11449
baz,two,cat,-1.319538,0.410794
baz,two,dog,-0.68087,-0.802295
foo,one,cat,1.014305,0.116974
foo,one,dog,-1.693696,-1.936273


In [67]:
df.stack().groupby(level=1).mean()

exp,A,B
second,Unnamed: 1_level_1,Unnamed: 2_level_1
one,-0.083445,-0.602087
two,0.177071,0.085836


In [70]:
df.stack().groupby(['second', 'animal']).mean()

Unnamed: 0_level_0,exp,A,B
second,animal,Unnamed: 2_level_1,Unnamed: 3_level_1
one,cat,0.382433,-0.320742
one,dog,-0.549322,-0.883431
two,cat,-0.009329,0.187349
two,dog,0.363471,-0.015677


In [73]:
df.mean()

exp  animal
A    cat       0.186552
B    dog      -0.449554
     cat      -0.066697
A    dog      -0.092926
dtype: float64

In [76]:
df.mean().unstack(0)

exp,A,B
animal,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,0.186552,-0.066697
dog,-0.092926,-0.449554


## Pivot tables

`pivot` 函数支持多种数据类型, `pivot_table` 仅支持数字的聚合

In [87]:
import datetime

df = pd.DataFrame(
    {
        "A": ["one", "one", "two", "three"] * 6,
        "B": ["A", "B", "C"] * 8,
        "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 4,
        "D": np.random.randn(24),
        "E": np.random.randn(24),
        "F": [datetime.datetime(2013, i, 1) for i in range(1, 13)]
        + [datetime.datetime(2013, i, 15) for i in range(1, 13)],
    }
)
df

Unnamed: 0,A,B,C,D,E,F
0,one,A,foo,0.755414,0.704228,2013-01-01
1,one,B,foo,0.215269,0.523508,2013-02-01
2,two,C,foo,0.841009,-0.926254,2013-03-01
3,three,A,bar,-1.44581,2.007843,2013-04-01
4,one,B,bar,-1.401973,0.226963,2013-05-01
5,one,C,bar,-0.100918,-1.152659,2013-06-01
6,two,A,foo,-0.548242,0.631979,2013-07-01
7,three,B,foo,-0.14462,0.039513,2013-08-01
8,one,C,foo,0.35402,0.464392,2013-09-01
9,one,A,bar,-0.035513,-3.563517,2013-10-01


In [95]:
df.loc[(df['A'] == 'one') & (df['B'] == 'A') & (df['C'] == 'bar')].groupby(['A', 'B', 'C']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,D,E
A,B,C,Unnamed: 3_level_1,Unnamed: 4_level_1
one,A,bar,0.334604,-2.074117


In [88]:
# 这里需要特别注意, pivot_table 有个默认聚合函数 np.mean()
pd.pivot_table(df, values='D', index=['A', 'B'], columns='C')

Unnamed: 0_level_0,C,bar,foo
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,0.334604,-0.109411
one,B,-0.184086,0.072462
one,C,-1.250686,0.282952
three,A,-0.827154,
three,B,,-0.643625
three,C,1.003859,
two,A,,0.741181
two,B,-0.109848,
two,C,,0.574489


In [96]:
pd.pivot_table(df, values='D', index=['B'], columns=['A', 'C'], aggfunc=np.sum)

A,one,one,three,three,two,two
C,bar,foo,bar,foo,bar,foo
B,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0.669208,-0.218822,-1.654309,,,1.482361
B,-0.368173,0.144924,,-1.287251,-0.219697,
C,-2.501372,0.565904,2.007719,,,1.148978


In [97]:
pd.pivot_table(df, values=['D', 'E'], index=['B'], columns=['A', 'C'], aggfunc=np.sum)

Unnamed: 0_level_0,D,D,D,D,D,D,E,E,E,E,E,E
A,one,one,three,three,two,two,one,one,three,three,two,two
C,bar,foo,bar,foo,bar,foo,bar,foo,bar,foo,bar,foo
B,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3
A,0.669208,-0.218822,-1.654309,,,1.482361,-4.148235,0.868758,2.992763,,,0.711822
B,-0.368173,0.144924,,-1.287251,-0.219697,,0.497798,0.093412,,-0.360452,2.1377,
C,-2.501372,0.565904,2.007719,,,1.148978,0.239327,-0.563458,0.070684,,,-0.158886


如果 `values` 参数没有给定, 那么 `pivot_table` 函数将所有能够参与聚合操作的字段都纳入进来, 就像下面这样

In [98]:
pd.pivot_table(df, index=['B'], columns=['A', 'C'], aggfunc=np.sum)

Unnamed: 0_level_0,D,D,D,D,D,D,E,E,E,E,E,E
A,one,one,three,three,two,two,one,one,three,three,two,two
C,bar,foo,bar,foo,bar,foo,bar,foo,bar,foo,bar,foo
B,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3
A,0.669208,-0.218822,-1.654309,,,1.482361,-4.148235,0.868758,2.992763,,,0.711822
B,-0.368173,0.144924,,-1.287251,-0.219697,,0.497798,0.093412,,-0.360452,2.1377,
C,-2.501372,0.565904,2.007719,,,1.148978,0.239327,-0.563458,0.070684,,,-0.158886


In [100]:
# with Grouper object
pd.pivot_table(df, values='D', index=pd.Grouper(key='F', freq='M'), columns='C')

C,bar,foo
F,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-01-31,,-0.109411
2013-02-28,,0.072462
2013-03-31,,0.574489
2013-04-30,-0.827154,
2013-05-31,-0.184086,
2013-06-30,-1.250686,
2013-07-31,,0.741181
2013-08-31,,-0.643625
2013-09-30,,0.282952
2013-10-31,0.334604,


### Adding margins

通过设置参数 `margins=True` 提供"行列合计"的功能, 系统自动在行列上新增一个 'All' 的字段/行

In [102]:
df.pivot_table(index=["A", "B"], columns="C", margins=True, aggfunc=np.sum)

Unnamed: 0_level_0,Unnamed: 1_level_0,D,D,D,E,E,E
Unnamed: 0_level_1,C,bar,foo,All,bar,foo,All
A,B,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
one,A,0.669208,-0.218822,0.450385,-4.148235,0.868758,-3.279477
one,B,-0.368173,0.144924,-0.223249,0.497798,0.093412,0.591211
one,C,-2.501372,0.565904,-1.935468,0.239327,-0.563458,-0.324131
three,A,-1.654309,,-1.654309,2.992763,,2.992763
three,B,,-1.287251,-1.287251,,-0.360452,-0.360452
three,C,2.007719,,2.007719,0.070684,,0.070684
two,A,,1.482361,1.482361,,0.711822,0.711822
two,B,-0.219697,,-0.219697,2.1377,,2.1377
two,C,,1.148978,1.148978,,-0.158886,-0.158886
All,,-2.066624,1.836093,-0.230531,1.790036,0.591196,2.381233


## Cross tabulations

使用 `crosstab()` 计算两个或多个因子的交叉表(cross-tabulation).

默认情况下 `crosstab()` 计算的是因子的频率表, 除非传入一个数值数组或聚合函数.

In [105]:
foo, bar, dull, shiny, one, two = "foo", "bar", "dull", "shiny", "one", "two"
a = np.array([foo, foo, bar, bar, foo, foo], dtype=object)
b = np.array([one, one, two, one, two, one], dtype=object)
c = np.array([dull, dull, shiny, dull, dull, shiny], dtype=object)
pd.crosstab(a, [b, c], rownames=["a"], colnames=["b", "c"])

b,one,one,two,two
c,dull,shiny,dull,shiny
a,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
bar,1,0,0,1
foo,2,1,1,0


上面的结果中的值是出现的频率, 比如:

- index='foo', columns=['one', 'dull'] 出现的次数是 2
- index='bar', columns=['one', 'shiny'] 出现的次数是 0

如果 `crosstab()` 仅仅收到两个 `Series`, 则它将返回一个频率表(frequency table)

In [106]:
df = pd.DataFrame({
    "A": [1, 2, 2, 2, 2],
    "B": [3, 3, 4, 4, 4],
    "C": [1, 1, np.nan, 1, 1]
})
df

Unnamed: 0,A,B,C
0,1,3,1.0
1,2,3,1.0
2,2,4,
3,2,4,1.0
4,2,4,1.0


In [107]:
pd.crosstab(df['A'], df['B'])

B,3,4
A,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1,0
2,1,3


`crosstab()` 还支持对 `Categorical` 数据的频率分析.

支持 `normalization`, 即计算频率的百分比.

支持 `adding margins`.

使用场景: 比如我们有个表里面有'性别'列和'血型'列, 那么我们就可以使用 `crosstab()` 提供的方法来分析'性别'+'血型'的概率情况.

## Tiling

平铺, 也就是分箱, 主要使用 `cut()` 函数.

## Computing indicator / dummy variables

In [129]:
df = pd.DataFrame({"key": list("bbacab"), "data1": range(6)})
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [130]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


## Example

In [133]:
np.random.seed([3, 1415])
n = 20
cols = np.array(['key', 'row', 'item', 'col'])
df = cols + pd.DataFrame(
    (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str),
    columns=cols
)
df

Unnamed: 0,key,row,item,col
0,key0,row3,item1,col3
1,key1,row2,item1,col2
2,key1,row0,item1,col0
3,key0,row4,item0,col2
4,key1,row0,item2,col1
5,key1,row2,item2,col4
6,key2,row4,item1,col3
7,key1,row4,item1,col1
8,key1,row0,item2,col4
9,key1,row2,item0,col2
