# 分层索引(hierarchical indexing)

顾名思义，分层索引意味着索引具有复合结构，可以从多个维度表示数据。

In [1]:
import numpy as np
import pandas as pd

## 1. 创建多层索引的Series和DataFrame

### 1.1 用嵌套列表构造MultiIndex

创建series/dataframe时给index传递一个嵌套列表，每个子列表代表一个“分层”，每个分层两两结合，构成单一数值的索引。

In [2]:
# 第一个元素是最外层的索引，最后一个元素是最内层的索引
idx = [
    ["a", "a", "b", "b"],
    [1, 2, 1, 2]
]
ser = pd.Series(np.random.randint(1, 100, 4), index=idx)

ser

a  1    57
   2    36
b  1    36
   2    31
dtype: int64

In [3]:
arr = np.random.randint(1, 100, (4,2))
cols = ["A", "B"]
df = pd.DataFrame(arr, index=idx, columns=cols)

df

Unnamed: 0,Unnamed: 1,A,B
a,1,77,28
a,2,40,73
b,1,9,71
b,2,84,28


In [4]:
type(df.index)

pandas.core.indexes.multi.MultiIndex

通常情况下我们会用字典构造数据框，如果字典的键是一个元组，则数据框的列是多重索引对象。

In [5]:
data = {
    ("a", 1): [1, 2],
    ("a", 2): [3, 4],
    ("b", 1): [5, 6],
    ("b", 2): [7, 8]
}

df2 = pd.DataFrame(data)

df2

Unnamed: 0_level_0,a,a,b,b
Unnamed: 0_level_1,1,2,1,2
0,1,3,5,7
1,2,4,6,8


In [6]:
type(df2.columns)

pandas.core.indexes.multi.MultiIndex

### 1.2 创建MultiIndex对象

In [7]:
# 构造时继续传递一个嵌套列表
midx = pd.MultiIndex.from_arrays([
    ["a", "a", "b", "b"],
    [1, 2, 1, 2]
])

midx

MultiIndex(levels=[['a', 'b'], [1, 2]],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]])

In [8]:
# 构造时传递一个元组的集合
midx2 = pd.MultiIndex.from_tuples([
    ("a", 1), ("a", 2),
    ("b", 1), ("b", 2)
])

midx2

MultiIndex(levels=[['a', 'b'], [1, 2]],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]])

In [9]:
# 利用点的笛卡尔乘积
midx3 = pd.MultiIndex.from_product([["a", "b"], [1, 2]])

midx3

MultiIndex(levels=[['a', 'b'], [1, 2]],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]])

显然通过点的笛卡尔乘积构造多重索引最简单，代码量最少，最容易理解。

接下来只需要将多重索引对象传递给series,dataframe。

In [10]:
ser = pd.Series(np.random.randint(1, 100, 4), index=midx3)

df = pd.DataFrame(np.random.randint(1, 100, (4, 2)), index=midx3, columns=["A", "B"])

print(ser)
print(df)

a  1    30
   2    10
b  1    50
   2    87
dtype: int64
      A   B
a 1  21  69
  2  53   2
b 1  68  19
  2   3  36


行和列都具有多重索引的dataframe。

In [11]:
idx = pd.MultiIndex.from_product([
    ["2019-11-19 13:00:00", "2019-11-19 13:00:05"],
    ["bid", "ask"]
], names=["time", "quote"])

cols = pd.MultiIndex.from_product([
    ["Binance", "Huobi", "Okex"],
    ["BTC/USDT", "ETH/USDT"]
], names=["exchange", "symbol"])

arr = np.random.randint(1, 100, (4, 6))

df = pd.DataFrame(arr, index=idx, columns=cols)

df

Unnamed: 0_level_0,exchange,Binance,Binance,Huobi,Huobi,Okex,Okex
Unnamed: 0_level_1,symbol,BTC/USDT,ETH/USDT,BTC/USDT,ETH/USDT,BTC/USDT,ETH/USDT
time,quote,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2019-11-19 13:00:00,bid,32,99,95,76,89,25
2019-11-19 13:00:00,ask,66,71,79,63,12,84
2019-11-19 13:00:05,bid,17,34,24,99,4,53
2019-11-19 13:00:05,ask,96,38,42,72,93,82


每一层索引都是一个维度，上方表格包含了四个维度：时间(time)，报价类型(quote)，交易所(exchange)，货币对(symbol)，但所有数据都存放在一张二维表格中，这非常容易理解，对于索引和切片等操作也更方便。这就是多重索引的强大之处。

## 2. 切片和索引

### 2.1 索引带MultiIndex的Series

In [12]:
# 创建一个MultiIndex的Series
idx = pd.MultiIndex.from_product([
    ["California", "New York", "Texsa"],  # 外层索引
    [2000, 2010]  # 内层索引
], names=["state", "year"])

ser = pd.Series([33871648, 37253956, 18976457, 19378102, 20851820, 25145561], index=idx)

ser

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texsa       2000    20851820
            2010    25145561
dtype: int64

根据多个维度索引

In [13]:
ser["New York", 2010]

19378102

根据一个维度索引

In [14]:
ser["Texsa"]

year
2000    20851820
2010    25145561
dtype: int64

In [15]:
ser[:, 2010]

state
California    37253956
New York      19378102
Texsa         25145561
dtype: int64

.iloc, .loc仍然正常使用

In [16]:
ser.loc[["New York", "Texsa"], 2010]

state     year
New York  2010    19378102
Texsa     2010    25145561
dtype: int64

### 2.2 索引带MultiIndex的DataFrame

In [17]:
# 创建带MultiIndex的DataFrame
idx = pd.MultiIndex.from_product([
    ["2019-11-19 13:00:00", "2019-11-19 13:00:05"],
    ["bid", "ask"]
], names=["time", "quote"])

cols = pd.MultiIndex.from_product([
    ["Binance", "Huobi", "Okex"],
    ["BTC/USDT", "ETH/USDT"]
], names=["exchange", "symbol"])

arr = np.random.randint(1, 100, (4, 6))

df = pd.DataFrame(arr, index=idx, columns=cols)

df

Unnamed: 0_level_0,exchange,Binance,Binance,Huobi,Huobi,Okex,Okex
Unnamed: 0_level_1,symbol,BTC/USDT,ETH/USDT,BTC/USDT,ETH/USDT,BTC/USDT,ETH/USDT
time,quote,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2019-11-19 13:00:00,bid,89,7,19,27,59,57
2019-11-19 13:00:00,ask,86,76,67,84,24,78
2019-11-19 13:00:05,bid,48,97,57,84,60,94
2019-11-19 13:00:05,ask,29,34,10,42,85,65


获取行观测值

In [18]:
df.loc["2019-11-19 13:00:00", :]  # 根据最外层的行索引取值

exchange,Binance,Binance,Huobi,Huobi,Okex,Okex
symbol,BTC/USDT,ETH/USDT,BTC/USDT,ETH/USDT,BTC/USDT,ETH/USDT
quote,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
bid,89,7,19,27,59,57
ask,86,76,67,84,24,78


如果要根据内层索引取行观测值，必须借助IndexSlice对象

In [19]:
# df.loc[(:, "bid")] 会引发异常

idx_slice = pd.IndexSlice

df.loc[idx_slice[:, "bid"], :]

Unnamed: 0_level_0,exchange,Binance,Binance,Huobi,Huobi,Okex,Okex
Unnamed: 0_level_1,symbol,BTC/USDT,ETH/USDT,BTC/USDT,ETH/USDT,BTC/USDT,ETH/USDT
time,quote,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2019-11-19 13:00:00,bid,89,7,19,27,59,57
2019-11-19 13:00:05,bid,48,97,57,84,60,94


获取列变量

In [20]:
df["Binance"]

Unnamed: 0_level_0,symbol,BTC/USDT,ETH/USDT
time,quote,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-11-19 13:00:00,bid,89,7
2019-11-19 13:00:00,ask,86,76
2019-11-19 13:00:05,bid,48,97
2019-11-19 13:00:05,ask,29,34


In [21]:
df.loc[:, "Binance"]

Unnamed: 0_level_0,symbol,BTC/USDT,ETH/USDT
time,quote,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-11-19 13:00:00,bid,89,7
2019-11-19 13:00:00,ask,86,76
2019-11-19 13:00:05,bid,48,97
2019-11-19 13:00:05,ask,29,34


与行索引相似，如果要根据内层索引取列变量，必须借助IndexSlice对象

In [22]:
# df.loc[:, "BTC/USDT"] 引发异常

idx_slice = pd.IndexSlice

df.loc[:, idx_slice[:, "BTC/USDT"]]

Unnamed: 0_level_0,exchange,Binance,Huobi,Okex
Unnamed: 0_level_1,symbol,BTC/USDT,BTC/USDT,BTC/USDT
time,quote,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2019-11-19 13:00:00,bid,89,19,59
2019-11-19 13:00:00,ask,86,67,24
2019-11-19 13:00:05,bid,48,57,60
2019-11-19 13:00:05,ask,29,10,85


In [23]:
df.loc[idx_slice[:, "bid"], idx_slice[:, "BTC/USDT"]]

Unnamed: 0_level_0,exchange,Binance,Huobi,Okex
Unnamed: 0_level_1,symbol,BTC/USDT,BTC/USDT,BTC/USDT
time,quote,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2019-11-19 13:00:00,bid,89,19,59
2019-11-19 13:00:05,bid,48,57,60


## 3. 重排多重索引

重排(rearrange): 改变Series或DataFrame的结构。

### 3.1 stack, unstack

* stack: 添加索引的层级，将列变量转化为索引
* unstack: 减少索引的层级，将索引变为列变量

In [29]:
# 创建具备多重索引的series
idx = pd.MultiIndex.from_product([["a", "b", "c"], [1, 2, 3]])
ser = pd.Series(np.random.randint(1, 100, 9), index=idx)
ser.index.names = ["outer", "inner"]
ser

outer  inner
a      1        67
       2        22
       3        53
b      1        99
       2        43
       3        87
c      1        21
       2        84
       3        76
dtype: int64

In [32]:
# 将外层行索引转化为列变量
ser.unstack(level=0)  # 也可以提供index name, 'outer'

outer,a,b,c
inner,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,67,99,21
2,22,43,84
3,53,87,76


In [33]:
# 将内层索引转化为列变量
ser.unstack(level=1)  # 等价于level='inner'

inner,1,2,3
outer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,67,22,53
b,99,43,87
c,21,84,76


In [35]:
# stack是unstack的逆操作
ser.unstack(level=1).stack()

outer  inner
a      1        67
       2        22
       3        53
b      1        99
       2        43
       3        87
c      1        21
       2        84
       3        76
dtype: int64

### 3.2 reset_index, set_index

stack, unstack通常用于series，reset_index, set_index通常用于dataframe

* reset_index: 将行索引转化为列变量
* set_index: 将列变量转化为行索引

In [37]:
df = ser.reset_index()
df

Unnamed: 0,outer,inner,0
0,a,1,67
1,a,2,22
2,a,3,53
3,b,1,99
4,b,2,43
5,b,3,87
6,c,1,21
7,c,2,84
8,c,3,76


In [38]:
df.set_index(["outer"])

Unnamed: 0_level_0,inner,0
outer,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1,67
a,2,22
a,3,53
b,1,99
b,2,43
b,3,87
c,1,21
c,2,84
c,3,76


In [39]:
df.set_index(["outer", "inner"])

Unnamed: 0_level_0,Unnamed: 1_level_0,0
outer,inner,Unnamed: 2_level_1
a,1,67
a,2,22
a,3,53
b,1,99
b,2,43
b,3,87
c,1,21
c,2,84
c,3,76


## 4. 数据汇总

In [46]:
# 创建带MultiIndex的DataFrame
idx = pd.MultiIndex.from_product([
    ["2019-11-19 13:00:00", "2019-11-19 13:00:05"],
    ["bid", "ask"]
], names=["time", "quote"])

cols = pd.MultiIndex.from_product([
    ["Binance", "Huobi", "Okex"],
    ["BTC/USDT", "ETH/USDT"]
], names=["exchange", "symbol"])

arr = np.random.randint(1, 100, (4, 6))

df = pd.DataFrame(arr, index=idx, columns=cols)

df

Unnamed: 0_level_0,exchange,Binance,Binance,Huobi,Huobi,Okex,Okex
Unnamed: 0_level_1,symbol,BTC/USDT,ETH/USDT,BTC/USDT,ETH/USDT,BTC/USDT,ETH/USDT
time,quote,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2019-11-19 13:00:00,bid,54,98,57,38,63,28
2019-11-19 13:00:00,ask,19,40,18,45,66,1
2019-11-19 13:00:05,bid,56,61,83,43,93,54
2019-11-19 13:00:05,ask,43,38,64,46,76,24


In [42]:
df.mean(level="time")

exchange,Binance,Binance,Huobi,Huobi,Okex,Okex
symbol,BTC/USDT,ETH/USDT,BTC/USDT,ETH/USDT,BTC/USDT,ETH/USDT
time,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2019-11-19 13:00:00,68.0,42.0,47.0,49.0,35.5,39.5
2019-11-19 13:00:05,71.0,43.5,41.5,47.5,6.5,40.5


In [43]:
df.mean(level="quote")

exchange,Binance,Binance,Huobi,Huobi,Okex,Okex
symbol,BTC/USDT,ETH/USDT,BTC/USDT,ETH/USDT,BTC/USDT,ETH/USDT
quote,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
bid,70.0,51.5,32.5,40.5,16.0,15.0
ask,69.0,34.0,56.0,56.0,26.0,65.0


In [44]:
df.mean(level="exchange", axis=1)  # 按列索引进行汇总

Unnamed: 0_level_0,exchange,Binance,Huobi,Okex
time,quote,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-11-19 13:00:00,bid,54.5,56.0,17.0
2019-11-19 13:00:00,ask,55.5,40.0,58.0
2019-11-19 13:00:05,bid,67.0,17.0,14.0
2019-11-19 13:00:05,ask,47.5,72.0,33.0


In [45]:
df.mean(level="symbol", axis=1)

Unnamed: 0_level_0,symbol,BTC/USDT,ETH/USDT
time,quote,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-11-19 13:00:00,bid,48.666667,36.333333
2019-11-19 13:00:00,ask,51.666667,50.666667
2019-11-19 13:00:05,bid,30.333333,35.0
2019-11-19 13:00:05,ask,49.0,52.666667
