계층적 인덱싱(hierarchical indexing) = 다중 인덱싱(multi-indexing)

고차원의 데이터를 1차원 Series와 2차원 DataFrame에서 간결하게 표현하기

(1) MultiIndex 객체를 직접 생성하고
(2) MultiIndex 데이터에서 인덱싱, 슬라이싱, 통계연산을 수행하고
(3) 데이터의 단순 인덱스 표현과 계층적 인덱스 표현간 전환을 위해 사용하는 방법 알아보기

In [38]:
import pandas as pd
import numpy as np

다중 인덱스로 된 Series = 2차원 데이터를 1차원 Series 에 표현하기

(나쁜 사례) 파이선 튜플을 키 값으로 pandas를 활용해 표현하기 

In [39]:
index = [('California', 2000), ('California', 2010),
        ('New York', 2000), ('New York', 2010),
        ('Texas', 2000), ('Texas', 2010)]
index

[('California', 2000),
 ('California', 2010),
 ('New York', 2000),
 ('New York', 2010),
 ('Texas', 2000),
 ('Texas', 2010)]

In [40]:
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

In [41]:
# 다중 인덱스를 기반으로 인덱싱하기
pop[('California', 2010):('Texas', 2000)]

(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64

In [42]:
# 만약 2010년 값으로 모든 값을 선택 하려면 복잡하다
pop[[i for i in pop.index if i[1]==2010]]

(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64

Pandas MultiIndex = 더욱 효율적인 방식

튜플로 부터 다중 인덱스 생성하기

Signature: pd.MultiIndex.from_tuples(tuples, sortorder=None, names=None)
Docstring:
Convert list of tuples to MultiIndex

Parameters
----------
tuples : list / sequence of tuple-likes
    Each tuple is the index of one row/column.
sortorder : int or None
    Level of sortedness (must be lexicographically sorted by that
    level)

Returns
-------
index : MultiIndex

Examples
--------
>>> tuples = [(1, u'red'), (1, u'blue'),
              (2, u'red'), (2, u'blue')]
>>> MultiIndex.from_tuples(tuples, names=('number', 'color'))

See Also
--------
MultiIndex.from_arrays : Convert list of arrays to MultiIndex
MultiIndex.from_product : Make a MultiIndex from cartesian product
                          of iterables
File:      c:\users\jsong\anaconda3\lib\site-packages\pandas\core\indexes\multi.py
Type:      method

In [43]:
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

In [44]:
# MultiIndex 를 시리즈에 다시 인덱싱 (reindex())하면 데이터의 계층적 표현을 볼 수 있다

pop = pop.reindex(index)
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [45]:
# 두번째 인덱스 2010에 접근하면... 

pop[:, 2010]

California    37253956
New York      19378102
Texas         25145561
dtype: int64

In [46]:
# unstack() 메서드 = 다중 인덱스를 가진 Series를 전형적인 인덱스를 가진 DataFrame으로 전환함

pop_df = pop.unstack()
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [47]:
# stack() 는 반대로 작동한다 

pop_df.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [48]:
# 다중 인덱스로 표시된  Series에 열 추가 하기

pop_df = pd.DataFrame({'total': pop,
                      'under_18': [9267089, 9284094,
                                  4687374, 4318033,
                                  5906301, 6879014]})
pop_df

Unnamed: 0,Unnamed: 1,total,under_18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


In [49]:
# 18세 이하 인구밀도 구하기

f_u18 = pop_df['under_18']/pop_df['total']
f_u18

California  2000    0.273594
            2010    0.249211
New York    2000    0.247010
            2010    0.222831
Texas       2000    0.283251
            2010    0.273568
dtype: float64

In [50]:
f_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


MultiIndex 생성 메서드 (자동적으로 암묵적으로 생성되는 방법)

가장 간단한 방법은 index에 두개 이상의 인덱스 배열을 지정하는 것

In [51]:
df = pd.DataFrame(np.random.rand(4,2),
                 index=[['a', 'a', 'b', 'b'], [1,2,1,2]],
                 columns=['data1', 'data2'])
df


Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.10479,0.218963
a,2,0.070319,0.471946
b,1,0.571494,0.206484
b,2,0.3377,0.712958


In [52]:
# 튜플을 키로 갖는 딕셔너리를 전달하면 pandas는 자동적으로 MultiIndex로 전환

data = {('California', 2000):    33871648,
        ('California', 2010):    37253956,
        ('New York', 2000):      18976457,
        ('New York', 2010):      19378102,
        ('Texas', 2000):         20851820,
        ('Texas', 2010):         25145561}

pd.Series(data)

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

명시적으로 MultiIndex 생성 메서드

pd.MultiIndex 의 클래스 메서드 생성자를 사용한다

In [53]:
# 배열 리스트를 만들어 MultiIndex 생성

pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1,2,1,2]])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

In [54]:
# 튜플 리스트를 만들어 MultiIndex 생성

pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

In [55]:
# 데카르트의 곱(cartesian 곱)으로 MultiIndex 생성

pd.MultiIndex.from_product([['a', 'b'], [1,2]])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

In [56]:
# levels와 labels를 직접 전달함

pd.MultiIndex(levels=[['a', 'b'], [1,2]],
             labels=[[0,0,1,1], [0,1,0,1]])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

MutiIndex Level이름 지정하기

(1) MultiIndex 의 names 기본 매개 변수에 인덱스의 이름을 지정하거나
(2) MultiIndex 생성 후 names의 속성을 설정해 이름을 지정

In [57]:
pop.index.names = ['state', 'year']
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

열의 MultiIndex

In [58]:
# 계층적 인덱스와 열
index = pd.MultiIndex.from_product([[2013, 2014], [1,2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                    names=['subject', 'type'])

In [64]:
# 데이터 모형 만들기
data = np.round(np.random.randn(4,6), 1)
data

array([[-1.4,  0.1,  0.9, -0.2, -1.5,  1.4],
       [ 0.6, -0.6, -0.7,  1.6, -0.7, -1.4],
       [-0.9,  2. ,  0.1, -0.8, -0.1, -0.1],
       [ 0.8, -1. ,  0.4,  0.4,  0.6,  0.2]])

In [65]:
data[:, ::2] *= 10 # ??
data += 37
data

array([[ 23. ,  37.1,  46. ,  36.8,  22. ,  38.4],
       [ 43. ,  36.4,  30. ,  38.6,  30. ,  35.6],
       [ 28. ,  39. ,  38. ,  36.2,  36. ,  36.9],
       [ 45. ,  36. ,  41. ,  37.4,  43. ,  37.2]])

In [66]:
health_data = pd.DataFrame(data, index=index, columns=columns) # 4차원 데이터
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,23.0,37.1,46.0,36.8,22.0,38.4
2013,2,43.0,36.4,30.0,38.6,30.0,35.6
2014,1,28.0,39.0,38.0,36.2,36.0,36.9
2014,2,45.0,36.0,41.0,37.4,43.0,37.2


In [67]:
health_data['Guido'] # 사람으로 최상위 열의 인덱스를 정하고 사람별 DataFrame을 불러 올 수 있다

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,46.0,36.8
2013,2,30.0,38.6
2014,1,38.0,36.2
2014,2,41.0,37.4


MutiIndex를 가진 Series

In [68]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [69]:
pop['California', 2000] # 여러 용어로 인덱싱해서단일 요소에 접근

33871648

In [70]:
pop['California'] # 단일 인덱싱으로 해당되는 Series를 얻는다

year
2000    33871648
2010    37253956
dtype: int64

In [73]:
pop.loc['California':'New York']

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

In [74]:
pop[:, 2000]

state
California    33871648
New York      18976457
Texas         20851820
dtype: int64

In [75]:
pop[pop > 22000000] # 마스킹

state       year
California  2000    33871648
            2010    37253956
Texas       2010    25145561
dtype: int64

In [76]:
pop[['California', 'Texas']] # 팬시 인덱싱

state       year
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
dtype: int64

MultiIndex를 가진 DataFrame

In [77]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,23.0,37.1,46.0,36.8,22.0,38.4
2013,2,43.0,36.4,30.0,38.6,30.0,35.6
2014,1,28.0,39.0,38.0,36.2,36.0,36.9
2014,2,45.0,36.0,41.0,37.4,43.0,37.2


In [78]:
health_data['Guido', 'HR']

year  visit
2013  1        46.0
      2        30.0
2014  1        38.0
      2        41.0
Name: (Guido, HR), dtype: float64

In [79]:
health_data.iloc[:2, :2]

Unnamed: 0_level_0,subject,Bob,Bob
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,23.0,37.1
2013,2,43.0,36.4


In [80]:
# loc 나 iloc 에서 개별 인덱스는 MultiIndex의 튜플로 전달 가능

health_data.loc[:, ('Bob', 'HR')]

year  visit
2013  1        23.0
      2        43.0
2014  1        28.0
      2        45.0
Name: (Bob, HR), dtype: float64

In [81]:
# 그러나 튜플내에서 슬라이싱을 하려면 IndexSlice를 사용하여야만 한다

health_data.loc[(:, 1), (:, 'HR')]

SyntaxError: invalid syntax (<ipython-input-81-0fef5a9a5433>, line 3)

In [82]:
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,23.0,46.0,22.0
2014,1,28.0,38.0,36.0


MultiIndex 재정렬 하기

MultiIndex의 경우 lexicographically 정렬이 되어 있어야 슬라이싱이 가능하다
따라서 정렬이 되어 있지 않다면 sort_index() 혹은 sortlevel() 을 이용하여 정렬 후 슬라이싱 한다

In [85]:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1,2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data

char  int
a     1      0.715316
      2      0.826650
c     1      0.029798
      2      0.744899
b     1      0.610969
      2      0.389377
dtype: float64

In [86]:
data['a':'b']

UnsortedIndexError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'

In [87]:
data = data.sort_index()
data

char  int
a     1      0.715316
      2      0.826650
b     1      0.610969
      2      0.389377
c     1      0.029798
      2      0.744899
dtype: float64

In [88]:
data['a':'b']

char  int
a     1      0.715316
      2      0.826650
b     1      0.610969
      2      0.389377
dtype: float64

In [89]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Index의 stack() 과 unstack()

Signature: pop.unstack(level=-1, fill_value=None)
Docstring:
Unstack, a.k.a. pivot, Series with MultiIndex to produce DataFrame.
The level involved will automatically get sorted.

Parameters
----------
level : int, string, or list of these, default last level
    Level(s) to unstack, can pass level name
fill_value : replace NaN with this value if the unstack produces
    missing values

    .. versionadded: 0.18.0

Examples
--------
>>> s = pd.Series([1, 2, 3, 4],
...     index=pd.MultiIndex.from_product([['one', 'two'], ['a', 'b']]))
>>> s
one  a    1
     b    2
two  a    3
     b    4
dtype: int64

>>> s.unstack(level=-1)
     a  b
one  1  2
two  3  4

>>> s.unstack(level=0)
   one  two
a    1    3
b    2    4

Returns
-------
unstacked : DataFrame
File:      c:\users\jsong\anaconda3\lib\site-packages\pandas\core\series.py
Type:      method

In [92]:
pop.unstack(level=0)

state,California,New York,Texas
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [93]:
pop.unstack(level=1)

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [94]:
pop.unstack(level=-1)

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [95]:
pop.unstack()

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [97]:
pop.unstack().stack()

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

인덱스 설정 및 재설정

reset_index()를 활용하여 열을 표현할 이름을 지정하여 명확하게 함(DataFrame이 만들어 진다)

In [98]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [99]:
pop_flat = pop.reset_index(name='population') # DataFrame이 만들어 진다
pop_flat

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


In [100]:
# 다시 Series로 반환 하려면 set_index()를 사용하면 된다

pop_flat.set_index(['state', 'year']) # Series 로 재 전환 됨

Unnamed: 0_level_0,Unnamed: 1_level_0,population
state,year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


MultiIndex에서 데이터 집계하기(DataFrame의 경우 level 과 axis 기본 매개 변수 활용)

In [101]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,23.0,37.1,46.0,36.8,22.0,38.4
2013,2,43.0,36.4,30.0,38.6,30.0,35.6
2014,1,28.0,39.0,38.0,36.2,36.0,36.9
2014,2,45.0,36.0,41.0,37.4,43.0,37.2


In [102]:
data_mean = health_data.mean(level='year')
data_mean

subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,33.0,36.75,38.0,37.7,26.0,37.0
2014,36.5,37.5,39.5,36.8,39.5,37.05


In [103]:
data_mean.mean(axis=1, level='type')

type,HR,Temp
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2013,32.333333,37.15
2014,38.5,37.116667


In [104]:
data_mean1 = health_data.mean(axis=1, level='type')
data_mean1

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,30.333333,37.433333
2013,2,34.333333,36.866667
2014,1,34.0,37.366667
2014,2,43.0,36.866667
