# Pandas数据合并与重塑
- 合并数据集
    - 数据库风格的DataFrame合并
        - pd.merge(DataFrameObj1, DataFrameObj2)
        - 基于两个dataframe中同名column_name作为merge key，类似inner join
        - pd.merge(DataFrameObj1, DataFrameObj2, on='column_name')
            - 显式地指定column name
        - pd.merge(DataFrameObj1, DataFrameObj2, left_on='column_name1', right_on='column_name2')
        - pd.merge(DataFrameObj1, DataFrameObj2, how='outer'):outer join
        - pd.merge(DataFrameObj1, DataFrameObj2, on='column_name', how='left'):left outer join
        - pd.merge(DataFrameObj1, DataFrameObj2, how='inner')
        - 列表：Different join types with how argument
        - 列表：merge function arguments
    - 索引上的合并
        - 指定column name和index进行merge
            - pd.merge(DataFrameObj1, DataFrameObj2, left_on='column_name', right_index=True)
            - pd.merge(DataFrameObj1, DataFrameObj2, left_on='column_name', right_index=True, how='outer')
            - pd.merge(DataFrameObj1, DataFrameObj2, left_on=['column_name1', 'column_name2'], right_index=True)
        - 指定index和index进行merge
            - pd.merge(DataFrameObj1, DataFrameObj2, how='outer', left_index=True, right_index=True)
            - 等价于： DataFrameObj1.join(DataFrameObj2, how='outer')
        - 多个dataframe间的join
            - DataFrameObj1.join([DataFrameObj2, DataFrameObj3])
    - axis连接
        - concat
            - pd.concat([Series1, Series2, Series3]):合并到1列
            - pd.concat([Series1, Series2, Series3], axis=1):合并到1行
            - pd.concat([Series1, Series2], axis=1, join='inner'):默认是全外连接，这里显式地指定内连接
            - pd.concat([Series1, Series2], axis=1, join_axes=[['index1', 'index2', 'index3', 'index4']]):指定`join_axes`
            - result = pd.concat([Series1, Series1, Series2], keys=['key1', 'key2', 'key3']):三个level并不存在，所以组成了层次结构
                - result.unstack():摊平
            - pd.concat([DataFrameObj1, DataFrameObj2], ignore_index=True)
    - 合并重叠数据
        - np.where(pd.isnull(Series1), Series2, Series1)
        - Series2[:-2].combine_first(Series1[2:])
- 层次化索引
    - 层次化索引介绍
        - 设置MultiIndex(`hierarchically-indexed` object, so-called `partial indexing`)
        - unstack():把多level index转成单level index
        - unstack().stack():再转成多level index
    - 重排分级(levels)顺序
        - swaplevel('index_name1', 'index_name2')
        - sortlevel(index_position)
        - swaplevel(index_position0, index_position1).sortlevel(index_position0)
    - 根据级别(level)汇总数据
        - sum(level='index_name')
        - sum(level='index_name', axis=1)
    - 使用DataFrame的列进行索引
        - set_index()
        - reset_index()
    - 整型位置索引
        - loc (for labels) 
        - iloc (for integers)
- 重塑和轴向旋转
    - 重塑层次化索引
        - SeriesObj.DataFrameObj.stack():把一个dataframe 压成一个series，即变成层次结构(level)
            - SeriesObj.unstack()
            - SeriesObj.unstack(0):level number
            - SeriesObj.unstack('level_name')
        - stack(dropna=False):stack()默认会过滤缺失值
    - 将『长格式』旋转(pivot)为『宽格式』
        - 旋转：stack() 和 pivot()

In [1]:

# coding:utf-8
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
%pwd

u'/Users/zhangjun/Documents/machine-learning-notes/data-processing'

## 合并数据集
### 数据库风格的DataFrame合并

In [2]:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                    'data2': range(3)})
df1

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [3]:
df2

Unnamed: 0,data2,key
0,0,a
1,1,b
2,2,d


This is an example of a many-to-one join; the data in df1 has multiple rows labeled a and b, whereas df2 has only one row for each value in the key column. Calling `merge` with these objects we obtain:

In [4]:
pd.merge(df1, df2)

Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,4,a,0
5,5,a,0


If not specified, merge uses the overlapping column names as the keys. It’s a good practice to specify explicitly, though:

In [5]:
pd.merge(df1, df2, on='key')

Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,4,a,0
5,5,a,0


If the column names are different in each object, you can specify them separately:

In [6]:
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
                    'data2': range(3)})
pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Unnamed: 0,data1,lkey,data2,rkey
0,0,b,1,b
1,1,b,1,b
2,6,b,1,b
3,2,a,0,a
4,4,a,0,a
5,5,a,0,a


By default merge does an `'inner'` join. Other possible options are `'left'`, `'right'`, and `'outer'`.

In [7]:
pd.merge(df1, df2, how='outer')

Unnamed: 0,data1,key,data2
0,0.0,b,1.0
1,1.0,b,1.0
2,6.0,b,1.0
3,2.0,a,0.0
4,4.0,a,0.0
5,5.0,a,0.0
6,3.0,c,
7,,d,2.0


Many-to-many merges have well-defined though not necessarily intuitive behavior. Here’s an example:

In [8]:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                    'data1': range(6)})
df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],
                    'data2': range(5)})
df1

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,b


In [9]:
df2

Unnamed: 0,data2,key
0,0,a
1,1,b
2,2,a
3,3,b
4,4,d


In [10]:
pd.merge(df1, df2, on='key', how='left')

Unnamed: 0,data1,key,data2
0,0,b,1.0
1,0,b,3.0
2,1,b,1.0
3,1,b,3.0
4,2,a,0.0
5,2,a,2.0
6,3,c,
7,4,a,0.0
8,4,a,2.0
9,5,b,1.0


In [11]:
pd.merge(df1, df2, how='inner')

Unnamed: 0,data1,key,data2
0,0,b,1
1,0,b,3
2,1,b,1
3,1,b,3
4,5,b,1
5,5,b,3
6,2,a,0
7,2,a,2
8,4,a,0
9,4,a,2


To merge with multiple keys, pass a list of column names:

In [12]:
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
                     'key2': ['one', 'two', 'one'],
                     'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                      'key2': ['one', 'one', 'one', 'two'],
                      'rval': [4, 5, 6, 7]})
pd.merge(left, right, on=['key1', 'key2'], how='outer')

Unnamed: 0,key1,key2,lval,rval
0,foo,one,1.0,4.0
1,foo,one,1.0,5.0
2,foo,two,2.0,
3,bar,one,3.0,6.0
4,bar,two,,7.0


`merge` has a `suffixes` option for specifying strings to append to overlapping names in the left and right DataFrame objects:

In [13]:
pd.merge(left, right, on='key1')

Unnamed: 0,key1,key2_x,lval,key2_y,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


In [14]:
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

Unnamed: 0,key1,key2_left,lval,key2_right,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


Table 8-1. Different join types with how argument

Option | Behavior
-------|---------
'inner' | Use only the key combinations observed in both tables.
'left' | Use all key combinations found in the left table.
'right' | Use all key combinations found in the right table.
'outer' | Use all key combinations observed in both tables together

Table 8-2. merge function arguments

Argument | Description
---------|------------
left | DataFrame to be merged on the left side
right | DataFrame to be merged on the right side
how | One of 'inner', 'outer', 'left' or 'right'. 'inner' by default
on | Column names to join on. Must be found in both DataFrame objects. If not specified and no other join keys given, will use the intersection of the column names in left and right as the join keys
left_on | Columns in left DataFrame to use as join keys
right_on | Analogous to left_on for left DataFrame
left_index | Use row index in left as its join key (or keys, if a MultiIndex)
right_index | Analogous to left_index
sort | Sort merged data lexicographically by join keys; True by default. Disable to get better performance in some cases on large datasets
suffixes | Tuple of string values to append to column names in case of overlap; defaults to ('_x', '_y'). For example, if 'data' in both DataFrame objects, would appear as 'data_x' and 'data_y' in result
copy | If False, avoid copying data into resulting data structure in some exceptional cases. By default always copies

### 索引上的合并
In some cases, the merge key or keys in a DataFrame will be found in its index. In this case, you can pass `left_index=True` or `right_index=True` (or both) to indicate that the index should be used as the merge key:

In [15]:
left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
                      'value': range(6)})
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
left1

Unnamed: 0,key,value
0,a,0
1,b,1
2,a,2
3,a,3
4,b,4
5,c,5


In [16]:
right1

Unnamed: 0,group_val
a,3.5
b,7.0


In [17]:
pd.merge(left1, right1, left_on='key', right_index=True)

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0


In [18]:
pd.merge(left1, right1, left_on='key', right_index=True, how='outer')

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0
5,c,5,


With hierarchically-indexed data, things are more complicated:

In [19]:
lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
                      'key2': [2000, 2001, 2002, 2001, 2002],
                      'data': np.arange(5.)})
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
                      index=[['Nevada', 'Nevada', 'Ohio', 'Ohio', 'Ohio', 'Ohio'],
                             [2001, 2000, 2000, 2000, 2001, 2002]],
                      columns=['event1', 'event2'])
lefth

Unnamed: 0,data,key1,key2
0,0.0,Ohio,2000
1,1.0,Ohio,2001
2,2.0,Ohio,2002
3,3.0,Nevada,2001
4,4.0,Nevada,2002


In [20]:
righth

Unnamed: 0,Unnamed: 1,event1,event2
Nevada,2001,0,1
Nevada,2000,2,3
Ohio,2000,4,5
Ohio,2000,6,7
Ohio,2001,8,9
Ohio,2002,10,11


In [21]:
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)

Unnamed: 0,data,key1,key2,event1,event2
0,0.0,Ohio,2000,4,5
0,0.0,Ohio,2000,6,7
1,1.0,Ohio,2001,8,9
2,2.0,Ohio,2002,10,11
3,3.0,Nevada,2001,0,1


In [22]:
pd.merge(lefth, righth, left_on=['key1', 'key2'],
         right_index=True, how='outer')

Unnamed: 0,data,key1,key2,event1,event2
0,0.0,Ohio,2000.0,4.0,5.0
0,0.0,Ohio,2000.0,6.0,7.0
1,1.0,Ohio,2001.0,8.0,9.0
2,2.0,Ohio,2002.0,10.0,11.0
3,3.0,Nevada,2001.0,0.0,1.0
4,4.0,Nevada,2002.0,,
4,,Nevada,2000.0,2.0,3.0


Using the indexes of both sides of the merge is also possible:

In [23]:
left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]], index=['a', 'c', 'e'],
                     columns=['Ohio', 'Nevada'])
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
                      index=['b', 'c', 'd', 'e'],
                      columns=['Missouri', 'Alabama'])
left2

Unnamed: 0,Ohio,Nevada
a,1.0,2.0
c,3.0,4.0
e,5.0,6.0


In [24]:
right2

Unnamed: 0,Missouri,Alabama
b,7.0,8.0
c,9.0,10.0
d,11.0,12.0
e,13.0,14.0


In [25]:
pd.merge(left2, right2, how='outer', left_index=True, right_index=True)

Unnamed: 0,Ohio,Nevada,Missouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


DataFrame has a convenient `join` instance for merging by index.

In [26]:
left2.join(right2, how='outer')

Unnamed: 0,Ohio,Nevada,Missouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


In [27]:
left1.join(right1, on='key')

Unnamed: 0,key,value,group_val
0,a,0,3.5
1,b,1,7.0
2,a,2,3.5
3,a,3,3.5
4,b,4,7.0
5,c,5,


In [28]:
another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
                       index=['a', 'c', 'e', 'f'], columns=['New York', 'Oregon'])
another

Unnamed: 0,New York,Oregon
a,7.0,8.0
c,9.0,10.0
e,11.0,12.0
f,16.0,17.0


In [29]:
left2.join([right2, another])

Unnamed: 0,Ohio,Nevada,Missouri,Alabama,New York,Oregon
a,1.0,2.0,,,7.0,8.0
c,3.0,4.0,9.0,10.0,9.0,10.0
e,5.0,6.0,13.0,14.0,11.0,12.0


In [30]:
left2.join([right2, another], how='outer')

Unnamed: 0,Ohio,Nevada,Missouri,Alabama,New York,Oregon
a,1.0,2.0,,,7.0,8.0
b,,,7.0,8.0,,
c,3.0,4.0,9.0,10.0,9.0,10.0
d,,,11.0,12.0,,
e,5.0,6.0,13.0,14.0,11.0,12.0
f,,,,,16.0,17.0


### axis连接

In [31]:
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])

Calling `concat` with these object in a list glues together the values and indexes:

In [32]:
pd.concat([s1, s2, s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

By default concat works along `axis=0`, producing another Series. If you pass `axis=1`, the result will instead be a DataFrame (`axis=1` is the columns):

In [33]:
pd.concat([s1, s2, s3], axis=1)

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In this case there is no overlap on the other axis, which as you can see is the sorted union (the `'outer' join`) of the indexes. You can instead intersect them by passing `join='inner'`:

In [35]:
s4 = pd.concat([s1 * 5, s3])
s4

a    0
b    5
f    5
g    6
dtype: int64

In [36]:
pd.concat([s1, s4], axis=1)

Unnamed: 0,0,1
a,0.0,0
b,1.0,5
f,,5
g,,6


In [37]:
pd.concat([s1, s4], axis=1, join='inner')

Unnamed: 0,0,1
a,0,0
b,1,5


You can even specify the axes to be used on the other axes with `join_axes`:

In [38]:
pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])

Unnamed: 0,0,1
a,0.0,0.0
c,,
b,1.0,5.0
e,,


A potential issue is that the concatenated pieces are not identifiable in the result. Suppose instead you wanted to create a hierarchical index on the concatenation axis. To do this, use the `keys` argument:

In [40]:
result = pd.concat([s1, s1, s3], keys=['one', 'two', 'three'])
result

one    a    0
       b    1
two    a    0
       b    1
three  f    5
       g    6
dtype: int64

In [41]:
result.unstack()

Unnamed: 0,a,b,f,g
one,0.0,1.0,,
two,0.0,1.0,,
three,,,5.0,6.0


In the case of combining Series along `axis=1`, the keys become the DataFrame column headers:

In [42]:
pd.concat([s1, s2, s3], axis=1, keys=['one', 'two', 'three'])

Unnamed: 0,one,two,three
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


The same logic extends to DataFrame objects:

In [43]:
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
                   columns=['one', 'two'])
df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
                   columns=['three', 'four'])
df1

Unnamed: 0,one,two
a,0,1
b,2,3
c,4,5


In [44]:
df2

Unnamed: 0,three,four
a,5,6
c,7,8


In [45]:
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


If you pass a dict of objects instead of a list, the dict’s keys will be used for the `keys` option:

In [46]:
pd.concat({'level1': df1, 'level2': df2}, axis=1)

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


We can name the created axis levels with the `names` argument:

In [47]:
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'],
          names=['upper', 'lower'])

upper,level1,level1,level2,level2
lower,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


A last consideration concerns DataFrames in which the row index does not contain any relevant data:

In [48]:
df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])
df1

Unnamed: 0,a,b,c,d
0,1.776944,-0.462237,3.057548,-0.332489
1,2.534262,1.308292,1.561051,0.586242
2,0.760955,-0.448366,-0.024375,0.974236


In [49]:
df2

Unnamed: 0,b,d,a
0,1.445224,0.098105,-0.925441
1,-1.758772,-0.565375,-1.338649


In this case, you can pass `ignore_index=True`:

In [51]:
pd.concat([df1, df2], ignore_index=True)

Unnamed: 0,a,b,c,d
0,1.776944,-0.462237,3.057548,-0.332489
1,2.534262,1.308292,1.561051,0.586242
2,0.760955,-0.448366,-0.024375,0.974236
3,-0.925441,1.445224,,0.098105
4,-1.338649,-1.758772,,-0.565375


Table 8-3. concat function arguments

Argument | Description
---------|------------
objs | List or dict of pandas objects to be concatenated. The only required argument
axis | Axis to concatenate along; defaults to 0
join | One of 'inner', 'outer', defaulting to 'outer'; whether to intersection (inner) or union (outer) together indexes along the other axes
join_axes | Specific indexes to use for the other n-1 axes instead of performing union/intersection logic
keys | Values to associate with objects being concatenated, forming a hierarchical index along the concatenation axis. Can either be a list or array of arbitrary values, an array of tuples, or a list of arrays (if multiple level arrays passed in levels)
levels | Specific indexes to use as hierarchical index level or levels if keys passed
names | Names for created hierarchical levels if keys and / or levels passed
verify_integrity | Check new axis in concatenated object for duplicates and raise exception if so. By default (False) allows duplicates
ignore_index | Do not preserve indexes along concatenation axis, instead producing a new range(total_length) index

### 合并重叠数据
As a motivating example, consider NumPy’s where function, which performs the array-oriented equivalent of an if-else expression:

In [52]:
a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
              index=['f', 'e', 'd', 'c', 'b', 'a'])
b = pd.Series(np.arange(len(a), dtype=np.float64),
              index=['f', 'e', 'd', 'c', 'b', 'a'])
a

f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [53]:
b

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    5.0
dtype: float64

In [55]:
b[-1] = np.nan
b

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

In [56]:
np.where(pd.isnull(a), b, a)

array([ 0. ,  2.5,  2. ,  3.5,  4.5,  nan])

Series has a `combine_first` method, which performs the equivalent of this operation along with pandas’s usual data alignment logic:

In [57]:
b[:-2].combine_first(a[2:])

a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64

With DataFrames, `combine_first` does the same thing column by column, so you can think of it as “patching” missing data in the calling object with data from the object you pass:

In [58]:
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
                    'b': [np.nan, 2., np.nan, 6.],
                    'c': range(2, 18, 4)})
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
                    'b': [np.nan, 3., 4., 6., 8.]})
df1

Unnamed: 0,a,b,c
0,1.0,,2
1,,2.0,6
2,5.0,,10
3,,6.0,14


In [59]:
df2

Unnamed: 0,a,b
0,5.0,
1,4.0,3.0
2,,4.0
3,3.0,6.0
4,7.0,8.0


In [60]:
df1.combine_first(df2)

Unnamed: 0,a,b,c
0,1.0,,2.0
1,4.0,2.0,6.0
2,5.0,4.0,10.0
3,3.0,6.0,14.0
4,7.0,8.0,


## 层次化索引
### 层次化索引介绍

In [95]:
data = pd.Series(np.random.randn(9),
                 index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                        [1, 2, 3, 1, 3, 1, 2, 2, 3]])
data

a  1    1.443597
   2   -0.909521
   3   -0.968799
b  1    0.255042
   3   -1.125145
c  1   -0.691838
   2    1.016297
d  2    0.833615
   3   -0.468632
dtype: float64

What you’re seeing is a prettified view of a Series with a `MultiIndex` as its index.

In [96]:
data.index

MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])

With a hierarchically-indexed object, so-called `partial indexing` is possible, enabling you to concisely select subsets of the data:

In [97]:
data['b']

1    0.255042
3   -1.125145
dtype: float64

In [98]:
data['b':'c']

b  1    0.255042
   3   -1.125145
c  1   -0.691838
   2    1.016297
dtype: float64

In [99]:
data.loc[['b', 'd']]

b  1    0.255042
   3   -1.125145
d  2    0.833615
   3   -0.468632
dtype: float64

Selection is even possible in some cases from an “inner” level:

In [100]:
data.loc[:, 2]

a   -0.909521
c    1.016297
d    0.833615
dtype: float64

For example, this data could be rearranged into a DataFrame using its `unstack` method:

In [101]:
data.unstack()

Unnamed: 0,1,2,3
a,1.443597,-0.909521,-0.968799
b,0.255042,,-1.125145
c,-0.691838,1.016297,
d,,0.833615,-0.468632


The inverse operation of unstack is `stack`:

In [103]:
data.unstack().stack()

a  1    1.443597
   2   -0.909521
   3   -0.968799
b  1    0.255042
   3   -1.125145
c  1   -0.691838
   2    1.016297
d  2    0.833615
   3   -0.468632
dtype: float64

With a DataFrame, either axis can have a hierarchical index:

In [104]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                     index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                     columns=[['Ohio', 'Ohio', 'Colorado'],
                              ['Green', 'Red', 'Green']])
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


The hierarchical levels can have `names` (as strings or any Python objects). If so, these will show up in the console output (don’t confuse the index names with the axis labels!):

In [105]:
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


With partial column indexing you can similarly select groups of columns:

In [106]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


A `MultiIndex` can be created by itself and then reused; the columns in the above DataFrame with level names could be created like this:

In [108]:
from pandas import MultiIndex
MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']], names=['state', 'color'])

MultiIndex(levels=[[u'Colorado', u'Ohio'], [u'Green', u'Red']],
           labels=[[1, 1, 0], [0, 1, 0]],
           names=[u'state', u'color'])

### 重排分级(levels)顺序
The `swaplevel` takes two level numbers or names and returns a new object with the levels interchanged (but the data is otherwise unaltered):

In [109]:
frame.swaplevel('key1', 'key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


`sortlevel`, on the other hand, sorts the data (stably) using only the values in a single level.

In [110]:
frame.sortlevel(1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [111]:
frame.swaplevel(0, 1).sortlevel(0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


### 根据级别(level)汇总数据

In [112]:
frame.sum(level='key2')

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [113]:
frame.sum(level='color', axis=1)

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


### 使用DataFrame的列进行索引

In [114]:
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
                      'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
                      'd': [0, 1, 2, 0, 1, 2, 3]})
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


DataFrame’s `set_index` function will create a new DataFrame using one or more of its columns as the index:

In [115]:
frame2 = frame.set_index(['c', 'd'])
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


By default the columns are removed from the DataFrame, though you can leave them in:

In [116]:
frame.set_index(['c', 'd'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


`reset_index`, on the other hand, does the opposite of `set_index`; the hierarchical index levels are are moved into the columns:

In [117]:
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


### 整型位置索引
To keep things consistent, if you have an axis index containing integers, data selection will always be label-oriented. For more precise handling, use `loc` (for labels) or `iloc` (for integers)

In [118]:
ser = pd.Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [119]:
ser[:1]

0    0.0
dtype: float64

In [120]:
ser.loc[:1]

0    0.0
1    1.0
dtype: float64

In [121]:
ser.iloc[:1]

0    0.0
dtype: float64

## 重塑和轴向旋转
### 重塑层次化索引

In [61]:
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
                    index=pd.Index(['Ohio', 'Colorado'], name='state'),
                    columns=pd.Index(['one', 'two', 'three'], name='number'))
data

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


Using the `stack` method on this data pivots the columns into the rows, producing a Series:

In [62]:
result = data.stack()
result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

From a hierarchically-indexed Series, you can rearrange the data back into a DataFrame with `unstack`:

In [63]:
result.unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


By default the innermost level is unstacked (same with stack). You can unstack a different level by passing a level number or name:

In [64]:
result.unstack(0)

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


In [65]:
result.unstack('state')

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


Unstacking might introduce missing data if all of the values in the level aren’t found in each of the subgroups:

In [66]:
s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
s1

a    0
b    1
c    2
d    3
dtype: int64

In [67]:
s2

c    4
d    5
e    6
dtype: int64

In [68]:
data2 = pd.concat([s1, s2], keys=['one', 'two'])
data2

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

In [69]:
data2.unstack()

Unnamed: 0,a,b,c,d,e
one,0.0,1.0,2.0,3.0,
two,,,4.0,5.0,6.0


Stacking filters out missing data by default, so the operation is more easily invertible:

In [70]:
data2.unstack().stack()

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

In [71]:
data2.unstack().stack(dropna=False)

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

When unstacking in a DataFrame, the level unstacked becomes the lowest level in the result:

In [72]:
df = pd.DataFrame({'left': result, 'right': result + 5},
                  columns=pd.Index(['left', 'right'],
                                   name='side'))
df

Unnamed: 0_level_0,side,left,right
state,number,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,one,0,5
Ohio,two,1,6
Ohio,three,2,7
Colorado,one,3,8
Colorado,two,4,9
Colorado,three,5,10


In [73]:
df.unstack('state')

side,left,left,right,right
state,Ohio,Colorado,Ohio,Colorado
number,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
one,0,3,5,8
two,1,4,6,9
three,2,5,7,10


In [74]:
df.unstack('state').stack('side')

Unnamed: 0_level_0,state,Ohio,Colorado
number,side,Unnamed: 2_level_1,Unnamed: 3_level_1
one,left,0,3
one,right,5,8
two,left,1,4
two,right,6,9
three,left,2,5
three,right,7,10


### 将『长格式』旋转(pivot)为『宽格式』
A common way to store multiple time series in databases and CSV is in so-called long or stacked format. FIrst, let’s load some example data and do a small amount of time series wrangling and other data cleaning:

In [89]:
data = pd.read_csv('data/macrodata.csv')
data.head()

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
0,1959.0,1.0,2710.349,1707.4,286.898,470.045,1886.9,28.98,139.7,2.82,5.8,177.146,0.0,0.0
1,1959.0,2.0,2778.801,1733.7,310.859,481.301,1919.7,29.15,141.7,3.08,5.1,177.83,2.34,0.74
2,1959.0,3.0,2775.488,1751.8,289.226,491.26,1916.4,29.35,140.5,3.82,5.3,178.657,2.74,1.09
3,1959.0,4.0,2785.204,1753.7,299.356,484.052,1931.3,29.37,140.0,4.33,5.6,179.386,0.27,4.06
4,1960.0,1.0,2847.699,1770.5,331.722,462.199,1955.5,29.54,139.6,3.5,5.2,180.007,2.31,1.19


In [92]:
periods = pd.PeriodIndex(year=data.year, quarter=data.quarter, name='date')
periods

PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='int64', name=u'date', length=203, freq='Q-DEC')

In [93]:
data = pd.DataFrame(data.to_records(),
                    columns=pd.Index(['realgdp', 'infl', 'unemp'], name='item'),
                    index=periods.to_timestamp('D', 'end'))
data.head()

item,realgdp,infl,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31,2710.349,0.0,5.8
1959-06-30,2778.801,2.34,5.1
1959-09-30,2775.488,2.74,5.3
1959-12-31,2785.204,0.27,5.6
1960-03-31,2847.699,2.31,5.2


In [83]:
ldata = data.stack().reset_index().rename(columns={0: 'value'})
ldata[:10]

Unnamed: 0,date,item,value
0,1959-03-31,realgdp,2710.349
1,1959-03-31,infl,0.0
2,1959-03-31,unemp,5.8
3,1959-06-30,realgdp,2778.801
4,1959-06-30,infl,2.34
5,1959-06-30,unemp,5.1
6,1959-09-30,realgdp,2775.488
7,1959-09-30,infl,2.74
8,1959-09-30,unemp,5.3
9,1959-12-31,realgdp,2785.204


This is the so-called `long format` for multiple time series, or other observational data with two or more keys (here, our keys are date and item). Each row in the table represents a single observation.

Data is frequently stored this way in relational databases like MySQL as a fixed schema (column names and data types) allows the number of distinct values in the item column to change as data is added to the table. In the above example date and item would usually be the primary keys (in relational database parlance), offering both relational integrity and easier joins. In some cases, the data may be more difficult to work with in this format; you might prefer to have a DataFrame containing one column per distinct item value indexed by timestamps in the date column. DataFrame’s `pivot` method performs exactly this transformation:

In [94]:
pivoted = ldata.pivot('date', 'item', 'value')
pivoted.head()

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31,0.0,2710.349,5.8
1959-06-30,2.34,2778.801,5.1
1959-09-30,2.74,2775.488,5.3
1959-12-31,0.27,2785.204,5.6
1960-03-31,2.31,2847.699,5.2


The first two values passed are the columns to be used respectively as the row and column index, then finally an optional value column to fill the DataFrame. Suppose you had two value columns that you wanted to reshape simultaneously:

In [85]:
ldata['value2'] = np.random.randn(len(ldata))
ldata[:10]

Unnamed: 0,date,item,value,value2
0,1959-03-31,realgdp,2710.349,-1.455693
1,1959-03-31,infl,0.0,0.442
2,1959-03-31,unemp,5.8,-0.267403
3,1959-06-30,realgdp,2778.801,0.220088
4,1959-06-30,infl,2.34,-0.339147
5,1959-06-30,unemp,5.1,0.536942
6,1959-09-30,realgdp,2775.488,1.689787
7,1959-09-30,infl,2.74,-1.254615
8,1959-09-30,unemp,5.3,-0.562661
9,1959-12-31,realgdp,2785.204,2.407114


By omitting the last argument, you obtain a DataFrame with hierarchical columns:

In [86]:
pivoted = ldata.pivot('date', 'item')
pivoted[:5]

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31,0.0,2710.349,5.8,0.442,-1.455693,-0.267403
1959-06-30,2.34,2778.801,5.1,-0.339147,0.220088,0.536942
1959-09-30,2.74,2775.488,5.3,-1.254615,1.689787,-0.562661
1959-12-31,0.27,2785.204,5.6,-0.172535,2.407114,-0.133247
1960-03-31,2.31,2847.699,5.2,1.181717,-0.343066,1.152249


In [87]:
pivoted['value'][:5]

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31,0.0,2710.349,5.8
1959-06-30,2.34,2778.801,5.1
1959-09-30,2.74,2775.488,5.3
1959-12-31,0.27,2785.204,5.6
1960-03-31,2.31,2847.699,5.2


Note that pivot is equivalent to creating a hierarchical index using set_index followed by a call to unstack:

In [88]:
unstacked = ldata.set_index(['date', 'item']).unstack('item')
unstacked[:7]

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31,0.0,2710.349,5.8,0.442,-1.455693,-0.267403
1959-06-30,2.34,2778.801,5.1,-0.339147,0.220088,0.536942
1959-09-30,2.74,2775.488,5.3,-1.254615,1.689787,-0.562661
1959-12-31,0.27,2785.204,5.6,-0.172535,2.407114,-0.133247
1960-03-31,2.31,2847.699,5.2,1.181717,-0.343066,1.152249
1960-06-30,0.14,2834.39,5.2,0.367568,-1.249692,-0.874178
1960-09-30,2.7,2839.022,5.6,-0.82483,-0.063725,1.097307
