# 8.1层次化索引
先来看一个简单的例子：创建一个Series，并用一个由列表或数组组成的列表作为索引：

In [1]:
import pandas as pd
import numpy as np
data=pd.Series(np.random.randn(9),index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],[1, 2, 3, 1, 3, 1, 2, 2, 3]])
data

a  1   -0.442638
   2    0.660272
   3    0.734740
b  1   -0.085272
   3    0.935795
c  1    0.646283
   2   -0.174291
d  2    0.205645
   3    0.597408
dtype: float64

对于一个层次化索引的对象，可以使用所谓的部分索引，使用它选取数据子集的操作更简单：

In [2]:
data['b']

1    0.027781
3    0.751170
dtype: float64

In [3]:
data['b':'c']

b  1    0.027781
   3    0.751170
c  1   -1.225588
   2   -0.984319
dtype: float64

In [6]:
data[['b','c']]

b  1    0.027781
   3    0.751170
c  1   -1.225588
   2   -0.984319
dtype: float64

有时甚至还可以在“内层”中进行选取:

In [8]:
data[:,2]

a    0.265055
c   -0.984319
d    2.585767
dtype: float64

层次化索引在数据重塑和基于分组的操作（如透视表生成）中扮演着重要的角色。例如，可以通过unstack方法将这段数据重新安排到一个DataFrame中：

In [9]:
data.unstack()

Unnamed: 0,1,2,3
a,-0.379219,0.265055,-1.680629
b,0.027781,,0.75117
c,-1.225588,-0.984319,
d,,2.585767,-0.837858


unstack的逆运算是stack：

In [11]:
data.unstack().stack()

a  1   -0.379219
   2    0.265055
   3   -1.680629
b  1    0.027781
   3    0.751170
c  1   -1.225588
   2   -0.984319
d  2    2.585767
   3   -0.837858
dtype: float64

对于一个DataFrame，每条轴都可以有分层索引：

In [12]:
frame=pd.DataFrame(np.arange(12).reshape((4,3)),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=[['Ohio', 'Ohio', 'Colorado'],['Green', 'Red', 'Green']])
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


各层都可以有名字（可以是字符串，也可以是别的Python对象）。如果指定了名称，它们就会显示在控制台输出中：

In [13]:
frame.index.names=['key1','key2']
frame.columns.names=['state','color']
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


有了部分列索引，因此可以轻松选取列分组：

In [14]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


In [15]:
frame.loc['a','Ohio']

color,Green,Red
key2,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,1
2,3,4


## 重排与分级排序
swaplevel接受两个级别编号或名称，并返回一个互换了级别的新对象（但数据不会发生变化）：

In [17]:
frame.swaplevel('key1','key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


而sort_index则根据单个级别中的值对数据进行排序。交换级别时，常常也会用到sort_index，这样最终结果就是按照指定顺序进行字母排序了：

In [18]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [19]:
frame.sort_index(level=1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [20]:
frame.swaplevel('key1','key2').sort_index(level=0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


## 根据级别汇总统计
许多对DataFrame和Series的描述和汇总统计都有一个level选项，它用于指定在某条轴上求和的级别。再以上面那个DataFrame为例，我们可以根据行或列上的级别来进行求和：

In [21]:
frame.sum(level='key2')

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [23]:
frame.sum(level='color',axis=1)

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


## 使用DataFrame的列进行索引
人们经常想要将DataFrame的一个或多个列当做行索引来用，或者可能希望将行索引变成DataFrame的列。以下面这个DataFrame为例：

In [24]:
frame=pd.DataFrame({'a':range(7),'b':range(7,0,-1),'c':['one', 'one', 'one', 'two', 'two','two', 'two'],'d':[0, 1, 2, 0, 1, 2, 3]})
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


DataFrame的set_index函数会将其一个或多个列转换为行索引，并创建一个新的DataFrame：

In [29]:
frame2=frame.set_index(['c','d'])

默认情况下，那些列会从DataFrame中移除，但也可以将其保留下来：

In [28]:
frame.set_index(['c','d'],drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


reset_index的功能跟set_index刚好相反，层次化索引的级别会被转移到列里面：

In [30]:
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


# 8.2 合并数据集
## 数据库风格的DataFrame合并
数据集的合并（merge）或连接（join）运算是通过一个或多个键将行链接起来的。这些运算是关系型数据库（基于SQL）的核心。pandas的merge函数是对数据应用这些算法的主要切入点。

以一个简单的例子开始：

In [37]:
df1=pd.DataFrame({'key':['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1':range(7)})
df1

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [38]:
df2=pd.DataFrame({'key':['a', 'b', 'd'],'data2':range(3)})
df2

Unnamed: 0,data2,key
0,0,a
1,1,b
2,2,d


这是一种多对一的合并。df1中的数据有多个被标记为a和b的行，而df2中key列的每个值则仅对应一行。

如果没有指定，merge就会将重叠列的列名当做键。不过，最好明确指定一下

In [39]:
pd.merge(df1,df2,on='key')

Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,4,a,0
5,5,a,0


如果两个对象的列名不同，也可以分别进行指定：

In [41]:
df3=pd.DataFrame({'lkey':['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1':range(7)})
df4=pd.DataFrame({'rkey':['a', 'b', 'd'],'data2':range(3)})
pd.merge(df3,df4,left_on='lkey',right_on='rkey')

Unnamed: 0,data1,lkey,data2,rkey
0,0,b,1,b
1,1,b,1,b
2,6,b,1,b
3,2,a,0,a
4,4,a,0,a
5,5,a,0,a


默认情况下，merge做的是“内连接”；结果中的键是交集。其他方式还有"left"、"right"以及"outer"。

外连接求取的是键的并集，组合了左连接和右连接的效果：

In [42]:
pd.merge(df1,df2,how='outer')

Unnamed: 0,data1,key,data2
0,0.0,b,1.0
1,1.0,b,1.0
2,6.0,b,1.0
3,2.0,a,0.0
4,4.0,a,0.0
5,5.0,a,0.0
6,3.0,c,
7,,d,2.0


多对多的合并有些不直观。看下面的例子：

多对多连接产生的是行的笛卡尔积。由于左边的DataFrame有3个"b"行，右边的有2个，所以最终结果中就有6个"b"行。

In [45]:
df1=pd.DataFrame({'key':['b', 'b', 'a', 'c', 'a', 'b'],'data1':range(6)})
df2=pd.DataFrame({'key':['a', 'b', 'a', 'b', 'd'],'data2':range(5)})
pd.merge(df1,df2,how='left')

Unnamed: 0,data1,key,data2
0,0,b,1.0
1,0,b,3.0
2,1,b,1.0
3,1,b,3.0
4,2,a,0.0
5,2,a,2.0
6,3,c,
7,4,a,0.0
8,4,a,2.0
9,5,b,1.0


要根据多个键进行合并，传入一个由列名组成的列表即可：

In [46]:
left=pd.DataFrame({'key1':['foo', 'foo', 'bar'],'key2':['one', 'two', 'one'],'data1':[1,2,3]})
right=pd.DataFrame({'key1':['foo', 'foo', 'bar', 'bar'],'key2':['one', 'one', 'one', 'two'],'data2':[4,5,6,7]})
pd.merge(left,right,on=['key1','key2'],how='outer')

Unnamed: 0,data1,key1,key2,data2
0,1.0,foo,one,4.0
1,1.0,foo,one,5.0
2,2.0,foo,two,
3,3.0,bar,one,6.0
4,,bar,two,7.0


suffixes选项，用于指定附加到左右两个DataFrame对象的重叠列名上的字符串：

In [47]:
pd.merge(left,right,on='key1',suffixes=('_left','_right'))

Unnamed: 0,data1,key1,key2_left,data2,key2_right
0,1,foo,one,4,one
1,1,foo,one,5,one
2,2,foo,two,4,one
3,2,foo,two,5,one
4,3,bar,one,6,one
5,3,bar,one,7,two


## 索引上的合并
有时候，DataFrame中的连接键位于其索引中。在这种情况下，你可以传入left_index=True或right_index=True（或两个都传）以说明索引应该被用作连接键：

In [49]:
left1=pd.DataFrame({'key':['a', 'b', 'a', 'a', 'b', 'c'],'value':range(6)})
right1=pd.DataFrame({'group_val':[3.5,7]},index=['a','b'])
pd.merge(left1,right1,left_on='key',right_index=True)

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0


## 轴向连接
有三个没有重叠索引的Series,对这些对象调用concat可以将值和索引粘合在一起：

In [3]:
s1=pd.Series([0,1],index=['a','b'])
s2=pd.Series([2,3,4],index=['c','d','e'])
s3=pd.Series([5,6],index=['f','g'])
pd.concat([s1,s2,s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

默认情况下，concat是在axis=0上工作的，最终产生一个新的Series。如果传入axis=1，则结果就会变成一个DataFrame（axis=1是列）：

In [5]:
pd.concat([s1,s2,s3],axis=1)

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In [6]:
s4=pd.concat([s1,s3])
s4

a    0
b    1
f    5
g    6
dtype: int64

In [7]:
pd.concat([s1,s4],axis=1)

Unnamed: 0,0,1
a,0.0,0
b,1.0,1
f,,5
g,,6


In [8]:
pd.concat([s1,s4],axis=1,join='inner')

Unnamed: 0,0,1
a,0,0
b,1,1


不过有个问题，参与连接的片段在结果中区分不开。假设你想要在连接轴上创建一个层次化索引。使用keys参数即可达到这个目的：

In [11]:
result=pd.concat([s1,s1,s3],keys=['one','two','three'])
result

one    a    0
       b    1
two    a    0
       b    1
three  f    5
       g    6
dtype: int64

In [12]:
result.unstack()

Unnamed: 0,a,b,f,g
one,0.0,1.0,,
two,0.0,1.0,,
three,,,5.0,6.0


如果沿着axis=1对Series进行合并，则keys就会成为DataFrame的列头：

In [14]:
pd.concat([s1,s2,s3],axis=1,keys=['one','two','three'])

Unnamed: 0,one,two,three
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


同样的逻辑也适用于DataFrame对象：

In [17]:
df1=pd.DataFrame(np.arange(6).reshape(3,2),index=['a','b','c'],columns=['one','two'])
df1

Unnamed: 0,one,two
a,0,1
b,2,3
c,4,5


In [18]:
df2=pd.DataFrame(5+np.arange(4).reshape(2,2),index=['a','c'],columns=['three','four'])
df2

Unnamed: 0,three,four
a,5,6
c,7,8


In [19]:
pd.concat([df1,df2],axis=1,keys=['level1','level2'])

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


最后一个关于DataFrame的问题是，DataFrame的行索引不包含任何相关数据：

In [20]:
df1=pd.DataFrame(np.random.randn(3,4),columns=['a', 'b', 'c', 'd'])
df1

Unnamed: 0,a,b,c,d
0,-0.655267,1.465735,-0.179952,-0.138597
1,-1.594736,-0.434394,0.282781,-0.799191
2,-0.504914,-1.911239,0.596504,0.333906


In [21]:
df2=pd.DataFrame(np.random.randn(2,3),columns=['b', 'd', 'a'])
df2

Unnamed: 0,b,d,a
0,2.732979,1.242908,0.257329
1,-0.952996,0.705629,0.105751


In [25]:
pd.concat([df1,df2])

Unnamed: 0,a,b,c,d
0,-0.655267,1.465735,-0.179952,-0.138597
1,-1.594736,-0.434394,0.282781,-0.799191
2,-0.504914,-1.911239,0.596504,0.333906
0,0.257329,2.732979,,1.242908
1,0.105751,-0.952996,,0.705629


如果要产生一组新的索引，传入ignore_index=True即可：

In [23]:
pd.concat([df1,df2],ignore_index=True)

Unnamed: 0,a,b,c,d
0,-0.655267,1.465735,-0.179952,-0.138597
1,-1.594736,-0.434394,0.282781,-0.799191
2,-0.504914,-1.911239,0.596504,0.333906
3,0.257329,2.732979,,1.242908
4,0.105751,-0.952996,,0.705629


# 8.3 重塑和轴向旋转
## 重塑层次化索引
层次化索引为DataFrame数据的重排任务提供了一种具有良好一致性的方式。主要功能有二：

stack：将数据的列“旋转”为行。

unstack：将数据的行“旋转”为列。

In [30]:
data=pd.DataFrame(np.arange(6).reshape(2,3),index=['Ohio','Colorado'],columns=['one', 'two', 'three'])
data.index.names=['state']
data.columns.names=['number']
data

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [33]:
result=data.stack()
result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32

对于一个层次化索引的Series，你可以用unstack将其重排为一个DataFrame：

In [34]:
result.unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


默认情况下，unstack操作的是最内层（stack也是如此）。传入分层级别的编号或名称即可对其它级别进行unstack操作：

In [35]:
result.unstack(0)

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


In [36]:
result.unstack('state')

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


如果不是所有的级别值都能在各分组中找到的话，则unstack操作可能会引入缺失数据：

In [37]:
s1=pd.Series([0,1,2,3],index=['a', 'b', 'c', 'd'])
s2=pd.Series([4,5,6],index=['c', 'd', 'e'])
data2=pd.concat([s1,s2],keys=['one','two'])
data2

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

In [38]:
data2.unstack()

Unnamed: 0,a,b,c,d,e
one,0.0,1.0,2.0,3.0,
two,,,4.0,5.0,6.0


stack默认会滤除缺失数据，因此该运算是可逆的：

In [39]:
data2.unstack().stack()

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

In [40]:
data2.unstack().stack(dropna=False)

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

在对DataFrame进行unstack操作时，作为旋转轴的级别将会成为结果中的最低级别：

In [46]:
df=pd.DataFrame({'left':result,'right':result+5},columns=pd.Index(['left','right'],name='side'))
df

Unnamed: 0_level_0,side,left,right
state,number,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,one,0,5
Ohio,two,1,6
Ohio,three,2,7
Colorado,one,3,8
Colorado,two,4,9
Colorado,three,5,10


In [45]:
df.unstack('state')

side,left,left,right,right
state,Ohio,Colorado,Ohio,Colorado
number,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
one,0,3,5,8
two,1,4,6,9
three,2,5,7,10


当调用stack，我们可以指明轴的名字：

In [47]:
df.unstack('state').stack('side')

Unnamed: 0_level_0,state,Colorado,Ohio
number,side,Unnamed: 2_level_1,Unnamed: 3_level_1
one,left,3,0
one,right,8,5
two,left,4,1
two,right,9,6
three,left,5,2
three,right,10,7


## 将“长格式”旋转为“宽格式”
我们先加载一些示例数据，做一些时间序列规整和数据清洗：

In [49]:
data=pd.read_csv(r'E:\python\data\macrodata.csv')
data.head()

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
0,1959.0,1.0,2710.349,1707.4,286.898,470.045,1886.9,28.98,139.7,2.82,5.8,177.146,0.0,0.0
1,1959.0,2.0,2778.801,1733.7,310.859,481.301,1919.7,29.15,141.7,3.08,5.1,177.83,2.34,0.74
2,1959.0,3.0,2775.488,1751.8,289.226,491.26,1916.4,29.35,140.5,3.82,5.3,178.657,2.74,1.09
3,1959.0,4.0,2785.204,1753.7,299.356,484.052,1931.3,29.37,140.0,4.33,5.6,179.386,0.27,4.06
4,1960.0,1.0,2847.699,1770.5,331.722,462.199,1955.5,29.54,139.6,3.5,5.2,180.007,2.31,1.19


In [50]:
periods=pd.PeriodIndex(year=data.year,quarter=data.quarter,name='date')
periods

PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', name='date', length=203, freq='Q-DEC')

In [51]:
columns=pd.Index(['realgdp', 'infl', 'unemp'],name='item')
columns

Index(['realgdp', 'infl', 'unemp'], dtype='object', name='item')

In [52]:
data=data.reindex(columns=columns)
data.head()

item,realgdp,infl,unemp
0,2710.349,0.0,5.8
1,2778.801,2.34,5.1
2,2775.488,2.74,5.3
3,2785.204,0.27,5.6
4,2847.699,2.31,5.2


In [53]:
data.index=periods.to_timestamp('D','end')
data.head()

item,realgdp,infl,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31,2710.349,0.0,5.8
1959-06-30,2778.801,2.34,5.1
1959-09-30,2775.488,2.74,5.3
1959-12-31,2785.204,0.27,5.6
1960-03-31,2847.699,2.31,5.2


In [59]:
ldata=data.stack()
ldata.head(6)

date        item   
1959-03-31  realgdp    2710.349
            infl          0.000
            unemp         5.800
1959-06-30  realgdp    2778.801
            infl          2.340
            unemp         5.100
dtype: float64

In [60]:
ldata=ldata.reset_index()
ldata.head(6)

Unnamed: 0,date,item,0
0,1959-03-31,realgdp,2710.349
1,1959-03-31,infl,0.0
2,1959-03-31,unemp,5.8
3,1959-06-30,realgdp,2778.801
4,1959-06-30,infl,2.34
5,1959-06-30,unemp,5.1


In [61]:
ldata=ldata.rename(columns={0:'value'})
ldata.head()

Unnamed: 0,date,item,value
0,1959-03-31,realgdp,2710.349
1,1959-03-31,infl,0.0
2,1959-03-31,unemp,5.8
3,1959-06-30,realgdp,2778.801
4,1959-06-30,infl,2.34


DataFrame的pivot方法可以实现行转列的转换：

In [63]:
pivoted=ldata.pivot('date','item','value')
pivoted.head()

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31,0.0,2710.349,5.8
1959-06-30,2.34,2778.801,5.1
1959-09-30,2.74,2775.488,5.3
1959-12-31,0.27,2785.204,5.6
1960-03-31,2.31,2847.699,5.2


前两个传递的值分别用作行和列索引，最后一个可选值则是用于填充DataFrame的数据列。假设有两个需要同时重塑的数据列：

In [67]:
ldata['value2']=np.random.randn(len(ldata))
ldata.head()

Unnamed: 0,date,item,value,value2
0,1959-03-31,realgdp,2710.349,1.716234
1,1959-03-31,infl,0.0,-0.601489
2,1959-03-31,unemp,5.8,1.466799
3,1959-06-30,realgdp,2778.801,-0.440601
4,1959-06-30,infl,2.34,-0.095323


如果忽略最后一个参数，得到的DataFrame就会带有层次化的列：

In [74]:
pivoted=ldata.pivot(index='date',columns='item')
pivoted.head()

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31,0.0,2710.349,5.8,-0.601489,1.716234,1.466799
1959-06-30,2.34,2778.801,5.1,-0.095323,-0.440601,-1.108721
1959-09-30,2.74,2775.488,5.3,1.051572,0.131663,1.732953
1959-12-31,0.27,2785.204,5.6,-0.71781,0.322066,-0.17396
1960-03-31,2.31,2847.699,5.2,1.971314,0.113417,0.178633


注意，pivot其实就是用set_index创建层次化索引，再用unstack重塑：

In [75]:
unstacked=ldata.set_index(['date','item']).unstack()
unstacked.head()

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31,0.0,2710.349,5.8,-0.601489,1.716234,1.466799
1959-06-30,2.34,2778.801,5.1,-0.095323,-0.440601,-1.108721
1959-09-30,2.74,2775.488,5.3,1.051572,0.131663,1.732953
1959-12-31,0.27,2785.204,5.6,-0.71781,0.322066,-0.17396
1960-03-31,2.31,2847.699,5.2,1.971314,0.113417,0.178633


## 将“宽格式”旋转为“长格式”
旋转DataFrame的逆运算是pandas.melt。它不是将一列转换到多个新的DataFrame，而是合并多个列成为一个，产生一个比输入长的DataFrame。

In [77]:
df=pd.DataFrame({'key':['foo', 'bar', 'baz'],
                'A':[1, 2, 3],
               'B':[4,5,6],
               'C':[7,8,9]})
df

Unnamed: 0,A,B,C,key
0,1,4,7,foo
1,2,5,8,bar
2,3,6,9,baz


key列可能是分组指标，其它的列是数据值。当使用pandas.melt，我们必须指明哪些列是分组指标。下面使用key作为唯一的分组指标：

In [79]:
melted=pd.melt(df,['key'])
melted

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6
6,foo,C,7
7,bar,C,8
8,baz,C,9


使用pivot，可以重塑回原来的样子：

In [81]:
reshaped=melted.pivot(index='key',columns='variable',values='value')
reshaped

variable,A,B,C
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,2,5,8
baz,3,6,9
foo,1,4,7


因为pivot的结果从列创建了一个索引，用作行标签，我们可以使用reset_index将数据移回列：

In [82]:
reshaped.reset_index()

variable,key,A,B,C
0,bar,2,5,8
1,baz,3,6,9
2,foo,1,4,7


还可以指定列的子集，作为值的列：

In [83]:
pd.melt(df,id_vars=['key'],value_vars=['A','B'])

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6


pandas.melt也可以不用分组指标：

In [84]:
pd.melt(df,value_vars=['A','B','C'])

Unnamed: 0,variable,value
0,A,1
1,A,2
2,A,3
3,B,4
4,B,5
5,B,6
6,C,7
7,C,8
8,C,9
