# Pandas简介

Python作为一门通用语言，并不像Stata、R等语言一样自带数据管理功能，因而通常需要额外的包，这其中Pandas由于强大的性能以及易用性成为了Python中最流行的数据管理包。在这里我们将对Pandas的基本功能做简要介绍。Pandas的官方文档地址为：https://pandas.pydata.org/docs/

实际上Pandas是建立在NumPy基础之上的，接下来会看到Pandas的很多操作与NumPy非常类似。为了使用Panda，我们需要首先导入：

In [1]:
import pandas as pd

Pandas支持从多种数据源读入数据，包括csv文件、SPSS文件、SAS文件、数据库（SQL语言）、Excel文件等等。比如，如下代码导入了一个.csv文件：

In [2]:
kfc=pd.read_csv("csv/kfc.csv")
kfc.head()

Unnamed: 0,id,lon,lat,province,city,name,address,brand,category,birth,all_day,wifi,store_view,basketball,self_service,attractions,train,breakfast
0,15487,116.28094,39.95174,北京市,北京市,板井路,远大路远大居住区二期世纪金源大酒店一层东南角,KFC,餐饮,True,False,False,True,True,True,False,False,True
1,15205,116.362701,39.79407,北京市,北京市,大兴新宫DT,西红门路丙13号一层加二层,KFC,餐饮,True,False,False,True,False,True,False,False,True
2,15184,116.209496,39.75268,北京市,北京市,长阳,长阳镇起步区五号地商业综合体中粮万科半岛商业广场首层L1017A及二层L2030A??,KFC,餐饮,False,False,False,False,False,True,False,False,False
3,15186,116.157956,39.723521,北京市,北京市,良乡长虹东路,拱辰街道东羊庄村18号一层,KFC,餐饮,True,False,False,True,True,True,False,False,True
4,15183,116.220176,40.222182,北京市,北京市,昌平京客隆,西关路20号三号楼京客隆超市一层二层西侧3号商铺,KFC,餐饮,True,False,False,True,True,True,False,False,True


另外很多提供数据的包所提供的数据也是Pandas格式的，比如：

In [3]:
import tushare as tus
gdp=tus.get_gdp_year()
gdp.head()

Unnamed: 0,year,gdp,pc_gdp,gnp,pi,si,industry,cons_industry,ti,trans_industry,lbdy
0,2019,990865.0,70892.0,,70467.0,386165.0,317109.0,70904.0,534233.0,42802.0,113886.0
1,2018,919281.0,64644.0,,64745.0,364835.0,305160.2,61808.0,489701.0,40550.2,100223.8
2,2017,820754.3,59201.0,,62099.5,332742.7,278328.2,55313.8,425912.1,37172.6,92348.2
3,2016,740060.8,53680.0,,60139.2,296547.7,247877.7,49702.9,383373.9,33058.8,84648.8
4,2015,685992.9,50028.0,,57774.6,282040.3,236506.3,46626.7,346178.0,30487.8,78340.4


# 序列

Pandas最基础的对象是**序列（Series）**，而一个序列可以简单的看做是带有**索引（Index）**的一个NumPy数组，比如我们可以使用如下方法创建一个简单的序列：

In [4]:
a=pd.Series([1,2,3])
a

0    1
1    2
2    3
dtype: int64

或者可以使用NumPy进行创建：

In [5]:
import numpy as np
b=pd.Series(np.random.random(10))
b

0    0.807440
1    0.726257
2    0.380989
3    0.665009
4    0.659779
5    0.957779
6    0.220684
7    0.579278
8    0.459168
9    0.976048
dtype: float64

注意以上可以看到，Series对象在数组的基础上海多了一列「行号」，这个「行号」其实是Pandas的「索引」，索引是Series和array的关键区别。我们将在稍后介绍索引的使用方法。

除了索引之外，Series很多行为都跟array没有本质区别，比如我们可以对其进行计算：

In [6]:
log_b=np.log(b)
print(log_b)
print("b的类型：",type(log_b))
print("b的和=", b.sum())
blogb=b*log_b
print("b*log_b=\n",blogb)

0   -0.213887
1   -0.319851
2   -0.964986
3   -0.407955
4   -0.415851
5   -0.043138
6   -1.511024
7   -0.545973
8   -0.778339
9   -0.024244
dtype: float64
b的类型： <class 'pandas.core.series.Series'>
b的和= 6.432429922956976
b*log_b=
 0   -0.172701
1   -0.232294
2   -0.367649
3   -0.271293
4   -0.274370
5   -0.041317
6   -0.333459
7   -0.316270
8   -0.357388
9   -0.023663
dtype: float64


注意到将一个Series对象进行计算之后，仍然是一个Series对象，且索引值保持了对应关系。

在Pandas中，缺失值的处理也比较方便，可以使用numpy.nan（not a number）以及Python原生的None代表缺失值，一般而言两者等价，不需要额外关注使用哪种解决方案。比如：

In [7]:
log_b1=np.log(b-0.5)
log_b1

  result = getattr(ufunc, method)(*inputs, **kwargs)


0   -1.179477
1   -1.486084
2         NaN
3   -1.801754
4   -1.833966
5   -0.781368
6         NaN
7   -2.534799
8         NaN
9   -0.742237
dtype: float64

可以看到上面所有不可计算的值都被NaN代替了。可以使用isnull()函数以及notnull()函数判断是否为缺失值，比如：

In [8]:
pd.notnull(log_b1)

0     True
1     True
2    False
3     True
4     True
5     True
6    False
7     True
8    False
9     True
dtype: bool

如果希望扔掉缺失值，可以用dropna()：

In [9]:
log_b1.dropna()

0   -1.179477
1   -1.486084
3   -1.801754
4   -1.833966
5   -0.781368
7   -2.534799
9   -0.742237
dtype: float64

## 字符串序列

与NumPy的array不一样的是，Pandas中的Series的数据类型可以不仅仅是数值，而array一般为数值类型。比如我们可以创建一个字符串的Series：

In [10]:
s=pd.Series(['Messi-10','Xavi-6','Iniesta-8','Puyol-5'])
s

0     Messi-10
1       Xavi-6
2    Iniesta-8
3      Puyol-5
dtype: object

NumPy中并没有针对字符串的向量化函数，Pandas则提供了常用的可以用于处理字符串的向量化函数，常见的比如len(), split(), index(), lower(), islower(), uppper(), isupper(), isdigit(), strip(),.....等Python字符串支持的函数，都有向量化的函数对应。比如：

In [11]:
len_s=s.str.len()
print("字符串长度=\n",len_s)
split_s=s.str.split('-')
print("分割的字符串:\n",split_s)

字符串长度=
 0    8
1    6
2    9
3    7
dtype: int64
分割的字符串:
 0     [Messi, 10]
1       [Xavi, 6]
2    [Iniesta, 8]
3      [Puyol, 5]
dtype: object


可以发现这些操作仍然是保留索引的。

除此之外，还可以向量化地使用正则表达式。Pandas中支持如下正则表达式的方法：

* match(): 对每个元素调用re.match()
* extract(): 对每个元素调用re.match()，返回匹配的字符串组
* findall(): 对每个元素调用re.findall()
* replace(): 用正则表达式替换字符串
* contains(): 对每个元素调用re.search()
* count(): 对每个元素计算符合正则模式的字符串数量
* split(), rsplit(): 分割字符串

比如：

In [12]:
code=s.str.findall(r'\d+')
name=s.str.findall(r'[a-zA-Z]+')
print(code)
print(name)

0    [10]
1     [6]
2     [8]
3     [5]
dtype: object
0      [Messi]
1       [Xavi]
2    [Iniesta]
3      [Puyol]
dtype: object


注意以上使用split()、正则表达式得到的结果都是一个列表，可以使用str[]的方式获取其中的元素，比如：

In [13]:
code1=code.str[0].astype('int_')
print(code1)
name1=name.str[0]
print(name1)

0    10
1     6
2     8
3     5
dtype: int64
0      Messi
1       Xavi
2    Iniesta
3      Puyol
dtype: object


注意对于code，我们还额外使用了astype()函数将字符串数字转为了整型。

此外还有很多关于字符串的方法，具体可以参考：https://pandas.pydata.org/docs/user_guide/text.html

## 时间序列

这里标题中的时间序列指的是以时间为值的一个Pandas Series。时间在时间序列数据和面板数据中非常重要，而Python原生的Datetime类型以为性能较差且不能方便的向量化运算，因而在Pandas中借鉴了NumPy中的时间类型，比如在NumPy中：

In [14]:
begin=np.array('2020-01-01', dtype=np.datetime64)
print(begin)
delta=np.arange(5)
nextdays=begin+delta
nextdays

2020-01-01


array(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
       '2020-01-05'], dtype='datetime64[D]')

以上可以看出，NumPy中日期时间类型的一个好处是可以方便的进行向量化操作。

在Pandas中，引入了时间戳（timestamp），可以方便的将字符串转换为时间戳，比如：

In [15]:
import datetime
dt=pd.Timestamp(datetime.datetime(2020, 3, 13)) ## 将Python的日期时间转换为pandas的日期时间
print(dt)
dt1=pd.Timestamp('2020-1-13') ## 将字符串转换为pandas的日期时间
print(dt1)
dt2=pd.Timestamp(2020, 3, 13)
print(dt2)

2020-03-13 00:00:00
2020-01-13 00:00:00
2020-03-13 00:00:00


注意以上代表的是一个时间点（timestamp就是时间点的意思），而有的时候我们希望表示一个时间周期，比如2020-3-13我们可能希望表达的是这一整天，而非那一天的0点0分0秒，此时我们需要使用timespan，即时间间隔，比如：

In [16]:
dt=pd.Period('2020-01')
print(dt)
dt1=pd.Period('2020-01-30', freq='D')
print(dt1)
dt2=pd.Period('2020-01', freq='M')
print(dt2)
dt3=pd.Period('2020-04', freq='Q')
print(dt3)
dt4=pd.Period('2020q4', freq='Q')
print(dt4)

2020-01
2020-01-30
2020-01
2020Q2
2020Q4


具体的日期时间转换函数可以查看：https://pandas.pydata.org/docs/user_guide/timeseries.html

当然，这些日期时间都可以进行向量化计算，比如：

In [17]:
dt1=pd.Period('2020-04', freq='M')
a=pd.Series(np.arange(10))
dta=dt1+a
dt2=pd.Period('2020-01-30', freq='D')
dta2=dt2+a
print("dta=\n",dta)
print("-----------")
print("dta2=\n",dta2)

dta=
 0    2020-04
1    2020-05
2    2020-06
3    2020-07
4    2020-08
5    2020-09
6    2020-10
7    2020-11
8    2020-12
9    2021-01
dtype: period[M]
-----------
dta2=
 0    2020-01-30
1    2020-01-31
2    2020-02-01
3    2020-02-02
4    2020-02-03
5    2020-02-04
6    2020-02-05
7    2020-02-06
8    2020-02-07
9    2020-02-08
dtype: period[D]


如果需要创建一个时间戳的索引，可以使用pd.date_range()函数，比如：

In [18]:
ts=pd.date_range('2020-1-1 15:00:00', periods=5, freq='M')
ts

DatetimeIndex(['2020-01-31 15:00:00', '2020-02-29 15:00:00',
               '2020-03-31 15:00:00', '2020-04-30 15:00:00',
               '2020-05-31 15:00:00'],
              dtype='datetime64[ns]', freq='M')

时间戳还可以进行运算，比如计算间隔天数：

In [19]:
ts1=pd.date_range('2019-1-1 15:00:00', periods=5, freq='D')
ts-ts1

TimedeltaIndex(['395 days', '423 days', '453 days', '482 days', '512 days'], dtype='timedelta64[ns]', freq=None)

注意上面的提示为TimedeltaIndex，即表示了一个时间差（time delta）。

也可以使用pd.period_range()函数创建时间间隔的索引：

In [20]:
pd.period_range('2000','2010', freq='Y')

PeriodIndex(['2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007',
             '2008', '2009', '2010'],
            dtype='period[A-DEC]', freq='A-DEC')

此外也可以使用shift()进行时间的迁移，比如：

In [21]:
print("dta向前迁移1月\n",dta.shift(1))
print("-----------")
print("dta向后迁移1天\n",dta2.shift(-1))

dta向前迁移1月
 0        NaT
1    2020-04
2    2020-05
3    2020-06
4    2020-07
5    2020-08
6    2020-09
7    2020-10
8    2020-11
9    2020-12
dtype: period[M]
-----------
dta向后迁移1天
 0    2020-01-31
1    2020-02-01
2    2020-02-02
3    2020-02-03
4    2020-02-04
5    2020-02-05
6    2020-02-06
7    2020-02-07
8    2020-02-08
9           NaT
dtype: period[D]


# 索引

以上可以看到，Pandas的数据与NumPy的数据最大差别在于索引，可以简单认为Pandas的Series是带有索引的array。

索引在Pandas中扮演着至关重要的作用，而且更加方便的在于，我们可以用很多种类型的数据（比如字符串、元组、时间戳、时间间隔等）作为索引，而不仅仅是行号。比如：

In [22]:
a=pd.Series([1.,2,3])

print("不指定索引：\n",a)
print('---------')
b=pd.Series([1.,2,3], index=[1,5,3])
print("指定索引：\n",b)
print('---------')
c=pd.Series([1.,2,3], index=['小明','小强','小红'])
print("指定字符串索引：\n",c)
print('---------')
dt1=pd.Period('2020-01', freq='M')
dt=dt1+np.arange(3)
d=pd.Series([1.,2,3], index=dt)
print("指定字符串索引：\n",d)

不指定索引：
 0    1.0
1    2.0
2    3.0
dtype: float64
---------
指定索引：
 1    1.0
5    2.0
3    3.0
dtype: float64
---------
指定字符串索引：
 小明    1.0
小强    2.0
小红    3.0
dtype: float64
---------
指定字符串索引：
 2020-01    1.0
2020-02    2.0
2020-03    3.0
Freq: M, dtype: float64


既然带有索引，我们可以很方便的使用索引进行切片的操作，实际上，之前NumPy讲过的切片、掩码操作在这里都是可以使用的，比如：

In [23]:
print("a的前两个元素：\n",a[0:2])
print("b的索引为5的元素：",b[5])
print("b的前两个元素：\n",b[0:2])
print("c中小红的元素：",c['小红'])
print("d中2月份的元素：",d['2020-02'])
print("d中3月份的元素：",d[pd.Period('2020-03', freq='M')])

a的前两个元素：
 0    1.0
1    2.0
dtype: float64
b的索引为5的元素： 2.0
b的前两个元素：
 1    1.0
5    2.0
dtype: float64
c中小红的元素： 3.0
d中2月份的元素： 2.0
d中3月份的元素： 3.0


值得注意的是b的两行比较具有迷惑性：当我们使用b[5]时，似乎选取的是索引值为5元素，而当我们使用b[0:2]时，似乎选取的又是前两行，行为非常迷惑。

实际上，当我们使用b[]选取单个元素时，的确是选取的索引值，也被称为**显式索引**，而当我们使用切片操作时，则默认使用**隐式索引**，也就是按照顺序取行号。

这会导致一些困惑，为了避免这个问题，建议在写代码时，要么避免使用整数作为索引，要么最好还是使用索引器使得代码更加直观。常用索引器有：

* loc[]：**显式索引**
* iloc[]：**隐式索引**

比如：

In [24]:
print("显式索引：\n",b.loc[1:3])
print("隐式索引：\n",b.iloc[1:3])

显式索引：
 1    1.0
5    2.0
3    3.0
dtype: float64
隐式索引：
 5    2.0
3    3.0
dtype: float64


此外掩码也可以使用，比如：

In [25]:
print(d[d>1])

2020-02    2.0
2020-03    3.0
Freq: M, dtype: float64


由于有了索引，我们除了可以将Series看做带有index的array之外，其实还可以将其看做带有顺序的词典，比如：

In [26]:
print(c['小红'])

3.0


其实我们也可以使用字典创建Series，比如

In [27]:
score={'小明':99,
      '小红':88,
      '小青':66}
score_pd=pd.Series(score)
score_pd

小明    99
小红    88
小青    66
dtype: int64

其实我们还可以使用「score_pd.小红」的方式取值，不过值得注意的是，「score_pd.小红」的写法虽然方便，但是容易与Series对象的属性、方法产生歧义，所以请尽量避免使用这种写法。

而对于时间索引，前面我们已经介绍了使用shift()函数进行时间的推移，如果时间作为索引，我们还可以使用tshift()做时间索引的推移，比如：

In [28]:
print('d=\n',d)
e=d.tshift(1)
print('e=\n',e)

d=
 2020-01    1.0
2020-02    2.0
2020-03    3.0
Freq: M, dtype: float64
e=
 2020-02    1.0
2020-03    2.0
2020-04    3.0
Freq: M, dtype: float64


## 索引对象

Pandas中索引本身就是一个对象，比如我们可以通过：

In [29]:
score_pd.index

Index(['小明', '小红', '小青'], dtype='object')

查看索引的取值，并使用：

In [30]:
score_pd.index[0]

'小明'

来对索引进行取值。也可以为索引取名：

In [31]:
score_pd.index.names=['name']
score_pd

name
小明    99
小红    88
小青    66
dtype: int64

注意这个索引对象也是可迭代的：

In [32]:
for n in score_pd.index:
    print(n,'->',score_pd[n])

小明 -> 99
小红 -> 88
小青 -> 66


## 层级索引

有事数据的索引不仅仅是一维的，可能是更多维的，一个比较典型的例子是面板数据，同时包括了个体和时间两个维度，用这两个维度的信息才能唯一确定一个观测。此时我们需要使用层级索引。

在Pandas中，为了创建多级索引，只需要在创建数据时设置index为多维就可以了，比如

In [33]:
data=pd.Series(np.arange(6), index=[['SH','SH','BJ','BJ','HZ','HZ'],[2000,2001,2000,2001,2000,2001]])
data

SH  2000    0
    2001    1
BJ  2000    2
    2001    3
HZ  2000    4
    2001    5
dtype: int64

可以发现上面的Series中索引已经变成了两级。我们也可以为索引加上名称，比如：

In [34]:
data.index.names=['city','year']
data

city  year
SH    2000    0
      2001    1
BJ    2000    2
      2001    3
HZ    2000    4
      2001    5
dtype: int64

可以查看这个索引，是一个多级索引对象：

In [35]:
data.index

MultiIndex([('SH', 2000),
            ('SH', 2001),
            ('BJ', 2000),
            ('BJ', 2001),
            ('HZ', 2000),
            ('HZ', 2001)],
           names=['city', 'year'])

可以查看其层次的个数：

In [36]:
data.index.nlevels

2

也可以列出某一个维度的索引的所有值：

In [37]:
data.index.get_level_values(0) #data.index.get_level_values('city')

Index(['SH', 'SH', 'BJ', 'BJ', 'HZ', 'HZ'], dtype='object', name='city')

或者索引的取值的唯一值：

In [38]:
data.index.unique(1) #data.index.unique('year')

Int64Index([2000, 2001], dtype='int64', name='year')

有了多级索引之后，可以同样使用切片等操作，比如获取单个元素：

In [39]:
data['SH',2001]

1

或者获取某一个城市的数据：

In [40]:
data['SH']

year
2000    0
2001    1
dtype: int64

或者某一年份的数据：

In [41]:
data[:,2001]

city
SH    1
BJ    3
HZ    5
dtype: int64

以及掩码：

In [42]:
data[data>=3]

city  year
BJ    2001    3
HZ    2000    4
      2001    5
dtype: int64

# 数据框

虽然序列对象非常方便，但是仍然非常基础，每个序列只是一个一维的向量，而我们的数据通常都是有很多维度的变量的。在序列的基础上，我们可以把多个序列放在一起，就成为了**数据框（DataFrames）**

数据框可以用多个Series创建，比如：

In [43]:
goals=pd.Series([35,10,3], index=['Messi','Suarez','Pique'])
passes=pd.Series([55,40,60], index=['Messi','Suarez','Xavi'])

data=pd.DataFrame({'goals':goals, 'passes':passes})
data.index.name='name'
data.head()

Unnamed: 0_level_0,goals,passes
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Messi,35.0,55.0
Pique,3.0,
Suarez,10.0,40.0
Xavi,,60.0


可以看到Pandas很智能的将两个Series的index取了并集，没有数据的自动填充为NaN。

创建数据框有很多种方式，比如我们也可以通过字典创建，只不过现在我们需要使用一个字典列表：

In [44]:
players=[{'name':'Messi','goals':35,'passes':55},
         {'name':'Pique','goals':3},
        {'name':'Suarez','goals':10,'passes':40},
        {'name':'Xavi','passes':60}]
data=pd.DataFrame(players)
data.head()

Unnamed: 0,name,goals,passes
0,Messi,35.0,55.0
1,Pique,3.0,
2,Suarez,10.0,40.0
3,Xavi,,60.0


上面的name被列为了一列数据，我们可以使用set_index()方法设定name为index：

In [45]:
data=data.set_index('name')
print(data.index)
data.head()

Index(['Messi', 'Pique', 'Suarez', 'Xavi'], dtype='object', name='name')


Unnamed: 0_level_0,goals,passes
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Messi,35.0,55.0
Pique,3.0,
Suarez,10.0,40.0
Xavi,,60.0


除了可以使用data.index查看行的索引外，还可以使用data.columns查看列的标题（索引）：

In [46]:
data.columns

Index(['goals', 'passes'], dtype='object')

可以发现其实列标题也是一个索引。

DataFrame使用起来非常方便，因为DataFrame可以被看做多种数据结构的一种混合，比如：

**DataFrame可以被看做特殊的字典**，只不过这里的字典是按列取的字典，比如：

In [47]:
data['goals']

name
Messi     35.0
Pique      3.0
Suarez    10.0
Xavi       NaN
Name: goals, dtype: float64

这里需要注意，对于Series，d[0]是按照**行**取的，而对于DataFrame，data['goals']是按列取，这点非常不同。

**DataFrame还可以看做是一个NumPy的矩阵**，比如我们可以对其进行转置等操作：

In [48]:
data.transpose()

name,Messi,Pique,Suarez,Xavi
goals,35.0,3.0,10.0,
passes,55.0,,40.0,60.0


这里行和列就被完全倒转过来了。这里也提示我们，DataFrame中不仅仅列有index，行同样可以有index。

也因为此，DataFrame可以进行各种NumPy允许的运算，比如：

In [49]:
log_data=np.log(data)
log_data

Unnamed: 0_level_0,goals,passes
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Messi,3.555348,4.007333
Pique,1.098612,
Suarez,2.302585,3.688879
Xavi,,4.094345


当然，两个数据库也可以进行运算：

In [50]:
data+log_data

Unnamed: 0_level_0,goals,passes
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Messi,38.555348,59.007333
Pique,4.098612,
Suarez,12.302585,43.688879
Xavi,,64.094345


当然，更经常的是直接在原来的DataFrame中添加一列：

In [51]:
data['log_goals']=np.log(data['goals'])
data

Unnamed: 0_level_0,goals,passes,log_goals
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Messi,35.0,55.0,3.555348
Pique,3.0,,1.098612
Suarez,10.0,40.0,2.302585
Xavi,,60.0,


经过运算之后的index仍然是保持不变的。

此外，对于缺失数据，当然可以使用isnull()以及notnull()函数：

In [52]:
data.isnull()

Unnamed: 0_level_0,goals,passes,log_goals
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Messi,False,False,False
Pique,False,True,False
Suarez,False,False,False
Xavi,True,False,True


值得注意的是，如果使用dropna()函数，会把**任何**有缺失值的**行**都删掉：

In [53]:
data_copy=data
data_copy.dropna()

Unnamed: 0_level_0,goals,passes,log_goals
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Messi,35.0,55.0,3.555348
Suarez,10.0,40.0,2.302585


当然，也可以扔掉**任何**有缺失值的**列**：

In [54]:
data_copy=data
data_copy.dropna(axis=1)

Messi
Pique
Suarez
Xavi


此外，Pandas还提供了fillna()用来填充缺失值，比如把所有的缺失值都用0来填充：

In [55]:
data1=data.fillna(0)
data1

Unnamed: 0_level_0,goals,passes,log_goals
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Messi,35.0,55.0,3.555348
Pique,3.0,0.0,1.098612
Suarez,10.0,40.0,2.302585
Xavi,0.0,60.0,0.0


fillna()还有其他用法，我们在这里不再赘述。

## 选取数据

DataFrame的数据选取方法与Series类似，由于DataFrame可以看做是一个Series的词典，因而我们可以使用类似的方法调用数据，比如：

In [56]:
data['goals']['Messi']

35.0

如果需要选取多列，可以使用列表的形式传递

In [57]:
data[['goals','passes']]

Unnamed: 0_level_0,goals,passes
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Messi,35.0,55.0
Pique,3.0,
Suarez,10.0,40.0
Xavi,,60.0


然而，如果使用了切片，则默认是对**行**进行切片：

In [58]:
data['Messi':'Pique']

Unnamed: 0_level_0,goals,passes,log_goals
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Messi,35.0,55.0,3.555348
Pique,3.0,,1.098612


同样，**掩码也是对行操作**：

In [59]:
data[data['goals']>=10]

Unnamed: 0_level_0,goals,passes,log_goals
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Messi,35.0,55.0,3.555348
Suarez,10.0,40.0,2.302585


如果需要对两个维度都是用显式或隐式索引，那么需要使用loc()、iloc()两个函数。其中loc()为显式索引：

In [60]:
data.loc['Messi':'Suarez','goals':'passes']

Unnamed: 0_level_0,goals,passes
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Messi,35.0,55.0
Pique,3.0,
Suarez,10.0,40.0


iloc()为隐式索引：

In [61]:
data.iloc[:2,1:]

Unnamed: 0_level_0,passes,log_goals
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Messi,55.0,3.555348
Pique,,1.098612


## 数据框中的时间索引

当然在数据框中也可以使用时间索引和多级索引。

首先对于时间索引，处理方法与之前类似，比如我们可以从

In [62]:
import tushare as tus
gdp=tus.get_gdp_year()
gdp.head()

Unnamed: 0,year,gdp,pc_gdp,gnp,pi,si,industry,cons_industry,ti,trans_industry,lbdy
0,2019,990865.0,70892.0,,70467.0,386165.0,317109.0,70904.0,534233.0,42802.0,113886.0
1,2018,919281.0,64644.0,,64745.0,364835.0,305160.2,61808.0,489701.0,40550.2,100223.8
2,2017,820754.3,59201.0,,62099.5,332742.7,278328.2,55313.8,425912.1,37172.6,92348.2
3,2016,740060.8,53680.0,,60139.2,296547.7,247877.7,49702.9,383373.9,33058.8,84648.8
4,2015,685992.9,50028.0,,57774.6,282040.3,236506.3,46626.7,346178.0,30487.8,78340.4


注意上面的year变量还不是代表时间的对象，我们首先可以使用pd.to_datetime()函数将其变换为时间戳：

In [63]:
tstamp=pd.to_datetime(gdp['year'], format='%Y')
tstamp

0    2019-01-01
1    2018-01-01
2    2017-01-01
3    2016-01-01
4    2015-01-01
        ...    
63   1956-01-01
64   1955-01-01
65   1954-01-01
66   1953-01-01
67   1952-01-01
Name: year, Length: 68, dtype: datetime64[ns]

其中format="%Y"指明了数据中日期时间的格式，该命令会将数字、字符串等转换为Pandas的时间。

接下来我们可以使用这个时间作为数据的index：

In [64]:
gdp_new=gdp.set_index(tstamp)
gdp_new

Unnamed: 0_level_0,year,gdp,pc_gdp,gnp,pi,si,industry,cons_industry,ti,trans_industry,lbdy
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2019-01-01,2019,990865.0,70892.0,,70467.0,386165.0,317109.0,70904.0,534233.0,42802.0,113886.0
2018-01-01,2018,919281.0,64644.0,,64745.0,364835.0,305160.2,61808.0,489701.0,40550.2,100223.8
2017-01-01,2017,820754.3,59201.0,,62099.5,332742.7,278328.2,55313.8,425912.1,37172.6,92348.2
2016-01-01,2016,740060.8,53680.0,,60139.2,296547.7,247877.7,49702.9,383373.9,33058.8,84648.8
2015-01-01,2015,685992.9,50028.0,,57774.6,282040.3,236506.3,46626.7,346178.0,30487.8,78340.4
...,...,...,...,...,...,...,...,...,...,...,...
1956-01-01,1956,1028.0,165.0,1028.0,443.9,280.7,224.7,56.0,303.4,46.0,131.4
1955-01-01,1955,910.0,150.0,910.0,421.0,222.2,191.2,31.0,266.8,39.0,119.8
1954-01-01,1954,859.0,144.0,859.0,392.0,211.7,184.7,27.0,255.3,38.0,120.3
1953-01-01,1953,824.0,142.0,824.0,378.0,192.5,163.5,29.0,253.5,35.0,115.5


接下来可以使用.to_period()函数转换为Period：

In [65]:
gdp_new.index.to_period('Y')

PeriodIndex(['2019', '2018', '2017', '2016', '2015', '2014', '2013', '2012',
             '2011', '2010', '2009', '2008', '2007', '2006', '2005', '2004',
             '2003', '2002', '2001', '2000', '1999', '1998', '1997', '1996',
             '1995', '1994', '1993', '1992', '1991', '1990', '1989', '1988',
             '1987', '1986', '1985', '1984', '1983', '1982', '1981', '1980',
             '1979', '1978', '1977', '1976', '1975', '1974', '1973', '1972',
             '1971', '1970', '1969', '1968', '1967', '1966', '1965', '1964',
             '1963', '1962', '1961', '1960', '1959', '1958', '1957', '1956',
             '1955', '1954', '1953', '1952'],
            dtype='period[A-DEC]', name='year', freq='A-DEC')

如果需要将Period作为index，继续使用set_index：

In [66]:
gdp_final=gdp.set_index(gdp_new.index.to_period('Y'))
gdp_final

Unnamed: 0_level_0,year,gdp,pc_gdp,gnp,pi,si,industry,cons_industry,ti,trans_industry,lbdy
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2019,2019,990865.0,70892.0,,70467.0,386165.0,317109.0,70904.0,534233.0,42802.0,113886.0
2018,2018,919281.0,64644.0,,64745.0,364835.0,305160.2,61808.0,489701.0,40550.2,100223.8
2017,2017,820754.3,59201.0,,62099.5,332742.7,278328.2,55313.8,425912.1,37172.6,92348.2
2016,2016,740060.8,53680.0,,60139.2,296547.7,247877.7,49702.9,383373.9,33058.8,84648.8
2015,2015,685992.9,50028.0,,57774.6,282040.3,236506.3,46626.7,346178.0,30487.8,78340.4
...,...,...,...,...,...,...,...,...,...,...,...
1956,1956,1028.0,165.0,1028.0,443.9,280.7,224.7,56.0,303.4,46.0,131.4
1955,1955,910.0,150.0,910.0,421.0,222.2,191.2,31.0,266.8,39.0,119.8
1954,1954,859.0,144.0,859.0,392.0,211.7,184.7,27.0,255.3,38.0,120.3
1953,1953,824.0,142.0,824.0,378.0,192.5,163.5,29.0,253.5,35.0,115.5


有了时间，就可以使用shift()以及tshift()函数了：

In [67]:
lag_gdp=gdp_final['gdp'].tshift(1)
lag_gdp

year
2020    990865.0
2019    919281.0
2018    820754.3
2017    740060.8
2016    685992.9
          ...   
1957      1028.0
1956       910.0
1955       859.0
1954       824.0
1953       679.0
Freq: A-DEC, Name: gdp, Length: 68, dtype: float64

可以直接将其并进数据中：

In [68]:
gdp_final['lag_gdp']=lag_gdp
gdp_final

Unnamed: 0_level_0,year,gdp,pc_gdp,gnp,pi,si,industry,cons_industry,ti,trans_industry,lbdy,lag_gdp
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2019,2019,990865.0,70892.0,,70467.0,386165.0,317109.0,70904.0,534233.0,42802.0,113886.0,919281.0
2018,2018,919281.0,64644.0,,64745.0,364835.0,305160.2,61808.0,489701.0,40550.2,100223.8,820754.3
2017,2017,820754.3,59201.0,,62099.5,332742.7,278328.2,55313.8,425912.1,37172.6,92348.2,740060.8
2016,2016,740060.8,53680.0,,60139.2,296547.7,247877.7,49702.9,383373.9,33058.8,84648.8,685992.9
2015,2015,685992.9,50028.0,,57774.6,282040.3,236506.3,46626.7,346178.0,30487.8,78340.4,641280.6
...,...,...,...,...,...,...,...,...,...,...,...,...
1956,1956,1028.0,165.0,1028.0,443.9,280.7,224.7,56.0,303.4,46.0,131.4,910.0
1955,1955,910.0,150.0,910.0,421.0,222.2,191.2,31.0,266.8,39.0,119.8,859.0
1954,1954,859.0,144.0,859.0,392.0,211.7,184.7,27.0,255.3,38.0,120.3,824.0
1953,1953,824.0,142.0,824.0,378.0,192.5,163.5,29.0,253.5,35.0,115.5,679.0


进一步计算GDP增长率：

In [69]:
gdp_final['gdp_growth']=(gdp_final['gdp']-gdp_final['lag_gdp'])/gdp_final['lag_gdp']
del gdp_final['year']
gdp_final

Unnamed: 0_level_0,gdp,pc_gdp,gnp,pi,si,industry,cons_industry,ti,trans_industry,lbdy,lag_gdp,gdp_growth
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2019,990865.0,70892.0,,70467.0,386165.0,317109.0,70904.0,534233.0,42802.0,113886.0,919281.0,0.077870
2018,919281.0,64644.0,,64745.0,364835.0,305160.2,61808.0,489701.0,40550.2,100223.8,820754.3,0.120044
2017,820754.3,59201.0,,62099.5,332742.7,278328.2,55313.8,425912.1,37172.6,92348.2,740060.8,0.109036
2016,740060.8,53680.0,,60139.2,296547.7,247877.7,49702.9,383373.9,33058.8,84648.8,685992.9,0.078817
2015,685992.9,50028.0,,57774.6,282040.3,236506.3,46626.7,346178.0,30487.8,78340.4,641280.6,0.069723
...,...,...,...,...,...,...,...,...,...,...,...,...
1956,1028.0,165.0,1028.0,443.9,280.7,224.7,56.0,303.4,46.0,131.4,910.0,0.129670
1955,910.0,150.0,910.0,421.0,222.2,191.2,31.0,266.8,39.0,119.8,859.0,0.059371
1954,859.0,144.0,859.0,392.0,211.7,184.7,27.0,255.3,38.0,120.3,824.0,0.042476
1953,824.0,142.0,824.0,378.0,192.5,163.5,29.0,253.5,35.0,115.5,679.0,0.213549


## 多级索引

当然，数据框中也支持多级索引，创建方法与Series的多级索引并无差异。

这里我们主要介绍如何从文件中读入数据并建立多级索引。

首先第一种情况，很多数据是以「宽格式」存储的，比如以下文件：

In [70]:
hcw=pd.read_csv("csv/hcw.csv")
hcw.head()

Unnamed: 0,time,HongKong,Australia,Austria,Canada,Denmark,Finland,France,Germany,Italy,...,Switzerland,UnitedKingdom,UnitedStates,Singapore,Philippines,Indonesia,Malaysia,Thailand,Taiwan,China
0,1993q1,0.062,0.040489,-0.013084,0.010064,-0.012292,-0.028357,-0.015177,-0.01968,-0.023383,...,-0.032865,0.015124,0.022959,0.087145,-0.004381,0.064024,0.085938,0.08,0.064902,0.143
1,1993q2,0.059,0.037857,-0.007581,0.021264,-0.003093,-0.023397,-0.014549,-0.015441,-0.018116,...,-0.019818,0.014795,0.018936,0.118075,0.016636,0.066068,0.131189,0.08,0.065123,0.141
2,1993q3,0.058,0.022509,0.000543,0.018919,-0.007764,-0.006018,-0.016704,-0.012701,-0.016875,...,-0.004587,0.029149,0.01799,0.11113,0.031504,0.057959,0.109666,0.08,0.067379,0.135
3,1993q4,0.062,0.028747,0.001181,0.025317,-0.004049,-0.004774,-0.007476,-0.011667,-0.004963,...,0.013651,0.036581,0.020683,0.125324,0.034007,0.062365,0.075801,0.08,0.069164,0.135
4,1994q1,0.079,0.03399,0.025511,0.043567,0.031094,0.012886,0.003748,0.02295,-0.002249,...,0.026644,0.030078,0.029918,0.130709,0.049344,0.049743,0.049147,0.112509,0.069451,0.125


为了将其转换为「长格式」的Series，可以使用DataFrame的stack()方法。不过在使用该方法之前，不妨先把时间转换为正确的格式：

In [71]:
hcw['quarter']=pd.to_datetime(hcw['time'])
hcw=hcw.set_index(hcw['quarter'])
hcw=hcw.set_index(hcw.index.to_period('Q'))
del hcw['time']
del hcw['quarter']
hcw

Unnamed: 0_level_0,HongKong,Australia,Austria,Canada,Denmark,Finland,France,Germany,Italy,Japan,...,Switzerland,UnitedKingdom,UnitedStates,Singapore,Philippines,Indonesia,Malaysia,Thailand,Taiwan,China
quarter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1993Q1,0.062,0.040489,-0.013084,0.010064,-0.012292,-0.028357,-0.015177,-0.019680,-0.023383,0.012683,...,-0.032865,0.015124,0.022959,0.087145,-0.004381,0.064024,0.085938,0.080000,0.064902,0.1430
1993Q2,0.059,0.037857,-0.007581,0.021264,-0.003093,-0.023397,-0.014549,-0.015441,-0.018116,-0.005571,...,-0.019818,0.014795,0.018936,0.118075,0.016636,0.066068,0.131189,0.080000,0.065123,0.1410
1993Q3,0.058,0.022509,0.000543,0.018919,-0.007764,-0.006018,-0.016704,-0.012701,-0.016875,-0.017558,...,-0.004587,0.029149,0.017990,0.111130,0.031504,0.057959,0.109666,0.080000,0.067379,0.1350
1993Q4,0.062,0.028747,0.001181,0.025317,-0.004049,-0.004774,-0.007476,-0.011667,-0.004963,-0.010101,...,0.013651,0.036581,0.020683,0.125324,0.034007,0.062365,0.075801,0.080000,0.069164,0.1350
1994Q1,0.079,0.033990,0.025511,0.043567,0.031094,0.012886,0.003748,0.022950,-0.002249,-0.022503,...,0.026644,0.030078,0.029918,0.130709,0.049344,0.049743,0.049147,0.112509,0.069451,0.1250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2007Q1,0.055,0.058013,0.036198,0.030712,0.033134,0.047340,0.032755,0.035225,0.025621,0.023959,...,0.046517,0.027946,0.017708,0.063092,0.065573,0.095297,0.039884,0.049846,0.041019,0.1110
2007Q2,0.062,0.059519,0.032570,0.039827,-0.007169,0.046808,0.030355,0.023897,0.017251,0.014193,...,0.046339,0.039619,0.018756,0.101377,0.081891,0.110900,0.080276,0.051197,0.051073,0.1167
2007Q3,0.068,0.056649,0.031558,0.034742,0.013517,0.045647,0.036748,0.020773,0.023338,0.012907,...,0.040958,0.038682,0.028225,0.105562,0.064359,0.110100,0.093361,0.050126,0.066369,0.1002
2007Q4,0.069,0.045825,0.019095,0.038128,0.023794,0.044177,0.021745,0.005865,0.005081,-0.004768,...,0.037792,0.033362,0.009288,0.069739,0.065470,0.110700,0.151739,0.079587,0.062929,0.1017


接下来，就可以使用stack()方法转换为长格式了：

In [72]:
hcw_new=hcw.stack()
hcw_new.index.names=['quarter','city']
hcw_new.head()

quarter  city     
1993Q1   HongKong     0.062000
         Australia    0.040489
         Austria     -0.013084
         Canada       0.010064
         Denmark     -0.012292
dtype: float64

In [73]:
hcw_new.index

MultiIndex([('1993Q1',      'HongKong'),
            ('1993Q1',     'Australia'),
            ('1993Q1',       'Austria'),
            ('1993Q1',        'Canada'),
            ('1993Q1',       'Denmark'),
            ('1993Q1',       'Finland'),
            ('1993Q1',        'France'),
            ('1993Q1',       'Germany'),
            ('1993Q1',         'Italy'),
            ('1993Q1',         'Japan'),
            ...
            ('2008Q1',   'Switzerland'),
            ('2008Q1', 'UnitedKingdom'),
            ('2008Q1',  'UnitedStates'),
            ('2008Q1',     'Singapore'),
            ('2008Q1',   'Philippines'),
            ('2008Q1',     'Indonesia'),
            ('2008Q1',      'Malaysia'),
            ('2008Q1',      'Thailand'),
            ('2008Q1',        'Taiwan'),
            ('2008Q1',         'China')],
           names=['quarter', 'city'], length=1525)

如果需要将数据转换回宽的格式，可以用unstack()

In [74]:
hcw_new.unstack()

city,HongKong,Australia,Austria,Canada,Denmark,Finland,France,Germany,Italy,Japan,...,Switzerland,UnitedKingdom,UnitedStates,Singapore,Philippines,Indonesia,Malaysia,Thailand,Taiwan,China
quarter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1993Q1,0.062,0.040489,-0.013084,0.010064,-0.012292,-0.028357,-0.015177,-0.019680,-0.023383,0.012683,...,-0.032865,0.015124,0.022959,0.087145,-0.004381,0.064024,0.085938,0.080000,0.064902,0.1430
1993Q2,0.059,0.037857,-0.007581,0.021264,-0.003093,-0.023397,-0.014549,-0.015441,-0.018116,-0.005571,...,-0.019818,0.014795,0.018936,0.118075,0.016636,0.066068,0.131189,0.080000,0.065123,0.1410
1993Q3,0.058,0.022509,0.000543,0.018919,-0.007764,-0.006018,-0.016704,-0.012701,-0.016875,-0.017558,...,-0.004587,0.029149,0.017990,0.111130,0.031504,0.057959,0.109666,0.080000,0.067379,0.1350
1993Q4,0.062,0.028747,0.001181,0.025317,-0.004049,-0.004774,-0.007476,-0.011667,-0.004963,-0.010101,...,0.013651,0.036581,0.020683,0.125324,0.034007,0.062365,0.075801,0.080000,0.069164,0.1350
1994Q1,0.079,0.033990,0.025511,0.043567,0.031094,0.012886,0.003748,0.022950,-0.002249,-0.022503,...,0.026644,0.030078,0.029918,0.130709,0.049344,0.049743,0.049147,0.112509,0.069451,0.1250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2007Q1,0.055,0.058013,0.036198,0.030712,0.033134,0.047340,0.032755,0.035225,0.025621,0.023959,...,0.046517,0.027946,0.017708,0.063092,0.065573,0.095297,0.039884,0.049846,0.041019,0.1110
2007Q2,0.062,0.059519,0.032570,0.039827,-0.007169,0.046808,0.030355,0.023897,0.017251,0.014193,...,0.046339,0.039619,0.018756,0.101377,0.081891,0.110900,0.080276,0.051197,0.051073,0.1167
2007Q3,0.068,0.056649,0.031558,0.034742,0.013517,0.045647,0.036748,0.020773,0.023338,0.012907,...,0.040958,0.038682,0.028225,0.105562,0.064359,0.110100,0.093361,0.050126,0.066369,0.1002
2007Q4,0.069,0.045825,0.019095,0.038128,0.023794,0.044177,0.021745,0.005865,0.005081,-0.004768,...,0.037792,0.033362,0.009288,0.069739,0.065470,0.110700,0.151739,0.079587,0.062929,0.1017


In [75]:
hcw_new.unstack(0) ## 指定按照层级索引的哪个level为列

quarter,1993Q1,1993Q2,1993Q3,1993Q4,1994Q1,1994Q2,1994Q3,1994Q4,1995Q1,1995Q2,...,2005Q4,2006Q1,2006Q2,2006Q3,2006Q4,2007Q1,2007Q2,2007Q3,2007Q4,2008Q1
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
HongKong,0.062,0.059,0.058,0.062,0.079,0.068,0.046,0.052,0.037,0.029,...,0.069,0.09,0.062,0.064,0.066,0.055,0.062,0.068,0.069,0.073
Australia,0.040489,0.037857,0.022509,0.028747,0.03399,0.037919,0.052289,0.031071,0.008696,0.006774,...,0.054983,0.048067,0.026982,0.032731,0.038575,0.058013,0.059519,0.056649,0.045825,0.027523
Austria,-0.013084,-0.007581,0.000543,0.001181,0.025511,0.019941,0.017088,0.023035,0.025293,0.02185,...,0.032616,0.03832,0.035104,0.03722,0.038982,0.036198,0.03257,0.031558,0.019095,0.017431
Canada,0.010064,0.021264,0.018919,0.025317,0.043567,0.050225,0.065122,0.067331,0.050921,0.031525,...,0.050334,0.049476,0.041199,0.031677,0.020005,0.030712,0.039827,0.034742,0.038128,0.029217
Denmark,-0.012292,-0.003093,-0.007764,-0.004049,0.031094,0.06428,0.045955,0.055166,0.048057,0.011954,...,0.028752,0.049316,0.038801,0.041836,0.029809,0.033134,-0.007169,0.013517,0.023794,-0.0052
Finland,-0.028357,-0.023397,-0.006018,-0.004774,0.012886,0.03509,0.035247,0.057251,0.068382,0.079265,...,0.027939,0.043949,0.050649,0.044582,0.046806,0.04734,0.046808,0.045647,0.044177,0.023732
France,-0.015177,-0.014549,-0.016704,-0.007476,0.003748,0.016165,0.023915,0.029711,0.027446,0.021708,...,0.023968,0.027028,0.033463,0.030261,0.034915,0.032755,0.030355,0.036748,0.021745,0.018386
Germany,-0.01968,-0.015441,-0.012701,-0.011667,0.02295,0.02107,0.020662,0.028744,0.016826,0.028715,...,0.007969,0.008848,0.016911,0.023511,0.033108,0.035225,0.023897,0.020773,0.005865,0.00902
Italy,-0.023383,-0.018116,-0.016875,-0.004963,-0.002249,0.011635,0.026412,0.034283,0.025394,0.022612,...,0.015251,0.017682,0.019596,0.011813,0.012561,0.025621,0.017251,0.023338,0.005081,-0.005581
Japan,0.012683,-0.005571,-0.017558,-0.010101,-0.022503,-0.005157,0.014087,0.005427,0.003919,0.015349,...,0.018941,0.015203,0.01,0.005606,0.01546,0.023959,0.014193,0.012907,-0.004768,-0.013647


当然，如果一个变量本来就是长格式，就方便很多，比如：

In [76]:
city=pd.read_csv('csv/city.csv')
city.head()

Unnamed: 0,CityCode,City,Year,population,gdp
0,110000,北京市,2003,1148.8199,36631000.0
1,120000,天津市,2003,926.0,24476600.0
2,130100,石家庄市,2003,910.51001,13779438.0
3,130200,唐山市,2003,706.28003,12953220.0
4,130300,秦皇岛市,2003,273.29001,3870301.0


可以指定层级索引比如：

In [77]:
city=city.set_index(['CityCode','Year'])
city

Unnamed: 0_level_0,Unnamed: 1_level_0,City,population,gdp
CityCode,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
110000,2003,北京市,1148.81990,36631000.0
120000,2003,天津市,926.00000,24476600.0
130100,2003,石家庄市,910.51001,13779438.0
130200,2003,唐山市,706.28003,12953220.0
130300,2003,秦皇岛市,273.29001,3870301.0
...,...,...,...,...
640300,2006,吴忠市,127.66000,1196379.0
640400,2006,固原市,151.23000,519279.0
640500,2006,中卫市,104.19000,750050.0
650100,2006,乌鲁木齐市,201.84000,6543019.0


不过这样做有两个问题，比如，如果使用切片操作可能会报错，所以一般需要先对index排序：

In [78]:
city=city.sort_index()
city

Unnamed: 0_level_0,Unnamed: 1_level_0,City,population,gdp
CityCode,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
110000,2000,北京市,1107.530000,24787600.0
110000,2001,北京市,1122.300000,28456500.0
110000,2002,北京市,1136.000000,32127100.0
110000,2003,北京市,1148.819900,36631000.0
110000,2004,北京市,1162.890000,42833100.0
...,...,...,...,...
650200,2007,克拉玛依市,35.340000,5151297.0
650200,2008,克拉玛依市,38.619999,6612062.0
650200,2009,克拉玛依市,39.349998,4802909.0
650200,2010,克拉玛依市,37.509998,7113531.0


然后就可以进行切片操作了：

In [79]:
city.loc[310000,:]

Unnamed: 0_level_0,City,population,gdp
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,上海市,1321.63,45511500.0
2001,上海市,1327.14,49508400.0
2002,上海市,1334.0,54087600.0
2003,上海市,1341.77,62508100.0
2004,上海市,1352.39,74502700.0
2005,上海市,1360.26,91541800.0
2006,上海市,1368.08,103663700.0
2007,上海市,1378.86,121888500.0
2008,上海市,1391.04,136981500.0
2009,上海市,1400.7,150464500.0


In [80]:
city[city['gdp']>160000000]

Unnamed: 0_level_0,Unnamed: 1_level_0,City,population,gdp
CityCode,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
110000,2011,北京市,1277.9,162519300.0
310000,2010,上海市,1412.3199,171659800.0
310000,2011,上海市,1419.4,191956900.0


然而值得注意的是，现在切片操作有三个维度，其中有两个维度是行的index，一个维度是列的index，默认情况下iloc和loc都认为第二个维度是列。此时需要使用**IndexSlice对象**，才能准确表示切片：

In [81]:
ids=pd.IndexSlice
city.loc[ids[:,2001:2003],['population', 'gdp']]

Unnamed: 0_level_0,Unnamed: 1_level_0,population,gdp
CityCode,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
110000,2001,1122.30000,28456500.0
110000,2002,1136.00000,32127100.0
110000,2003,1148.81990,36631000.0
120000,2001,913.97998,18401000.0
120000,2002,919.00000,20511600.0
...,...,...,...
650100,2002,176.00000,3544426.0
650100,2003,181.53000,4085834.0
650200,2001,28.43000,1675526.0
650200,2002,29.00000,1704788.0


另一个问题是在这里很难产生滞后项，如果要做，一种做法是：先unstack成一个宽的格式，然后转为时间序列，产生滞后，再stack并合并进来：

In [82]:
city_unstack=city.unstack(0)
city_unstack.index=pd.to_datetime(city_unstack.index,format="%Y").to_period('Y')
city_unstack
lag_gdp=city_unstack['gdp'].shift(1)
lag_gdp=lag_gdp.stack()
city_new=city_unstack.stack()
city_new['lag_gdp']=lag_gdp
city_new.loc[ids[:,110000],:]

Unnamed: 0_level_0,Unnamed: 1_level_0,City,population,gdp,lag_gdp
Year,CityCode,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000,110000,北京市,1107.53,24787600.0,
2001,110000,北京市,1122.3,28456500.0,24787600.0
2002,110000,北京市,1136.0,32127100.0,28456500.0
2003,110000,北京市,1148.8199,36631000.0,32127100.0
2004,110000,北京市,1162.89,42833100.0,36631000.0
2005,110000,北京市,1180.7,68863101.0,42833100.0
2006,110000,北京市,1197.6,78702835.0,68863101.0
2007,110000,北京市,1213.26,93533200.0,78702835.0
2008,110000,北京市,1299.85,104880500.0,93533200.0
2009,110000,北京市,1245.83,121530000.0,104880500.0


## 虚拟变量与数据集合并

在Pandas中，专门提供了一个函数用于产生虚拟变量：pd.get_dummies()，比如：

In [83]:
city_new.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,City,population,gdp,lag_gdp
Year,CityCode,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000,110000,北京市,1107.53,24787600.0,
2000,120000,天津市,912.0,16393600.0,
2000,130100,石家庄市,889.79999,10031119.0,
2000,130200,唐山市,699.78998,9150473.0,
2000,130300,秦皇岛市,266.29001,2853937.0,


In [84]:
city_dummy=pd.get_dummies(city_new['City'], prefix="city")
city_dummy

Unnamed: 0_level_0,Unnamed: 1_level_0,city_七台河市,city_三亚市,city_三明市,city_三门峡市,city_上海市,city_上饶市,city_东莞市,city_东营市,city_中卫市,city_中山市,...,city_鸡西市,city_鹤壁市,city_鹤岗市,city_鹰潭市,city_黄冈市,city_黄山市,city_黄石市,city_黑河市,city_齐齐哈尔市,city_龙岩市
Year,CityCode,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2000,110000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2000,120000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2000,130100,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2000,130200,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2000,130300,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2011,640300,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2011,640400,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2011,640500,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2011,650100,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


其中prefix选项可以忽略，不过为了不与其他变量的虚拟变量混淆，最好加上。

但是这里有个小问题，产生的虚拟变量是一个单独的数据框，如果需要加回到原来的数据框中，就需要对两个数据框进行合并。合并的方法有很多种，其中的一种简单的方法是：

In [85]:
city_with_dummy=pd.concat([city_new,city_dummy],axis=1)
city_with_dummy.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,City,population,gdp,lag_gdp,city_七台河市,city_三亚市,city_三明市,city_三门峡市,city_上海市,city_上饶市,...,city_鸡西市,city_鹤壁市,city_鹤岗市,city_鹰潭市,city_黄冈市,city_黄山市,city_黄石市,city_黑河市,city_齐齐哈尔市,city_龙岩市
Year,CityCode,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2000,110000,北京市,1107.53,24787600.0,,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2000,120000,天津市,912.0,16393600.0,,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2000,130100,石家庄市,889.79999,10031119.0,,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2000,130200,唐山市,699.78998,9150473.0,,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2000,130300,秦皇岛市,266.29001,2853937.0,,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


其中axis=1代表按列合并。

## concat合并

刚刚已经展示了pd.concat()的用法，注意我们使用axis=1来表明需要按列合并。实际上这个函数的工作方式与NumPy中的concatenate()很类似，只不过合并时会对行、列的索引进行对齐，因而只要有索引其实不必担心顺序的影响，比如：

In [86]:
players1=[{'name':'Messi','goals':35,'passes':55},
         {'name':'Pique','goals':3},
        {'name':'Suarez','goals':10,'passes':40},
        {'name':'Xavi','passes':60}]
data1=pd.DataFrame(players1)
data1=data1.set_index('name')
data1.head()

Unnamed: 0_level_0,goals,passes
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Messi,35.0,55.0
Pique,3.0,
Suarez,10.0,40.0
Xavi,,60.0


In [87]:
players2=[{'name':'Messi','attempts':70},
        {'name':'Suarez','attempts':30},
        {'name':'Pique','attempts':10}]
data2=pd.DataFrame(players2)
data2=data2.set_index('name')
data2.head()

Unnamed: 0_level_0,attempts
name,Unnamed: 1_level_1
Messi,70
Suarez,30
Pique,10


In [88]:
data=pd.concat([data1,data2],axis=1)
data

Unnamed: 0,goals,passes,attempts
Messi,35.0,55.0,70.0
Pique,3.0,,10.0
Suarez,10.0,40.0,30.0
Xavi,,60.0,


除了按列合并之外，当然默认的是竖向按行合并。比如：

In [89]:
players1=[{'name':'Messi','goals':35,'passes':55},
         {'name':'Pique','goals':3},
        {'name':'Suarez','goals':10,'passes':40},
        {'name':'Xavi','passes':60}]
data1=pd.DataFrame(players1)
data1=data1.set_index('name')
data1.head()

Unnamed: 0_level_0,goals,passes
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Messi,35.0,55.0
Pique,3.0,
Suarez,10.0,40.0
Xavi,,60.0


In [90]:
players2=[{'name':'ter Stegen','goals':0,'save':5},
         {'name':'Iniesta','goals':1,'passes':80},
        {'name':'Xavi','passes':60}]
data2=pd.DataFrame(players2)
data2=data2.set_index('name')
data2.head()

Unnamed: 0_level_0,goals,save,passes
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ter Stegen,0.0,5.0,
Iniesta,1.0,,80.0
Xavi,,,60.0


In [91]:
data=pd.concat([data1,data2])
data

Unnamed: 0_level_0,goals,passes,save
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Messi,35.0,55.0,
Pique,3.0,,
Suarez,10.0,40.0,
Xavi,,60.0,
ter Stegen,0.0,,5.0
Iniesta,1.0,80.0,
Xavi,,60.0,


可以看到以上程序完成了纵向的合并，但是需要注意的是，里面有两个Xavi，作为索引这本应该是不允许的，但是Pandas并不会自动检查。

为了解决这个问题，首先可以使用verigy_integrity选项强制检查是否有重复：

In [92]:
try:
    data=pd.concat([data1,data2], verify_integrity=True)
except Exception as e:
    print(e)

Indexes have overlapping values: Index(['Xavi'], dtype='object', name='name')


此时会提示重复的index，需要动手去解决。

或者，我们可以加入keys选项，将其合并为层级索引：

In [93]:
data=pd.concat([data1,data2], keys=['data1','data2'])
data

Unnamed: 0_level_0,Unnamed: 1_level_0,goals,passes,save
Unnamed: 0_level_1,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
data1,Messi,35.0,55.0,
data1,Pique,3.0,,
data1,Suarez,10.0,40.0,
data1,Xavi,,60.0,
data2,ter Stegen,0.0,,5.0
data2,Iniesta,1.0,80.0,
data2,Xavi,,60.0,


可以看到上面的数据被划分为了一个两级索引，就不存在重复的问题了。

当然，选取何种方法处理需要根据应用背景来决定。

## merge合并

与Stata类似，Pandas中也有merge函数。

与Stata一样，pd.merge()函数也有一对一（1:1）和多对一（m:1），最简单的是一对一合并，如果不做特殊处理，将会默认按照共同变量名进行merge，比如：

In [94]:
gdp=tus.get_gdp_year()
gdp.head()

Unnamed: 0,year,gdp,pc_gdp,gnp,pi,si,industry,cons_industry,ti,trans_industry,lbdy
0,2019,990865.0,70892.0,,70467.0,386165.0,317109.0,70904.0,534233.0,42802.0,113886.0
1,2018,919281.0,64644.0,,64745.0,364835.0,305160.2,61808.0,489701.0,40550.2,100223.8
2,2017,820754.3,59201.0,,62099.5,332742.7,278328.2,55313.8,425912.1,37172.6,92348.2
3,2016,740060.8,53680.0,,60139.2,296547.7,247877.7,49702.9,383373.9,33058.8,84648.8
4,2015,685992.9,50028.0,,57774.6,282040.3,236506.3,46626.7,346178.0,30487.8,78340.4


In [95]:
monetary=tus.get_money_supply_bal()
monetary['year']=monetary['year'].astype('int_') ##数据问题，这里的year并不是数字。。
monetary.head()

Unnamed: 0,year,m2,m1,m0,cd,qm,ftd,sd,rests
0,2018,1826744.2,551685.9,73208.4,478477.5,1275058.3,340178.9,721688.6,213190.8
1,2017,1690235.3,543790.1,70645.6,473144.5,1146445.2,320196.2,649341.5,176907.4
2,2016,1550066.7,486557.2,68303.9,418253.4,1063509.4,307989.6,603504.2,152015.6
3,2015,1392278.1,400953.4,63216.6,337736.9,991324.7,288240.7,552073.5,151010.5
4,2014,1228374.8,348056.4,60259.5,287796.9,880318.4,264055.7,508878.1,107384.6


In [96]:
merged=pd.merge(gdp,monetary)
merged.head()

Unnamed: 0,year,gdp,pc_gdp,gnp,pi,si,industry,cons_industry,ti,trans_industry,lbdy,m2,m1,m0,cd,qm,ftd,sd,rests
0,2018,919281.0,64644.0,,64745.0,364835.0,305160.2,61808.0,489701.0,40550.2,100223.8,1826744.2,551685.9,73208.4,478477.5,1275058.3,340178.9,721688.6,213190.8
1,2017,820754.3,59201.0,,62099.5,332742.7,278328.2,55313.8,425912.1,37172.6,92348.2,1690235.3,543790.1,70645.6,473144.5,1146445.2,320196.2,649341.5,176907.4
2,2016,740060.8,53680.0,,60139.2,296547.7,247877.7,49702.9,383373.9,33058.8,84648.8,1550066.7,486557.2,68303.9,418253.4,1063509.4,307989.6,603504.2,152015.6
3,2015,685992.9,50028.0,,57774.6,282040.3,236506.3,46626.7,346178.0,30487.8,78340.4,1392278.1,400953.4,63216.6,337736.9,991324.7,288240.7,552073.5,151010.5
4,2014,641280.6,47005.0,,55626.3,277571.8,233856.4,44880.5,308082.5,28500.9,73582.0,1228374.8,348056.4,60259.5,287796.9,880318.4,264055.7,508878.1,107384.6


上面两个数据集中有共同的列year，所以只要year相同就合并在一起了。但是注意gdp的数据中有2019的数据，monetary的数据中没有，合并之后居然没有2019了，这是因为默认情况下merge会取两个year的**交集**，如果需要取并集，需要使用how='outer'选项：

In [97]:
merged=pd.merge(gdp,monetary, how='outer')
merged.head()

Unnamed: 0,year,gdp,pc_gdp,gnp,pi,si,industry,cons_industry,ti,trans_industry,lbdy,m2,m1,m0,cd,qm,ftd,sd,rests
0,2019,990865.0,70892.0,,70467.0,386165.0,317109.0,70904.0,534233.0,42802.0,113886.0,,,,,,,,
1,2018,919281.0,64644.0,,64745.0,364835.0,305160.2,61808.0,489701.0,40550.2,100223.8,1826744.2,551685.9,73208.4,478477.5,1275058.3,340178.9,721688.6,213190.8
2,2017,820754.3,59201.0,,62099.5,332742.7,278328.2,55313.8,425912.1,37172.6,92348.2,1690235.3,543790.1,70645.6,473144.5,1146445.2,320196.2,649341.5,176907.4
3,2016,740060.8,53680.0,,60139.2,296547.7,247877.7,49702.9,383373.9,33058.8,84648.8,1550066.7,486557.2,68303.9,418253.4,1063509.4,307989.6,603504.2,152015.6
4,2015,685992.9,50028.0,,57774.6,282040.3,236506.3,46626.7,346178.0,30487.8,78340.4,1392278.1,400953.4,63216.6,337736.9,991324.7,288240.7,552073.5,151010.5


how选项还可以用left（保留第一个数据框的观测）、right（保留第二个数据框的观测）。

此外如果两个数据库中有重名的列，可以使用suffixed参数为重名的列名字加一个前缀。

有的时候两个数据集可能有多个重名的列，我们可能需要手动使用on选项指定用来匹配的列，比如：

In [98]:
merged=pd.merge(gdp,monetary, on='year' ,how='outer')
merged.head()

Unnamed: 0,year,gdp,pc_gdp,gnp,pi,si,industry,cons_industry,ti,trans_industry,lbdy,m2,m1,m0,cd,qm,ftd,sd,rests
0,2019,990865.0,70892.0,,70467.0,386165.0,317109.0,70904.0,534233.0,42802.0,113886.0,,,,,,,,
1,2018,919281.0,64644.0,,64745.0,364835.0,305160.2,61808.0,489701.0,40550.2,100223.8,1826744.2,551685.9,73208.4,478477.5,1275058.3,340178.9,721688.6,213190.8
2,2017,820754.3,59201.0,,62099.5,332742.7,278328.2,55313.8,425912.1,37172.6,92348.2,1690235.3,543790.1,70645.6,473144.5,1146445.2,320196.2,649341.5,176907.4
3,2016,740060.8,53680.0,,60139.2,296547.7,247877.7,49702.9,383373.9,33058.8,84648.8,1550066.7,486557.2,68303.9,418253.4,1063509.4,307989.6,603504.2,152015.6
4,2015,685992.9,50028.0,,57774.6,282040.3,236506.3,46626.7,346178.0,30487.8,78340.4,1392278.1,400953.4,63216.6,337736.9,991324.7,288240.7,552073.5,151010.5


如果需要按照多个列进行匹配，可以使用列名字的列表，比如 on=['city','year']。

有时也会碰上用来匹配的变量在两个数据框中不同名的情况，此时可以使用left_on和right_on，比如：

In [99]:
gdp['年份']=gdp['year']
del gdp['year']
merged=pd.merge(gdp,monetary, left_on='年份', right_on='year' ,how='outer')
merged.head()

Unnamed: 0,gdp,pc_gdp,gnp,pi,si,industry,cons_industry,ti,trans_industry,lbdy,年份,year,m2,m1,m0,cd,qm,ftd,sd,rests
0,990865.0,70892.0,,70467.0,386165.0,317109.0,70904.0,534233.0,42802.0,113886.0,2019,,,,,,,,,
1,919281.0,64644.0,,64745.0,364835.0,305160.2,61808.0,489701.0,40550.2,100223.8,2018,2018.0,1826744.2,551685.9,73208.4,478477.5,1275058.3,340178.9,721688.6,213190.8
2,820754.3,59201.0,,62099.5,332742.7,278328.2,55313.8,425912.1,37172.6,92348.2,2017,2017.0,1690235.3,543790.1,70645.6,473144.5,1146445.2,320196.2,649341.5,176907.4
3,740060.8,53680.0,,60139.2,296547.7,247877.7,49702.9,383373.9,33058.8,84648.8,2016,2016.0,1550066.7,486557.2,68303.9,418253.4,1063509.4,307989.6,603504.2,152015.6
4,685992.9,50028.0,,57774.6,282040.3,236506.3,46626.7,346178.0,30487.8,78340.4,2015,2015.0,1392278.1,400953.4,63216.6,337736.9,991324.7,288240.7,552073.5,151010.5


如果需要按照索引进行合并，那么需要设定left_index和right_index，如果第一个数据库要按照索引合并，那么就设定left_index=True，如果第二个数据框需要用索引合并，就设定right_index=True。

left_on, right_on, left_index, right_index可以组合使用，非常方便，比如：

In [100]:
monetary1=monetary.set_index('year')
monetary1.head()

Unnamed: 0_level_0,m2,m1,m0,cd,qm,ftd,sd,rests
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2018,1826744.2,551685.9,73208.4,478477.5,1275058.3,340178.9,721688.6,213190.8
2017,1690235.3,543790.1,70645.6,473144.5,1146445.2,320196.2,649341.5,176907.4
2016,1550066.7,486557.2,68303.9,418253.4,1063509.4,307989.6,603504.2,152015.6
2015,1392278.1,400953.4,63216.6,337736.9,991324.7,288240.7,552073.5,151010.5
2014,1228374.8,348056.4,60259.5,287796.9,880318.4,264055.7,508878.1,107384.6


In [101]:
merged=pd.merge(gdp,monetary1, left_on='年份', right_index=True ,how='outer')
merged.head()

Unnamed: 0,gdp,pc_gdp,gnp,pi,si,industry,cons_industry,ti,trans_industry,lbdy,年份,m2,m1,m0,cd,qm,ftd,sd,rests
0,990865.0,70892.0,,70467.0,386165.0,317109.0,70904.0,534233.0,42802.0,113886.0,2019,,,,,,,,
1,919281.0,64644.0,,64745.0,364835.0,305160.2,61808.0,489701.0,40550.2,100223.8,2018,1826744.2,551685.9,73208.4,478477.5,1275058.3,340178.9,721688.6,213190.8
2,820754.3,59201.0,,62099.5,332742.7,278328.2,55313.8,425912.1,37172.6,92348.2,2017,1690235.3,543790.1,70645.6,473144.5,1146445.2,320196.2,649341.5,176907.4
3,740060.8,53680.0,,60139.2,296547.7,247877.7,49702.9,383373.9,33058.8,84648.8,2016,1550066.7,486557.2,68303.9,418253.4,1063509.4,307989.6,603504.2,152015.6
4,685992.9,50028.0,,57774.6,282040.3,236506.3,46626.7,346178.0,30487.8,78340.4,2015,1392278.1,400953.4,63216.6,337736.9,991324.7,288240.7,552073.5,151010.5


至于多对一的合并与一对一的合并并无本质区别，在此不再赘述。