# 第10章 时间序列

时间序列是一种重要的结构化数据形式，主要有以下几种形式：  
- timestamp：特定的时刻
- period：如2001年1月
- interval：由起始和结束时间表示  

pandas提供了一组标准的时间序列处理工具和数据算法，因此可以轻松地进行切片、切块、聚合，对定期/不定期时间序列进行重抽样等。

# 1.日期和时间数据类型及工具

python标准库中包含date，time数据的数据类型。我们主要会用到datetime, time, calendar模块

**获取当前时间**

In [1]:
from datetime import datetime
now = datetime.now()
now

datetime.datetime(2018, 4, 11, 12, 36, 12, 703191)

In [2]:
now.year, now.month, now.day

(2018, 4, 11)

datetime以毫秒形式存储日期和时间。datetime.timedelta表示两个datetime对象之间的时间差

In [3]:
delta = datetime(2011, 1, 7) - datetime(2008, 6, 24, 8, 15)
delta

datetime.timedelta(926, 56700)

In [4]:
delta.days

926

In [5]:
delta.seconds

56700

可以给datetime对象加上一个或多个timedelta，这样会产生一个新对象

In [6]:
from datetime import timedelta
start = datetime(2011, 1, 7)
start + timedelta(12)

datetime.datetime(2011, 1, 19, 0, 0)

In [7]:
start - 2 * timedelta(12)

datetime.datetime(2010, 12, 14, 0, 0)

In [24]:
str(timedelta(12))

'12 days, 0:00:00'

# 1.1字符串和datetime的相互转换

**datetime——>字符串**  
利用str，可以把datetime对象格式化为字符串

In [8]:
stamp = datetime(2011, 1, 3)

In [9]:
str(stamp)

'2011-01-03 00:00:00'

In [10]:
stamp

datetime.datetime(2011, 1, 3, 0, 0)

strftime可以将datetime换成不同的格式

In [15]:
stamp.strftime('%Y-%m-%d')
#设置为特定格式

'2011-01-03'

datetime的格式有下面这些

In [17]:
from IPython.display import Image

In [19]:
Image(url = '1.png', width = 500, height = 400)

In [21]:
Image(url = '2.png', width = 600, height = 500)

**字符串——>datetime**

In [12]:
value = '2011-01-03'
datetime.strptime(value, '%Y-%m-%d')

datetime.datetime(2011, 1, 3, 0, 0)

In [13]:
datestrs = ['7/6/2011', '8/6/2011']
[datetime.strptime(x, '%m/%d/%Y') for x in datestrs]

[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]

# 1.2另一种解析日期格式的方法

datetime.striptime是通过已知格式进行日期解析的最佳方式，但是每次都要编写格式的话会很麻烦，尤其是那些常见的日期格式，所以，我们可以用dateutil这个第三方包中的**parse.parse**方法：

In [30]:
from dateutil.parser import parse
import pandas as pd
parse('2011-01-03')

datetime.datetime(2011, 1, 3, 0, 0)

In [26]:
parse('Jan 31, 1997 10:45 PM')

datetime.datetime(1997, 1, 31, 22, 45)

In [27]:
parse('6/12/2011', dayfirst=True)

datetime.datetime(2011, 12, 6, 0, 0)

pandas通常用于处理成组日期，比如DataFrame中的轴或者列索引。  
pandas的**to_datetime**方法可以解析多种不同的日期表示形式。

In [28]:
datestrs

['7/6/2011', '8/6/2011']

In [32]:
pd.to_datetime(datestrs)
# 日期转换没有'00:00:00'了

DatetimeIndex(['2011-07-06', '2011-08-06'], dtype='datetime64[ns]', freq=None)

它还可以用来处理缺失值

In [33]:
idx = pd.to_datetime(datestrs + [None])
idx

DatetimeIndex(['2011-07-06', '2011-08-06', 'NaT'], dtype='datetime64[ns]', freq=None)

In [35]:
idx[2]
# 存在缺失值

NaT

In [36]:
pd.isnull(idx)

array([False, False,  True], dtype=bool)

In [40]:
idx.fillna('Missing')

Index([2011-07-06 00:00:00, 2011-08-06 00:00:00, 'Missing'], dtype='object')

# 1.3时间序列基础

pandas最基本的时间序列类型就是以timestamp为索引的Series

In [42]:
from datetime import datetime
import numpy as np
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 7),
         datetime(2011, 1, 8), datetime(2011, 1, 10), datetime(2011, 1, 12)]
ts = pd.Series(np.random.randn(6), index=dates)
ts

2011-01-02    0.723262
2011-01-05   -0.040427
2011-01-07   -1.278755
2011-01-08   -0.403372
2011-01-10    0.096309
2011-01-12   -1.270615
dtype: float64

现在,ts就变成一个TimeSeries了

In [43]:
type(ts)

pandas.core.series.Series

In [44]:
ts + ts[::2]

2011-01-02    1.446524
2011-01-05         NaN
2011-01-07   -2.557511
2011-01-08         NaN
2011-01-10    0.192618
2011-01-12         NaN
dtype: float64

In [48]:
ts[::2]

2011-01-02    0.723262
2011-01-07   -1.278755
2011-01-10    0.096309
dtype: float64

# TimeSeries的索引、选取、子集构造

TimeSeries是Series的一个子类，所以在索引及数据选取方面和Series是一样的

In [49]:
stamp = ts.index[2]
ts[stamp]

-1.278755317944988

也可以传入被解释为日期的字符串

In [50]:
ts['1/10/2011']

0.096309199644055224

对于较长的时间序列，只需传入 年 或者 年月 就可以进行切片

In [52]:
longer_ts = pd.Series(np.random.randn(1000),
                   index=pd.date_range('1/1/2000', periods=1000))
longer_ts

2000-01-01   -0.230527
2000-01-02    0.550285
2000-01-03    0.134177
2000-01-04   -1.702856
2000-01-05   -1.759721
2000-01-06    1.159631
2000-01-07   -0.376138
2000-01-08    0.247216
2000-01-09    1.346568
2000-01-10   -0.486979
2000-01-11    0.281456
2000-01-12    0.633256
2000-01-13   -0.112693
2000-01-14    0.131824
2000-01-15    0.245397
2000-01-16   -0.116126
2000-01-17    1.667103
2000-01-18    1.188882
2000-01-19    2.538886
2000-01-20   -0.479999
2000-01-21    0.442562
2000-01-22   -0.285166
2000-01-23   -0.508150
2000-01-24    2.093633
2000-01-25    0.205784
2000-01-26   -0.623640
2000-01-27   -0.172223
2000-01-28   -1.333404
2000-01-29    1.260948
2000-01-30   -2.070936
                ...   
2002-08-28    1.082957
2002-08-29    0.998291
2002-08-30    0.810699
2002-08-31    0.214862
2002-09-01   -2.136139
2002-09-02   -0.439619
2002-09-03   -0.275766
2002-09-04   -0.127271
2002-09-05   -2.916603
2002-09-06   -0.606639
2002-09-07   -0.394880
2002-09-08   -0.954748
2002-09-09 

In [53]:
longer_ts['2001']

2001-01-01   -0.234237
2001-01-02   -0.821772
2001-01-03    0.251823
2001-01-04   -1.151684
2001-01-05    0.561426
2001-01-06   -0.011441
2001-01-07   -0.096660
2001-01-08    1.748163
2001-01-09   -2.462276
2001-01-10   -0.181436
2001-01-11   -0.551715
2001-01-12    0.112277
2001-01-13    0.364381
2001-01-14   -1.701154
2001-01-15    0.855690
2001-01-16   -0.129744
2001-01-17    0.892121
2001-01-18   -0.433653
2001-01-19   -0.061193
2001-01-20   -0.123583
2001-01-21   -0.875426
2001-01-22   -0.537524
2001-01-23    1.755290
2001-01-24    0.587609
2001-01-25   -2.051877
2001-01-26    0.414889
2001-01-27   -1.042092
2001-01-28    0.967143
2001-01-29    0.470026
2001-01-30    0.215321
                ...   
2001-12-02    1.910626
2001-12-03   -0.186834
2001-12-04   -0.424152
2001-12-05    1.386123
2001-12-06   -0.697632
2001-12-07   -1.561521
2001-12-08   -0.929383
2001-12-09    0.923648
2001-12-10    1.201735
2001-12-11   -0.548249
2001-12-12   -0.483855
2001-12-13   -0.302389
2001-12-14 

In [54]:
longer_ts['2001-05']

2001-05-01   -0.345164
2001-05-02    0.420989
2001-05-03   -0.549764
2001-05-04    0.020090
2001-05-05    0.340157
2001-05-06    0.013183
2001-05-07   -0.786318
2001-05-08   -1.500468
2001-05-09   -1.188088
2001-05-10    0.054422
2001-05-11    0.494057
2001-05-12   -0.378727
2001-05-13   -1.537717
2001-05-14    0.339951
2001-05-15   -0.025708
2001-05-16    1.196398
2001-05-17    0.135183
2001-05-18   -0.478639
2001-05-19    0.193404
2001-05-20   -1.402856
2001-05-21   -2.122873
2001-05-22   -0.732601
2001-05-23    0.060487
2001-05-24    0.058874
2001-05-25    0.026984
2001-05-26   -0.176191
2001-05-27    0.427269
2001-05-28    1.623517
2001-05-29    0.857508
2001-05-30   -1.958664
2001-05-31    0.827976
Freq: D, dtype: float64

通过日期进行切片的方式只对规则的Series有效

In [55]:
ts[datetime(2011, 1, 7):]

2011-01-07   -1.278755
2011-01-08   -0.403372
2011-01-10    0.096309
2011-01-12   -1.270615
dtype: float64

# 带有重复索引的时间序列

对于同一个时间点有多个观测值的情况，比如

In [58]:
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000', '1/2/2000',
                          '1/3/2000'])
dup_ts = pd.Series(np.arange(5), index=dates)
dup_ts

2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int32

首先可以通过is_unique属性来查看是否唯一

In [59]:
dup_ts.index.is_unique

False

对这个时间序列进行索引，要么产生标量，要么产生切片，这要看具体所选的时间是否重复

In [60]:
dup_ts['1/3/2000']
# 不重复

4

In [61]:
dup_ts['1/2/2000']
# 重复

2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int32

要想对非唯一timestamp的数据进行聚合，可以使用groupby，传入level=0

In [62]:
grouped = dup_ts.groupby(level=0)
grouped.mean()

2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int32

In [64]:
grouped.count()

2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64