# 11.2 Time Series Basics（时间序列基础）

在pandas中，一个基本的时间序列对象，是一个用时间戳作为索引的Series，在pandas外部的话，通常是用python 字符串或datetime对象来表示的：

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

In [2]:
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5),
         datetime(2011, 1, 7), datetime(2011, 1, 8), 
         datetime(2011, 1, 10), datetime(2011, 1, 12)]

In [3]:
ts = pd.Series(np.random.randn(6), index=dates)
ts

2011-01-02    0.025435
2011-01-05   -1.114448
2011-01-07   -1.088135
2011-01-08    0.767476
2011-01-10    1.324203
2011-01-12   -0.131522
dtype: float64

上面的转化原理是，datetime对象被放进了DatetimeIndex:

In [4]:
ts.index

DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)

像其他的Series一行，数值原色会自动按时间序列索引进行对齐：

In [5]:
ts[::2]

2011-01-02    0.025435
2011-01-07   -1.088135
2011-01-10    1.324203
dtype: float64

In [6]:
ts + ts[::2]

2011-01-02    0.050871
2011-01-05         NaN
2011-01-07   -2.176269
2011-01-08         NaN
2011-01-10    2.648406
2011-01-12         NaN
dtype: float64

ts[::2]会在ts中，每隔两个元素选一个元素。

pandas中的时间戳，是按numpy中的datetime64数据类型进行保存的，可以精确到纳秒的级别：

In [7]:
ts.index.dtype

dtype('<M8[ns]')

DatetimeIndex的标量是pandas的Timestamp对象：

In [8]:
stamp = ts.index[0]
stamp

Timestamp('2011-01-02 00:00:00')

Timestamp可以在任何地方用datetime对象进行替换。

# 1 Indexing, Selection, Subsetting（索引，选择，取子集）

当我们基于标签进行索引和选择时，时间序列就像是pandas.Series：

In [9]:
ts

2011-01-02    0.025435
2011-01-05   -1.114448
2011-01-07   -1.088135
2011-01-08    0.767476
2011-01-10    1.324203
2011-01-12   -0.131522
dtype: float64

In [10]:
stamp = ts.index[2]

In [11]:
ts[stamp]

-1.0881347493871827

为了方便，我们可以直接传入一个字符串用来表示日期：

In [12]:
ts['1/10/2011']

1.324202900659826

In [13]:
ts['20110110']

1.324202900659826

对于比较长的时间序列，我们可以直接传入一年或一年一个月，来进行数据选取：

In [14]:
longer_ts = pd.Series(np.random.randn(1000),
                      index=pd.date_range('1/1/2000', periods=1000))
longer_ts

2000-01-01   -0.379845
2000-01-02   -1.298660
2000-01-03   -0.893993
2000-01-04   -0.964604
2000-01-05   -0.954546
2000-01-06   -0.598283
2000-01-07    1.145359
2000-01-08    0.942578
2000-01-09    0.958636
2000-01-10    1.947841
2000-01-11   -0.227763
2000-01-12    0.383774
2000-01-13   -1.508759
2000-01-14   -0.418396
2000-01-15    2.072546
2000-01-16   -0.946572
2000-01-17    0.313601
2000-01-18   -0.258638
2000-01-19   -1.231985
2000-01-20   -1.495832
2000-01-21   -1.498773
2000-01-22    0.602056
2000-01-23    1.889076
2000-01-24    0.832943
2000-01-25   -0.429162
2000-01-26    1.374411
2000-01-27    0.122448
2000-01-28    0.415741
2000-01-29   -1.614026
2000-01-30    0.005310
                ...   
2002-08-28    0.137185
2002-08-29    0.357340
2002-08-30   -0.385203
2002-08-31   -0.538060
2002-09-01   -0.528917
2002-09-02   -1.434767
2002-09-03   -1.158592
2002-09-04   -0.386925
2002-09-05    0.176856
2002-09-06    0.017353
2002-09-07    0.157153
2002-09-08   -0.041555
2002-09-09 

In [15]:
longer_ts['2001']

2001-01-01   -0.998390
2001-01-02   -1.687655
2001-01-03    1.064079
2001-01-04   -1.324456
2001-01-05   -0.158792
2001-01-06   -0.910289
2001-01-07    1.122474
2001-01-08    0.591065
2001-01-09   -0.171404
2001-01-10    1.022657
2001-01-11    0.456984
2001-01-12   -1.312784
2001-01-13    0.737419
2001-01-14    1.309602
2001-01-15    2.117640
2001-01-16    0.216930
2001-01-17    1.187940
2001-01-18    1.206902
2001-01-19   -0.053970
2001-01-20   -0.172110
2001-01-21    1.548288
2001-01-22    2.101209
2001-01-23    0.152660
2001-01-24    1.323908
2001-01-25    0.394168
2001-01-26   -0.145509
2001-01-27   -1.141089
2001-01-28    0.517972
2001-01-29   -0.077372
2001-01-30   -1.032074
                ...   
2001-12-02   -0.980038
2001-12-03   -0.557246
2001-12-04   -0.673834
2001-12-05   -2.216628
2001-12-06   -0.525384
2001-12-07    0.658106
2001-12-08    1.304198
2001-12-09   -1.103472
2001-12-10    0.812964
2001-12-11    0.382879
2001-12-12    1.519063
2001-12-13   -0.521908
2001-12-14 

这里，字符串'2001'就直接被解析为一年，然后选中这个时期的数据。我们也可以指定月份：

In [16]:
longer_ts['2001-05']

2001-05-01    0.218134
2001-05-02   -0.310297
2001-05-03    0.331231
2001-05-04   -0.087658
2001-05-05    0.436195
2001-05-06   -0.390576
2001-05-07    0.017918
2001-05-08    0.891836
2001-05-09   -0.500486
2001-05-10    0.096207
2001-05-11    0.479083
2001-05-12   -0.497008
2001-05-13   -0.524715
2001-05-14   -0.701695
2001-05-15   -0.908180
2001-05-16   -0.348267
2001-05-17    0.517769
2001-05-18    1.863125
2001-05-19    0.617462
2001-05-20    0.422309
2001-05-21   -1.443255
2001-05-22   -0.236599
2001-05-23    0.346252
2001-05-24    0.618586
2001-05-25   -0.524429
2001-05-26    0.637954
2001-05-27    0.970810
2001-05-28    0.222774
2001-05-29    1.400135
2001-05-30    2.104010
2001-05-31    0.910560
Freq: D, dtype: float64

利用datetime进行切片（slicing）也没问题：

In [17]:
ts[datetime(2011, 1, 7)]

-1.0881347493871827

因为大部分时间序列是按年代时间顺序来排列的，我们可以用时间戳来进行切片，选中一段范围内的时间：

In [18]:
ts

2011-01-02    0.025435
2011-01-05   -1.114448
2011-01-07   -1.088135
2011-01-08    0.767476
2011-01-10    1.324203
2011-01-12   -0.131522
dtype: float64

In [19]:
ts['1/6/2011':'1/11/2011']

2011-01-07   -1.088135
2011-01-08    0.767476
2011-01-10    1.324203
dtype: float64

记住，这种方式的切片得到的只是原来数据的一个视图，如果我们在切片的结果上进行更改的的，原来的数据也会变化。

有一个相等的实例方法（instance method）也能切片，truncate，能在两个日期上，对Series进行切片：

In [20]:
ts.truncate(after='1/9/2011')

2011-01-02    0.025435
2011-01-05   -1.114448
2011-01-07   -1.088135
2011-01-08    0.767476
dtype: float64

所有这些都适用于DataFrame，我们对行进行索引：

In [21]:
dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')

In [22]:
long_df = pd.DataFrame(np.random.randn(100, 4),
                       index=dates,
                       columns=['Colorado', 'Texas',
                                'New York', 'Ohio'])

In [23]:
long_df.loc['5-2001']

Unnamed: 0,Colorado,Texas,New York,Ohio
2001-05-02,-1.456652,-1.324286,-0.279348,1.045905
2001-05-09,-0.850585,0.513576,-0.082321,1.410633
2001-05-16,-2.381799,0.711073,1.436283,0.679508
2001-05-23,-2.487292,0.709146,0.393318,0.931982
2001-05-30,1.137026,1.035017,0.044617,-1.368798


# 2 Time Series with Duplicate Indices（重复索引的时间序列）

在某些数据中，可能会遇到多个数据在同一时间戳下的情况：

In [24]:
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000', 
                          '1/2/2000', '1/3/2000'])

In [25]:
dup_ts = pd.Series(np.arange(5), index=dates)
dup_ts

2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int32

我们通过is_unique属性来查看index是否是唯一值：

In [26]:
dup_ts.index.is_unique

False

对这个时间序列取索引的的话， 要么得到标量，要么得到切片，这取决于时间戳是否是重复的：

In [27]:
dup_ts['1/3/2000'] # not duplicated

4

In [28]:
dup_ts['1/2/2000'] # duplicated

2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int32

假设我们想要聚合那些有重复时间戳的数据，一种方法是用groupby，设定level=0：

In [29]:
grouped = dup_ts.groupby(level=0)
grouped.mean()

2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int32

In [30]:
grouped.count()

2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64