时序(time series)是时间轴上的一系列数据点， 它们2在很多场景中扮演着重要角色： 交易员用历史股价计算风险；天气预报给予测量温度、湿度、起亚的传感器生成的时序来预测天气；数字市场部依靠网页生成的时序来得出营销活动所需的结论。

在Pandas中，我们可以利用Datarame处理多种基于时间的索引： DatetimeIndex是最常见的一种， 表示带有时间戳的索引。


## 1. DatetimeIndex

本节需要学会如下内容：
1. 如何构造DatetimeIndex
2. 如何筛选属于特定时间范围的索引
3. 如何处理时区

### 1.1 创建DatetimeIndex

pandas为构造DatetimeIndex提供了date_range函数。它接受一个开始日期、频率参数、周期数、结束日期

In [1]:
import pandas as pd
import numpy as np


In [3]:
pd.options.plotting.backend = 'plotly'

In [4]:
daily_index = pd.date_range("2020-02-28", periods=4, freq="D")
daily_index

DatetimeIndex(['2020-02-28', '2020-02-29', '2020-03-01', '2020-03-02'], dtype='datetime64[ns]', freq='D')

In [6]:
weekly_index = pd.date_range("2020-01-01", "2020-01-31", freq="W-SUN")
weekly_index

DatetimeIndex(['2020-01-05', '2020-01-12', '2020-01-19', '2020-01-26'], dtype='datetime64[ns]', freq='W-SUN')

In [7]:
# 通过时间、人数构建一个关于日期与游客参观人数的表格
pd.DataFrame(data=[21, 15, 33, 34],
             columns=["visitors"], index=weekly_index)

Unnamed: 0,visitors
2020-01-05,21
2020-01-12,15
2020-01-19,33
2020-01-26,34


In [18]:
msft = pd.read_csv('../ori_writer/csv/MSFT.csv')
msft.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8622 entries, 0 to 8621
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       8622 non-null   object 
 1   Open       8622 non-null   float64
 2   High       8622 non-null   float64
 3   Low        8622 non-null   float64
 4   Close      8622 non-null   float64
 5   Adj Close  8622 non-null   float64
 6   Volume     8622 non-null   int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 471.6+ KB


上面的结果我们观察到： **在一般读取的时候， pandas会将时间戳作为字符串来解释(object)**， 想要解决上面的问题， 有2种解决。

In [19]:
msft.loc[:, "Date"] = pd.to_datetime(msft["Date"])

In [20]:
msft.dtypes

Date          object
Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume         int64
dtype: object

上述方法无效

In [22]:
# 第二种方法
msft = pd.read_csv('../ori_writer/csv/MSFT.csv', index_col='Date', parse_dates=['Date'])
msft.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8622 entries, 1986-03-13 to 2020-05-27
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       8622 non-null   float64
 1   High       8622 non-null   float64
 2   Low        8622 non-null   float64
 3   Close      8622 non-null   float64
 4   Adj Close  8622 non-null   float64
 5   Volume     8622 non-null   int64  
dtypes: float64(5), int64(1)
memory usage: 471.5 KB


In [23]:
# 在处理时序数据，分析前最好确保索引井然有序
msft = msft.sort_index()

In [24]:
# 访问时间戳的日期，而不是时间
msft.index.date

array([datetime.date(1986, 3, 13), datetime.date(1986, 3, 14),
       datetime.date(1986, 3, 17), ..., datetime.date(2020, 5, 22),
       datetime.date(2020, 5, 26), datetime.date(2020, 5, 27)],
      dtype=object)

### 筛选DatetimeIndex

如果DateFrame包含DatetimeIndex， 为loc传递YYYY-MM-DD HH:MM:SS格式的字符串作为参数可以轻松选取属于特定时间周期的行。pandas会将这个字符串转换为一个包含整个时间周期的切片。


In [25]:

msft.loc['2019', 'Adj Close']

Date
2019-01-02     99.099190
2019-01-03     95.453529
2019-01-04     99.893005
2019-01-07    100.020401
2019-01-08    100.745613
                 ...    
2019-12-24    156.515396
2019-12-26    157.798309
2019-12-27    158.086731
2019-12-30    156.724243
2019-12-31    156.833633
Name: Adj Close, Length: 252, dtype: float64

In [26]:
msft.loc['2019-06':'2020-05', 'Adj Close'].plot()


In [28]:
### 1.3 处理时区
msft_close = msft.loc[:, ['Adj Close']].copy()
msft_close.index = msft_close.index + pd.DateOffset(hours=16)
msft_close.head(2)


Unnamed: 0_level_0,Adj Close
Date,Unnamed: 1_level_1
1986-03-13 16:00:00,0.062205
1986-03-14 16:00:00,0.064427


In [29]:
msft_close = msft_close.tz_localize('America/New_York')
msft_close.head(2)

Unnamed: 0_level_0,Adj Close
Date,Unnamed: 1_level_1
1986-03-13 16:00:00-05:00,0.062205
1986-03-14 16:00:00-05:00,0.064427


In [30]:
# 转换时区
msft_close = msft_close.tz_convert('UTC')


In [31]:
msft_close

Unnamed: 0_level_0,Adj Close
Date,Unnamed: 1_level_1
1986-03-13 21:00:00+00:00,0.062205
1986-03-14 21:00:00+00:00,0.064427
1986-03-17 21:00:00+00:00,0.065537
1986-03-18 21:00:00+00:00,0.063871
1986-03-19 21:00:00+00:00,0.062760
...,...
2020-05-20 20:00:00+00:00,185.660004
2020-05-21 20:00:00+00:00,183.429993
2020-05-22 20:00:00+00:00,183.509995
2020-05-26 20:00:00+00:00,181.570007


In [32]:
msft_close.loc['2020-01-02', 'Adj Close']

Date
2020-01-02 21:00:00+00:00    159.737595
Name: Adj Close, dtype: float64