# Python 時間序列分析 (ARIMA 模型回歸，決策樹二分類)

* Ref: [https://blog.csdn.net/weixin_43746433/article/details/95673475](https://blog.csdn.net/weixin_43746433/article/details/95673475)
* Data:

In [6]:
import pandas as pd
import datetime as dt
import numpy as np

## 1. 時間序列

### 1.1 data_range
* timestamp
* period
* interval

可以指定開始的時間與週期
* H: hour
* D: day
* M: month

In [3]:
rng = pd.date_range('2020-10-01', # 起始日期
                    periods=10,   # 重複 10 個週期 (包含起始日期)
                    freq='3D')    # 每三天為一個週期
rng

DatetimeIndex(['2020-10-01', '2020-10-04', '2020-10-07', '2020-10-10',
               '2020-10-13', '2020-10-16', '2020-10-19', '2020-10-22',
               '2020-10-25', '2020-10-28'],
              dtype='datetime64[ns]', freq='3D')

In [12]:
# 產生 20 個 normal distribution random number
# index 設為以 2020-10-01 開始的 20 天
time = pd.Series(np.random.randn(20),
                 index=pd.date_range(dt.datetime(2020, 10, 1), periods=20))
time

2020-10-01   -0.909143
2020-10-02    0.845100
2020-10-03    0.595816
2020-10-04   -1.653899
2020-10-05   -0.905150
2020-10-06    0.440458
2020-10-07   -1.466339
2020-10-08   -0.124193
2020-10-09   -2.453975
2020-10-10   -1.047542
2020-10-11   -0.572579
2020-10-12   -0.880458
2020-10-13   -1.120524
2020-10-14   -1.289865
2020-10-15    0.922374
2020-10-16    0.651188
2020-10-17   -1.704083
2020-10-18   -0.777089
2020-10-19   -0.183997
2020-10-20   -0.782617
Freq: D, dtype: float64

In [13]:
# 只選出在指定時間之後的
# before 的時間 truncate 掉了
time.truncate(before='2020-10-10')

2020-10-10   -1.047542
2020-10-11   -0.572579
2020-10-12   -0.880458
2020-10-13   -1.120524
2020-10-14   -1.289865
2020-10-15    0.922374
2020-10-16    0.651188
2020-10-17   -1.704083
2020-10-18   -0.777089
2020-10-19   -0.183997
2020-10-20   -0.782617
Freq: D, dtype: float64

In [15]:
# 只選出在指定時間之前的
# after 的時間 truncate 掉了
time.truncate(after='2020-10-10')

2020-10-01   -0.909143
2020-10-02    0.845100
2020-10-03    0.595816
2020-10-04   -1.653899
2020-10-05   -0.905150
2020-10-06    0.440458
2020-10-07   -1.466339
2020-10-08   -0.124193
2020-10-09   -2.453975
2020-10-10   -1.047542
Freq: D, dtype: float64

In [16]:
time['2020-10-15':'2020-10-20'] # 會包含 start 和 end

2020-10-15    0.922374
2020-10-16    0.651188
2020-10-17   -1.704083
2020-10-18   -0.777089
2020-10-19   -0.183997
2020-10-20   -0.782617
Freq: D, dtype: float64

In [17]:
data = pd.date_range('2020-10-01', '2021-10-01', freq='M')
data

DatetimeIndex(['2020-10-31', '2020-11-30', '2020-12-31', '2021-01-31',
               '2021-02-28', '2021-03-31', '2021-04-30', '2021-05-31',
               '2021-06-30', '2021-07-31', '2021-08-31', '2021-09-30'],
              dtype='datetime64[ns]', freq='M')

### 1.2 Timestamp, Period, Timedelta

* Timestamp
  * 可以指定至日期，或是成某個時間點

In [18]:
pd.Timestamp('2020-10-01')

Timestamp('2020-10-01 00:00:00')

In [19]:
pd.Timestamp('2020-10-01 10')

Timestamp('2020-10-01 10:00:00')

In [20]:
pd.Timestamp('2020-10-01 10:15')

Timestamp('2020-10-01 10:15:00')

* Period
  * 物件中會記載頻率的資訊

In [21]:
pd.Period('2020-10')

Period('2020-10', 'M')

In [22]:
pd.Period('2020-10-01')

Period('2020-10-01', 'D')

* Timedelta

In [23]:
pd.Timedelta('1 day')

Timedelta('1 days 00:00:00')

### 1.3 時間轉換

* 利用 Timedelta 來增加或減少時間

In [24]:
pd.Period('2020-10-01 10:10') + pd.Timedelta('1 day')

Period('2020-10-02 10:10', 'T')

In [26]:
pd.Timestamp('2020-10-01 10:10') + pd.Timedelta('1 day')

Timestamp('2020-10-02 10:10:00')

In [27]:
pd.Timestamp('2020-10-01 10:10') + pd.Timedelta('15 ns')

Timestamp('2020-10-01 10:10:00.000000015')

### 1.4 `period_range()`

In [28]:
p1 = pd.period_range('2020-10-01 10:10', freq='25H', periods=10)

In [29]:
p2 = pd.period_range('2020-10-01 10:10', freq='1D1H', periods=10)

In [30]:
p1

PeriodIndex(['2020-10-01 10:00', '2020-10-02 11:00', '2020-10-03 12:00',
             '2020-10-04 13:00', '2020-10-05 14:00', '2020-10-06 15:00',
             '2020-10-07 16:00', '2020-10-08 17:00', '2020-10-09 18:00',
             '2020-10-10 19:00'],
            dtype='period[25H]', freq='25H')

In [31]:
p2

PeriodIndex(['2020-10-01 10:00', '2020-10-02 11:00', '2020-10-03 12:00',
             '2020-10-04 13:00', '2020-10-05 14:00', '2020-10-06 15:00',
             '2020-10-07 16:00', '2020-10-08 17:00', '2020-10-09 18:00',
             '2020-10-10 19:00'],
            dtype='period[25H]', freq='25H')

### 1.5 指定索引

In [32]:
rng = pd.date_range('2020 Oct 1', periods=10, freq='D')
rng

DatetimeIndex(['2020-10-01', '2020-10-02', '2020-10-03', '2020-10-04',
               '2020-10-05', '2020-10-06', '2020-10-07', '2020-10-08',
               '2020-10-09', '2020-10-10'],
              dtype='datetime64[ns]', freq='D')

In [33]:
pd.Series(range(len(rng)), index=rng)

2020-10-01    0
2020-10-02    1
2020-10-03    2
2020-10-04    3
2020-10-05    4
2020-10-06    5
2020-10-07    6
2020-10-08    7
2020-10-09    8
2020-10-10    9
Freq: D, dtype: int64

In [36]:
periods = [pd.Period('2020-10'), pd.Period('2020-11'), pd.Period('2020-12')]
periods

[Period('2020-10', 'M'), Period('2020-11', 'M'), Period('2020-12', 'M')]

In [37]:
ts = pd.Series(np.random.randn(len(periods)), index=periods)
ts

2020-10    0.991335
2020-11   -0.516955
2020-12   -0.569943
Freq: M, dtype: float64

In [38]:
type(ts.index)

pandas.core.indexes.period.PeriodIndex

### 1.6 Timestamp 和 Period 互換

In [40]:
ts = pd.Series(range(10),
               pd.date_range('2020-10-01 8:00', periods=10, freq='H'))
ts

2020-10-01 08:00:00    0
2020-10-01 09:00:00    1
2020-10-01 10:00:00    2
2020-10-01 11:00:00    3
2020-10-01 12:00:00    4
2020-10-01 13:00:00    5
2020-10-01 14:00:00    6
2020-10-01 15:00:00    7
2020-10-01 16:00:00    8
2020-10-01 17:00:00    9
Freq: H, dtype: int64

In [41]:
ts_period = ts.to_period()
ts_period

2020-10-01 08:00    0
2020-10-01 09:00    1
2020-10-01 10:00    2
2020-10-01 11:00    3
2020-10-01 12:00    4
2020-10-01 13:00    5
2020-10-01 14:00    6
2020-10-01 15:00    7
2020-10-01 16:00    8
2020-10-01 17:00    9
Freq: H, dtype: int64

### 1.7 填補時間序列的缺失值

* `asfreq()`
* `ffill(n)`
* `bfill(n)`
* `interpolate("linear")`

In [46]:
ts = pd.Series(range(20),
               pd.date_range('2020-10-01 8:00', periods=20, freq='D'))
ts

2020-10-01 08:00:00     0
2020-10-02 08:00:00     1
2020-10-03 08:00:00     2
2020-10-04 08:00:00     3
2020-10-05 08:00:00     4
2020-10-06 08:00:00     5
2020-10-07 08:00:00     6
2020-10-08 08:00:00     7
2020-10-09 08:00:00     8
2020-10-10 08:00:00     9
2020-10-11 08:00:00    10
2020-10-12 08:00:00    11
2020-10-13 08:00:00    12
2020-10-14 08:00:00    13
2020-10-15 08:00:00    14
2020-10-16 08:00:00    15
2020-10-17 08:00:00    16
2020-10-18 08:00:00    17
2020-10-19 08:00:00    18
2020-10-20 08:00:00    19
Freq: D, dtype: int64

In [47]:
day3Ts = ts.resample('3D').mean() # 每三天求一次平均，然後用第一天當索引
day3Ts

2020-10-01     1.0
2020-10-04     4.0
2020-10-07     7.0
2020-10-10    10.0
2020-10-13    13.0
2020-10-16    16.0
2020-10-19    18.5
Freq: 3D, dtype: float64

In [48]:
# 加上 asfreq() 會補上其他的日期
# 這樣我有一個有 missing value 的時間序列
day3Ts.resample('D').asfreq()

2020-10-01     1.0
2020-10-02     NaN
2020-10-03     NaN
2020-10-04     4.0
2020-10-05     NaN
2020-10-06     NaN
2020-10-07     7.0
2020-10-08     NaN
2020-10-09     NaN
2020-10-10    10.0
2020-10-11     NaN
2020-10-12     NaN
2020-10-13    13.0
2020-10-14     NaN
2020-10-15     NaN
2020-10-16    16.0
2020-10-17     NaN
2020-10-18     NaN
2020-10-19    18.5
Freq: D, dtype: float64

In [49]:
day3Ts.resample('D').ffill(1) # ffill(n) 就是 forward fill，用上一個非零的值來往後填補 n 次

2020-10-01     1.0
2020-10-02     1.0
2020-10-03     NaN
2020-10-04     4.0
2020-10-05     4.0
2020-10-06     NaN
2020-10-07     7.0
2020-10-08     7.0
2020-10-09     NaN
2020-10-10    10.0
2020-10-11    10.0
2020-10-12     NaN
2020-10-13    13.0
2020-10-14    13.0
2020-10-15     NaN
2020-10-16    16.0
2020-10-17    16.0
2020-10-18     NaN
2020-10-19    18.5
Freq: D, dtype: float64

In [50]:
day3Ts.resample('D').bfill(1) # bfill(n) 就是 backward fill，用下一個非零的值來往前填補 n 次

2020-10-01     1.0
2020-10-02     NaN
2020-10-03     4.0
2020-10-04     4.0
2020-10-05     NaN
2020-10-06     7.0
2020-10-07     7.0
2020-10-08     NaN
2020-10-09    10.0
2020-10-10    10.0
2020-10-11     NaN
2020-10-12    13.0
2020-10-13    13.0
2020-10-14     NaN
2020-10-15    16.0
2020-10-16    16.0
2020-10-17     NaN
2020-10-18    18.5
2020-10-19    18.5
Freq: D, dtype: float64

In [51]:
day3Ts.resample('D').interpolate('linear') # 用線性方程式來填補缺失值

2020-10-01     1.000000
2020-10-02     2.000000
2020-10-03     3.000000
2020-10-04     4.000000
2020-10-05     5.000000
2020-10-06     6.000000
2020-10-07     7.000000
2020-10-08     8.000000
2020-10-09     9.000000
2020-10-10    10.000000
2020-10-11    11.000000
2020-10-12    12.000000
2020-10-13    13.000000
2020-10-14    14.000000
2020-10-15    15.000000
2020-10-16    16.000000
2020-10-17    16.833333
2020-10-18    17.666667
2020-10-19    18.500000
Freq: D, dtype: float64

## 2. 股價預測

### 2.1 Load data

In [57]:
import pandas as pd
# import pandas_datareader
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.pylab import style
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

In [55]:
style.use('ggplot')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

In [None]:
stock = 'data/T10yr.csv'
df = pd.read_csv(stock,
                 index_col=0,
                 parse_dates=[0])
df.head(10)

### 2.2 取 Weekly average 然後分 train test samples

In [None]:
# W-MON 表示用 Weekly 並且 Monday 當第一天
weekly = df['Close'].resample('W-MON').mean()
train = weekly['2000':'2015']
train.head()

### 2.3 畫圖

In [None]:
train.plot(figsize=(15, 5))
plt.legend(bbox_to_anchor(1.25, 0.5))
plt.title('Stock Close Price')
sns.despine()

### 2.4 Differencing

In [None]:
diff = train.diff()
diff = train.dropna()

In [None]:
plt.figure()
plt.plot(diff)
plt.title('一階差分')
plt.show()

### 2.5 ACF, PACF

* ACF (自相關函數): 有序的隨機變數序列與自身相比較
  * ACF 表示同一個序列在不同的時序取值之間的相關性
  * 包含其他變數的影響
* PACF (偏自相關函數)
  * 對於一個平穩的 AR(p) 模型，求出滯後 $k$ 自相關係數 $p(k)$ 時，實際上得到的並不是 $x(t)$ 與 $x(t-k)$ 之間單純的相關關係
    * $x(t)$ 同時還會受到 $x(t-1)$, $x(t-2)$, $\cdots$, $x(t-k+1)$ 的影響，而這 $k-1$ 個隨機變數又和 $x(t-k)$ 有關係
  * PACF 剔除了這 $k-1$ 個變數的干擾，只看 $x(t-k)$ 對 $x(t)$ 的影響的相關程度
    * 只有這兩個變數彼此之間的影響

In [58]:
acf = plot_acf(diff, lags=20)
plt.title('ACF')
acf.show()

NameError: name 'plt_acf' is not defined

### 2.6 ARIMA 模型

* ARIMA(p,d,q)
  * 描述當前值與歷史值之間的關係，用變數自身的歷史數據對自身進行預測
  * 必須要滿足 stationary
  * AR: 自回歸，$p$ 是自回歸項係數
  * I: $d$ 是分差的次數
  * MA: 移動平均，$q$ 是移動平均的項數