## 今回アジェンダ

時系列データ（株価データ）の予測分析をディープラーニングを用いて実施

1. 前回アジェンダの復習

2. LSTM　（ディープラーニングモデルで時系列データを扱う事ができるRNNの問題を解決したモデル）

3. GRU　（上記のLSTMの欠点を解決したモデル）

＊LSTM、GRUともに時系列データ予測や音声認識で広く利用されている。

## 前回アジェンダ

時系列データを扱う際の一般的な分析手順を実施
1. 季節性の分解(Seasonal Decomposition)

2. 定常性の検定試験(Staionary Test)

3. 自己相関(ACF)と偏自己相関(PACF)   

4. 予測モデル

4.1 ARIMA　(基本モデル)

4.2 SARIMAX　（季節性を考慮したモデル+外生変数）

4.3 状態空間モデル　（ARIMAXの重付け係数を定数ではなく時間とともに変化）



In [None]:
import warnings 
warnings.filterwarnings('ignore')
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt  # Matlab-style plotting
import seaborn as sns
import statsmodels.api as sm

color = sns.color_palette()
sns.set_style('darkgrid')

In [None]:
from subprocess import check_output
print(check_output(['ls', '../input']).decode('utf-8'))

In [None]:
train = pd.read_csv('../input/demand-forecasting-kernels-only/train.csv')
train['date'] = pd.to_datetime(train['date'], format="%Y-%m-%d")
train.head()

In [None]:
# per 1 store, 1 item
train_df = train[train['store']==1]
train_df = train_df[train['item']==1]

In [None]:
print(train_df.shape)

In [None]:
sns.lineplot(x="date", y="sales",legend = 'full' , data=train_df)

In [None]:
# train_df = train_df.set_index('date')
train_df['year'] = train['date'].dt.year
train_df['month'] = train['date'].dt.month
train_df['day'] = train['date'].dt.dayofyear
train_df['weekday'] = train['date'].dt.weekday
train_df.head()

In [None]:
sns.boxplot(x="weekday", y="sales", data=train_df)

Monday=0, Sunday=6.  

# ARIMA

AR項:AR(p) 
$$y_t = \beta_{0}+\beta_{1}y_{t-1}+ \cdots +\beta_{p}y_{t-p}+e_{t}$$ 
MA項:MA(q) 
$$y_t = e_{t}+\theta_{1}e_{t-1}+ \cdots +\theta_{q}e_{t-q}$$
ARIMA(p,d,q)
$$y_t = \beta_{0}+\beta_{1}y_{t-1}+ \cdots +\beta_{p}y_{t-p}+e_{t}+\theta_{1}e_{t-1}+ \cdots +\theta_{q}e_{t-q}$$ 
I項:I(d) 
$$y_t = \Delta^{d}Y_{t}$$
<br>
$Y_{t}$はオリジナルデータ、dは残差を取る回数（d＝１なら、$\Delta Y_{t}=Y_{t}-Y_{t-1}$）

## 時系列の分解
季節性、トレンド、残余(residual)に分解し、データの傾向を調査 

In [None]:
train_df = train_df.set_index('date')
train_df['sales'] = train_df['sales'].astype(float)
train_df.head()

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(train_df['sales'], model='additive', freq=365)

fig = plt.figure()  
fig = result.plot()  
fig.set_size_inches(15, 12)

###  定常と非定常:

時系列データの予測精度をあげる為には、定常(staionary)でなければならない。非定常(non-stationary)の場合は、残差を取る事で、定常化できるかを検証

![alt text](https://imgur.com/LjtBXwf.png)


![alt text](https://imgur.com/v2Uye7X.png)


![Imgur](https://i.imgur.com/6HVlvg2.png)  

実際には、目視で定常かは判断できない為に検定試験を実施


In [None]:
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries, window = 12, cutoff = 0.01):

    #Determing rolling statistics
    rolmean = timeseries.rolling(window).mean()
    rolstd = timeseries.rolling(window).std()

    #Plot rolling statistics:
    fig = plt.figure(figsize=(12, 8))
    orig = plt.plot(timeseries, color='blue',label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show()
    
    #Perform Dickey-Fuller test:
    print('Results of Augmented Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC', maxlag = 20 )
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    pvalue = dftest[1]
    if pvalue < cutoff:
        print('p-value = %.4f. The series is likely stationary.' % pvalue)
    else:
        print('p-value = %.4f. The series is likely non-stationary.' % pvalue)
    
    print(dfoutput)


In [None]:
test_stationarity(train_df['sales'])

オリジナルデータ($Y_t$)の残差($ \Delta Y_t = Y_t-Y_{t-1}$)を算出し、定常性を評価

In [None]:
first_diff = train_df.sales - train_df.sales.shift(1)
first_diff = first_diff.dropna(inplace = False)
test_stationarity(first_diff, window = 12)

売上個数のデータは、残差を一回取ると、定常となる事がわかった


## 自己相関(ACF)と偏自己相関(PACF)


自己相関は、ある時点tとt-kの時のデータの相関(k=1,2,...)

**corr(y<sub>t</sub>, y<sub>t-k</sub>)**

編自己相関は、ある時点tとt+kの時のモデルの誤差項データの相関(k=1,2,...)

**corr(e<sub>t</sub>, e<sub>t-k</sub>)**


ARIMA(p、d、q)のpとqを選定するのに利用される

In [None]:
import statsmodels.api as sm

fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(train_df.sales, lags=40, ax=ax1) # 
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(train_df.sales, lags=40, ax=ax2)# , lags=40

　オリジナルデータではなく、残差データを使用

In [None]:
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(first_diff, lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(first_diff, lags=40, ax=ax2)

# Here we can see the acf and pacf both has a recurring pattern every 7 periods. Indicating a weekly pattern exists. 
# Any time you see a regular pattern like that in one of these plots, you should suspect that there is some sort of 
# significant seasonal thing going on. Then we should start to consider SARIMA to take seasonality into accuont

タイムラグ k= 7, 14, 21  等でACFが大きく正の数が出ているのは、季節性（週）を考慮しなければならない事を示している。
## AIRMA(p,d,q)モデリング

### p, d, qの決める方法

(1) ACFとPACFを利用する<br>
(2) p,d,qの組合せを決定し、全ての組合せでARIMAを実行し、AICがが最小の組合せを使用<br>

ここでは、(1)を利用する<br><br>
残差を一回取ると、定常化になった為、d＝１はすぐにわかる。 ** d = 1** 

AR項は、PACFが6タイムラグ以内で、有為性がある ** p = 6** 

MA項は、一般的にACFから判定するが、ここで、誤ったqの選定が分析に悪影響が出ないように０とする（AR項が正しく選定されている場合は、MA項は必要ない） ** q = 0** 


In [None]:
arima_mod6 = sm.tsa.ARIMA(train_df.sales, (6,1,0)).fit(disp=False)
print(arima_mod6.summary())

In [None]:
import itertools
p = range(0,7)
d = range(1,2)
q = range(0,3)
pdq = list(itertools.product(p,d,q)) # gets all possible combinations of p, d, and q 
combs = {}
aics = []
# p, d, and q can be either 0, 1, or 2
for combination in pdq:
    try:
        #print(combination)
        model = sm.tsa.ARIMA(train_df.sales, order=combination).fit(disp=False)
        #print(model.aic)
        combs.update({model.aic : [combination]})
        aics.append(model.aic)
    except:
        continue

#print(aics)
best_aic = min(aics)
#print(best_aic)
print(combs[best_aic][0])

ちなみに(2)のAICを最小化する方法でp,d,qのベストな組合せは、p=6, d=1, q=2

In [None]:
from scipy import stats
from scipy.stats import normaltest

resid = arima_mod6.resid
print(normaltest(resid))
# returns a 2-tuple of the chi-squared statistic, and the associated p-value. the p-value is very small, meaning
# the residual is not a normal distribution

fig = plt.figure(figsize=(12,8))
ax0 = fig.add_subplot(111)

sns.distplot(resid ,fit = stats.norm, ax = ax0) # need to import scipy.stats

# Get the fitted parameters used by the function
(mu, sigma) = stats.norm.fit(resid)

#Now plot the distribution using 
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)], loc='best')
plt.ylabel('Frequency')
plt.title('Residual distribution')


# ACF and PACF
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(arima_mod6.resid, lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(arima_mod6.resid, lags=40, ax=ax2)

ACFとPACFは、未だ、収束しない為、季節性を考慮したSARIMAモデルを利用する

### SARIMA

SARIMA（p,d,q,p2,d2,q2,s）モデルでは、ARIMA(p,d,q)と合わせて、季節性項で(p2,d2,q2)を、さらに周期sを選定する必要がある。SARIMA（p,d,q,0,0,0,1）は、ARIMA(p,d,q)と同じ

In [None]:
arima_mod6 = sm.tsa.statespace.SARIMAX(train_df.sales, trend='n', order=(6,1,0)).fit()
print(arima_mod6.summary())

In [None]:
resid = arima_mod6.resid
print(normaltest(resid))

fig = plt.figure(figsize=(12,8))
ax0 = fig.add_subplot(111)

sns.distplot(resid ,fit = stats.norm, ax = ax0) # need to import scipy.stats

# Get the fitted parameters used by the function
(mu, sigma) = stats.norm.fit(resid)

#Now plot the distribution using 
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)], loc='best')
plt.ylabel('Frequency')
plt.title('Residual distribution')


# ACF and PACF
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(arima_mod6.resid, lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(arima_mod6.resid, lags=40, ax=ax2)

## 予測と検証

データの最後の3ケ月分を予測し、結果検証を実施

In [None]:
start_index = '2017-10-01'
end_index = '2017-12-31'
train_df['forecast'] = arima_mod6.predict(start = start_index, end= end_index, dynamic= True)  
train_df[start_index:end_index][['sales', 'forecast']].plot(figsize=(12, 8))

In [None]:
def smape_kun(y_true, y_pred):
#    mape = np.mean(abs((y_true-y_pred)/y_true))*100
    smape = np.mean((np.abs(y_pred - y_true) * 200/ (np.abs(y_pred) + np.abs(y_true))).fillna(0))
#    print('MAPE: %.2f %% \nSMAPE: %.2f'% (mape,smape), "%")    
    print('SMAPE: %.2f'% (smape), "%")

In [None]:
smape_kun(train_df[start_index:end_index]['sales'],train_df[start_index:end_index]['forecast'])

## SARIMAX: adding external variables 

In [None]:
# per 1 store, 1 item
storeid = 1
itemid = 1
train_df = train[train['store']==storeid]
train_df = train_df[train_df['item']==itemid]

# train_df = train_df.set_index('date')
train_df['year'] = train_df['date'].dt.year - 2012
train_df['month'] = train_df['date'].dt.month
train_df['day'] = train_df['date'].dt.dayofyear
train_df['weekday'] = train_df['date'].dt.weekday

train_df.head(10)

In [None]:
train_df = pd.get_dummies(train_df, columns = ['month','weekday'] , prefix = ['month','weekday'])
train_df.head()

In [None]:
train_df = train_df.set_index('date')
#train_df = train_df.reset_index()
start_index = '2017-10-01'
end_index = '2017-12-31'
train_df.head()

In [None]:
ext_var_list = ['year', 'day', 
       'month_1', 'month_2', 'month_3', 'month_4', 'month_5', 'month_6',
       'month_7', 'month_8', 'month_9', 'month_10', 'month_11', 'month_12',
       'weekday_0','weekday_1', 'weekday_2', 'weekday_3', 'weekday_4', 'weekday_5',
       'weekday_6']

In [None]:
exog_data = train_df[ext_var_list]
#exog_data = exog_data.set_index('date')
exog_data.head(10)

In [None]:
%%time
sarimax_mod6 = sm.tsa.statespace.SARIMAX(endog = train_df.sales[:start_index],
                                        exog = exog_data[:start_index],  
                                        trend='n', order=(6,1,0), seasonal_order=(0,1,1,7)).fit()
print(sarimax_mod6.summary())

In [None]:
start_index = '2017-10-01'
end_index = '2017-12-30'
end_index1 = '2017-12-31'

In [None]:
exog_data[start_index:end_index]

In [None]:
train_df['forecast'] = sarimax_mod6.predict(start = pd.to_datetime(start_index), end= pd.to_datetime(end_index1),
                                            exog = exog_data[start_index:end_index], 
                                            dynamic= True)  

train_df[start_index:end_index][['sales', 'forecast']].plot(figsize=(12, 8))

In [None]:
smape_kun(train_df[start_index:end_index]['sales'],train_df[start_index:end_index]['forecast'])

### 状態空間モデル

ここでは、ARIMAX(6,1,0)を状態空間モデルを適応する。 これにより、ARIMAXの重付け係数は定数だったが、時間とともに変動できるようにする事ができる。例えば、コロナ以前、以後で購買者の行動パターンが変化した時に、コロナ以前、以後で重付け係数が変化する事で、予測精度を向上させる目的など。


In [None]:
%%time
sarimax_tvc_mod = sm.tsa.statespace.SARIMAX(endog = train_df.sales[:start_index],
                                        exog = exog_data[:start_index],trend='n', order=(6,1,0),
                                        time_varying_regression=True, mle_regression=False).fit()
print(sarimax_tvc_mod.summary())

In [None]:
train_df['forecast'] = sarimax_tvc_mod.predict(start = pd.to_datetime(start_index), end= pd.to_datetime(end_index1),
                                            exog = exog_data[start_index:end_index], 
                                            dynamic= True)  

train_df[start_index:end_index][['sales', 'forecast']].plot(figsize=(12, 8))

In [None]:
smape_kun(train_df[start_index:end_index]['sales'],train_df[start_index:end_index]['forecast'])

SMAPEは、ARIMA(6,1,0)の28.58%、SARIMAX(6,1,0,0,1,1,7)の22.32%、状態空間ARIMAX(6,1,0)の27.72%で、予測誤差が一番小さいSARIMAX(6,1,0,0,1,1,7)が３つのモデルのベストモデルと言える。