## Machine Learning for Forecasting: Supervised Learning with Multivariate Time Series
- https://towardsdatascience.com/machine-learning-for-forecasting-supervised-learning-with-multivariate-time-series-b5b5044fe068
- ml기반 시계열예측 2번째 포스트

<div style="text-align: right"> <b>Author : Kwang Myung Yu</b></div>
<div style="text-align: right"> Initial upload: 2023.6.25</div>
<div style="text-align: right"> Last update: 2023.6.25</div>

In [1]:
import re
import sys
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings; warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline
# print(plt.stype.available)

# Options for pandas
pd.options.display.max_columns = 30

- transforming time series from a sequence into a tabular format;
- adding new features based on summary statistics.


In [2]:
def time_delay_embedding(
    series: pd.Series,
    n_lags: int,
    horizon: int
):
    """
    Time delay embedding
    Time series for supervised learning
    :param series: time series as pd.Series
    :param n_lags: number of past values to used as explanatory variables
    :param horizon: how many values to forecast
    :return: pd.DataFrame with reconstructed time series
    """
    assert isinstance(series, pd.Series)
    
    if series.name is None:
        name = 'Series'
    else:
        name = series.name
        
    n_lags_iter = list(range(n_lags, -horizon, -1))
    
    X = [series.shift(i) for i in n_lags_iter]
    X = pd.concat(X, axis = 1).dropna()
    X.columns =[f'{name}(t-{j - 1})'
                 if j > 0 else f'{name}(t+{np.abs(j) + 1})'
                 for j in n_lags_iter]
    
    return X
        

In [3]:
wine = pd.read_csv("data/wine_sales.csv", parse_dates=['date'])
wine.head()

Unnamed: 0,Fortified,Drywhite,Sweetwhite,Red,Rose,Sparkling,date
0,2585,1954,85,464,112.0,1686,1980-01-01
1,3368,2302,89,675,118.0,1591,1980-02-01
2,3210,3054,109,703,129.0,2304,1980-03-01
3,3111,2414,95,887,99.0,1712,1980-04-01
4,3756,2226,91,1139,116.0,1471,1980-05-01


In [4]:
wine = wine.set_index("date")
wine.head()

Unnamed: 0_level_0,Fortified,Drywhite,Sweetwhite,Red,Rose,Sparkling
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1980-01-01,2585,1954,85,464,112.0,1686
1980-02-01,3368,2302,89,675,118.0,1591
1980-03-01,3210,3054,109,703,129.0,2304
1980-04-01,3111,2414,95,887,99.0,1712
1980-05-01,3756,2226,91,1139,116.0,1471


In [5]:
wine_ds = []
for col in wine.columns:
    col_df = time_delay_embedding(wine[col], n_lags = 12, horizon=6)
    wine_ds.append(col_df)

In [6]:
wine_ds

[            Fortified(t-11)  Fortified(t-10)  Fortified(t-9)  Fortified(t-8)  \
 date                                                                           
 1981-01-01           2585.0           3368.0          3210.0          3111.0   
 1981-02-01           3368.0           3210.0          3111.0          3756.0   
 1981-03-01           3210.0           3111.0          3756.0          4216.0   
 1981-04-01           3111.0           3756.0          4216.0          5225.0   
 1981-05-01           3756.0           4216.0          5225.0          4426.0   
 ...                     ...              ...             ...             ...   
 1994-10-01           1772.0           2526.0          2755.0          1154.0   
 1994-11-01           2526.0           2755.0          1154.0          1568.0   
 1994-12-01           2755.0           1154.0          1568.0          1965.0   
 1995-01-01           1154.0           1568.0          1965.0          2659.0   
 1995-02-01           1568.0

In [7]:
wine_df = pd.concat(wine_ds, axis=1).dropna()
wine_df.head()

Unnamed: 0_level_0,Fortified(t-11),Fortified(t-10),Fortified(t-9),Fortified(t-8),Fortified(t-7),Fortified(t-6),Fortified(t-5),Fortified(t-4),Fortified(t-3),Fortified(t-2),Fortified(t-1),Fortified(t-0),Fortified(t+1),Fortified(t+2),Fortified(t+3),...,Sparkling(t-8),Sparkling(t-7),Sparkling(t-6),Sparkling(t-5),Sparkling(t-4),Sparkling(t-3),Sparkling(t-2),Sparkling(t-1),Sparkling(t-0),Sparkling(t+1),Sparkling(t+2),Sparkling(t+3),Sparkling(t+4),Sparkling(t+5),Sparkling(t+6)
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
1981-01-01,2585.0,3368.0,3210.0,3111.0,3756.0,4216.0,5225.0,4426.0,3932.0,3816.0,3661.0,3795.0,2285,2934.0,2985.0,...,1712.0,1471.0,1377.0,1966.0,2453.0,1984.0,2596.0,4087.0,5179.0,1530,1523.0,1633.0,1976.0,1170.0,1480.0
1981-02-01,3368.0,3210.0,3111.0,3756.0,4216.0,5225.0,4426.0,3932.0,3816.0,3661.0,3795.0,2285.0,2934,2985.0,3646.0,...,1471.0,1377.0,1966.0,2453.0,1984.0,2596.0,4087.0,5179.0,1530.0,1523,1633.0,1976.0,1170.0,1480.0,1781.0
1981-03-01,3210.0,3111.0,3756.0,4216.0,5225.0,4426.0,3932.0,3816.0,3661.0,3795.0,2285.0,2934.0,2985,3646.0,4198.0,...,1377.0,1966.0,2453.0,1984.0,2596.0,4087.0,5179.0,1530.0,1523.0,1633,1976.0,1170.0,1480.0,1781.0,2472.0
1981-04-01,3111.0,3756.0,4216.0,5225.0,4426.0,3932.0,3816.0,3661.0,3795.0,2285.0,2934.0,2985.0,3646,4198.0,4935.0,...,1966.0,2453.0,1984.0,2596.0,4087.0,5179.0,1530.0,1523.0,1633.0,1976,1170.0,1480.0,1781.0,2472.0,1981.0
1981-05-01,3756.0,4216.0,5225.0,4426.0,3932.0,3816.0,3661.0,3795.0,2285.0,2934.0,2985.0,3646.0,4198,4935.0,5618.0,...,2453.0,1984.0,2596.0,4087.0,5179.0,1530.0,1523.0,1633.0,1976.0,1170,1480.0,1781.0,2472.0,1981.0,2273.0


In [8]:
# defining target (Y) and explanatory variables (X)
predictor_variables = wine_df.columns.str.contains('\(t\-')
target_variables = wine_df.columns.str.contains('Sparkling\(t\+')

X = wine_df.iloc[:, predictor_variables]
y = wine_df.iloc[:, target_variables]

모델 구축하기

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error as mae
from sklearn.ensemble import RandomForestRegressor

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=False)

model = RandomForestRegressor()
model.fit(X_train, y_train)
preds = model.predict(X_test)

print(mae(y_test, preds))

305.24500000000006


각 변수의 12개의 지연을 설명 변수로 사용했습니다. 이는 time_delay_embedding 함수의 매개변수 n_lags에 정의되어 있습니다.

이 파라미터의 값은 어떻게 설정해야 할까요?

얼마나 많은 값을 포함해야 하는지 미리 말하기는 어렵습니다. 이는 입력 데이터와 특정 변수에 따라 달라집니다.

이에 접근하는 간단한 방법은 기능 선택을 사용하는 것입니다. 먼저 상당한 양의 값으로 시작합니다. 그런 다음 중요도 점수 또는 예측 성과에 따라 이 수를 줄입니다.

다음은 이 프로세스의 간소화된 버전입니다. 무작위 포레스트의 중요도 점수에 따라 상위 10개 피처를 선택합니다. 그런 다음 훈련 및 테스트 주기를 반복합니다.

In [11]:
fi = pd.Series(dict(zip(X_train.columns, model.feature_importances_)))
fi

Fortified(t-11)    0.004212
Fortified(t-10)    0.009965
Fortified(t-9)     0.008443
Fortified(t-8)     0.012716
Fortified(t-7)     0.022947
                     ...   
Sparkling(t-4)     0.005109
Sparkling(t-3)     0.002783
Sparkling(t-2)     0.008126
Sparkling(t-1)     0.005132
Sparkling(t-0)     0.002413
Length: 72, dtype: float64

In [12]:
top_10_features = fi.sort_values(ascending=False)[:10]
top_10_features_nm = top_10_features.index
top_10_features_nm

Index(['Sparkling(t-11)', 'Sparkling(t-7)', 'Sparkling(t-9)',
       'Sparkling(t-10)', 'Sparkling(t-8)', 'Sparkling(t-6)', 'Sparkling(t-5)',
       'Fortified(t-7)', 'Drywhite(t-10)', 'Drywhite(t-9)'],
      dtype='object')

In [13]:
X_train_top = X_train[top_10_features_nm]
X_test_top = X_test[top_10_features_nm]

# re-fitting the model
model_top_features = RandomForestRegressor()
model_top_features.fit(X_train_top, y_train)

# getting forecasts for the test set
preds_topf = model_top_features.predict(X_test_top)

# computing MAE error
print(mae(y_test, preds_topf))

261.8669791666666


- 성능이 개선되었다.