## Machine Learning for Forecasting: Transformations and Feature Extraction
- https://towardsdatascience.com/machine-learning-for-forecasting-transformations-and-feature-extraction-bbbea9de0ac2
- ml기반 시계열예측 1번째 포스트

<div style="text-align: right"> <b>Author : Kwang Myung Yu</b></div>
<div style="text-align: right"> Initial upload: 2023.6.25</div>
<div style="text-align: right"> Last update: 2023.6.25</div>

In [1]:
import re
import sys
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings; warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline
# print(plt.stype.available)

# Options for pandas
pd.options.display.max_columns = 30

- transforming time series from a sequence into a tabular format;
- adding new features based on summary statistics.


In [2]:
def time_delay_embedding(
    series: pd.Series,
    n_lags: int,
    horizon: int
):
    """
    Time delay embedding
    Time series for supervised learning
    :param series: time series as pd.Series
    :param n_lags: number of past values to used as explanatory variables
    :param horizon: how many values to forecast
    :return: pd.DataFrame with reconstructed time series
    """
    assert isinstance(series, pd.Series)
    
    if series.name is None:
        name = 'Series'
    else:
        name = series.name
        
    n_lags_iter = list(range(n_lags, -horizon, -1))
    
    X = [series.shift(i) for i in n_lags_iter]
    X = pd.concat(X, axis = 1).dropna()
    X.columns =[f'{name}(t-{j - 1})'
                 if j > 0 else f'{name}(t+{np.abs(j) + 1})'
                 for j in n_lags_iter]
    
    return X
        

In [3]:
test = pd.Series([i for i in range(10)])
test

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

In [4]:
time_delay_embedding(test, 4, 1)

Unnamed: 0,Series(t-3),Series(t-2),Series(t-1),Series(t-0),Series(t+1)
4,0.0,1.0,2.0,3.0,4
5,1.0,2.0,3.0,4.0,5
6,2.0,3.0,4.0,5.0,6
7,3.0,4.0,5.0,6.0,7
8,4.0,5.0,6.0,7.0,8
9,5.0,6.0,7.0,8.0,9


실제 모델로 진행해보기

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error as mae
from pmdarima.datasets import load_sunspots

In [6]:
series = load_sunspots(as_series=True).diff() # 안정된 평균을 구하기 위해 1차 차분을 받아옴

In [7]:
series.head()

Jan 1749     NaN
Feb 1749     4.6
Mar 1749     7.4
Apr 1749   -14.3
May 1749    29.3
dtype: float64

In [8]:
# using 3 lags (n_lags=3) to predict the next value (horizon=1)
ts = time_delay_embedding(series=series, n_lags=3, horizon=1)
ts

Unnamed: 0,Series(t-2),Series(t-1),Series(t-0),Series(t+1)
May 1749,4.6,7.4,-14.3,29.3
Jun 1749,7.4,-14.3,29.3,-1.5
Jul 1749,-14.3,29.3,-1.5,11.3
Aug 1749,29.3,-1.5,11.3,-28.5
Sep 1749,-1.5,11.3,-28.5,9.6
...,...,...,...,...
Aug 1983,18.5,-8.1,-8.9,-10.4
Sep 1983,-8.1,-8.9,-10.4,-21.5
Oct 1983,-8.9,-10.4,-21.5,5.5
Nov 1983,-10.4,-21.5,5.5,-22.5


In [9]:
target_columns = ts.columns.str.contains('\+')
X = ts.iloc[:, ~target_columns]
y = ts.iloc[:, target_columns]

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=False)

In [11]:
model = RandomForestRegressor()
model.fit(X_train, y_train)

pred = model.predict(X_test)
mae(y_test, pred)

13.657167898702033

피처 추가해보기

In [12]:
series = load_sunspots(as_series=True).diff() # 안정된 평균을 구하기 위해 1차 차분을 받아옴
# using 3 lags (n_lags=3) to predict the next value (horizon=1)
ts = time_delay_embedding(series=series, n_lags=3, horizon=1)

target_columns = ts.columns.str.contains('\+')
X = ts.iloc[:, ~target_columns]
y = ts.iloc[:, target_columns]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=False)

# 평균을 피처로 추가하기
X_train['mean'] = X_train.mean(axis = 1)
X_test['mean'] = X_test.mean(axis = 1)


model = RandomForestRegressor()
model.fit(X_train, y_train)

pred = model.predict(X_test)
mae(y_test, pred)

13.233553931046432

- 성능이 약간 개선되었다.