---
title: 時系列データの lag, step の作成操作
date: 2025-02-05
categories: [ml]
---

ML わからんので手始めに時系列データの予測をやってみようとしたところ、 lag, step の作成に手こずったのでメモを残す。  
DataFrame の MultiIndex がまだまだわからん。

## データ

kaggle の時系列の練習コンペ [Store Sales - Time Series Forecasting
](https://www.kaggle.com/competitions/store-sales-time-series-forecasting) のものを使用

このデータを使って multi step データを作成する過程を残す

## コード

In [27]:
# https://www.kaggle.com/code/ekrembayar/store-sales-ts-forecasting-a-comprehensive-guide より拝借
# BASE
# ------------------------------------------------------
import numpy as n
import pandas as pd
from pathlib import Path
import os
import gc
import warnings

# DATA VISUALIZATION
# ------------------------------------------------------
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px


# CONFIGURATIONS
# ------------------------------------------------------
pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:.2f}'.format
warnings.filterwarnings('ignore')

In [2]:
DATA_DIR=Path("../input/data/")

In [3]:
# Import
train = pd.read_csv(DATA_DIR/"train.csv")
test = pd.read_csv(DATA_DIR/"test.csv")
stores = pd.read_csv(DATA_DIR/"stores.csv")
#sub = pd.read_csv(DATA_DIR/"sample_submission.csv")   
transactions = pd.read_csv(DATA_DIR/"transactions.csv").sort_values(["store_nbr", "date"])

# Datetime
train["date"] = pd.to_datetime(train.date)
test["date"] = pd.to_datetime(test.date)
transactions["date"] = pd.to_datetime(transactions.date)

# Data types
train.onpromotion = train.onpromotion.astype("float16")
train.sales = train.sales.astype("float32")
stores.cluster = stores.cluster.astype("int8")

train

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.00,0.00
1,1,2013-01-01,1,BABY CARE,0.00,0.00
2,2,2013-01-01,1,BEAUTY,0.00,0.00
3,3,2013-01-01,1,BEVERAGES,0.00,0.00
4,4,2013-01-01,1,BOOKS,0.00,0.00
...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.13,0.00
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.55,1.00
3000885,3000885,2017-08-15,9,PRODUCE,2419.73,148.00
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.00,8.00


### データ準備

In [4]:
train.sort_values(['date','family','store_nbr'],inplace=True)
train.reset_index(drop=True,inplace=True)

In [5]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(train.family)


In [6]:
train["family"]=le.transform(train.family)

In [7]:
# デバッグ用なのでデータを絞る
train=train[train['store_nbr'].isin([1,2]) & train['family'].isin([1,2])]

In [8]:
train.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
54,1,2013-01-01,1,1,0.0,0.0
55,364,2013-01-01,2,1,0.0,0.0
108,2,2013-01-01,1,2,0.0,0.0
109,365,2013-01-01,2,2,0.0,0.0
1836,1783,2013-01-02,1,1,0.0,0.0


In [9]:
KEY=["date","store_nbr","family"]
train_step=train[KEY+["sales"]]

In [10]:
# https://github.com/Kaggle/learntools/blob/master/learntools/time_series/utils.py より
def make_multistep_target(ts, steps, reverse=False):

    shifts = reversed(range(steps)) if reverse else range(steps)
    return pd.concat({f'y_step_{i + 1}': ts.shift(-i) for i in shifts}, axis=1)

In [11]:
# HACK: unstack せずに1発で columns を multiIndex にできるか？
train_step=train_step.set_index(['family', 'store_nbr','date', ])
train_step=train_step.unstack(['family', 'store_nbr'])

下のように軸`{family, store_nbr}` ごとに `sales` が時系列順で縦に並ぶようにする。  
これを shift することで各軸ごとの lag や multistep_target を作成できる。

In [12]:
train_step.head()

Unnamed: 0_level_0,sales,sales,sales,sales
family,1,1,2,2
store_nbr,1,2,1,2
date,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3
2013-01-01,0.0,0.0,0.0,0.0
2013-01-02,0.0,0.0,2.0,3.0
2013-01-03,0.0,0.0,0.0,2.0
2013-01-04,0.0,0.0,3.0,3.0
2013-01-05,0.0,0.0,3.0,9.0


In [13]:
train_step=make_multistep_target(train_step["sales"], 3)

In [14]:
train_step=train_step.stack(['family', 'store_nbr']) # family, store_nbr を index に戻す

In [15]:
train_step=train_step.dropna()

In [16]:
train_step.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,y_step_1,y_step_2,y_step_3
date,family,store_nbr,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-01-01,1,1,0.0,0.0,0.0
2013-01-01,1,2,0.0,0.0,0.0
2013-01-01,2,1,0.0,2.0,0.0
2013-01-01,2,2,0.0,3.0,2.0
2013-01-02,1,1,0.0,0.0,0.0


In [17]:
FEATURES=["onpromotion",]

train_feature_df=train[KEY+FEATURES]

In [18]:
train_feature_df=train_feature_df.set_index(['family', 'store_nbr','date', ])

In [19]:
train_with_step_df=train_feature_df.join(train_step, how='inner')

In [20]:
X=train_with_step_df[FEATURES]
y=train_with_step_df[[c for c in train_with_step_df.columns if c.startswith("y_step")]]

In [21]:
X

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,onpromotion
family,store_nbr,date,Unnamed: 3_level_1
1,1,2013-01-01,0.00
1,2,2013-01-01,0.00
2,1,2013-01-01,0.00
2,2,2013-01-01,0.00
1,1,2013-01-02,0.00
...,...,...,...
2,2,2017-08-12,1.00
1,1,2017-08-13,0.00
1,2,2017-08-13,0.00
2,1,2017-08-13,0.00


In [22]:
y

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,y_step_1,y_step_2,y_step_3
family,store_nbr,date,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1,2013-01-01,0.00,0.00,0.00
1,2,2013-01-01,0.00,0.00,0.00
2,1,2013-01-01,0.00,2.00,0.00
2,2,2013-01-01,0.00,3.00,2.00
1,1,2013-01-02,0.00,0.00,0.00
...,...,...,...,...,...
2,2,2017-08-12,7.00,10.00,7.00
1,1,2017-08-13,0.00,0.00,0.00
1,2,2017-08-13,0.00,0.00,0.00
2,1,2017-08-13,1.00,6.00,4.00


### モデルの学習
上の情報で学習してみる

In [23]:
from sklearn.multioutput import RegressorChain
from sklearn.linear_model import LinearRegression
# from xgboost import XGBRegressor

model = RegressorChain(base_estimator=LinearRegression())

In [24]:
model.fit(X, y)

In [25]:
pred=model.predict(X)

In [26]:
pred.shape

(6728, 3)