# このノートブックの概要

- Table Playground Series 2022 の 9月。
- まずはデータを眺める。
- つぎにアプローチを練る。

参考
- [4 Strategies for Multi-Step Time Series Forecasting | Machine Learning Mastery](https://machinelearningmastery.com/multi-step-time-series-forecasting/)
- Tabular Playgound Series 2022 Jan コンペ。時系列の特徴量を作らず、日付のみから特徴量生成して TimeSeriesSplit で10モデルつくってアンサンブル。[Catboost Baseline | Kaggle](https://www.kaggle.com/code/junhyeok99/catboost-baseline/notebook)
- XGBoost で multi-step prediction [https://www.kaggle.com/code/cv13j0/tps-jan22-quick-eda-xgboost | Kaggle](https://www.kaggle.com/code/cv13j0/tps-jan22-quick-eda-xgboost)
- XGBoost を optuna でハイパラ調整 [Kaggle merchandise EDA with XGBoost | Kaggle](https://www.kaggle.com/code/lucamassaron/kaggle-merchandise-eda-with-xgboost)
- LSTM で multi-step prediction する。[【日本語】Starter Data Exploration と LSTM | Kaggle](https://www.kaggle.com/code/takahiro1127/starter-data-exploration-lstm/notebook)
- Encoder Decoder で multi-step prediction する。[Encoder-Decoder Model for Multistep Time Series Forecasting Using PyTorch | Kaggle](https://towardsdatascience.com/encoder-decoder-model-for-multistep-time-series-forecasting-using-pytorch-5d54c6af6e60)

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/tabular-playground-series-sep-2022/sample_submission.csv
/kaggle/input/tabular-playground-series-sep-2022/train.csv
/kaggle/input/tabular-playground-series-sep-2022/test.csv


In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

%matplotlib inline
sns.set(style='darkgrid')

In [3]:
TRAIN_PATH = '../input/tabular-playground-series-sep-2022/train.csv'
TEST_PATH = '../input/tabular-playground-series-sep-2022/test.csv'
SUBMISSION_PATH = '../input/tabular-playground-series-sep-2022/sample_submission.csv'

In [4]:
train_df = pd.read_csv(TRAIN_PATH, parse_dates=['date'])
train_df

Unnamed: 0,row_id,date,country,store,product,num_sold
0,0,2017-01-01,Belgium,KaggleMart,Kaggle Advanced Techniques,663
1,1,2017-01-01,Belgium,KaggleMart,Kaggle Getting Started,615
2,2,2017-01-01,Belgium,KaggleMart,Kaggle Recipe Book,480
3,3,2017-01-01,Belgium,KaggleMart,Kaggle for Kids: One Smart Goose,710
4,4,2017-01-01,Belgium,KaggleRama,Kaggle Advanced Techniques,240
...,...,...,...,...,...,...
70123,70123,2020-12-31,Spain,KaggleMart,Kaggle for Kids: One Smart Goose,614
70124,70124,2020-12-31,Spain,KaggleRama,Kaggle Advanced Techniques,215
70125,70125,2020-12-31,Spain,KaggleRama,Kaggle Getting Started,158
70126,70126,2020-12-31,Spain,KaggleRama,Kaggle Recipe Book,135


In [5]:
test_df = pd.read_csv(TEST_PATH, parse_dates=['date'])
test_df

Unnamed: 0,row_id,date,country,store,product
0,70128,2021-01-01,Belgium,KaggleMart,Kaggle Advanced Techniques
1,70129,2021-01-01,Belgium,KaggleMart,Kaggle Getting Started
2,70130,2021-01-01,Belgium,KaggleMart,Kaggle Recipe Book
3,70131,2021-01-01,Belgium,KaggleMart,Kaggle for Kids: One Smart Goose
4,70132,2021-01-01,Belgium,KaggleRama,Kaggle Advanced Techniques
...,...,...,...,...,...
17515,87643,2021-12-31,Spain,KaggleMart,Kaggle for Kids: One Smart Goose
17516,87644,2021-12-31,Spain,KaggleRama,Kaggle Advanced Techniques
17517,87645,2021-12-31,Spain,KaggleRama,Kaggle Getting Started
17518,87646,2021-12-31,Spain,KaggleRama,Kaggle Recipe Book


In [6]:
from pandas_profiling import ProfileReport

# minimal=True
profile = ProfileReport(train_df, minimal=True)
profile.to_file(output_file="output_train_profiling_minimal.html")

# mininmal=False
profile = ProfileReport(train_df, minimal=False)
profile.to_file(output_file="output_train_profiling.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [7]:
# 日付・国・店舗・商品 ごとの売上点数を予測するタスク
# 2017 ~ 2020年のデータを train
# 2021年のデータを test
# pandas_profiling の結果をみると、データの欠損などはなく、列の分布に異常はない
# → 時系列の、特徴量・validation を作り込むゲーム、とみなしてよさそう

### TODO
# 時系列データの EDA
# 日付をもとに、曜日x週のマス目をつくり売上点数をヒートマップに

In [8]:
train_df.groupby(['country', 'store', 'product']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,row_id,date,num_sold
country,store,product,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Belgium,KaggleMart,Kaggle Advanced Techniques,1461,1461,1461
Belgium,KaggleMart,Kaggle Getting Started,1461,1461,1461
Belgium,KaggleMart,Kaggle Recipe Book,1461,1461,1461
Belgium,KaggleMart,Kaggle for Kids: One Smart Goose,1461,1461,1461
Belgium,KaggleRama,Kaggle Advanced Techniques,1461,1461,1461
Belgium,KaggleRama,Kaggle Getting Started,1461,1461,1461
Belgium,KaggleRama,Kaggle Recipe Book,1461,1461,1461
Belgium,KaggleRama,Kaggle for Kids: One Smart Goose,1461,1461,1461
France,KaggleMart,Kaggle Advanced Techniques,1461,1461,1461
France,KaggleMart,Kaggle Getting Started,1461,1461,1461


In [9]:
test_df.groupby(['country', 'store', 'product']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,row_id,date
country,store,product,Unnamed: 3_level_1,Unnamed: 4_level_1
Belgium,KaggleMart,Kaggle Advanced Techniques,365,365
Belgium,KaggleMart,Kaggle Getting Started,365,365
Belgium,KaggleMart,Kaggle Recipe Book,365,365
Belgium,KaggleMart,Kaggle for Kids: One Smart Goose,365,365
Belgium,KaggleRama,Kaggle Advanced Techniques,365,365
Belgium,KaggleRama,Kaggle Getting Started,365,365
Belgium,KaggleRama,Kaggle Recipe Book,365,365
Belgium,KaggleRama,Kaggle for Kids: One Smart Goose,365,365
France,KaggleMart,Kaggle Advanced Techniques,365,365
France,KaggleMart,Kaggle Getting Started,365,365


In [10]:
plt.rcParams['font.size'] = 8
fig, axes = plt.subplots(nrows=1, ncols=3, tight_layout=True, figsize=(16, 4))

sns.boxplot(data=train_df, x='num_sold', y='country', ax=axes[0])
sns.boxplot(data=train_df, x='num_sold', y='store', ax=axes[1])
sns.boxplot(data=train_df, x='num_sold', y='product', ax=axes[2])

# plt.xticks(rotation=90)
axes[2].set_xticklabels(axes[2].get_xticklabels(), rotation=30)
fig.show()

  if __name__ == "__main__":


In [11]:
train_df_pivot = train_df.pivot(index='date', columns=['country', 'store', 'product'], values='num_sold')
train_df_pivot

country,Belgium,Belgium,Belgium,Belgium,Belgium,Belgium,Belgium,Belgium,France,France,...,Poland,Poland,Spain,Spain,Spain,Spain,Spain,Spain,Spain,Spain
store,KaggleMart,KaggleMart,KaggleMart,KaggleMart,KaggleRama,KaggleRama,KaggleRama,KaggleRama,KaggleMart,KaggleMart,...,KaggleRama,KaggleRama,KaggleMart,KaggleMart,KaggleMart,KaggleMart,KaggleRama,KaggleRama,KaggleRama,KaggleRama
product,Kaggle Advanced Techniques,Kaggle Getting Started,Kaggle Recipe Book,Kaggle for Kids: One Smart Goose,Kaggle Advanced Techniques,Kaggle Getting Started,Kaggle Recipe Book,Kaggle for Kids: One Smart Goose,Kaggle Advanced Techniques,Kaggle Getting Started,...,Kaggle Recipe Book,Kaggle for Kids: One Smart Goose,Kaggle Advanced Techniques,Kaggle Getting Started,Kaggle Recipe Book,Kaggle for Kids: One Smart Goose,Kaggle Advanced Techniques,Kaggle Getting Started,Kaggle Recipe Book,Kaggle for Kids: One Smart Goose
date,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3
2017-01-01,663,615,480,710,240,187,158,267,610,463,...,50,92,447,364,313,451,159,123,113,181
2017-01-02,514,408,342,601,187,158,119,196,455,364,...,40,67,339,266,236,379,124,104,74,123
2017-01-03,549,425,334,515,172,131,120,188,465,362,...,35,61,320,271,211,335,113,87,69,125
2017-01-04,477,384,328,517,177,134,115,169,465,311,...,35,55,302,260,201,316,114,84,74,110
2017-01-05,447,371,268,480,150,129,101,169,385,323,...,32,53,285,248,184,271,99,72,67,104
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-12-27,574,482,399,609,231,160,129,228,552,450,...,139,223,587,472,368,629,199,160,133,204
2020-12-28,625,445,387,608,203,168,132,200,574,384,...,153,238,576,472,384,655,198,168,127,212
2020-12-29,597,546,427,684,227,192,146,248,615,479,...,163,251,616,510,423,711,234,182,157,242
2020-12-30,632,492,438,649,221,183,156,237,621,568,...,168,250,601,509,459,655,232,170,153,239


In [12]:
fig, axes = plt.subplots(nrows=6, ncols=1, tight_layout=True, figsize=(20, 24))

# 国別にプロット
for i, c in enumerate(train_df_pivot.columns.get_level_values(0).unique()):
    train_df_pivot[c].plot(linewidth=1, fontsize=8, ax=axes[i])
    axes[i].set_title(c)
    axes[i].set_xlabel('date')
    axes[i].legend(fontsize=8, loc='upper left')

plt.show()

In [13]:
# 曜日の周期性はある
# 年末年始のバーストある
# Italy, Poland, Spain などは2020年以前以降で傾向が異なる

In [14]:
train_df_pivot_corr = train_df_pivot.corr(method='spearman')

plt.figure(figsize=(20, 20))

sns.heatmap(train_df_pivot_corr,
            annot=True,
            linewidths=0.4,
            annot_kws={"size": 6}
)

plt.xticks(rotation=90)
plt.yticks(rotation=0) 
plt.show()

In [15]:
# ----------------------------------------------------------------------------
# Author:  Nicolas P. Rougier
# License: BSD
# ----------------------------------------------------------------------------
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
from datetime import datetime
from dateutil.relativedelta import relativedelta


def calmap(ax, year, data):
    ax.tick_params('x', length=0, labelsize="medium", which='major')
    ax.tick_params('y', length=0, labelsize="x-small", which='major')

    # Month borders
    xticks, labels = [], []
    start = datetime(year,1,1).weekday()
    for month in range(1,13):
        first = datetime(year, month, 1)
        last = first + relativedelta(months=1, days=-1)

        y0 = first.weekday()
        y1 = last.weekday()
        x0 = (int(first.strftime("%j"))+start-1)//7
        x1 = (int(last.strftime("%j"))+start-1)//7

        P = [ (x0,   y0), (x0,    7),  (x1,   7),
              (x1,   y1+1), (x1+1,  y1+1), (x1+1, 0),
              (x0+1,  0), (x0+1,  y0) ]
        xticks.append(x0 +(x1-x0+1)/2)
        labels.append(first.strftime("%b"))
        poly = Polygon(P, edgecolor="black", facecolor="None",
                       linewidth=1, zorder=20, clip_on=False)
        ax.add_artist(poly)
    
    ax.set_xticks(xticks)
    ax.set_xticklabels(labels)
    ax.set_yticks(0.5 + np.arange(7))
    ax.set_yticklabels(["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"])
    ax.set_title("{}".format(year), weight="semibold")
    
    # Clearing first and last day from the data
    valid = datetime(year, 1, 1).weekday()
    data[:valid,0] = np.nan
    valid = datetime(year, 12, 31).weekday()
    # data[:,x1+1:] = np.nan
    data[valid+1:,x1] = np.nan

    # Showing data
    ax.imshow(data, extent=[0,53,0,7], zorder=10, vmin=-1, vmax=1,
              cmap="RdYlBu_r", origin="lower", alpha=.75)

In [16]:
for i, c in enumerate(train_df_pivot.columns):
    sscale = StandardScaler()
    fig, axes = plt.subplots(4, 1, figsize=(8, 8))

    date2017 = pd.date_range(start='2017-01-01', periods=53*7, freq='D')
    vals = sscale.fit_transform(train_df_pivot[c][date2017].values.reshape(-1, 1))
    calmap(axes[0], 2017, vals.reshape(53,7).T)

    date2018 = pd.date_range(start='2018-01-01', periods=53*7, freq='D')
    vals = sscale.fit_transform(train_df_pivot[c][date2018].values.reshape(-1, 1))
    calmap(axes[1], 2018, vals.reshape(53,7).T)

    date2019 = pd.date_range(start='2019-01-01', periods=53*7, freq='D')
    vals = sscale.fit_transform(train_df_pivot[c][date2019].values.reshape(-1, 1))
    calmap(axes[2], 2019, vals.reshape(53,7).T)

    date2020 = pd.date_range(start='2019-12-27', periods=53*7, freq='D')
    vals = sscale.fit_transform(train_df_pivot[c][date2020].values.reshape(-1, 1))
    calmap(axes[3], 2020, vals.reshape(53,7).T)

    title = '_'.join([''.join(t.split(' ')) for t in c])
    plt.suptitle(title, fontsize=15, x=0.3, y=0.98)
    plt.tight_layout()
    plt.show()
    
    if i>3: 
        break

In [17]:
submission_df = pd.read_csv(SUBMISSION_PATH)
submission_df

Unnamed: 0,row_id,num_sold
0,70128,100
1,70129,100
2,70130,100
3,70131,100
4,70132,100
...,...,...
17515,87643,100
17516,87644,100
17517,87645,100
17518,87646,100
