I have updated this notebook to modify the wrmsse function  at 29th Mar.  
New wrmsse function for LGBM metric calculate wrmsse only for last 28 days to consider non-zero demand period.  
Please refer comment section. I have commented the detail of my fixing.
(note:I have also remove some variable to reduce the run-time and changed 'objective' in lgbm to 'poisson'.)

This kernel is:  
- Based on [Very fst Model](https://www.kaggle.com/ragnar123/very-fst-model). Thanks [@ragnar123](https://www.kaggle.com/ragnar123).  
- Based on [m5-baseline](https://www.kaggle.com/harupy/m5-baseline). Thank [@harupy](https://www.kaggle.com/harupy).  
to explain the detail of these great notebook by Japanese especially for beginner.  

Additionaly, I have added an relatively efficient evaluation of WRSSE for LGBM metric to these kernel.

## module import

In [None]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import dask.dataframe as dd
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import lightgbm as lgb
#import dask_xgboost as xgb
#import dask.dataframe as dd
from sklearn import preprocessing, metrics
from sklearn.preprocessing import LabelEncoder
import gc
import os
from tqdm import tqdm
from scipy.sparse import csr_matrix

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## functionの定義

reduce_mem_usageは、データのメモリを減らすためにデータ型を変更する関数です。  
('reduce_mem_usage' is a functin which reduce memory usage by changing data type.)
https://qiita.com/hiroyuki_kageyama/items/02865616811022f79754　を参照ください。

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns: #columns毎に処理
        col_type = df[col].dtypes
        if col_type in numerics: #numericsのデータ型の範囲内のときに処理を実行. データの最大最小値を元にデータ型を効率的なものに変更
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df


PandasのdataFrameをきれいに表示する関数
(This function is to diplay a head of Pandas DataFrame.)

In [None]:
import IPython

def display(*dfs, head=True):
    for df in dfs:
        IPython.display.display(df.head() if head else df)

## train/testの分割とmodelの推定

02_make_featuresノートブックで作成した特徴量列追加済みのpickleファイルを読み込み

In [None]:
data = pd.read_pickle('../input/02-make-features/m5_add_features_data_sample.pickle')
data.head(3)

2016/3/27より前を学習用、2016/3/27~2016/4/24（28day）を検証用として分割  
（LightGBMのEarly stoppingの対象）

In [None]:
# 学習用データセットを作成
x_train = data[data['date'] <= '2016-03-27']
y_train = x_train['demand']

# 検証用データセットを作成
x_val = data[(data['date'] > '2016-03-27') & (data['date'] <= '2016-04-24')]
y_val = x_val['demand']

# submission用のデータセットを作成
test = data[(data['date'] > '2016-04-24')]

#dataの削除（メモリの削除）
#del data
#gc.collect()

modelのLGBMでの推定　　
* early stoppingのmetricに全体のRMSEを使っているため, コンペの指標のWRMSSEとは異なる.

In [None]:
# モデルに投入する特徴量のリストを作成
features = [
    "cat_id",
    "dept_id",
    "item_id",
    "state_id",
    "store_id",
    "event_name_1",
    "event_name_2",
    "snap_CA",
    "snap_TX",
    "snap_WI",
    
    # 売上の特徴量
    "rolling_demand_mean_7",
    "last_year_rolling_demand_mean",
    "demand_lag_1",
    "demand_lag_2",

    # 価格の特徴量
    "sell_price",
    "price_volatility_w1",

    # 時間の特徴量
    "year",
    "month",
    "dayofweek",
    "wm_yr_wk",
]

# LightGBMのハイパーパラメーターを設定
params = {
    'boosting_type': 'gbdt',
    'metric': 'rmse',
    'objective': 'regression',
    'n_jobs': -1,
    'seed': 236,
    'learning_rate': 0.1,
    'bagging_fraction': 0.75,
    'bagging_freq': 10, 
    'colsample_bytree': 0.75}

# 学習用・検証用データセットをLightGBM用のデータセットに変換
train_set = lgb.Dataset(x_train[features], y_train)
val_set = lgb.Dataset(x_val[features], y_val)

# 不要な変数の削除
del x_train, y_train
gc.collect()

# modelの学習を実行（今回は学習過程で50回連続でrmseが改善しなかった場合、早期に学習を切り上げる＝early stoppingする設定。）
model = lgb.train(params, train_set, num_boost_round = 2500, early_stopping_rounds = 50, valid_sets = [train_set, val_set], verbose_eval = 100)

# 学習したモデルを使用して、検証用データセットについてdemandを予測
val_pred = model.predict(x_val[features])
# 実測値と予測値の平均二乗誤差＝rmseを計算し、精度を確認
val_score = np.sqrt(metrics.mean_squared_error(val_pred, y_val))
print(f'Our val rmse score is {val_score}')

# 同様に学習したモデルを使用して、submission用データセットについてdemandを予測
y_pred = model.predict(test[features])
# 予測値をdemand列に格納
test['demand'] = y_pred


## submission fileの出力

In [None]:
# サンプルのsubmissionファイルを読み込み
submission = pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sample_submission.csv')

In [None]:
# demandにモデルの予測値が入ったsubmission用データセットから必要な列のみを抽出
predictions = test[['id', 'date', 'demand']]
# ピボット関数を適用させて横持ちに変換し、日ごとのdemandが入った列名をF1~F28に変更
predictions = pd.pivot(predictions, index = 'id', columns = 'date', values = 'demand').reset_index()
predictions.columns = ['id'] + ['F' + str(i + 1) for i in range(28)]

# idの値にevaluationが含まれているレコードのみ抽出
evaluation_rows = [row for row in submission['id'] if 'evaluation' in row] 
evaluation = submission[submission['id'].isin(evaluation_rows)]

# submission.csvと上記で作成したpredictionとを、id列をキーに結合
validation = submission[['id']].merge(predictions, on = 'id')

# validation期間とevaluation期間のデータセットを縦に結合
final = pd.concat([validation, evaluation])
# finalをsubmission.csvとして出力
final.to_csv('submission.csv', index = False)


In [None]:
# submission用のデータセットを確認
final.head(5)