# Recruit Restaurant Visitor Forecasting

[前処理のコード](https://www.kaggle.com/mnthasi/210516-recruit-pre-japanese)で前処理したdata_all_motoをロードして学習だけ行う<br>
処理時間<br>
モデル学習2分×6
<br>
【モデル学習方法】
* 週ごとにモデルを作成する
* 2週目のモデルであれば、前週の来店者数の情報は使わないで予測する
* validationがいい加減にならないように、lag特徴量を適宜落としている
<br><br>

Load the preprocessed data_all_moto in [Preprocessing Code] (https://www.kaggle.com/mnthasi/210516-recruit-pre-japanese) and only learn <br>
Processing time <br>
Model learning 2 minutes x 6
<br> <br>
[Model learning method] <br>
* Create a model weekly
* If it is the second week model, predict without using the information on the number of visitors in the previous week
* The lag features are reduced as appropriate so that validation is not sloppy.

コードを3つに分割しています<br>
The code is divided into three<br>
https://www.kaggle.com/mnthasi/recruit-restaurant-01-pre
https://www.kaggle.com/mnthasi/recruit-restaurant-02-lightgbm-japanese
https://www.kaggle.com/mnthasi/recruit-restaurant-03-ensemble-japanese

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt

# 前処理したデータフレームのインポート
visit_dateは日付情報が残っていて欲しいので、読み込みのオプションをつけている

In [None]:
data_all_moto = pd.read_csv("../input/data-all-11/data_all (11).csv", parse_dates=['visit_date'])

In [None]:
data_all_moto

# 特徴量のdrop

In [None]:
data_all_moto = data_all_moto.drop(["lag70","lag63"], axis=1)
data_all_moto = data_all_moto.drop(["index"], axis=1)
data_all_moto = data_all_moto.drop(["lag_mean_7-21","lag_mean_7-28","lag_mean_7-14"], axis=1)
# data_all_moto = data_all_moto.loc[(data_all_moto['air_genre_name'] == 13)]

In [None]:
data_all_moto.visit_date

In [None]:
data_all = data_all_moto

# validationデータの範囲を指定する関数set_vaの定義
期間の順番に気を付けること

In [None]:
def set_va():
    #とりあえず全部trainにする
    data_all.loc[(data_all['visit_date'] <= dt.datetime(2017,4,22)), 'set'] = 'train'
    #validationデータの期間の終わりを指定する
    data_all.loc[(data_all['visit_date'] <= dt.datetime(2017,4,22)), 'set'] = 'va'
    data_all.loc[(data_all['visit_date'] <= dt.datetime(2017,4,15)), 'set'] = 'train'
    print("set_va done")

# --------------------ここから①週目-------------------
4/23-4/29までを予測するためのモデルを作る<br>
～4/1までのtrainデータを学習させて4/2-4/9までを予測するためのモデルを作り、<br>
4/23-4/29までを予測する

* ①　4/23-29 lagのdropなし
* ②　4/30-5/6　lag7をdrop(7日前以降の情報は無い)
* ③　5/7-5/13　lag7,14をdrop(14日前以降の情報は無い)
* ④　5/14-5/20　lag7,14,21をdrop(21日前以降の情報は無い)
* ⑤　5/21-5/27　lag7,14,21,28をdrop(28日前以降の情報は無い)
* ⑥　5/28-5-31　lag7,14,21,28,35をdrop(35日前以降の情報は無い)

# 【重要】期間ごとにtrain/va/testに分割
submission用のデータを作る時は、trainの期間を4/22までに変更する
(そうしないと、直近の学習データが不足し不利になる)

In [None]:
#2017/5/31まで全部testにした後、5/28より前はnot-testにする #【YOU NEED TO CHANGE】
data_all.loc[(data_all['visit_date'] <= dt.datetime(2017,4,29)), 'set'] = 'test'
data_all.loc[(data_all['visit_date'] < dt.datetime(2017,4,23)), 'set'] = 'not-test'

# バリデーションデータの範囲を指定
set_va()

# visit_dataはdrop
data_all = data_all.drop(columns="visit_date")

# trainデータから、visitors(目的変数)と、idを落とす(予測に寄与しない)
columns_drop2 = ["visitors","id"]

# train,test,vaに分ける
train_fit = pd.DataFrame()
train_fit = data_all.loc[data_all.set == "train"]
train_fit_y = pd.DataFrame()
train_fit_y["visitors"] = train_fit.visitors
train_fit_x = train_fit.drop(columns=columns_drop2)
train_fit_x = train_fit_x.drop(columns="set")

va_fit = pd.DataFrame()
va_fit = data_all.loc[data_all.set == "va"]
va_fit_y = pd.DataFrame()
va_fit_y["visitors"] = va_fit.visitors
va_fit_x = va_fit.drop(columns=columns_drop2)
va_fit_x = va_fit_x.drop(columns="set")

test_fit = pd.DataFrame()
test_fit = data_all.loc[data_all.set == "test"]
test_fit_x = test_fit.drop(columns="visitors")
test_fit_x = test_fit_x.drop(columns="set")
# testデータで評価しない日・店舗の組み合わせがありid=0となってしまう
# id=0の行を削除する
test_fit_x = test_fit_x[test_fit_x['id'] != 0]

test_id = test_fit_x.id
test_fit_x = test_fit_x.drop(columns="id")



# lightGBMの学習
全データで学習すると1分くらい

In [None]:
%%time
import lightgbm as lgb
from sklearn.metrics import mean_squared_log_error

# パラメータの設定
params = {
    "objective" : "regression",
    "metric" : "rmse",
    "num_leaves" : 40,
    "learning_rate" : 0.01,
    "bagging_fraction" : 0.8,
    "feature_fraction" : 0.4, #特徴量の〇％だけ利用する
    "bagging_frequency" : 6,
    "bagging_seed" : 42,
    "verbosity" : -1, # 学習途中の経過を表示するかどうか
    "seed": 42
}

train_lgb = lgb.Dataset(train_fit_x, label=train_fit_y)
va_lgb = lgb.Dataset(va_fit_x, label=va_fit_y)

evals_result = {}

model_lgb1 = lgb.train(params, train_set = train_lgb,
                  num_boost_round = 10000,
                  valid_sets=[train_lgb, va_lgb], 
                  early_stopping_rounds=200, 
                  verbose_eval=1000, 
                  evals_result=evals_result)

# validationデータの予測

In [None]:
va_visitors = data_all.loc[(data_all.set == "va"),"visitors"]

pred = model_lgb1.predict(va_fit_x, num_iteration=model_lgb1.best_iteration)

columns = ["pred_visitors"]
pred_d = pd.DataFrame(data=pred, columns = columns)

# マイナスに予測される値は絶対値を取得する
pred_d["pred_visitors"] = pred_d["pred_visitors"].abs()

# RSMLEスコアの算出
score1 = np.sqrt(mean_squared_log_error(va_visitors, pred_d["pred_visitors"]))
print("---------------RMSLE-score----------------")
score1

# 可視化

In [None]:
columns = ["visitors"]
va_fit_d = pd.DataFrame(data=va_visitors, columns = columns)
va_fit_d = va_fit_d.reset_index(drop=True)

sns.scatterplot(x=va_fit_d['visitors'], y=pred_d['pred_visitors'])
sns.scatterplot(x=va_fit_d['visitors'], y=va_fit_d['visitors']) #perfect fitting line

# testデータの予測

In [None]:
# モデルで予測する
pred = model_lgb1.predict(test_fit_x)

# 提出用データを作成する
columns_ck = [["visitors"]]
submission_ck = pd.DataFrame(data=pred, columns = columns_ck)
test_id_df = pd.DataFrame(data=test_id)
test_id_df = test_id_df.reset_index(drop=True)
test_id_df.columns = ["id"]

test_reindex = test_fit_x.reset_index(drop=True)

submission = pd.DataFrame()
submission["id"] = test_id_df.id
submission["visitors"] = submission_ck.visitors

# csvファイルに書き出す
submission.to_csv("submission_week1.csv", index=False)

In [None]:
# data_allを元に戻す
data_all = data_all_moto

# ------------------ここから②週目-----------------------
変更箇所は以下の通りで、　#【YOU NEED TO CHANGE】で注記あり
* 特徴量のドロップ
* test_setの期間の変更
* lgb1モデルの名前を変更
* submission.csvの名前を変更

In [None]:
%%time
#特徴量をドロップする
drops = ["lag7"]                                                              #【YOU NEED TO CHANGE】
data_all = data_all.drop(columns = drops)

#2017/5/31まで全部testにした後、5/28より前はnot-testにする #【YOU NEED TO CHANGE】
data_all.loc[(data_all['visit_date'] <= dt.datetime(2017,5,6)), 'set'] = 'test'
data_all.loc[(data_all['visit_date'] < dt.datetime(2017,4,30)), 'set'] = 'not-test'

# バリデーションデータの範囲を指定
set_va()

# visit_dataはdrop
data_all = data_all.drop(columns="visit_date")

# trainデータから、visitors(目的変数)と、idを落とす(予測に寄与しない)
columns_drop2 = ["visitors","id"]

# train,test,vaに分ける
train_fit = pd.DataFrame()
train_fit = data_all.loc[data_all.set == "train"]
train_fit_y = pd.DataFrame()
train_fit_y["visitors"] = train_fit.visitors
train_fit_x = train_fit.drop(columns=columns_drop2)
train_fit_x = train_fit_x.drop(columns="set")

va_fit = pd.DataFrame()
va_fit = data_all.loc[data_all.set == "va"]
va_fit_y = pd.DataFrame()
va_fit_y["visitors"] = va_fit.visitors
va_fit_x = va_fit.drop(columns=columns_drop2)
va_fit_x = va_fit_x.drop(columns="set")

test_fit = pd.DataFrame()
test_fit = data_all.loc[data_all.set == "test"]
test_fit_x = test_fit.drop(columns="visitors")
test_fit_x = test_fit_x.drop(columns="set")
# testデータで評価しない日・店舗の組み合わせがありid=0となってしまう
# id=0の行を削除する
test_fit_x = test_fit_x[test_fit_x['id'] != 0]

test_id = test_fit_x.id
test_fit_x = test_fit_x.drop(columns="id")

# LightGBMモデルで学習
import lightgbm as lgb
from sklearn.metrics import mean_squared_log_error


train_lgb = lgb.Dataset(train_fit_x, label=train_fit_y)
va_lgb = lgb.Dataset(va_fit_x, label=va_fit_y)

evals_result = {}

model_lgb2 = lgb.train(params, train_set = train_lgb,                          #【YOU NEED TO CHANGE】
                  num_boost_round = 10000,
                  valid_sets=[train_lgb, va_lgb], 
                  early_stopping_rounds=200, 
                  verbose_eval=1000, 
                  evals_result=evals_result)

va_visitors = data_all.loc[(data_all.set == "va"),"visitors"]

pred = model_lgb2.predict(va_fit_x, num_iteration=model_lgb2.best_iteration)  #【YOU NEED TO CHANGE】*2

columns = ["pred_visitors"]
pred_d = pd.DataFrame(data=pred, columns = columns)

# マイナスに予測される値は絶対値を取得する
pred_d["pred_visitors"] = pred_d["pred_visitors"].abs()

# RSMLEスコアの算出
score2 = np.sqrt(mean_squared_log_error(va_visitors, pred_d["pred_visitors"])) #【YOU NEED TO CHANGE】
print("---------------RMSLE-score----------------")
print(score2)                                                                #【YOU NEED TO CHANGE】

# モデルで予測する
pred = model_lgb2.predict(test_fit_x)                                   #【YOU NEED TO CHANGE】

# 提出用データを作成する
columns_ck = [["visitors"]]
submission_ck = pd.DataFrame(data=pred, columns = columns_ck)
test_id_df = pd.DataFrame(data=test_id)
test_id_df = test_id_df.reset_index(drop=True)
test_id_df.columns = ["id"]

test_reindex = test_fit_x.reset_index(drop=True)

submission = pd.DataFrame()
submission["id"] = test_id_df.id
submission["visitors"] = submission_ck.visitors

# csvファイルに書き出す
submission.to_csv("submission_week2.csv", index=False)                #【YOU NEED TO CHANGE】

# data_allを元に戻す
data_all = data_all_moto

# ------------------ここから③週目-----------------------

In [None]:
data_all = data_all_moto

In [None]:
%%time
#特徴量をドロップする
drops = ["lag7","lag14"]                                                  #【YOU NEED TO CHANGE】
data_all = data_all.drop(columns = drops)

#2017/5/31まで全部testにした後、5/28より前はnot-testにする #【YOU NEED TO CHANGE】
data_all.loc[(data_all['visit_date'] <= dt.datetime(2017,5,13)), 'set'] = 'test'
data_all.loc[(data_all['visit_date'] < dt.datetime(2017,5,7)), 'set'] = 'not-test'

# バリデーションデータの範囲を指定
set_va()

# visit_dataはdrop
data_all = data_all.drop(columns="visit_date")

# trainデータから、visitors(目的変数)と、idを落とす(予測に寄与しない)
columns_drop2 = ["visitors","id"]

# train,test,vaに分ける
train_fit = pd.DataFrame()
train_fit = data_all.loc[data_all.set == "train"]
train_fit_y = pd.DataFrame()
train_fit_y["visitors"] = train_fit.visitors
train_fit_x = train_fit.drop(columns=columns_drop2)
train_fit_x = train_fit_x.drop(columns="set")

va_fit = pd.DataFrame()
va_fit = data_all.loc[data_all.set == "va"]
va_fit_y = pd.DataFrame()
va_fit_y["visitors"] = va_fit.visitors
va_fit_x = va_fit.drop(columns=columns_drop2)
va_fit_x = va_fit_x.drop(columns="set")

test_fit = pd.DataFrame()
test_fit = data_all.loc[data_all.set == "test"]
test_fit_x = test_fit.drop(columns="visitors")
test_fit_x = test_fit_x.drop(columns="set")
# testデータで評価しない日・店舗の組み合わせがありid=0となってしまう
# id=0の行を削除する
test_fit_x = test_fit_x[test_fit_x['id'] != 0]

test_id = test_fit_x.id
test_fit_x = test_fit_x.drop(columns="id")

# LightGBMモデルで学習
import lightgbm as lgb
from sklearn.metrics import mean_squared_log_error

train_lgb = lgb.Dataset(train_fit_x, label=train_fit_y)
va_lgb = lgb.Dataset(va_fit_x, label=va_fit_y)

evals_result = {}

model_lgb3 = lgb.train(params, train_set = train_lgb,                          #【YOU NEED TO CHANGE】
                  num_boost_round = 10000,
                  valid_sets=[train_lgb, va_lgb], 
                  early_stopping_rounds=200, 
                  verbose_eval=1000, 
                  evals_result=evals_result)

va_visitors = data_all.loc[(data_all.set == "va"),"visitors"]

pred = model_lgb3.predict(va_fit_x, num_iteration=model_lgb3.best_iteration)  #【YOU NEED TO CHANGE】*2

columns = ["pred_visitors"]
pred_d = pd.DataFrame(data=pred, columns = columns)

# マイナスに予測される値は絶対値を取得する
pred_d["pred_visitors"] = pred_d["pred_visitors"].abs()

# RSMLEスコアの算出
score3 = np.sqrt(mean_squared_log_error(va_visitors, pred_d["pred_visitors"]))  #【YOU NEED TO CHANGE】
print("---------------RMSLE-score----------------")
print(score3)                                                                #【YOU NEED TO CHANGE】
print()

# モデルで予測する
pred = model_lgb3.predict(test_fit_x)                                   #【YOU NEED TO CHANGE】

# 提出用データを作成する
columns_ck = [["visitors"]]
submission_ck = pd.DataFrame(data=pred, columns = columns_ck)
test_id_df = pd.DataFrame(data=test_id)
test_id_df = test_id_df.reset_index(drop=True)
test_id_df.columns = ["id"]

test_reindex = test_fit_x.reset_index(drop=True)

submission = pd.DataFrame()
submission["id"] = test_id_df.id
submission["visitors"] = submission_ck.visitors

# csvファイルに書き出す
submission.to_csv("submission_week3.csv", index=False)                #【YOU NEED TO CHANGE】

# data_allを元に戻す
data_all = data_all_moto

# ------------------ここから④週目-----------------------

In [None]:
data_all = data_all_moto

In [None]:
%%time
#特徴量をドロップする
drops = ["lag7","lag14","lag21"]                                 #【YOU NEED TO CHANGE】
data_all = data_all.drop(columns = drops)

#2017/5/31まで全部testにした後、5/28より前はnot-testにする #【YOU NEED TO CHANGE】
data_all.loc[(data_all['visit_date'] <= dt.datetime(2017,5,20)), 'set'] = 'test'
data_all.loc[(data_all['visit_date'] < dt.datetime(2017,5,14)), 'set'] = 'not-test'

# バリデーションデータの範囲を指定
set_va()

# visit_dataはdrop
data_all = data_all.drop(columns="visit_date")

# trainデータから、visitors(目的変数)と、idを落とす(予測に寄与しない)
columns_drop2 = ["visitors","id"]

# train,test,vaに分ける
train_fit = pd.DataFrame()
train_fit = data_all.loc[data_all.set == "train"]
train_fit_y = pd.DataFrame()
train_fit_y["visitors"] = train_fit.visitors
train_fit_x = train_fit.drop(columns=columns_drop2)
train_fit_x = train_fit_x.drop(columns="set")

va_fit = pd.DataFrame()
va_fit = data_all.loc[data_all.set == "va"]
va_fit_y = pd.DataFrame()
va_fit_y["visitors"] = va_fit.visitors
va_fit_x = va_fit.drop(columns=columns_drop2)
va_fit_x = va_fit_x.drop(columns="set")

test_fit = pd.DataFrame()
test_fit = data_all.loc[data_all.set == "test"]
test_fit_x = test_fit.drop(columns="visitors")
test_fit_x = test_fit_x.drop(columns="set")
# testデータで評価しない日・店舗の組み合わせがありid=0となってしまう
# id=0の行を削除する
test_fit_x = test_fit_x[test_fit_x['id'] != 0]

test_id = test_fit_x.id
test_fit_x = test_fit_x.drop(columns="id")

# LightGBMモデルで学習
import lightgbm as lgb
from sklearn.metrics import mean_squared_log_error

train_lgb = lgb.Dataset(train_fit_x, label=train_fit_y)
va_lgb = lgb.Dataset(va_fit_x, label=va_fit_y)

evals_result = {}

model_lgb4 = lgb.train(params, train_set = train_lgb,                          #【YOU NEED TO CHANGE】
                  num_boost_round = 100000,
                  valid_sets=[train_lgb, va_lgb], 
                  early_stopping_rounds=200, 
                  verbose_eval=1000, 
                  evals_result=evals_result)

va_visitors = data_all.loc[(data_all.set == "va"),"visitors"]

pred = model_lgb4.predict(va_fit_x, num_iteration=model_lgb4.best_iteration) #【YOU NEED TO CHANGE】*2

columns = ["pred_visitors"]
pred_d = pd.DataFrame(data=pred, columns = columns)

# マイナスに予測される値は絶対値を取得する
pred_d["pred_visitors"] = pred_d["pred_visitors"].abs()

# RSMLEスコアの算出
score4 = np.sqrt(mean_squared_log_error(va_visitors, pred_d["pred_visitors"]))  #【YOU NEED TO CHANGE】
print("---------------RMSLE-score----------------")
print(score4)                                                                #【YOU NEED TO CHANGE】

# モデルで予測する
pred = model_lgb4.predict(test_fit_x)                                   #【YOU NEED TO CHANGE】

# 提出用データを作成する
columns_ck = [["visitors"]]
submission_ck = pd.DataFrame(data=pred, columns = columns_ck)
test_id_df = pd.DataFrame(data=test_id)
test_id_df = test_id_df.reset_index(drop=True)
test_id_df.columns = ["id"]

test_reindex = test_fit_x.reset_index(drop=True)

submission = pd.DataFrame()
submission["id"] = test_id_df.id
submission["visitors"] = submission_ck.visitors

# csvファイルに書き出す
submission.to_csv("submission_week4.csv", index=False)                #【YOU NEED TO CHANGE】

# data_allを元に戻す
data_all = data_all_moto

# ------------------ここから⑤週目-----------------------

In [None]:
%%time
#特徴量をドロップする
drops = ["lag7","lag14","lag21","lag28"]                #【YOU NEED TO CHANGE】
data_all = data_all.drop(columns = drops)

#2017/5/31まで全部testにした後、5/28より前はnot-testにする #【YOU NEED TO CHANGE】
data_all.loc[(data_all['visit_date'] <= dt.datetime(2017,5,27)), 'set'] = 'test'
data_all.loc[(data_all['visit_date'] < dt.datetime(2017,5,21)), 'set'] = 'not-test'

# バリデーションデータの範囲を指定
set_va()

# visit_dataはdrop
data_all = data_all.drop(columns="visit_date")

# trainデータから、visitors(目的変数)と、idを落とす(予測に寄与しない)
columns_drop2 = ["visitors","id"]

# train,test,vaに分ける
train_fit = pd.DataFrame()
train_fit = data_all.loc[data_all.set == "train"]
train_fit_y = pd.DataFrame()
train_fit_y["visitors"] = train_fit.visitors
train_fit_x = train_fit.drop(columns=columns_drop2)
train_fit_x = train_fit_x.drop(columns="set")

va_fit = pd.DataFrame()
va_fit = data_all.loc[data_all.set == "va"]
va_fit_y = pd.DataFrame()
va_fit_y["visitors"] = va_fit.visitors
va_fit_x = va_fit.drop(columns=columns_drop2)
va_fit_x = va_fit_x.drop(columns="set")

test_fit = pd.DataFrame()
test_fit = data_all.loc[data_all.set == "test"]
test_fit_x = test_fit.drop(columns="visitors")
test_fit_x = test_fit_x.drop(columns="set")
# testデータで評価しない日・店舗の組み合わせがありid=0となってしまう
# id=0の行を削除する
test_fit_x = test_fit_x[test_fit_x['id'] != 0]

test_id = test_fit_x.id
test_fit_x = test_fit_x.drop(columns="id")

# LightGBMモデルで学習
import lightgbm as lgb
from sklearn.metrics import mean_squared_log_error

train_lgb = lgb.Dataset(train_fit_x, label=train_fit_y)
va_lgb = lgb.Dataset(va_fit_x, label=va_fit_y)

evals_result = {}

model_lgb5 = lgb.train(params, train_set = train_lgb,                          #【YOU NEED TO CHANGE】
                  num_boost_round = 100000,
                  valid_sets=[train_lgb, va_lgb], 
                  early_stopping_rounds=200, 
                  verbose_eval=1000, 
                  evals_result=evals_result)

va_visitors = data_all.loc[(data_all.set == "va"),"visitors"]

pred = model_lgb5.predict(va_fit_x, num_iteration=model_lgb5.best_iteration) #【YOU NEED TO CHANGE】*2

columns = ["pred_visitors"]
pred_d = pd.DataFrame(data=pred, columns = columns)

# マイナスに予測される値は絶対値を取得する
pred_d["pred_visitors"] = pred_d["pred_visitors"].abs()

# RSMLEスコアの算出
score5 = np.sqrt(mean_squared_log_error(va_visitors, pred_d["pred_visitors"]))  #【YOU NEED TO CHANGE】
print("---------------RMSLE-score----------------")
print(score5)                                                                #【YOU NEED TO CHANGE】

# モデルで予測する
pred = model_lgb5.predict(test_fit_x)                                   #【YOU NEED TO CHANGE】

# 提出用データを作成する
columns_ck = [["visitors"]]
submission_ck = pd.DataFrame(data=pred, columns = columns_ck)
test_id_df = pd.DataFrame(data=test_id)
test_id_df = test_id_df.reset_index(drop=True)
test_id_df.columns = ["id"]

test_reindex = test_fit_x.reset_index(drop=True)

submission = pd.DataFrame()
submission["id"] = test_id_df.id
submission["visitors"] = submission_ck.visitors

# csvファイルに書き出す
submission.to_csv("submission_week5.csv", index=False)                #【YOU NEED TO CHANGE】

# data_allを元に戻す
data_all = data_all_moto

# ------------------ここから⑥週目-----------------------

In [None]:
%%time
#特徴量をドロップする
drops = ["lag7","lag14","lag21","lag28","lag35"]                #【YOU NEED TO CHANGE】
data_all = data_all.drop(columns = drops)

#2017/5/31まで全部testにした後、5/28より前はnot-testにする #【YOU NEED TO CHANGE】
data_all.loc[(data_all['visit_date'] <= dt.datetime(2017,5,31)), 'set'] = 'test'
data_all.loc[(data_all['visit_date'] < dt.datetime(2017,5,28)), 'set'] = 'not-test'

# バリデーションデータの範囲を指定
set_va()

# visit_dataはdrop
data_all = data_all.drop(columns="visit_date")

# trainデータから、visitors(目的変数)と、idを落とす(予測に寄与しない)
columns_drop2 = ["visitors","id"]

# train,test,vaに分ける
train_fit = pd.DataFrame()
train_fit = data_all.loc[data_all.set == "train"]
train_fit_y = pd.DataFrame()
train_fit_y["visitors"] = train_fit.visitors
train_fit_x = train_fit.drop(columns=columns_drop2)
train_fit_x = train_fit_x.drop(columns="set")

va_fit = pd.DataFrame()
va_fit = data_all.loc[data_all.set == "va"]
va_fit_y = pd.DataFrame()
va_fit_y["visitors"] = va_fit.visitors
va_fit_x = va_fit.drop(columns=columns_drop2)
va_fit_x = va_fit_x.drop(columns="set")

test_fit = pd.DataFrame()
test_fit = data_all.loc[data_all.set == "test"]
test_fit_x = test_fit.drop(columns="visitors")
test_fit_x = test_fit_x.drop(columns="set")
# testデータで評価しない日・店舗の組み合わせがありid=0となってしまう
# id=0の行を削除する
test_fit_x = test_fit_x[test_fit_x['id'] != 0]

test_id = test_fit_x.id
test_fit_x = test_fit_x.drop(columns="id")

# LightGBMモデルで学習
import lightgbm as lgb
from sklearn.metrics import mean_squared_log_error

train_lgb = lgb.Dataset(train_fit_x, label=train_fit_y)
va_lgb = lgb.Dataset(va_fit_x, label=va_fit_y)

evals_result = {}

model_lgb6 = lgb.train(params, train_set = train_lgb,                          #【YOU NEED TO CHANGE】
                  num_boost_round = 100000,
                  valid_sets=[train_lgb, va_lgb], 
                  early_stopping_rounds=200, 
                  verbose_eval=1000, 
                  evals_result=evals_result)

va_visitors = data_all.loc[(data_all.set == "va"),"visitors"]

pred = model_lgb6.predict(va_fit_x, num_iteration=model_lgb6.best_iteration) #【YOU NEED TO CHANGE】*2

columns = ["pred_visitors"]
pred_d = pd.DataFrame(data=pred, columns = columns)

# マイナスに予測される値は絶対値を取得する
pred_d["pred_visitors"] = pred_d["pred_visitors"].abs()

# RSMLEスコアの算出
score6 = np.sqrt(mean_squared_log_error(va_visitors, pred_d["pred_visitors"]))  #【YOU NEED TO CHANGE】
print("---------------RMSLE-score----------------")
print(score6)                                                                #【YOU NEED TO CHANGE】

# モデルで予測する
pred = model_lgb6.predict(test_fit_x)                                   #【YOU NEED TO CHANGE】

# 提出用データを作成する
columns_ck = [["visitors"]]
submission_ck = pd.DataFrame(data=pred, columns = columns_ck)
test_id_df = pd.DataFrame(data=test_id)
test_id_df = test_id_df.reset_index(drop=True)
test_id_df.columns = ["id"]

test_reindex = test_fit_x.reset_index(drop=True)

submission = pd.DataFrame()
submission["id"] = test_id_df.id
submission["visitors"] = submission_ck.visitors

# csvファイルに書き出す
submission.to_csv("submission_week6.csv", index=False)                #【YOU NEED TO CHANGE】

# data_allを元に戻す
data_all = data_all_moto

In [None]:
submission1 = pd.read_csv("./submission_week1.csv")
submission2 = pd.read_csv("./submission_week2.csv")
submission3 = pd.read_csv("./submission_week3.csv")
submission4 = pd.read_csv("./submission_week4.csv")
submission5 = pd.read_csv("./submission_week5.csv")
submission6 = pd.read_csv("./submission_week6.csv")

submission_all = pd.concat([submission1, submission2,submission3,submission4,submission5,submission6])
# submission_all = submission1

In [None]:
submission_all

In [None]:
submission_all.sort_values(by='id', ascending=True).head(315)

In [None]:
zeros = submission_all.loc[submission_all['id'] == "0"]
submission_all = submission_all.loc[submission_all["id"] != "0"]
submission_all

In [None]:
zeros.describe()

In [None]:
submission_all.sort_values(by='id', ascending=True).head(315)

In [None]:
submission_all.duplicated().sum()

In [None]:
submission_all

In [None]:
submission_test = submission_all.sort_values(by='id', ascending=True)
# 重複した行を削除
# submission_test = submission_test[~submission_test.duplicated()]
submission_test = submission_test.drop_duplicates(subset=["id"],keep="first")
submission_test

# サブミッション用データを追加

In [None]:
submission_test.to_csv("F_submission_week_all.csv", index=False)

In [None]:
print(score1)
print(score2)
print(score3)
print(score4)
print(score5)
print(score6)

# 特徴量の寄与度を算出
各モデルで、影響のある特徴量を表示する
model_lgb〇の部分を変えると、期間ごとにそれぞれ作成したモデルごとに見れる

In [None]:
# 特徴量重要度の算出 (データフレームで取得)

# 特徴量名のリスト(目的変数CRIM以外)
# 特徴量重要度の算出方法 'gain'(推奨) : トレーニングデータの損失の減少量を評価
cols = list(train_fit_x.columns)
cols_df = pd.DataFrame(cols)

# 特徴量重要度の算出 //
f_importance = np.array(model_lgb6.feature_importance(importance_type='gain'))
# 正規化(必要ない場合はコメントアウト)
f_importance = f_importance / np.sum(f_importance)
f_importance_df = pd.DataFrame(f_importance)
df_importance = cols_df.join(f_importance_df,lsuffix='_features', rsuffix='_importance')
# 降順ソート
df_importance = df_importance.sort_values('0_importance', ascending=False)
df_importance