# Kaggle Recruit

## 前書き

盛んな地元のレストランを走らせることは、第一印象が現れるほど魅力的ではありません。多くの場合、ビジネスを傷つける可能性のあるあらゆる予期せぬ問題が発生します。

1つの一般的な苦境は、レストランでは、効果的に原料を購入し、スタッフをスケジュールするために、どれくらいの数の顧客が毎日期待するのかを知る必要があるということです。天気や地元の競争など、多くの予測不可能な要因がレストランの出席に影響するため、この予測は容易ではありません。過去のデータがほとんどない新しいレストランにとってはさらに難しいことです。

Recruit Holdingsは、自動化された将来の顧客予測を可能にする重要なデータセットに独自のアクセス権を持っています。具体的には、Recruit Holdingsは、レストランPepper Gourmet（レストランレビューサービス）、AirREGI（レストラン営業ポイント）、レストランボード（予約ログ管理ソフトウェア）を所有しています。

このコンテストでは、予約や訪問データを使用して、将来の日付にレストランの訪問者の総数を予測することに挑戦しています。この情報は、レストランがより効率的になり、顧客が楽しいダイニング体験を作成できるようにするのに役立ちます。

** テストセットは意図的に日本では「ゴールデンウィーク」と呼ばれる休日の週に及ぶことに注意してください。**

## ファイルの説明
これは2つのシステムからのリレーショナルデータセットです。各ファイルには、その起源を示すソース（air_またはhpg_）が付いています。各レストランにはユニークなレストランがair_store_idありhpg_store_idます。両方のシステムですべてのレストランがカバーされているわけではなく、予測する必要があるレストラン以外のデータが提供されていることに注意してください。緯度と経度は、レストランの識別を妨げるものではありません。

**air_reserve.csv**  
このファイルには、航空システムで行われた予約が含まれています。ここではreserve_datetime、予約が作成された時刻を示していvisit_datetimeますが、将来は訪問が行われる時刻です。  
air_store_id - 空気システムのレストランのID  
visit_datetime - 予約の時間  
reserve_datetime - 予約が行われた時間  
reserve_visitors - その予約の訪問者数  

**hpg_reserve.csv**  
このファイルには、hpgシステムで行われた予約が含まれています。  
hpg_store_id - hpgシステムのレストランのID  
visit_datetime - 予約の時間  
reserve_datetime - 予約が行われた時間  
reserve_visitors - その予約の訪問者数  

**air_store_info.csv**  
このファイルには、選択された航空レストランに関する情報が含まれています。列の名前と内容は自明です。  
air_store_id  
air_genre_name  
air_area_name  
latitude  
longitude  
注：緯度と経度は、店舗が属する地域の緯度と経度です

**hpg_store_info.csv **  
このファイルには、選択したhpgレストランに関する情報が含まれています。列の名前と内容は自明です。  
hpg_store_id  
hpg_genre_name  
hpg_area_name  
latitude  
longitude  
注：緯度と経度は、店舗が属する地域の緯度と経度です  

**store_id_relation.csv**  
このファイルを使用すると、空とhpgシステムの両方を持つ選択レストランに参加できます。     
hpg_store_id  
air_store_id  

**air_visit_data.csv**  
このファイルには、航空レストランの履歴データが含まれています。  
air_store_id  
visit_date - 日付  
visitors - 当日のレストラン訪問者数  

**sample_submission.csv**  
このファイルは、あなたが予測しなければならない日数を含め、正しい形式で提出物を表示します。  
id- IDが連結することによって形成されるair_store_idとvisit_date下線で  
visitors- 店舗と日付の組み合わせについて予測される訪問者の数  

**date_info.csv**  
このファイルは、データセット内のカレンダー日付に関する基本情報を提供します。    
calendar_date  
day_of_week  
holiday_flg - 日本の休日の日です。  

# 事前知識

今回のデータは時系列データとなっております。    
時系列データとは、例えば「毎日の売り上げデータ」や「日々の気温のデータ」、「月ごとの飛行機乗客数」など、毎日（あるいは毎週・毎月・毎年）増えていくデータのことです。　　  
時系列データには「昨日の売り上げと今日の売り上げが似ている」といった関係性を持つことがよくあります。  
そのため、時系列データをうまく使えば、昨日の売り上げデータから、未来の売り上げデータを予測することができるかもしれません。  

時系列データを扱うときの注意点として、ただ時系列に並べて学習お行うことはとても危険です。  
時系列データでは棚の日付のみが入っている場合が多いですが、曜日や季節により全く異なるからです。  
ですので時系列データを扱う際はまずトレンドを合わせる事が大切です。



# 必要なライブラリのインポート

In [88]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# データのインポート

In [89]:
air_reserve = pd.read_csv("air_reserve.csv")
hpg_reserve = pd.read_csv("hpg_reserve.csv")

air_store_info = pd.read_csv("air_store_info.csv")
hpg_store_info = pd.read_csv("hpg_store_info.csv")

air_visit_data = pd.read_csv("air_visit_data.csv")

store_id_relation = pd.read_csv("store_id_relation.csv")

sample_submission = pd.read_csv("sample_submission.csv")
date_info = pd.read_csv("date_info.csv")

まずはデータを把握するためにも、一通り見ていきましょう。

In [90]:
air_reserve.head()

Unnamed: 0,air_store_id,visit_datetime,reserve_datetime,reserve_visitors
0,air_877f79706adbfb06,2016-01-01 19:00:00,2016-01-01 16:00:00,1
1,air_db4b38ebe7a7ceff,2016-01-01 19:00:00,2016-01-01 19:00:00,3
2,air_db4b38ebe7a7ceff,2016-01-01 19:00:00,2016-01-01 19:00:00,6
3,air_877f79706adbfb06,2016-01-01 20:00:00,2016-01-01 16:00:00,2
4,air_db80363d35f10926,2016-01-01 20:00:00,2016-01-01 01:00:00,5


In [91]:
hpg_reserve.head()

Unnamed: 0,hpg_store_id,visit_datetime,reserve_datetime,reserve_visitors
0,hpg_c63f6f42e088e50f,2016-01-01 11:00:00,2016-01-01 09:00:00,1
1,hpg_dac72789163a3f47,2016-01-01 13:00:00,2016-01-01 06:00:00,3
2,hpg_c8e24dcf51ca1eb5,2016-01-01 16:00:00,2016-01-01 14:00:00,2
3,hpg_24bb207e5fd49d4a,2016-01-01 17:00:00,2016-01-01 11:00:00,5
4,hpg_25291c542ebb3bc2,2016-01-01 17:00:00,2016-01-01 03:00:00,13


In [92]:
air_store_info.head()

Unnamed: 0,air_store_id,air_genre_name,air_area_name,latitude,longitude
0,air_0f0cdeee6c9bf3d7,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
1,air_7cc17a324ae5c7dc,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
2,air_fee8dcf4d619598e,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
3,air_a17f0778617c76e2,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
4,air_83db5aff8f50478e,Italian/French,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599


In [93]:
hpg_store_info.head()

Unnamed: 0,hpg_store_id,hpg_genre_name,hpg_area_name,latitude,longitude
0,hpg_6622b62385aec8bf,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
1,hpg_e9e068dd49c5fa00,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
2,hpg_2976f7acb4b3a3bc,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
3,hpg_e51a522e098f024c,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
4,hpg_e3d0e1519894f275,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221


In [94]:
store_id_relation.head()

Unnamed: 0,air_store_id,hpg_store_id
0,air_63b13c56b7201bd9,hpg_4bc649e72e2a239a
1,air_a24bf50c3e90d583,hpg_c34b496d0305a809
2,air_c7f78b4f3cba33ff,hpg_cd8ae0d9bbd58ff9
3,air_947eb2cae4f3e8f2,hpg_de24ea49dc25d6b8
4,air_965b2e0cf4119003,hpg_653238a84804d8e7


In [95]:
air_visit_data.head()

Unnamed: 0,air_store_id,visit_date,visitors
0,air_ba937bf13d40fb24,2016-01-13,25
1,air_ba937bf13d40fb24,2016-01-14,32
2,air_ba937bf13d40fb24,2016-01-15,29
3,air_ba937bf13d40fb24,2016-01-16,22
4,air_ba937bf13d40fb24,2016-01-18,6


In [96]:
sample_submission.head()

Unnamed: 0,id,visitors
0,air_00a91d42b08b08d9_2017-04-23,0
1,air_00a91d42b08b08d9_2017-04-24,0
2,air_00a91d42b08b08d9_2017-04-25,0
3,air_00a91d42b08b08d9_2017-04-26,0
4,air_00a91d42b08b08d9_2017-04-27,0


In [97]:
date_info.head()

Unnamed: 0,calendar_date,day_of_week,holiday_flg
0,2016-01-01,Friday,1
1,2016-01-02,Saturday,1
2,2016-01-03,Sunday,1
3,2016-01-04,Monday,0
4,2016-01-05,Tuesday,0


かなりデータがばらけている事がわかりました。　　  
まず予測するテストデータの形を整えてデータを整理していきましょう。

In [98]:
sample_test = sample_submission["id"].str.split('_', expand=True)
sample_test["id"] = sample_test[0]+"_"+sample_test[1]
sample = sample_submission.copy() 

In [99]:
sample["id"] = sample_test["id"]
sample["visit_date"] = sample_test[2]

TRGETをあとで分けられるよう来店者数をpredに置き換える

In [100]:
sample["visitors"] = "pred"

In [101]:
sample.head()

Unnamed: 0,id,visitors,visit_date
0,air_00a91d42b08b08d9,pred,2017-04-23
1,air_00a91d42b08b08d9,pred,2017-04-24
2,air_00a91d42b08b08d9,pred,2017-04-25
3,air_00a91d42b08b08d9,pred,2017-04-26
4,air_00a91d42b08b08d9,pred,2017-04-27


ひとまず、予測するサンプルデータの形を整え得る事ができました。  
2つの予約サイトのIDが違っているので、それを合わせていきましょう。  

In [102]:
store_id_relation.head()

Unnamed: 0,air_store_id,hpg_store_id
0,air_63b13c56b7201bd9,hpg_4bc649e72e2a239a
1,air_a24bf50c3e90d583,hpg_c34b496d0305a809
2,air_c7f78b4f3cba33ff,hpg_cd8ae0d9bbd58ff9
3,air_947eb2cae4f3e8f2,hpg_de24ea49dc25d6b8
4,air_965b2e0cf4119003,hpg_653238a84804d8e7


### idの辞書化

In [103]:
store_id = store_id_relation.set_index('hpg_store_id')['air_store_id']

In [104]:
store_dict = store_id.to_dict()

### idをマージし無いIDは削除

In [105]:
hpg_reserves = pd.merge(hpg_reserve,store_id_relation, on = "hpg_store_id",how = "left")

In [106]:
hpg_reserves = hpg_reserves.dropna()

In [107]:
hpg_reserves.head()

Unnamed: 0,hpg_store_id,visit_datetime,reserve_datetime,reserve_visitors,air_store_id
103,hpg_878cc70b1abc76f7,2016-01-01 19:00:00,2016-01-01 15:00:00,4,air_db80363d35f10926
121,hpg_dc639640420bde5f,2016-01-01 19:00:00,2016-01-01 16:00:00,2,air_08cb3c4ee6cd6a22
272,hpg_babe2c3d962d7bb6,2016-01-02 17:00:00,2016-01-01 22:00:00,3,air_6b15edd1b4fbb96a
348,hpg_2e10e1956528199a,2016-01-02 18:00:00,2016-01-02 17:00:00,2,air_37189c92b6c761ec
349,hpg_2e10e1956528199a,2016-01-02 18:00:00,2016-01-01 20:00:00,2,air_37189c92b6c761ec


In [108]:
air_reserves = air_reserve

#### 日付データから時刻を消す

In [109]:
hpg_reserves["visit_datetime"] = hpg_reserves["visit_datetime"].str.split(' ', expand=True)
air_reserves["visit_datetime"] = air_reserves["visit_datetime"].str.split(' ', expand=True)

In [110]:
hpg_reserves.head()

Unnamed: 0,hpg_store_id,visit_datetime,reserve_datetime,reserve_visitors,air_store_id
103,hpg_878cc70b1abc76f7,2016-01-01,2016-01-01 15:00:00,4,air_db80363d35f10926
121,hpg_dc639640420bde5f,2016-01-01,2016-01-01 16:00:00,2,air_08cb3c4ee6cd6a22
272,hpg_babe2c3d962d7bb6,2016-01-02,2016-01-01 22:00:00,3,air_6b15edd1b4fbb96a
348,hpg_2e10e1956528199a,2016-01-02,2016-01-02 17:00:00,2,air_37189c92b6c761ec
349,hpg_2e10e1956528199a,2016-01-02,2016-01-01 20:00:00,2,air_37189c92b6c761ec


In [111]:
hpg_reserves = hpg_reserves.groupby(['hpg_store_id', 'visit_datetime']).sum().reset_index()
air_reserves = air_reserves.groupby(['air_store_id', 'visit_datetime']).sum().reset_index()

In [112]:
hpg_reserves["hpg_store_id"] = hpg_reserves["hpg_store_id"].map(store_dict)

In [113]:
hpg_reserves = hpg_reserves.rename(columns={'hpg_store_id': 'air_store_id'})

In [114]:
hpg_reserves.head()

Unnamed: 0,air_store_id,visit_datetime,reserve_visitors
0,air_cbe867adcf44e14f,2016-01-09,2
1,air_cbe867adcf44e14f,2016-01-11,8
2,air_cbe867adcf44e14f,2016-01-14,11
3,air_cbe867adcf44e14f,2016-01-15,8
4,air_cbe867adcf44e14f,2016-01-18,9


In [115]:
air_reserves.head()

Unnamed: 0,air_store_id,visit_datetime,reserve_visitors
0,air_00a91d42b08b08d9,2016-10-31,2
1,air_00a91d42b08b08d9,2016-12-05,9
2,air_00a91d42b08b08d9,2016-12-14,18
3,air_00a91d42b08b08d9,2016-12-17,2
4,air_00a91d42b08b08d9,2016-12-20,4


In [116]:
air_visit_data.head()

Unnamed: 0,air_store_id,visit_date,visitors
0,air_ba937bf13d40fb24,2016-01-13,25
1,air_ba937bf13d40fb24,2016-01-14,32
2,air_ba937bf13d40fb24,2016-01-15,29
3,air_ba937bf13d40fb24,2016-01-16,22
4,air_ba937bf13d40fb24,2016-01-18,6


In [117]:
air_visit_data2= air_visit_data.rename(columns={'visit_date': 'visit_datetime'})

In [118]:
air_reserves = pd.merge(air_visit_data2,air_reserves,on= ['air_store_id', 'visit_datetime']
                       ,how="left")

In [119]:
air_reserves.isnull().sum()

air_store_id             0
visit_datetime           0
visitors                 0
reserve_visitors    224044
dtype: int64

In [120]:
air_reserve

Unnamed: 0,air_store_id,visit_datetime,reserve_datetime,reserve_visitors
0,air_877f79706adbfb06,2016-01-01,2016-01-01 16:00:00,1
1,air_db4b38ebe7a7ceff,2016-01-01,2016-01-01 19:00:00,3
2,air_db4b38ebe7a7ceff,2016-01-01,2016-01-01 19:00:00,6
3,air_877f79706adbfb06,2016-01-01,2016-01-01 16:00:00,2
4,air_db80363d35f10926,2016-01-01,2016-01-01 01:00:00,5
5,air_db80363d35f10926,2016-01-02,2016-01-01 16:00:00,2
6,air_db80363d35f10926,2016-01-02,2016-01-01 15:00:00,4
7,air_3bb99a1fe0583897,2016-01-02,2016-01-02 14:00:00,2
8,air_3bb99a1fe0583897,2016-01-02,2016-01-01 20:00:00,2
9,air_2b8b29ddfd35018e,2016-01-02,2016-01-02 17:00:00,2


ひとまず、reserveの処理が終わりました。    
マージしていきましょう

In [121]:
air_reserves.sort_values(by = "visitors")[::-1].head(30)

Unnamed: 0,air_store_id,visit_datetime,visitors,reserve_visitors
85314,air_cfdeb326418194ff,2017-03-08,877,
214825,air_8c3175aa5e4fc569,2017-04-18,777,
72836,air_f2985de32bb792e0,2016-07-10,675,
172123,air_eca5e0064dc9314a,2016-08-30,627,
143894,air_43d577e0c9460e64,2016-01-24,514,
167504,air_9828505fefc77d75,2016-11-19,409,
147739,air_e42bdc3377d1eee7,2016-12-14,372,
151243,air_cb083b4789a8d3a2,2016-01-14,369,
141539,air_07bb665f9cdfbdfb,2016-08-07,351,
200611,air_c6aa2efba0ffc8eb,2017-01-23,348,


In [122]:
reserve = air_reserves.append(hpg_reserves)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


In [123]:
reserve = reserve.groupby(['air_store_id', 'visit_datetime']).sum().reset_index()

In [124]:
reserve.sort_values(by = "visitors")[::-1].head(150)

Unnamed: 0,air_store_id,visit_datetime,reserve_visitors,visitors
209888,air_cfdeb326418194ff,2017-03-08,0.0,877.0
138847,air_8c3175aa5e4fc569,2017-04-18,0.0,777.0
245833,air_f2985de32bb792e0,2016-07-10,0.0,675.0
237886,air_eca5e0064dc9314a,2016-08-30,0.0,627.0
63086,air_43d577e0c9460e64,2016-01-24,0.0,514.0
154552,air_9828505fefc77d75,2016-11-19,0.0,409.0
229120,air_e42bdc3377d1eee7,2016-12-14,0.0,372.0
205239,air_cb083b4789a8d3a2,2016-01-14,0.0,369.0
6468,air_07bb665f9cdfbdfb,2016-08-07,0.0,351.0
199299,air_c6aa2efba0ffc8eb,2017-01-23,0.0,348.0


In [125]:
reserve.shape

(257178, 4)

これでreserveの処理が終わりました。

In [126]:
air_store_info

Unnamed: 0,air_store_id,air_genre_name,air_area_name,latitude,longitude
0,air_0f0cdeee6c9bf3d7,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
1,air_7cc17a324ae5c7dc,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
2,air_fee8dcf4d619598e,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
3,air_a17f0778617c76e2,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
4,air_83db5aff8f50478e,Italian/French,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599
5,air_99c3eae84130c1cb,Italian/French,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599
6,air_f183a514cb8ff4fa,Italian/French,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599
7,air_6b9fa44a9cf504a1,Italian/French,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599
8,air_0919d54f0c9a24b8,Italian/French,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599
9,air_2c6c79d597e48096,Italian/French,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599


In [127]:
hpg_store_info

Unnamed: 0,hpg_store_id,hpg_genre_name,hpg_area_name,latitude,longitude
0,hpg_6622b62385aec8bf,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
1,hpg_e9e068dd49c5fa00,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
2,hpg_2976f7acb4b3a3bc,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
3,hpg_e51a522e098f024c,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
4,hpg_e3d0e1519894f275,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
5,hpg_530cd91db13b938e,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
6,hpg_02457b318e186fa4,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
7,hpg_0cb3c2c490020a29,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
8,hpg_3efe9b08c887fe9a,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221
9,hpg_765e8d3ba261dc1c,Japanese style,Tōkyō-to Setagaya-ku Taishidō,35.643675,139.668221


In [128]:
hpg_store = hpg_store_info.copy()
air_store = air_store_info.copy()

In [129]:
hpg_store["hpg_store_id"] = hpg_store["hpg_store_id"].map(store_dict)

In [130]:
hpg_store = hpg_store.dropna()

In [131]:
hpg_store = hpg_store.rename(columns={'hpg_store_id': 'air_store_id',"hpg_area_name":"air_area_name"})

In [132]:
hpg_store

Unnamed: 0,air_store_id,hpg_genre_name,air_area_name,latitude,longitude
98,air_2aab19554f91ff82,Japanese style,Tōkyō-to Chūō-ku Ginza,35.668600,139.763043
150,air_258ad2619d7bff9a,Japanese style,Tōkyō-to Sumida-ku Tachibana,35.704960,139.828642
178,air_c47aa7493b15f297,Japanese style,Hiroshima-ken Hiroshima-shi Hondōri,34.392106,132.461914
216,air_96005f79124e12bf,Japanese style,Ōsaka-fu Ōsaka-shi Shinsaibashisuji,34.669514,135.501425
351,air_f2c5a1f24279c531,Japanese style,Tōkyō-to Taitō-ku None,35.711353,139.782684
374,air_1033310359ceeac1,Japanese style,Tōkyō-to Kōtō-ku Minamisuna,35.670728,139.824576
682,air_640cf4835f0d9ba3,Japanese style,Kanagawa-ken Yokohama-shi Nagatsutachō,35.512762,139.495733
777,air_a38f25e3399d1b25,Japanese style,Tōkyō-to Chiyoda-ku None,35.695780,139.768453
818,air_96743eee94114261,Japanese style,Niigata-ken Niigata-shi Higashiōdōri,37.914180,139.060024
820,air_de88770300008624,Japanese style,Niigata-ken Niigata-shi Higashiōdōri,37.914180,139.060024


In [133]:
hpg_store2 = hpg_store.drop(["air_area_name","latitude","longitude"],axis = 1)

In [134]:
store = pd.merge(air_store,hpg_store2,on ="air_store_id", how = "left")

In [135]:
store.head()

Unnamed: 0,air_store_id,air_genre_name,air_area_name,latitude,longitude,hpg_genre_name
0,air_0f0cdeee6c9bf3d7,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,
1,air_7cc17a324ae5c7dc,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,
2,air_fee8dcf4d619598e,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,
3,air_a17f0778617c76e2,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,
4,air_83db5aff8f50478e,Italian/French,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,


これでストア情報の処理も終わりました。  
reservesとstoreをくっつけましょう。

In [136]:
data_1 = pd.merge(reserve,store,on ="air_store_id", how = "left")

In [137]:
date = pd

In [138]:
date_infos = date_info.rename(columns={'calendar_date': 'visit_datetime'})

In [139]:
data = pd.merge(data_1,date_infos,on = "visit_datetime",how = "left")

In [140]:
data.isnull().sum()

air_store_id             0
visit_datetime           0
reserve_visitors         0
visitors                 0
air_genre_name           0
air_area_name            0
latitude                 0
longitude                0
hpg_genre_name      234770
day_of_week              0
holiday_flg              0
dtype: int64

In [141]:
data.to_csv("data.csv")

# 時系列データの扱い
今回は時系列データなので、まずはTARGETに条件を合わせてみる。

In [142]:
data1 = data.copy()

In [143]:
data1.isnull().sum()

air_store_id             0
visit_datetime           0
reserve_visitors         0
visitors                 0
air_genre_name           0
air_area_name            0
latitude                 0
longitude                0
hpg_genre_name      234770
day_of_week              0
holiday_flg              0
dtype: int64

In [144]:
data1['date'] = pd.to_datetime(data['visit_datetime'])

In [145]:
data1 = data1.set_index('date')

In [146]:
data1["dates"] = data1.index

In [147]:
data1.shape

(257178, 12)

In [148]:
data1.dtypes

air_store_id                object
visit_datetime              object
reserve_visitors           float64
visitors                   float64
air_genre_name              object
air_area_name               object
latitude                   float64
longitude                  float64
hpg_genre_name              object
day_of_week                 object
holiday_flg                  int64
dates               datetime64[ns]
dtype: object

TARGETを1度マージ

In [149]:
data1.head(100)

Unnamed: 0_level_0,air_store_id,visit_datetime,reserve_visitors,visitors,air_genre_name,air_area_name,latitude,longitude,hpg_genre_name,day_of_week,holiday_flg,dates
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2016-01-14,air_00a91d42b08b08d9,2016-01-14,2.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Thursday,0,2016-01-14
2016-01-15,air_00a91d42b08b08d9,2016-01-15,4.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Friday,0,2016-01-15
2016-01-16,air_00a91d42b08b08d9,2016-01-16,2.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Saturday,0,2016-01-16
2016-01-22,air_00a91d42b08b08d9,2016-01-22,2.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Friday,0,2016-01-22
2016-01-29,air_00a91d42b08b08d9,2016-01-29,5.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Friday,0,2016-01-29
2016-02-05,air_00a91d42b08b08d9,2016-02-05,2.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Friday,0,2016-02-05
2016-03-08,air_00a91d42b08b08d9,2016-03-08,3.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Tuesday,0,2016-03-08
2016-04-04,air_00a91d42b08b08d9,2016-04-04,1.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Monday,0,2016-04-04
2016-04-07,air_00a91d42b08b08d9,2016-04-07,2.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Thursday,0,2016-04-07
2016-04-08,air_00a91d42b08b08d9,2016-04-08,7.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Friday,0,2016-04-08


In [150]:
data1.shape

(257178, 12)

train_data = data1[data1.index.month == 4]
train_data = train_data[train_data.index.day >= 23]
train_data = train_data.append(data1[data1.index.month == 5])

In [151]:
train_data = data1[data1.index.month == 4]

In [152]:
train_data = train_data[train_data.index.day >= 23]

In [66]:
train_data = train_data.append(data1[data1.index.month == 5])

In [67]:
train_data.head()

Unnamed: 0_level_0,air_store_id,visit_datetime,reserve_visitors,visitors,air_genre_name,air_area_name,latitude,longitude,hpg_genre_name,day_of_week,holiday_flg,dates
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2016-04-28,air_00a91d42b08b08d9,2016-04-28,2.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Thursday,0,2016-04-28
2016-04-23,air_0241aa3964b7f861,2016-04-23,0.0,18.0,Izakaya,Tōkyō-to Taitō-ku Higashiueno,35.712607,139.779996,,Saturday,0,2016-04-23
2016-04-24,air_0241aa3964b7f861,2016-04-24,0.0,23.0,Izakaya,Tōkyō-to Taitō-ku Higashiueno,35.712607,139.779996,,Sunday,0,2016-04-24
2016-04-25,air_0241aa3964b7f861,2016-04-25,0.0,6.0,Izakaya,Tōkyō-to Taitō-ku Higashiueno,35.712607,139.779996,,Monday,0,2016-04-25
2016-04-26,air_0241aa3964b7f861,2016-04-26,0.0,17.0,Izakaya,Tōkyō-to Taitō-ku Higashiueno,35.712607,139.779996,,Tuesday,0,2016-04-26


In [68]:
train_data.shape

(11513, 12)

In [153]:
data1.drop(["visit_datetime","air_store_id"],axis = 1)

Unnamed: 0_level_0,reserve_visitors,visitors,air_genre_name,air_area_name,latitude,longitude,hpg_genre_name,day_of_week,holiday_flg,dates
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2016-01-14,2.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Thursday,0,2016-01-14
2016-01-15,4.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Friday,0,2016-01-15
2016-01-16,2.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Saturday,0,2016-01-16
2016-01-22,2.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Friday,0,2016-01-22
2016-01-29,5.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Friday,0,2016-01-29
2016-02-05,2.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Friday,0,2016-02-05
2016-03-08,3.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Tuesday,0,2016-03-08
2016-04-04,1.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Monday,0,2016-04-04
2016-04-07,2.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Thursday,0,2016-04-07
2016-04-08,7.0,0.0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,Friday,0,2016-04-08


In [154]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
def les(data):
    
    for col in data:
        if data[col].dtype == "object":
            if len(list(data[col].unique())) <= 2:
                le.fit(data[col])
            
                data[col] = le.transform(data[col])
                
    data = pd.get_dummies(data)
    return data

In [155]:
def ch_category(data):
    
    for col in data:
        if data[col].dtype == "object":
            data[col] = data[col].astype('category')
            
    return data

In [72]:
#train = ch_category(train_data)

In [156]:
data1["dates"] = data1["dates"].astype("object")

In [157]:
train = les(data1)

In [158]:
train.shape

(257178, 1990)

In [159]:
train.isnull().sum()

reserve_visitors                     0
visitors                             0
latitude                             0
longitude                            0
holiday_flg                          0
air_store_id_air_00a91d42b08b08d9    0
air_store_id_air_0164b9927d20bcc3    0
air_store_id_air_0241aa3964b7f861    0
air_store_id_air_0328696196e46f18    0
air_store_id_air_034a3d5b40d5b1b1    0
air_store_id_air_036d4f1ee7285390    0
air_store_id_air_0382c794b73b51ad    0
air_store_id_air_03963426c9312048    0
air_store_id_air_04341b588bde96cd    0
air_store_id_air_049f6d5b402a31b2    0
air_store_id_air_04cae7c1bc9b2a0b    0
air_store_id_air_0585011fa179bcce    0
air_store_id_air_05c325d315cc17f5    0
air_store_id_air_0647f17b4dc041c8    0
air_store_id_air_064e203265ee5753    0
air_store_id_air_066f0221b8a4d533    0
air_store_id_air_06f95ac5c33aca10    0
air_store_id_air_0728814bd98f7367    0
air_store_id_air_0768ab3910f7967f    0
air_store_id_air_07b314d83059c4d2    0
air_store_id_air_07bb665f

In [160]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split

In [161]:
target = train.visitors
test_train = train.drop(["visitors"],axis = 1)

In [162]:
x_train,x_test,y_train,y_test = train_test_split(test_train,target,test_size = 0.2)

In [163]:
def rmsle(y, y0):
    assert len(y) == len(y0)
    return np.sqrt(np.mean(np.power(np.log1p(y)-np.log1p(y0), 2)))

In [166]:
d_trainL1 = lgb.Dataset(x_train, label=y_train)
d_validL1 = lgb.Dataset(x_test, label=y_test)
watchlistL1 = [d_trainL1, d_validL1]
paramsL1 = {
        'learning_rate': 0.55,
        'application': 'regression',
        'max_depth': 5,
        'num_leaves': 60,
        'verbosity': -1,
        'metric': 'RMSE',
        'data_random_seed': 1,
        'bagging_fraction': 0.5,
        'nthread': 5
    }
modelL1 = lgb.train(paramsL1, train_set=d_trainL1, num_boost_round=8000, valid_sets=watchlistL1, \
early_stopping_rounds=1000, verbose_eval=500)

Training until validation scores don't improve for 1000 rounds.
[500]	training's rmse: 11.0585	valid_1's rmse: 11.1091
[1000]	training's rmse: 10.4455	valid_1's rmse: 11.007
[1500]	training's rmse: 10.1112	valid_1's rmse: 10.9759
[2000]	training's rmse: 9.89201	valid_1's rmse: 10.9698
[2500]	training's rmse: 9.6713	valid_1's rmse: 10.9689
[3000]	training's rmse: 9.5483	valid_1's rmse: 10.9873
Early stopping, best iteration is:
[2296]	training's rmse: 9.75898	valid_1's rmse: 10.9559


In [167]:
pred = modelL1.predict(x_test)
rmsleL2 = rmsle(pred, y_test)
print(rmsleL2)

0.6573761420282467


  app.launch_new_instance()


In [None]:
pd.DataFrame(pred)

In [None]:
y_test

In [None]:
sample.head()

In [None]:
sample1 = sample.drop("visitors",axis = 1)

In [None]:
sample1 = sample1.rename(columns={'id': 'air_store_id',"visit_date":"dates"})

In [None]:
sample1["dates"] = pd.to_datetime(sample1['dates'])

In [None]:
sample1

In [None]:
sample3 = pd.merge(sample1,data1,on = ["air_store_id"],how = "left")

In [None]:
sample3.shape

In [None]:
sample3

In [None]:
sample1.shape

In [None]:
x_train.head()

In [None]:
air_visit

In [None]:
data1.sort_values(by = "reserve_visitors")

In [None]:
x_train