# Kaggleで磨く 機械学習の実践力
# 第8章 回帰問題のコンペ (MLB Player Digital Engagement Forecasting)

### ※サブミット時の注意点
- 右メニューで「Settings」の**「Internet」をオフ**にしてください。インターネット接続状態ではサブミットできません。
- **mlbライブラリは一度しか実行できません。再度実行したい場合はカーネルを再起動**する必要があります。
- 本notebookではmlbライブラリを3ヶ所で実行しています。**実行したいセル以外はコメントアウトしてから実行してください。**
    - スクリプト8-37 :  8.3 ベースライン作成
    - スクリプト8-47 :  8.4 特徴量エンジニアリング
    - スクリプト8-62 :  8.5 モデルチューニング
- 3つ実行したい場合は、コピーしてNotebookを3個作成し、上記3か所のコメントアウトを外して実行してサブミットしてください。

# 8.3 ベースライン作成
## 8.3.2 データ前処理

### ● train_updated.csvの読み込みと加工
#### スクリプト8-1: ライブラリのインポート

In [1]:
import numpy as np
import pandas as pd
import gc
import pickle
import os
import datetime as dt

# plot
import matplotlib.pyplot as plt

# LightGBM
import lightgbm as lgb

from sklearn.metrics import mean_absolute_error

import warnings
warnings.simplefilter("ignore")

# 表示桁数の指定
pd.options.display.float_format = '{:10.4f}'.format

#### スクリプト8-2: train_updated.csvファイルの読み込み

In [2]:
train = pd.read_csv("../input/mlb-player-digital-engagement-forecasting/train_updated.csv")
print(train.shape)
train.head()

(1308, 12)


Unnamed: 0,date,nextDayPlayerEngagement,games,rosters,playerBoxScores,teamBoxScores,transactions,standings,awards,events,playerTwitterFollowers,teamTwitterFollowers
0,20180101,"[{""engagementMetricsDate"":""2018-01-02"",""player...",,"[{""playerId"":400121,""gameDate"":""2018-01-01"",""t...",,,"[{""transactionId"":340732,""playerId"":547348,""pl...",,,,"[{""date"":""2018-01-01"",""playerId"":545361,""playe...","[{""date"":""2018-01-01"",""teamId"":147,""teamName"":..."
1,20180102,"[{""engagementMetricsDate"":""2018-01-03"",""player...",,"[{""playerId"":134181,""gameDate"":""2018-01-02"",""t...",,,"[{""transactionId"":339458,""playerId"":621173,""pl...",,,,,
2,20180103,"[{""engagementMetricsDate"":""2018-01-04"",""player...",,"[{""playerId"":425492,""gameDate"":""2018-01-03"",""t...",,,"[{""transactionId"":347527,""playerId"":572389,""pl...",,,,,
3,20180104,"[{""engagementMetricsDate"":""2018-01-05"",""player...",,"[{""playerId"":282332,""gameDate"":""2018-01-04"",""t...",,,"[{""transactionId"":339549,""playerId"":545343,""pl...",,,,,
4,20180105,"[{""engagementMetricsDate"":""2018-01-06"",""player...",,"[{""playerId"":282332,""gameDate"":""2018-01-05"",""t...",,,"[{""transactionId"":341195,""playerId"":628336,""pl...",,,,,


#### スクリプト8-3: 処理速度を上げるためにデータを絞り込む

In [3]:
train = train.loc[train["date"]>=20200401, :].reset_index(drop=True)
print(train.shape)

(487, 12)


#### スクリプト8-4: train_updated.csv専用の変換関数の作成

In [4]:
def unpack_json(json_str):
    return np.nan if pd.isna(json_str) else pd.read_json(json_str)

def extract_data(input_df, col="events", show=False):
    output_df = pd.DataFrame()
    for i in np.arange(len(input_df)):
        if show: print("\r{}/{}".format(i+1, len(input_df)), end="")
        try:
            output_df = pd.concat([
                output_df,
                unpack_json(input_df[col].iloc[i])
            ], axis=0, ignore_index=True)
        except:
            pass
    if show: print("")
    if show: print(output_df.shape)
    if show: display(output_df.head())
    return output_df

#### スクリプト8-5: train_updated.csvから「nextDayPlayerEngagement」を取り出して表形式に変換

In [5]:
df_engagement = extract_data(train, col="nextDayPlayerEngagement", show=True)

487/487
(1003707, 6)


Unnamed: 0,engagementMetricsDate,playerId,target1,target2,target3,target4
0,2020-04-02,425794,5.1249,9.434,0.1179,6.1947
1,2020-04-02,571704,0.0389,8.1761,0.0105,2.1304
2,2020-04-02,506702,0.0106,5.0314,0.0082,0.885
3,2020-04-02,607231,0.0247,2.8302,0.0222,0.59
4,2020-04-02,543193,0.0071,1.1006,0.0012,0.1967


#### スクリプト8-6: 結合キーであるdate_playerIdの作成

In [6]:
df_engagement["date_playerId"] = df_engagement["engagementMetricsDate"].str.replace("-", "") + "_" + df_engagement["playerId"].astype(str)
df_engagement.head()

Unnamed: 0,engagementMetricsDate,playerId,target1,target2,target3,target4,date_playerId
0,2020-04-02,425794,5.1249,9.434,0.1179,6.1947,20200402_425794
1,2020-04-02,571704,0.0389,8.1761,0.0105,2.1304,20200402_571704
2,2020-04-02,506702,0.0106,5.0314,0.0082,0.885,20200402_506702
3,2020-04-02,607231,0.0247,2.8302,0.0222,0.59,20200402_607231
4,2020-04-02,543193,0.0071,1.1006,0.0012,0.1967,20200402_543193


#### スクリプト8-7: 日付から簡単な特徴量を作成

In [7]:
# 推論実施日のカラム作成（推論実施日＝推論対象日の前日）
df_engagement["date"] = pd.to_datetime(df_engagement["engagementMetricsDate"], format="%Y-%m-%d") + dt.timedelta(days=-1)

# 推論実施日から「曜日」と「年月」の特徴量作成
df_engagement["dayofweek"] = df_engagement["date"].dt.dayofweek
df_engagement["yearmonth"] = df_engagement["date"].astype(str).apply(lambda x: x[:7])
df_engagement.head()

Unnamed: 0,engagementMetricsDate,playerId,target1,target2,target3,target4,date_playerId,date,dayofweek,yearmonth
0,2020-04-02,425794,5.1249,9.434,0.1179,6.1947,20200402_425794,2020-04-01,2,2020-04
1,2020-04-02,571704,0.0389,8.1761,0.0105,2.1304,20200402_571704,2020-04-01,2,2020-04
2,2020-04-02,506702,0.0106,5.0314,0.0082,0.885,20200402_506702,2020-04-01,2,2020-04
3,2020-04-02,607231,0.0247,2.8302,0.0222,0.59,20200402_607231,2020-04-01,2,2020-04
4,2020-04-02,543193,0.0071,1.1006,0.0012,0.1967,20200402_543193,2020-04-01,2,2020-04


### ● players.csvの読み込みと加工
#### スクリプト8-8: players.csvの読み込み

In [8]:
df_players = pd.read_csv("../input/mlb-player-digital-engagement-forecasting/players.csv")
print(df_players.shape)
print(df_players["playerId"].agg("nunique"))
df_players.head()

(2061, 12)
2061


Unnamed: 0,playerId,playerName,DOB,mlbDebutDate,birthCity,birthStateProvince,birthCountry,heightInches,weight,primaryPositionCode,primaryPositionName,playerForTestSetAndFuturePreds
0,665482,Gilberto Celestino,1999-02-13,2021-06-02,Santo Domingo,,Dominican Republic,72,170,8,Outfielder,False
1,593590,Webster Rivas,1990-08-08,2021-05-28,Nagua,,Dominican Republic,73,219,3,First Base,True
2,661269,Vladimir Gutierrez,1995-09-18,2021-05-28,Havana,,Cuba,73,190,1,Pitcher,True
3,669212,Eli Morgan,1996-05-13,2021-05-28,Rancho Palos Verdes,CA,USA,70,190,1,Pitcher,True
4,666201,Alek Manoah,1998-01-09,2021-05-27,Homestead,FL,USA,78,260,1,Pitcher,True


#### スクリプト8-9: 評価対象の人数確認

In [9]:
df_players["playerForTestSetAndFuturePreds"] = np.where(df_players["playerForTestSetAndFuturePreds"]==True, 1, 0)
print(df_players["playerForTestSetAndFuturePreds"].sum())
print(df_players["playerForTestSetAndFuturePreds"].mean())

1187
0.5759340126152354


## 8.3.3 データセット作成
#### スクリプト8-10: テーブル結合 

In [10]:
df_train = pd.merge(df_engagement, df_players, on=["playerId"], how="left")
print(df_train.shape)

(1003707, 21)


#### スクリプト8-11: 学習用データセットの作成

In [11]:
x_train = df_train[[
    "playerId", "dayofweek",
    "birthCity", "birthStateProvince", "birthCountry", "heightInches", "weight", 
    "primaryPositionCode", "primaryPositionName", "playerForTestSetAndFuturePreds"]]
y_train = df_train[["target1","target2","target3","target4"]]
id_train = df_train[["engagementMetricsDate","playerId","date_playerId","date","yearmonth","playerForTestSetAndFuturePreds"]]
print(x_train.shape, y_train.shape, id_train.shape)
x_train.head()

(1003707, 10) (1003707, 4) (1003707, 6)


Unnamed: 0,playerId,dayofweek,birthCity,birthStateProvince,birthCountry,heightInches,weight,primaryPositionCode,primaryPositionName,playerForTestSetAndFuturePreds
0,425794,2,Brunswick,GA,USA,79,230,1,Pitcher,1
1,571704,2,Albuquerque,NM,USA,75,210,1,Pitcher,0
2,506702,2,Maracaibo,,Venezuela,70,235,2,Catcher,1
3,607231,2,Savannah,GA,USA,76,200,1,Pitcher,1
4,543193,2,Columbia,CA,USA,76,215,1,Pitcher,0


#### スクリプト8-12: カテゴリ変数をcategory型に変換

In [12]:
for col in ["playerId", "dayofweek", "birthCity", "birthStateProvince", "birthCountry", "primaryPositionCode", "primaryPositionName"]:
    x_train[col] = x_train[col].astype("category")

## 8.3.4 バリデーション設計

#### スクリプト8-13: 学習データと検証データの期間の設定

In [13]:
list_cv_month = [
    [["2020-05","2020-06","2020-07","2020-08","2020-09","2020-10","2020-11","2020-12","2021-01","2021-02","2021-03","2021-04"], ["2021-05"]],
    [["2020-06","2020-07","2020-08","2020-09","2020-10","2020-11","2020-12","2021-01","2021-02","2021-03","2021-04","2021-05"], ["2021-06"]],
    [["2020-07","2020-08","2020-09","2020-10","2020-11","2020-12","2021-01","2021-02","2021-03","2021-04","2021-05","2021-06"], ["2021-07"]],
]

#### スクリプト8-14: 学習データと検証データのindexリストの作成

In [14]:
cv = []
for month_tr, month_va in list_cv_month:
    cv.append([
        id_train.index[id_train["yearmonth"].isin(month_tr)],
        id_train.index[id_train["yearmonth"].isin(month_va) & (id_train["playerForTestSetAndFuturePreds"]==1)],
    ])
# fold0のindexのリスト
cv[0]

[Int64Index([ 61830,  61831,  61832,  61833,  61834,  61835,  61836,  61837,
              61838,  61839,
             ...
             814085, 814086, 814087, 814088, 814089, 814090, 814091, 814092,
             814093, 814094],
            dtype='int64', length=752265),
 Int64Index([814095, 814096, 814100, 814101, 814102, 814104, 814105, 814106,
             814107, 814109,
             ...
             877931, 877934, 877950, 877951, 877957, 877958, 877969, 877972,
             877974, 877975],
            dtype='int64', length=36797)]

## 8.3.5 モデル学習
#### スクリプト8-15: 学習データと検証データに分離

In [15]:
# 目的変数は「target1」で，foldは「fold0」の場合とする
target = "target1"
nfold = 0

# trainとvalidのindex取得
idx_tr, idx_va = cv[nfold][0], cv[nfold][1]

# trainデータとvalidデータに分離
x_tr, y_tr, id_tr = x_train.loc[idx_tr, :], y_train.loc[idx_tr, target], id_train.loc[idx_tr, :]
x_va, y_va, id_va = x_train.loc[idx_va, :], y_train.loc[idx_va, target], id_train.loc[idx_va, :]
print(x_tr.shape, y_tr.shape, id_tr.shape)
print(x_va.shape, y_va.shape, id_va.shape)

(752265, 10) (752265,) (752265, 6)
(36797, 10) (36797,) (36797, 6)


#### スクリプト8-16: モデル学習

In [16]:
# ハイパーパラメータの設定
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression_l1', 
    'metric': 'mean_absolute_error',
    'learning_rate': 0.05,
    'num_leaves': 32,
    'subsample': 0.7,
    'subsample_freq': 1,
    'feature_fraction': 0.8,
    'min_data_in_leaf': 50,
    'min_sum_hessian_in_leaf': 50,
    'n_estimators': 1000,
    "random_state": 123,
    "importance_type": "gain",
}

# モデルの学習
model = lgb.LGBMRegressor(**params)
model.fit(x_tr,
          y_tr,
          eval_set=[(x_tr,y_tr), (x_va,y_va)],
          early_stopping_rounds=50,
          verbose=100,
         )

# モデルの保存
with open("model_lgb_target1_fold0.h5", "wb") as f:
    pickle.dump(model, f, protocol=4)

[100]	training's l1: 0.508316	valid_1's l1: 1.29781
[200]	training's l1: 0.508247	valid_1's l1: 1.29772
[300]	training's l1: 0.508185	valid_1's l1: 1.29768
[400]	training's l1: 0.508154	valid_1's l1: 1.29768


#### スクリプト8-17: モデル評価

In [17]:
# validデータの推論値取得
y_va_pred = model.predict(x_va)

# 全target/foldの推論値を格納する変数の作成
df_valid_pred = pd.DataFrame()

# 推論値を格納
tmp_pred = pd.concat([
    id_va,
    pd.DataFrame({"target": target, "nfold": 0, "true": y_va, "pred": y_va_pred}),
], axis=1)
df_valid_pred = pd.concat([df_valid_pred, tmp_pred], axis=0, ignore_index=True)

# 全target/foldの評価値を入れる変数の作成
metrics = []

# 評価値の算出
metric_va = mean_absolute_error(y_va, y_va_pred)
# 評価値を格納
metrics.append([target, nfold, metric_va])
metrics

[['target1', 0, 1.297671433371246]]

#### スクリプト8-18: 説明変数の重要度取得

In [18]:
# 重要度の取得
tmp_imp = pd.DataFrame({"col":x_tr.columns, "imp":model.feature_importances_, "target":"target1", "nfold":nfold})
# 確認（重要度の上位10個）
display(tmp_imp.sort_values("imp", ascending=False))

# 全target/foldの重要度を格納するデータフレームの作成
df_imp = pd.DataFrame()                                                                                                                                                                                                                                                                                                                
# imp_foldをdf_impに結合
df_imp = pd.concat([df_imp, tmp_imp], axis=0, ignore_index=True)

Unnamed: 0,col,imp,target,nfold
0,playerId,19860714.6846,target1,0
2,birthCity,3294792.5985,target1,0
9,playerForTestSetAndFuturePreds,2401641.995,target1,0
7,primaryPositionCode,637437.8781,target1,0
1,dayofweek,303999.4478,target1,0
8,primaryPositionName,117352.6663,target1,0
6,weight,39797.095,target1,0
3,birthStateProvince,33818.563,target1,0
5,heightInches,23002.5407,target1,0
4,birthCountry,4671.998,target1,0


#### スクリプト8-19: モデルの評価（全target/foldのサマリ）

In [19]:
# リスト型をデータフレームに変換
df_metrics = pd.DataFrame(metrics, columns=["target", "nfold", "mae"])
display(df_metrics.head())

# 評価値
print("MCMAE: {:.4f}".format(df_metrics["mae"].mean()))

display(pd.pivot_table(df_metrics, index="nfold", columns="target", values="mae", aggfunc=np.mean, margins=True))

Unnamed: 0,target,nfold,mae
0,target1,0,1.2977


MCMAE: 1.2977


target,target1,All
nfold,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1.2977,1.2977
All,1.2977,1.2977


#### スクリプト8-20: 検証データの推論値の形式変換（全target/foldのサマリ）

In [20]:
df_valid_pred_all = pd.pivot_table(df_valid_pred, index=["engagementMetricsDate","playerId","date_playerId","date","yearmonth","playerForTestSetAndFuturePreds"], columns=["target",  "nfold"], values=["true", "pred"], aggfunc=np.sum)
df_valid_pred_all.columns = ["{}_fold{}_{}".format(j,k,i) for i,j,k in df_valid_pred_all.columns]
df_valid_pred_all = df_valid_pred_all.reset_index(drop=False)
df_valid_pred_all.head()

Unnamed: 0,engagementMetricsDate,playerId,date_playerId,date,yearmonth,playerForTestSetAndFuturePreds,target1_fold0_pred,target1_fold0_true
0,2021-05-02,405395,20210502_405395,2021-05-01,2021-05,1,0.6213,0.1518
1,2021-05-02,408234,20210502_408234,2021-05-01,2021-05,1,0.3264,0.2365
2,2021-05-02,424144,20210502_424144,2021-05-01,2021-05,1,0.0018,0.0016
3,2021-05-02,425772,20210502_425772,2021-05-01,2021-05,1,0.0066,0.0035
4,2021-05-02,425784,20210502_425784,2021-05-01,2021-05,1,0.0007,0.0001


#### スクリプト8-21: 説明変数の重要度取得（全target/foldのサマリ）

In [21]:
df_imp.groupby(["col"])["imp"].agg(["mean", "std"]).sort_values("mean", ascending=False)

Unnamed: 0_level_0,mean,std
col,Unnamed: 1_level_1,Unnamed: 2_level_1
playerId,19860714.6846,
birthCity,3294792.5985,
playerForTestSetAndFuturePreds,2401641.995,
primaryPositionCode,637437.8781,
dayofweek,303999.4478,
primaryPositionName,117352.6663,
weight,39797.095,
birthStateProvince,33818.563,
heightInches,23002.5407,
birthCountry,4671.998,


#### スクリプト8-22: 学習用関数の作成

In [22]:
def train_lgb(input_x,
              input_y,
              input_id,
              params,
              list_nfold=[0,1,2],
              mode_train="train",
             ):
    # 推論値を格納する変数の作成
    df_valid_pred = pd.DataFrame()
    # 評価値を入れる変数の作成
    metrics = []
    # 重要度を格納するデータフレームの作成
    df_imp = pd.DataFrame() 

    # validation
    cv = []
    for month_tr, month_va in list_cv_month:
        cv.append([
            input_id.index[input_id["yearmonth"].isin(month_tr)],
            input_id.index[input_id["yearmonth"].isin(month_va) & (input_id["playerForTestSetAndFuturePreds"]==1)],
        ])
    
    # モデル学習 (target/foldごとに学習)
    for nfold in list_nfold:
        for i, target in enumerate(["target1", "target2", "target3", "target4"]):
            print("-"*20, target, ", fold:", nfold, "-"*20)
            # trainとvalid1に分離
            idx_tr, idx_va = cv[nfold][0], cv[nfold][1]
            x_tr, y_tr, id_tr = x_train.loc[idx_tr, :], y_train.loc[idx_tr, target], id_train.loc[idx_tr, :]
            x_va, y_va, id_va = x_train.loc[idx_va, :], y_train.loc[idx_va, target], id_train.loc[idx_va, :]
            print(x_tr.shape, y_tr.shape, id_tr.shape)
            print(x_va.shape, y_va.shape, id_va.shape)
            
            # 保存するモデルのファイル名
            filepath = "model_lgb_{}_fold{}.h5".format(target, nfold)

            if mode_train=="train":
                print("training start.")
                model = lgb.LGBMRegressor(**params)
                model.fit(x_tr,
                          y_tr,
                          eval_set=[(x_tr,y_tr), (x_va,y_va)],
                          early_stopping_rounds=50,
                          verbose=100,
                         )
                with open(filepath, "wb") as f:
                    pickle.dump(model, f, protocol=4)
            else:
                print("model load.")
                with open(filepath, "rb") as f:
                    model = pickle.load(f)
                print("Done.")
                
            # validの推論値取得
            y_va_pred = model.predict(x_va)
            tmp_pred = pd.concat([
                id_va,
                pd.DataFrame({"target": target, "nfold": 0, "true": y_va, "pred": y_va_pred}),
            ], axis=1)
            df_valid_pred = pd.concat([df_valid_pred, tmp_pred], axis=0, ignore_index=True)
            
            # 評価値の算出
            metric_va = mean_absolute_error(y_va, y_va_pred)
            metrics.append([target, nfold, metric_va])
            
            # 重要度の取得
            tmp_imp = pd.DataFrame({"col":x_tr.columns, "imp":model.feature_importances_, "target":target, "nfold":nfold})
            df_imp = pd.concat([df_imp, tmp_imp], axis=0, ignore_index=True)
        
    print("-"*10, "result", "-"*10)
    # 評価値
    df_metrics = pd.DataFrame(metrics, columns=["target", "nfold", "mae"])
    print("MCMAE: {:.4f}".format(df_metrics["mae"].mean()))
    
    # validの推論値
    df_valid_pred_all = pd.pivot_table(df_valid_pred, index=["engagementMetricsDate","playerId","date_playerId","date","yearmonth","playerForTestSetAndFuturePreds"], columns=["target",  "nfold"], values=["true", "pred"], aggfunc=np.sum)
    df_valid_pred_all.columns = ["{}_fold{}_{}".format(j,k,i) for i,j,k in df_valid_pred_all.columns]
    df_valid_pred_all = df_valid_pred_all.reset_index(drop=False)

    return df_valid_pred_all, df_metrics, df_imp

#### スクリプト8-23: モデル学習

In [23]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression_l1', 
    'metric': 'mean_absolute_error',
    'learning_rate': 0.05,
    'num_leaves': 32,
    'subsample': 0.7,
    'subsample_freq': 1,
    'feature_fraction': 0.8,
    'min_data_in_leaf': 50,
    'min_sum_hessian_in_leaf': 50,
    'n_estimators': 1000,
    "random_state": 123,
    "importance_type": "gain",
}

df_valid_pred, df_metrics, df_imp = train_lgb(x_train,
                                              y_train,
                                              id_train,
                                              params,
                                              list_nfold=[0,1,2],
                                              mode_train="train",
                                             )

-------------------- target1 , fold: 0 --------------------
(752265, 10) (752265,) (752265, 6)
(36797, 10) (36797,) (36797, 6)
training start.
[100]	training's l1: 0.508316	valid_1's l1: 1.29781
[200]	training's l1: 0.508247	valid_1's l1: 1.29772
[300]	training's l1: 0.508185	valid_1's l1: 1.29768
[400]	training's l1: 0.508154	valid_1's l1: 1.29768
-------------------- target2 , fold: 0 --------------------
(752265, 10) (752265,) (752265, 6)
(36797, 10) (36797,) (36797, 6)
training start.
[100]	training's l1: 1.82802	valid_1's l1: 2.44748
-------------------- target3 , fold: 0 --------------------
(752265, 10) (752265,) (752265, 6)
(36797, 10) (36797,) (36797, 6)
training start.
-------------------- target4 , fold: 0 --------------------
(752265, 10) (752265,) (752265, 6)
(36797, 10) (36797,) (36797, 6)
training start.
[100]	training's l1: 0.794924	valid_1's l1: 1.25542
[200]	training's l1: 0.790874	valid_1's l1: 1.24914
[300]	training's l1: 0.790258	valid_1's l1: 1.24761
[400]	trainin

#### スクリプト8-24: 評価値（MCMAE）の確認

In [24]:
print("MCMAE: {:.4f}".format(df_metrics["mae"].mean()))
display(pd.pivot_table(df_metrics, index="nfold", columns="target", values="mae", aggfunc=np.mean, margins=True))

MCMAE: 1.3503


target,target1,target2,target3,target4,All
nfold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1.2977,2.4445,0.878,1.2449,1.4662
1,1.1951,2.1536,0.8316,1.641,1.4553
2,1.1132,1.7897,0.7605,0.8534,1.1292
All,1.202,2.1293,0.8234,1.2464,1.3503


#### スクリプト8-25: 説明変数の重要度の確認

In [25]:
df_imp.groupby(["col"])["imp"].agg(["mean", "std"]).sort_values("mean", ascending=False)

Unnamed: 0_level_0,mean,std
col,Unnamed: 1_level_1,Unnamed: 2_level_1
playerId,7729847.2801,13640878.7873
playerForTestSetAndFuturePreds,1281685.2354,1606340.8381
birthCity,1235480.7889,2426099.1
dayofweek,143082.819,209384.4154
primaryPositionCode,135093.7701,222089.2313
primaryPositionName,37740.4615,43340.0821
weight,28607.5458,44706.078
heightInches,26661.7844,43729.0979
birthStateProvince,14551.1016,34710.664
birthCountry,5324.3698,13346.4883


## 8.3.6 モデル推論
### **パート１：推論用データセットの作成**

#### スクリプト8-26: 推論時に受け取るデータのフォーマット確認①（サブミット時はコメントアウト）

In [26]:
# import mlb

# env = mlb.make_env()
# iter_test = env.iter_test()

# for (test_df, prediction_df) in iter_test:
#     # forループで受け取るデータの確認
#     display(test_df.head())
#     display(prediction_df.head())
#     break

#### スクリプト8-27: 推論時に受け取るデータのフォーマット確認②（サブミット時はコメントアウト）

In [27]:
# # forループで受け取るtest_dfのサンプルデータ
# test_df = pd.read_csv("../input/mlb-player-digital-engagement-forecasting/example_test.csv")
# display(test_df.head())

# # forループで受け取るprediction_dfのサンプルデータ
# prediction_df = pd.read_csv("../input/mlb-player-digital-engagement-forecasting/example_sample_submission.csv")
# display(prediction_df.head())

#### スクリプト8-28: 推論時に受け取るデータの疑似生成（2021/4/26分）

In [28]:
# test_dfの疑似生成（4/26に受け取るデータを想定）
test_df = train.loc[train["date"]==20210426, :]
display(test_df.head())

# prediction_dfの疑似生成（4/26に受け取るデータを想定）
prediction_df = df_engagement.loc[df_engagement["date"]=="2021-04-26", ["date","date_playerId"]].reset_index(drop=True)
prediction_df["date"] = prediction_df["date"].apply(lambda x: int(str(x).replace("-","")[:8]))
for col in ["target1","target2","target3","target4"]:
    prediction_df[col] = 0
display(prediction_df.head())

Unnamed: 0,date,nextDayPlayerEngagement,games,rosters,playerBoxScores,teamBoxScores,transactions,standings,awards,events,playerTwitterFollowers,teamTwitterFollowers
390,20210426,"[{""engagementMetricsDate"":""2021-04-27"",""player...","[{""gamePk"":634374,""gameType"":""R"",""season"":2021...","[{""playerId"":405395,""gameDate"":""2021-04-26"",""t...","[{""home"":1,""gamePk"":634377,""gameDate"":""2021-04...","[{""home"":1,""teamId"":139,""gamePk"":634343,""gameD...","[{""transactionId"":480386,""playerId"":543685,""pl...","[{""season"":2021,""gameDate"":""2021-04-26"",""divis...",,"[{""gamePk"":634433,""gameDate"":""2021-04-26"",""gam...",,


Unnamed: 0,date,date_playerId,target1,target2,target3,target4
0,20210426,20210427_656669,0,0,0,0
1,20210426,20210427_543475,0,0,0,0
2,20210426,20210427_623465,0,0,0,0
3,20210426,20210427_595032,0,0,0,0
4,20210426,20210427_592866,0,0,0,0


#### スクリプト8-29: 推論用データセット作成の関数

In [29]:
def makedataset_for_predict(input_test, input_prediction):
    test = input_test.copy()
    prediction = input_prediction.copy()
    
    # dateを日付型に変換
    prediction["date"] = pd.to_datetime(prediction["date"], format="%Y%m%d") 
    # 推論対象日(engagementMetricsDate)と選手ID(playerId)のカラムを作成
    prediction["engagementMetricsDate"] = prediction["date_playerId"].apply(lambda x: x[:8])
    prediction["engagementMetricsDate"] = pd.to_datetime(prediction["engagementMetricsDate"], format="%Y%m%d") 
    prediction["playerId"] = prediction["date_playerId"].apply(lambda x: int(x[9:]))
    
    # 日付から曜日と年月を作成
    prediction["dayofweek"] = prediction["date"].dt.dayofweek
    prediction["yearmonth"] = prediction["date"].astype(str).apply(lambda x: x[:7])
    
    # テーブルの結合
    df_test = pd.merge(prediction, df_players, on=["playerId"], how="left")
    
    # 説明変数の作成
    x_test = df_test[[
        "playerId", "dayofweek",
        "birthCity", "birthStateProvince", "birthCountry", "heightInches", "weight", 
        "primaryPositionCode", "primaryPositionName", "playerForTestSetAndFuturePreds"]]
    id_test = df_test[["engagementMetricsDate","playerId","date_playerId","date","yearmonth","playerForTestSetAndFuturePreds"]]

    # カテゴリ変数をcategory型に変換
    for col in ["playerId", "dayofweek", "birthCity", "birthStateProvince", "birthCountry", "primaryPositionCode", "primaryPositionName"]:
        x_test[col] = x_test [col].astype("category")

    return x_test, id_test

#### スクリプト8-30: 推論用データセット作成の実行

In [30]:
x_test, id_test = makedataset_for_predict(test_df, prediction_df)
display(x_test.head())
display(id_test.head())

Unnamed: 0,playerId,dayofweek,birthCity,birthStateProvince,birthCountry,heightInches,weight,primaryPositionCode,primaryPositionName,playerForTestSetAndFuturePreds
0,656669,0,Visalia,CA,USA,73,195,8,Outfielder,1
1,543475,0,Hartsville,SC,USA,77,230,1,Pitcher,1
2,623465,0,Salisbury,MD,USA,74,215,1,Pitcher,0
3,595032,0,Ranburne,AL,USA,76,220,1,Pitcher,0
4,592866,0,San Diego,CA,USA,75,235,1,Pitcher,1


Unnamed: 0,engagementMetricsDate,playerId,date_playerId,date,yearmonth,playerForTestSetAndFuturePreds
0,2021-04-27,656669,20210427_656669,2021-04-26,2021-04,1
1,2021-04-27,543475,20210427_543475,2021-04-26,2021-04,1
2,2021-04-27,623465,20210427_623465,2021-04-26,2021-04,0
3,2021-04-27,595032,20210427_595032,2021-04-26,2021-04,0
4,2021-04-27,592866,20210427_592866,2021-04-26,2021-04,1


### **パート２：モデル推論**
#### スクリプト8-31: モデルの読み込み

In [31]:
with open("model_lgb_target1_fold0.h5", "rb") as f:
    model = pickle.load(f)

#### スクリプト8-32: モデルを用いた推論

In [32]:
pred = model.predict(x_test)

df_test_pred = id_test.copy()
df_test_pred["target1_fold0"] = pred

#### スクリプト8-33: 推論値の計算

In [33]:
# target1の推論値： 各foldの平均値
df_test_pred["target1"] = df_test_pred[df_test_pred.columns[df_test_pred.columns.str.contains("target1")]].mean(axis=1)
# target2,3,4についても同様の方法で計算します。(ここでは省略)

print(df_test_pred.shape)
df_test_pred.head()

(2061, 8)


Unnamed: 0,engagementMetricsDate,playerId,date_playerId,date,yearmonth,playerForTestSetAndFuturePreds,target1_fold0,target1
0,2021-04-27,656669,20210427_656669,2021-04-26,2021-04,1,0.0292,0.0292
1,2021-04-27,543475,20210427_543475,2021-04-26,2021-04,1,0.0034,0.0034
2,2021-04-27,623465,20210427_623465,2021-04-26,2021-04,0,0.0001,0.0001
3,2021-04-27,595032,20210427_595032,2021-04-26,2021-04,0,0.0,0.0
4,2021-04-27,592866,20210427_592866,2021-04-26,2021-04,1,0.0466,0.0466


#### スクリプト8-34: 推論処理の関数

In [34]:
def predict_lgb(input_test,
                input_id,
                list_nfold=[0,1,2],
               ):
    df_test_pred = id_test.copy()
    
    for target in ["target1","target2","target3","target4"]:
        for nfold in list_nfold:
            # モデルのロード
            with open("model_lgb_{}_fold{}.h5".format(target, nfold), "rb") as f:
                    model = pickle.load(f)

            # 推論
            pred = model.predict(input_test)
            # 予測値の格納
            df_test_pred["{}_fold{}".format(target, nfold)] = pred
            
    # 推論値の取得： 各foldの平均値
    for target in ["target1","target2","target3","target4"]:
        df_test_pred[target] = df_test_pred[df_test_pred.columns[df_test_pred.columns.str.contains(target)]].mean(axis=1)
    
    return df_test_pred

#### スクリプト8-35: モデル推論の実行

In [35]:
df_test_pred = predict_lgb(x_test, id_test)
df_test_pred.head()

Unnamed: 0,engagementMetricsDate,playerId,date_playerId,date,yearmonth,playerForTestSetAndFuturePreds,target1_fold0,target1_fold1,target1_fold2,target2_fold0,...,target3_fold0,target3_fold1,target3_fold2,target4_fold0,target4_fold1,target4_fold2,target1,target2,target3,target4
0,2021-04-27,656669,20210427_656669,2021-04-26,2021-04,1,0.0292,0.0379,0.0197,1.3384,...,0.0044,0.0045,0.006,0.194,0.2257,0.3232,0.029,1.1602,0.005,0.2476
1,2021-04-27,543475,20210427_543475,2021-04-26,2021-04,1,0.0034,0.0031,0.0035,1.0874,...,0.0048,0.0048,0.0053,0.2203,0.2479,0.2974,0.0033,0.9486,0.005,0.2552
2,2021-04-27,623465,20210427_623465,2021-04-26,2021-04,0,0.0001,0.0,0.0001,0.3045,...,0.004,0.0033,0.0025,0.1047,0.1194,0.1579,0.0001,0.2651,0.0033,0.1273
3,2021-04-27,595032,20210427_595032,2021-04-26,2021-04,0,0.0,-0.0,0.0001,0.0311,...,0.0008,0.0007,0.0004,0.0795,0.0838,0.1264,0.0,0.0796,0.0007,0.0965
4,2021-04-27,592866,20210427_592866,2021-04-26,2021-04,1,0.0466,0.0377,0.0181,1.459,...,0.0081,0.0079,0.0116,0.5418,0.6198,0.5293,0.0341,1.2132,0.0092,0.5636


### **パート３：提出用フォーマットへの変換**
#### スクリプト8-36: 提出用フォーマットへの変換

In [36]:
df_submit = df_test_pred[["date_playerId", "target1","target2","target3","target4"]]
df_submit.head()

Unnamed: 0,date_playerId,target1,target2,target3,target4
0,20210427_656669,0.029,1.1602,0.005,0.2476
1,20210427_543475,0.0033,0.9486,0.005,0.2552
2,20210427_623465,0.0001,0.2651,0.0033,0.1273
3,20210427_595032,0.0,0.0796,0.0007,0.0965
4,20210427_592866,0.0341,1.2132,0.0092,0.5636


#### スクリプト8-37: 推論処理の実行
- **mlbライブラリは一度しか実行できません。再度実行したい場合はカーネルを再起動する必要があります。**
- 本notebookではmlbライブラリを3ヶ所で実行しています。**実行したいセル以外はコメントアウトしてから実行してください。**
    - スクリプト8-37 :  8.3 ベースライン作成
    - スクリプト8-47 :  8.4 特徴量エンジニアリング
    - スクリプト8-62 :  8.5 モデルチューニング

In [37]:
# import mlb

# env = mlb.make_env()
# iter_test = env.iter_test()

# for (test_df, prediction_df) in iter_test:
#     test = test_df.copy()
#     prediction = prediction_df.copy()
#     prediction = prediction.reset_index(drop=False)
    
#     print("date:", prediction["date"][0])
    
#     # データセット作成
#     x_test, id_test = makedataset_for_predict(test, prediction)
    
#     # 推論処理
#     df_test_pred = predict_lgb(x_test, id_test)
    
#     # 提出データの作成
#     df_submit = df_test_pred[["date_playerId", "target1","target2","target3","target4"]]
    
#     # 後処理：欠損値埋め，0-100の範囲外のデータをクリッピング
#     for i,col in enumerate(["target1","target2","target3","target4"]):
#         df_submit[col] = df_submit[col].fillna(0.)
#         df_submit[col] = df_submit[col].clip(0, 100)

#     # 予測値データの提出
#     env.predict(df_submit)
# print("Done.")

# 8.4 特徴量エンジニアリング
## 8.4.1 データ前処理

#### スクリプト8-38: train_updated.csvからrostersカラムのデータ取り出し

In [38]:
df_rosters = extract_data(train, col="rosters", show=True)

487/487
(598950, 5)


Unnamed: 0,playerId,gameDate,teamId,statusCode,status
0,430935,2020-04-01,144,A,Active
1,435062,2020-04-01,120,A,Active
2,444489,2020-04-01,158,A,Active
3,445276,2020-04-01,119,A,Active
4,446308,2020-04-01,138,A,Active


#### スクリプト8-39: rostersのデータ前処理加工

In [39]:
# dateカラムの作成・加工
df_rosters = df_rosters.rename(columns={"gameDate":"date"})
df_rosters["date"] = pd.to_datetime(df_rosters["date"], format="%Y-%m-%d")

# 追加するカラムリストの作成 (dateとplayerIdは結合キー)
col_rosters = ["teamId","statusCode","status"]

df_rosters.head()

Unnamed: 0,playerId,date,teamId,statusCode,status
0,430935,2020-04-01,144,A,Active
1,435062,2020-04-01,120,A,Active
2,444489,2020-04-01,158,A,Active
3,445276,2020-04-01,119,A,Active
4,446308,2020-04-01,138,A,Active


#### スクリプト8-40: targetの特徴量の計算

In [40]:
df_agg_target = df_train.groupby(["yearmonth", "playerId"])[["target1", "target2", "target3", "target4"]].agg(["mean", "median", "std", "min", "max"])
df_agg_target.columns = ["{}_{}".format(i,j) for i,j in df_agg_target.columns]
df_agg_target = df_agg_target.reset_index(drop=False)
df_agg_target.head()

Unnamed: 0,yearmonth,playerId,target1_mean,target1_median,target1_std,target1_min,target1_max,target2_mean,target2_median,target2_std,...,target3_mean,target3_median,target3_std,target3_min,target3_max,target4_mean,target4_median,target4_std,target4_min,target4_max
0,2020-04,112526,0.8834,0.0647,2.9618,0.0224,15.978,10.811,10.4352,5.3041,...,0.2894,0.1752,0.3478,0.0216,1.6761,21.1961,20.7913,12.6768,0.6305,51.3299
1,2020-04,134181,2.9999,0.2175,10.9845,0.0645,58.4642,14.7861,11.9902,13.2362,...,10.6877,0.9546,24.8149,0.0348,100.0,12.0298,11.6739,6.2926,0.5478,24.3902
2,2020-04,279571,0.0003,0.0,0.0006,0.0,0.0016,0.397,0.3435,0.2787,...,0.0004,0.0,0.0013,0.0,0.006,0.2895,0.2481,0.1986,0.0097,0.7
3,2020-04,282332,0.1413,0.0748,0.1702,0.0223,0.7391,7.8652,7.7711,4.0453,...,0.3794,0.3382,0.2484,0.0501,0.9882,11.354,10.0147,6.1022,0.5633,23.4455
4,2020-04,400085,1.9515,0.6949,3.3399,0.0947,17.0843,30.0941,27.2808,16.4382,...,13.3777,1.8486,26.4342,0.2183,100.0,50.7711,47.0509,29.4601,2.5769,100.0


#### スクリプト8-41: ラグ特徴量の作成

In [41]:
# 年月でソート（時系列順に並んでいないとシフト時におかしくなるので）
df_agg_target = df_agg_target.sort_values("yearmonth").reset_index(drop=True)

# yearmonthを1ヶ月シフト過去にさせる
df_agg_target["yearmonth"] = df_agg_target.groupby(["playerId"])["yearmonth"].shift(-1)
# yearmonthの欠損値を「2021-08」で埋める
df_agg_target["yearmonth"] = df_agg_target["yearmonth"].fillna("2021-08")

# 集計値がラグ特徴量と分かるように名称を変更
df_agg_target.columns = [col+"_lag1month" if col not in ["playerId","yearmonth"] else col for col in df_agg_target.columns ]

# 追加したカラムリスト作成
col_agg_target = list(df_agg_target.columns[df_agg_target.columns.str.contains("lag1month")])

df_agg_target.head()

Unnamed: 0,yearmonth,playerId,target1_mean_lag1month,target1_median_lag1month,target1_std_lag1month,target1_min_lag1month,target1_max_lag1month,target2_mean_lag1month,target2_median_lag1month,target2_std_lag1month,...,target3_mean_lag1month,target3_median_lag1month,target3_std_lag1month,target3_min_lag1month,target3_max_lag1month,target4_mean_lag1month,target4_median_lag1month,target4_std_lag1month,target4_min_lag1month,target4_max_lag1month
0,2020-05,112526,0.8834,0.0647,2.9618,0.0224,15.978,10.811,10.4352,5.3041,...,0.2894,0.1752,0.3478,0.0216,1.6761,21.1961,20.7913,12.6768,0.6305,51.3299
1,2020-05,628318,0.0003,0.0,0.0016,0.0,0.0088,0.3717,0.3519,0.2857,...,0.0,0.0,0.0,0.0,0.0,0.4519,0.4173,0.2852,0.0126,1.176
2,2020-05,628317,0.0747,0.0327,0.1005,0.0139,0.4201,10.7568,9.6495,4.7834,...,0.0816,0.0746,0.0462,0.0116,0.1811,3.2524,2.9701,1.861,0.1119,6.8816
3,2020-05,627894,0.0004,0.0,0.0008,0.0,0.0037,1.2347,1.1066,0.6663,...,0.002,0.0,0.0035,0.0,0.0157,0.3802,0.3303,0.2352,0.0165,0.9146
4,2020-05,627500,0.0004,0.0,0.0019,0.0,0.0104,0.294,0.1969,0.3396,...,0.0,0.0,0.0001,0.0,0.0005,0.2036,0.1609,0.1362,0.0117,0.5662


## 8.4.2 データセット作成
#### スクリプト8-42: 学習用データセットの作成

In [42]:
# データを結合
df_train = pd.merge(df_engagement, df_players, on=["playerId"], how="left")
df_train = pd.merge(df_train, df_rosters, on=["date", "playerId"], how="left")
df_train = pd.merge(df_train, df_agg_target, on=["playerId", "yearmonth"], how="left")

# 説明変数と目的変数の作成
x_train = df_train[[
    "playerId", "dayofweek",
    "birthCity", "birthStateProvince", "birthCountry", "heightInches", "weight", 
    "primaryPositionCode", "primaryPositionName", "playerForTestSetAndFuturePreds"
] + col_rosters + col_agg_target]
y_train = df_train[["target1","target2","target3","target4"]]
id_train = df_train[["engagementMetricsDate","playerId","date_playerId","date","yearmonth","playerForTestSetAndFuturePreds"]]

# カテゴリ変数をcategory型に変換
for col in ["playerId", "dayofweek", "birthCity", "birthStateProvince", "birthCountry", "primaryPositionCode", "primaryPositionName"] + col_rosters:
    x_train[col] = x_train[col].astype("category")

print(x_train.shape, y_train.shape, id_train.shape)
x_train.head()

(1003707, 33) (1003707, 4) (1003707, 6)


Unnamed: 0,playerId,dayofweek,birthCity,birthStateProvince,birthCountry,heightInches,weight,primaryPositionCode,primaryPositionName,playerForTestSetAndFuturePreds,...,target3_mean_lag1month,target3_median_lag1month,target3_std_lag1month,target3_min_lag1month,target3_max_lag1month,target4_mean_lag1month,target4_median_lag1month,target4_std_lag1month,target4_min_lag1month,target4_max_lag1month
0,425794,2,Brunswick,GA,USA,79,230,1,Pitcher,1,...,,,,,,,,,,
1,571704,2,Albuquerque,NM,USA,75,210,1,Pitcher,0,...,,,,,,,,,,
2,506702,2,Maracaibo,,Venezuela,70,235,2,Catcher,1,...,,,,,,,,,,
3,607231,2,Savannah,GA,USA,76,200,1,Pitcher,1,...,,,,,,,,,,
4,543193,2,Columbia,CA,USA,76,215,1,Pitcher,0,...,,,,,,,,,,


## 8.4.3 モデル学習
#### スクリプト8-43: モデル学習の実行

In [43]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression_l1', 
    'metric': 'mean_absolute_error',
    'learning_rate': 0.05,
    'num_leaves': 32,
    'subsample': 0.7,
    'subsample_freq': 1,
    'feature_fraction': 0.8,
    'min_data_in_leaf': 50,
    'min_sum_hessian_in_leaf': 50,
    'n_estimators': 10000,
    "random_state": 123,
    "importance_type": "gain",
}

df_valid_pred, df_metrics, df_imp = train_lgb(x_train,
                                              y_train,
                                              id_train,
                                              params,
                                              list_nfold=[0,1,2],
                                              mode_train="train",
                                             )

-------------------- target1 , fold: 0 --------------------
(752265, 33) (752265,) (752265, 6)
(36797, 33) (36797,) (36797, 6)
training start.
[100]	training's l1: 0.504554	valid_1's l1: 1.28795
[200]	training's l1: 0.504185	valid_1's l1: 1.28709
-------------------- target2 , fold: 0 --------------------
(752265, 33) (752265,) (752265, 6)
(36797, 33) (36797,) (36797, 6)
training start.
[100]	training's l1: 1.57166	valid_1's l1: 2.19039
-------------------- target3 , fold: 0 --------------------
(752265, 33) (752265,) (752265, 6)
(36797, 33) (36797,) (36797, 6)
training start.
-------------------- target4 , fold: 0 --------------------
(752265, 33) (752265,) (752265, 6)
(36797, 33) (36797,) (36797, 6)
training start.
[100]	training's l1: 0.754202	valid_1's l1: 1.21511
[200]	training's l1: 0.733777	valid_1's l1: 1.20404
-------------------- target1 , fold: 1 --------------------
(752265, 33) (752265,) (752265, 6)
(35610, 33) (35610,) (35610, 6)
training start.
[100]	training's l1: 0.536

#### スクリプト8-44: 評価値の取得

In [44]:
print("MCMAE: {:.4f}".format(df_metrics["mae"].mean()))
display(pd.pivot_table(df_metrics, index="nfold", columns="target", values="mae", aggfunc=np.mean, margins=True))

MCMAE: 1.2762


target,target1,target2,target3,target4,All
nfold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1.2871,2.1877,0.8728,1.2038,1.3878
1,1.1817,1.895,0.8248,1.5521,1.3634
2,1.1002,1.5747,0.7525,0.882,1.0774
All,1.1897,1.8858,0.8167,1.2126,1.2762


#### スクリプト8-45: 説明変数の重要度の確認

In [45]:
df_imp.groupby(["col"])["imp"].agg(["mean", "std"]).sort_values("mean", ascending=False)[:10]

Unnamed: 0_level_0,mean,std
col,Unnamed: 1_level_1,Unnamed: 2_level_1
target3_mean_lag1month,1688014.9521,3455273.6685
playerId,1400102.7888,856757.8037
target1_mean_lag1month,1065649.4352,1980489.1468
target1_median_lag1month,892846.6924,1333219.0329
target3_std_lag1month,431693.7797,949798.904
target4_mean_lag1month,391389.7266,832807.9512
target2_std_lag1month,376575.3955,661457.004
target2_mean_lag1month,321696.2312,533681.4111
birthCity,299509.1416,188395.7614
target4_median_lag1month,292456.7648,622413.6875


## 8.4.4 モデル推論
#### スクリプト8-46: 推論用データセット作成の関数

In [46]:
def makedataset_for_predict(input_x, input_prediction):
    test = input_x.copy()
    prediction = input_prediction.copy()
    
    # 日付型に変換
    prediction["date"] = pd.to_datetime(prediction["date"], format="%Y%m%d") 
    # engagementMetricsDateとplayerIdを取り出す
    prediction["engagementMetricsDate"] = prediction["date_playerId"].apply(lambda x: x[:8])
    prediction["engagementMetricsDate"] = pd.to_datetime(prediction["engagementMetricsDate"], format="%Y%m%d") 
    prediction["playerId"] = prediction["date_playerId"].apply(lambda x: int(x[9:]))
    
    # dateから特徴量を作成
    prediction["dayofweek"] = prediction["date"].dt.dayofweek
    prediction["yearmonth"] = prediction["date"].astype(str).apply(lambda x: x[:7])
    
    # dateカラムの作成・加工
    df_rosters = extract_data(test, col="rosters")
    df_rosters = df_rosters.rename(columns={"gameDate":"date"})
    df_rosters["date"] = pd.to_datetime(df_rosters["date"], format="%Y-%m-%d")
    
    # テーブルの結合
    df_test = pd.merge(prediction, df_players, on=["playerId"], how="left")
    df_test = pd.merge(df_test, df_rosters, on=["date", "playerId"], how="left")
    df_test = pd.merge(df_test, df_agg_target, on=["playerId", "yearmonth"], how="left")
    
    # 説明変数の作成
    x_test = df_test[[
        "playerId", "dayofweek",
        "birthCity", "birthStateProvince", "birthCountry", "heightInches", "weight", 
        "primaryPositionCode", "primaryPositionName", "playerForTestSetAndFuturePreds"
    ] + col_rosters + col_agg_target]
    id_test = df_test[["engagementMetricsDate","playerId","date_playerId","date","yearmonth","playerForTestSetAndFuturePreds"]]

    # カテゴリ変数をcategory型に変換
    for col in ["playerId", "dayofweek", "birthCity", "birthStateProvince", "birthCountry", "primaryPositionCode", "primaryPositionName"] + col_rosters:
        x_test[col] = x_test [col].astype("category")

    return x_test, id_test

#### スクリプト8-47: 推論処理の実行（ベースラインと同一）
- **mlbライブラリは一度しか実行できません。再度実行したい場合はカーネルを再起動する必要があります。**
- 本notebookではmlbライブラリを3ヶ所で実行しています。**実行したいセル以外はコメントアウトしてから実行してください。**
    - スクリプト8-37 :  8.3 ベースライン作成
    - スクリプト8-47 :  8.4 特徴量エンジニアリング
    - スクリプト8-62 :  8.5 モデルチューニング

In [47]:
# import mlb

# env = mlb.make_env()
# iter_test = env.iter_test()

# for (test_df, sample_prediction_df) in iter_test:
#     test = test_df.copy()
#     prediction = sample_prediction_df.copy()
#     prediction = prediction.reset_index(drop=False)
    
#     print("date:", prediction["date"][0])
    
#     # データセット作成
#     x_test, id_test = makedataset_for_predict(test, prediction)
    
#     # 推論処理
#     df_test_pred = predict_lgb(x_test, id_test)
    
#     # 提出データの作成
#     df_submit = df_test_pred[["date_playerId", "target1","target2","target3","target4"]]
    
#     # 後処理：欠損値埋め，0-100の範囲外のデータをクリッピング
#     for i,col in enumerate(["target1","target2","target3","target4"]):
#         df_submit[col] = df_submit[col].fillna(0.)
#         df_submit[col] = df_submit[col].clip(0, 100)
    
#     # 予測値データの提出
#     env.predict(df_submit)
# print("Done.")

# 8.5 モデルチューニング
#### スクリプト8-48: 目的変数間の相関係数の算出

In [48]:
df_engagement[["target1", "target2", "target3", "target4"]].corr()

Unnamed: 0,target1,target2,target3,target4
target1,1.0,0.3529,0.3833,0.3252
target2,0.3529,1.0,0.366,0.4988
target3,0.3833,0.366,1.0,0.3229
target4,0.3252,0.4988,0.3229,1.0


#### スクリプト8-49: ライブラリのインポート

In [49]:
from sklearn.preprocessing import LabelEncoder

# import tensorflow
import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Input, Dense, Dropout, BatchNormalization, Activation, Concatenate
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, LearningRateScheduler
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.layers import Embedding, Flatten

#### スクリプト: 再現性のためのシート指定 (スクリプト6-15の再掲)

In [50]:
def seed_everything(seed):
    import random
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    session_conf = tf.compat.v1.ConfigProto(
        intra_op_parallelism_threads=1,
        inter_op_parallelism_threads=1
    )
    sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_conf)
    tf.compat.v1.keras.backend.set_session(sess)

## 8.5.1 データセット作成
#### スクリプト8-50: 学習用データセットの作成

In [51]:
x_train = df_train[[
    "playerId", "dayofweek",
    "birthCity", "birthStateProvince", "birthCountry", "heightInches", "weight", 
    "primaryPositionCode", "primaryPositionName", "playerForTestSetAndFuturePreds"
] + col_rosters + col_agg_target]
y_train = df_train[["target1","target2","target3","target4"]]
id_train = df_train[["engagementMetricsDate","playerId","date_playerId","date","yearmonth","playerForTestSetAndFuturePreds"]]

print(x_train.shape, y_train.shape, id_train.shape)

(1003707, 33) (1003707, 4) (1003707, 6)


#### スクリプト8-51: 数値とカテゴリ変数のカラムリストを作成

In [52]:
col_num = ["heightInches", "weight","playerForTestSetAndFuturePreds"] + col_agg_target
col_cat = ["playerId", "dayofweek", "birthCity", "birthStateProvince", "birthCountry", "primaryPositionCode", "primaryPositionName"] + col_rosters
print(len(col_num), len(col_cat))

23 10


#### スクリプト8-52: 数値データの欠損値補間・数値化

In [53]:
dict_num = {}
for col in col_num:
    print(col)
#     # 欠損値補間：平均値で埋める
#     value_fillna = x_train[col].mean()
    # 欠損値補間：0で埋める
    value_fillna = 0
    x_train[col] = x_train[col].fillna(value_fillna)
    
    # 正規化（0～1になるように変換）
    value_min = x_train[col].min()
    value_max = x_train[col].max()
    x_train[col] = (x_train[col] - value_min) / (value_max - value_min)
    
    # testデータにも適用できるように保存
    dict_num[col] = {}
    dict_num[col]["fillna"] = value_fillna
    dict_num[col]["min"] = value_min
    dict_num[col]["max"] = value_max
    
print("Done.")

heightInches
weight
playerForTestSetAndFuturePreds
target1_mean_lag1month
target1_median_lag1month
target1_std_lag1month
target1_min_lag1month
target1_max_lag1month
target2_mean_lag1month
target2_median_lag1month
target2_std_lag1month
target2_min_lag1month
target2_max_lag1month
target3_mean_lag1month
target3_median_lag1month
target3_std_lag1month
target3_min_lag1month
target3_max_lag1month
target4_mean_lag1month
target4_median_lag1month
target4_std_lag1month
target4_min_lag1month
target4_max_lag1month
Done.


#### スクリプト8-53: カテゴリ変数の欠損値補間・数値化

In [54]:
dict_cat = {}
for col in col_cat:
    print(col)
    # 欠損値補間：unknownで埋める
    value_fillna = "unknown"
    x_train[col] = x_train[col].fillna(value_fillna)
    
    # str型に変換
    x_train[col] = x_train[col].astype(str)
    
    # ラベルエンコーダー：0からはじまる整数に変換
    le = LabelEncoder()
    le.fit(x_train[col])
    # 推論時に未知の値があっても対応できるように未知ラベル(unknown)を用意。
    list_label = sorted(list(set(le.classes_) | set(["unknown"])))
    map_label = {j:i for i,j in enumerate(list_label)}
    x_train[col] = x_train[col].map(map_label)
    
    # testデータにも適用できるように保存
    dict_cat[col] = {}
    dict_cat[col]["fillna"] = value_fillna
    dict_cat[col]["map_label"] = map_label
    dict_cat[col]["num_label"] = len(list_label)
    
print("Done.")

playerId
dayofweek
birthCity
birthStateProvince
birthCountry
primaryPositionCode
primaryPositionName
teamId
statusCode
status
Done.


#### スクリプト8-54: 欠損値補間・正規化の関数化（推論用）

In [55]:
def transform_data(input_x):
    output_x = input_x.copy()
    
    # 数値データの欠損値補間・正規化
    for col in col_num:
        # 欠損値補間：平均値で埋める
        value_fillna = dict_num[col]["fillna"]
        output_x[col] = output_x[col].fillna(value_fillna)
        
        # 正規化（0～1になるように変換）
        value_min = dict_num[col]["min"]
        value_max = dict_num[col]["max"]
        output_x[col] = (output_x[col] - value_min) / (value_max - value_min)
    
    # カテゴリ変数の欠損値補間・正規化
    for col in col_cat:
        # 欠損値補間：unknownで埋める
        value_fillna = dict_cat[col]["fillna"]
        output_x[col] = output_x[col].fillna(value_fillna)
        
        # str型に変換
        output_x[col] = output_x[col].astype(str)
        
        # ラベルエンコーダー：0からはじまる整数に変換
        map_label = dict_cat[col]["map_label"]
        output_x[col] = output_x[col].map(map_label)
        # 対応するものが無い場合はunknownのラベルで埋める
        output_x[col] = output_x[col].fillna(map_label["unknown"])

    return output_x

## 8.5.2 モデル学習
#### スクリプト8-55: ニューラルネットワークのモデル定義

In [56]:
def create_model(col_num=["heightInches", "weight"],
                 col_cat=["playerId", "teamId", "dayofweek"], 
                 show=False,
                ):
    input_num = Input(shape=(len(col_num),))
    input_cat = Input(shape=(len(col_cat),))
    
    # numeric
    x_num = input_num #Dense(30, activation="relu")(input_num)
    
    # category
    for i,col in enumerate(col_cat):
        tmp_cat = input_cat[:, i]
        input_dim = dict_cat[col]["num_label"]
        output_dim = int(input_dim/2)
        tmp_cat = Embedding(input_dim=input_dim, output_dim=output_dim)(tmp_cat)
        tmp_cat = Dropout(0.2)(tmp_cat)
        tmp_cat = Flatten()(tmp_cat)
        if i==0:
            x_cat = tmp_cat
        else:
            x_cat = Concatenate()([x_cat, tmp_cat])

    # concat
    x = Concatenate()([x_num, x_cat])
    
    x = Dense(128, activation="relu")(x)
    x = BatchNormalization()(x)
    x = Dropout(0.1)(x)
    output = Dense(4, activation="linear")(x)
    
    model = Model(inputs=[input_num, input_cat], outputs=output)
    model.compile(optimizer="Adam", loss="mae", metrics=["mae"])
    
    if show:
        print(model.summary())
    else:
        return model

#### スクリプト8-56: モデル構造の確認

In [57]:
create_model(col_num=col_num,
             col_cat=col_cat,
             show=True)

2022-05-01 09:04:41.594450: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            [(None, 10)]         0                                            
__________________________________________________________________________________________________
tf.__operators__.getitem (Slici (None,)              0           input_2[0][0]                    
__________________________________________________________________________________________________
tf.__operators__.getitem_1 (Sli (None,)              0           input_2[0][0]                    
__________________________________________________________________________________________________
embedding (Embedding)           (None, 1031)         2125922     tf.__operators__.getitem[0][0]   
______________________________________________________________________________________________

#### スクリプト8-57: 学習用の関数をニューラルネットワーク用にカスタマイズ

In [58]:
def train_tf(input_x,
             input_y,
             input_id,
             list_nfold=[0,1,2],
             mode_train="train",
             batch_size=1024,
             epochs=100,
            ):
    # 推論値を格納する変数の作成
    df_valid_pred = pd.DataFrame()
    # 評価値を入れる変数の作成
    metrics = []
    
    # validation
    cv = []
    for month_tr, month_va in list_cv_month:
        cv.append([
            input_id.index[input_id["yearmonth"].isin(month_tr)],
            input_id.index[input_id["yearmonth"].isin(month_va) & (input_id["playerForTestSetAndFuturePreds"]==1)],
        ])
    
    # モデル学習 (foldごとに学習)
    for nfold in list_nfold:
        print("-"*20, "fold:", nfold, "-"*20)
        idx_tr, idx_va = cv[nfold][0], cv[nfold][1]
        
        x_num_tr, x_cat_tr, y_tr = input_x.loc[idx_tr, col_num].values, input_x.loc[idx_tr, col_cat].values, input_y.loc[idx_tr, :].values
        x_num_va, x_cat_va, y_va = input_x.loc[idx_va, col_num].values, input_x.loc[idx_va, col_cat].values, input_y.loc[idx_va, :].values
        print(x_num_tr.shape, x_cat_tr.shape, y_tr.shape)
        print(x_num_va.shape, x_cat_va.shape, y_va.shape)
        
        filepath = "model_tf_fold{}.h5".format(nfold)
        
        if mode_train=="train":
            print("training start.")
            seed_everything(seed=123)
            model = create_model(col_num=col_num, col_cat=col_cat, show=False)
            model.fit(x=[x_num_tr, x_cat_tr],
                      y=y_tr,
                      validation_data=([x_num_va, x_cat_va], y_va),
                      batch_size=batch_size,
                      epochs=epochs,
                      callbacks=[
                          ModelCheckpoint(filepath= filepath, monitor="val_loss", mode="min", verbose=1, save_best_only=True, save_weights_only=True),
                          EarlyStopping(monitor="val_loss", mode="min", min_delta=0, patience=10, verbose=1, restore_best_weights=True),
                          ReduceLROnPlateau(monitor="val_loss", mode="min", factor=0.1, patience=5, verbose=1),
                      ],
                      verbose=1,
                     )
        else:
            print("model load.")
            model = create_model(col_num=col_num, col_cat=col_cat, show=False)
            model.load_weights(filepath)
            print("Done.")
    
        # validの推論値取得
        y_va_pred = model.predict([x_num_va, x_cat_va])
        tmp_pred = pd.concat([
            id_va,
            pd.DataFrame(y_va, columns=["target1_true","target2_true","target3_true","target4_true"]),
            pd.DataFrame(y_va_pred, columns=["target1_pred","target2_pred","target3_pred","target4_pred"]),
        ], axis=1)
        tmp_pred["nfold"] = nfold
        df_valid_pred = pd.concat([df_valid_pred, tmp_pred], axis=0, ignore_index=True)
        
        # 評価値の算出
        metrics.append(["target1", nfold, np.mean(np.abs(y_va[:,0] - y_va_pred[:,0]))])
        metrics.append(["target2", nfold, np.mean(np.abs(y_va[:,1] - y_va_pred[:,1]))])
        metrics.append(["target3", nfold, np.mean(np.abs(y_va[:,2] - y_va_pred[:,2]))])
        metrics.append(["target4", nfold, np.mean(np.abs(y_va[:,3] - y_va_pred[:,3]))])
    
    print("-"*10, "result", "-"*10)
    # 評価値
    df_metrics = pd.DataFrame(metrics, columns=["target", "nfold", "mae"])
    print("MCMAE: {:.4f}".format(df_metrics["mae"].mean()))
    
    # validの推論値
    df_valid_pred_all = pd.pivot_table(df_valid_pred,
                                       index=["engagementMetricsDate","playerId","date_playerId","date","yearmonth","playerForTestSetAndFuturePreds"],
                                       columns=["nfold"], values=list(df_valid_pred.columns[df_valid_pred.columns.str.contains("target")]), aggfunc=np.sum)
    df_valid_pred_all.columns = ["{}_fold{}_{}".format(i.split("_")[0], j,i.split("_")[1]) for i,j in df_valid_pred_all.columns]
    df_valid_pred_all = df_valid_pred_all.reset_index(drop=False)
    
    return df_valid_pred_all, df_metrics

#### スクリプト8-58: 学習の実行

In [59]:
df_valid_pred, df_metrics = train_tf(x_train,
                                     y_train,
                                     id_train,
                                     list_nfold=[0,1,2],
                                     mode_train="train",
                                     batch_size=1024,
                                     epochs=1000,
                                    )

-------------------- fold: 0 --------------------
(752265, 23) (752265, 10) (752265, 4)
(36797, 23) (36797, 10) (36797, 4)
training start.


2022-05-01 09:04:46.865882: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Epoch 1/1000

Epoch 00001: val_loss improved from inf to 1.39767, saving model to model_tf_fold0.h5
Epoch 2/1000

Epoch 00002: val_loss did not improve from 1.39767
Epoch 3/1000

Epoch 00003: val_loss improved from 1.39767 to 1.39519, saving model to model_tf_fold0.h5
Epoch 4/1000

Epoch 00004: val_loss improved from 1.39519 to 1.39262, saving model to model_tf_fold0.h5
Epoch 5/1000

Epoch 00005: val_loss did not improve from 1.39262
Epoch 6/1000

Epoch 00006: val_loss improved from 1.39262 to 1.39116, saving model to model_tf_fold0.h5
Epoch 7/1000

Epoch 00007: val_loss did not improve from 1.39116
Epoch 8/1000

Epoch 00008: val_loss did not improve from 1.39116
Epoch 9/1000

Epoch 00009: val_loss did not improve from 1.39116
Epoch 10/1000

Epoch 00010: val_loss did not improve from 1.39116
Epoch 11/1000

Epoch 00011: val_loss did not improve from 1.39116

Epoch 00011: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 12/1000

Epoch 00012: val_loss did not impr

#### スクリプト8-59: 評価値の確認

In [60]:
print("MCMAE: {:.4f}".format(df_metrics["mae"].mean()))
display(pd.pivot_table(df_metrics, index="nfold", columns="target", values="mae", aggfunc=np.mean, margins=True))

MCMAE: 1.2749


target,target1,target2,target3,target4,All
nfold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1.2842,2.1993,0.8792,1.2019,1.3912
1,1.1757,1.8948,0.8224,1.5313,1.3561
2,1.0874,1.6056,0.7466,0.8701,1.0774
All,1.1824,1.8999,0.8161,1.2011,1.2749


## 8.5.3 モデル推論
#### スクリプト8-60: データセット作成関数をニューラルネットワーク用にカスタマイズ

In [61]:
def makedataset_for_predict(input_x, input_prediction):
    test = input_x.copy()
    prediction = input_prediction.copy()
    
    # 日付型に変換
    prediction["date"] = pd.to_datetime(prediction["date"], format="%Y%m%d") 
    # engagementMetricsDateとplayerIdを取り出す
    prediction["engagementMetricsDate"] = prediction["date_playerId"].apply(lambda x: x[:8])
    prediction["engagementMetricsDate"] = pd.to_datetime(prediction["engagementMetricsDate"], format="%Y%m%d") 
    prediction["playerId"] = prediction["date_playerId"].apply(lambda x: int(x[9:]))
    
    # dateから特徴量を作成
    prediction["dayofweek"] = prediction["date"].dt.dayofweek
    prediction["yearmonth"] = prediction["date"].astype(str).apply(lambda x: x[:7])
    
    # dateカラムの作成・加工
    df_rosters = extract_data(test, col="rosters")
    df_rosters = df_rosters.rename(columns={"gameDate":"date"})
    df_rosters["date"] = pd.to_datetime(df_rosters["date"], format="%Y-%m-%d")
    
    # テーブルの結合
    df_test = pd.merge(prediction, df_players, on=["playerId"], how="left")
    df_test = pd.merge(df_test, df_rosters, on=["date", "playerId"], how="left")
    df_test = pd.merge(df_test, df_agg_target, on=["playerId", "yearmonth"], how="left")
    
    # 説明変数の作成
    x_test = df_test[[
        "playerId", "dayofweek",
        "birthCity", "birthStateProvince", "birthCountry", "heightInches", "weight", 
        "primaryPositionCode", "primaryPositionName", "playerForTestSetAndFuturePreds"
    ] + col_rosters + col_agg_target]
    id_test = df_test[["engagementMetricsDate","playerId","date_playerId","date","yearmonth","playerForTestSetAndFuturePreds"]]

#     # カテゴリ変数をcategory型に変換
#     for col in ["playerId", "dayofweek", "birthCity", "birthStateProvince", "birthCountry", "primaryPositionCode", "primaryPositionName"] + col_rosters:
#         x_test[col] = x_test [col].astype("category")

    return x_test, id_test

#### スクリプト8-61: 推論用関数をニューラルネットワーク用にカスタマイズ

In [62]:
def predict_tf(input_x,
               input_id,
               list_nfold=[0,1,2],
              ):
    # 推論値を入れる変数の作成
    test_pred = np.zeros((len(input_x), 4))
    
    # 数値とカテゴリ変数に分離
    x_num_test, x_cat_test = input_x[col_num], input_x[col_cat]
    
    for nfold in list_nfold:
        # モデルのロード
        filepath = "model_tf_fold{}.h5".format(nfold)
        model = create_model(col_num=col_num, col_cat=col_cat, show=False)
        model.load_weights(filepath)
        
        # validの推論値取得
        pred = model.predict([x_num_test, x_cat_test], batch_size=512, verbose=0)
        test_pred += pred / len(list_nfold)
    
    # 推論値の格納
    df_test_pred = pd.concat([
        input_id,
        pd.DataFrame(test_pred, columns=["target1","target2","target3","target4"]),
    ], axis=1)
    
    return df_test_pred

#### スクリプト8-62: 推論処理の実行
- **mlbライブラリは一度しか実行できません。再度実行したい場合はカーネルを再起動する必要があります。**
- 本notebookではmlbライブラリを3ヶ所で実行しています。**実行したいセル以外はコメントアウトしてから実行してください。**
    - スクリプト8-37 :  8.3 ベースライン作成
    - スクリプト8-47 :  8.4 特徴量エンジニアリング
    - スクリプト8-62 :  8.5 モデルチューニング

In [63]:
# import mlb

# env = mlb.make_env()
# iter_test = env.iter_test()

# for (test_df, sample_prediction_df) in iter_test:
#     test = test_df.copy()
#     prediction = sample_prediction_df.copy()
#     prediction = prediction.reset_index(drop=False)
    
#     print("date:", prediction["date"][0])
    
#     # データセット作成
#     x_test, id_test = makedataset_for_predict(test, prediction)
    
#     # 欠損値補間・正規化
#     x_test = transform_data(x_test)
        
#     # 推論処理
#     df_test_pred = predict_tf(x_test, id_test)
    
#     # 提出データの作成
#     df_submit = df_test_pred[["date_playerId", "target1","target2","target3","target4"]]
    
#     # 後処理：欠損値埋め，0-100の範囲外のデータをクリッピング
#     for i,col in enumerate(["target1","target2","target3","target4"]):
#         df_submit[col] = df_submit[col].fillna(0.)
#         df_submit[col] = df_submit[col].clip(0, 100)
    
#     # 予測値データの提出
#     env.predict(df_submit)
# print("Done.")