This kernel is a kaggle tutorial written in Japanese.

# 2019 Data Science Bowl
<img src="https://i.ytimg.com/vi/8g4zZN7UJKs/maxresdefault.jpg" alt="drawing" width="600"/>

このNotebookはKaggle入門者向けの2019 DSBの解説です。  
TitanicやHouse Priceはやったけれどまだ本物のコンペのデータは触っていない！という方に向けて書いております。  
内容としては私がこのコンペに参加したときに考えたことや他の強い方のNotebookに書いてあったことをまとめています。  
私はコンペへの参加はこれが初めてですので、入門者が書いた入門者向けの解説です。  
二週間前の自分に読ませたいNotebookを目指して書きました。  

結果としては銅メダルでした。Notebookの中でコンペでより良い結果を出したモデルとの違いも話します。  
**Notebookの最後に参考にした他のNotebook・資料のリストがあります。**  
ぜひこちらを読んでみると良いと思います。


# コンペの概要

データは子供向けゲームのユーザの行動ログから、ゲーム内にあるユーザの能力を図る５つの`Assessment`での成績を予測するというものです。  
肝となるのはユーザの行動ログが時間間隔のまばらな時系列であることです。  
行動ログの長さはユーザによって非常に大きく違います。  
この行動ログをどうまとめてユーザのFeatureにするかが問題になります。  

予測するラベル、成績は4つのグループに分けられますが、これは順番のある量です。  
評価関数quadratic weighted kappa (後述)もこの順番を反映させた関数です。  
これをうまくモデルに反映させられるかという点も考える必要があります。

データは実際のゲームのものです。**自分でプレイしてみて、どういうものなのか一通り確認しておくべきだと思います。**  
https://measureup.pbskids.org/


# モデルの概要

LightGBMによるregressionモデルを使いました。  
先述のように今回予測すべきラベルは順番のある量です。  
この背景にある量をモデルに反映させるためにclassifierではなくregressionをしました。  
Regressionモデルは5つ学習し、それらの予測の重み付き平均値を計算します。  
5つのモデルはBayesianOptimizationで得られたそれぞれ異なるハイパーパラメータを使っています。  
平均値をRoundingして整数値になおし、最終的な予測ラベルとしました。


### パラメータ

様々な実験条件を試すためのコードの変更箇所を最小にするための制御パラメータです。

In [None]:
debug = False

skip_bo = True # hyperparameter search
skip_xgb = True # ensemble with xgb

ajust_mean = False

drop_outliers = False

train_rounder = False

time_decay = 0 # 0.01

preprocessed = False
dirname = '/kaggle/input/datav2/'


## ライブラリのImport

データの操作のためにはpandasが便利です。公式のTutorialがよくまとまっています。  
https://www.kaggle.com/learn/pandas

Visualizationためにはseabornというライブラリがおすすめです。Kaggle上にアメリカの都市の危険度を扱ったTutorialがあります。  
https://www.kaggle.com/kanncaa1/seaborn-tutorial-for-beginners

機械学習モデルの操作にはsklearnを使います。モデルだけでなく前処理やcross validationまわりも簡単に出来るので便利です。  
https://scikit-learn.org/stable/modules/cross_validation.html

学習モデルにはlightGBMを使います。  
https://lightgbm.readthedocs.io/

ハイパーパラメータ探索にはBayesianOptimizationを使いました。  
https://github.com/fmfn/BayesianOptimization

In [None]:
import numpy as np
import pandas as pd

import random
random.seed(1029)
np.random.seed(1029)

import os
import copy
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import NuSVR, SVR
from sklearn.metrics import mean_absolute_error
pd.options.display.precision = 15
from collections import defaultdict
import lightgbm as lgb
import xgboost as xgb
import catboost as cat
import time
from collections import Counter
import datetime
from catboost import CatBoostRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold, KFold, RepeatedKFold, GroupKFold, GridSearchCV, train_test_split, TimeSeriesSplit, RepeatedStratifiedKFold
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import linear_model
from sklearn.base import BaseEstimator, TransformerMixin
import gc
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
from bayes_opt import BayesianOptimization
import eli5
import shap
from IPython.display import HTML
import json
import altair as alt
from category_encoders.ordinal import OrdinalEncoder
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline
from typing import List

import json
from numba import jit

from functools import partial
import scipy as sp

from tqdm import tqdm, tqdm_notebook

from typing import Any
from itertools import product
pd.set_option('max_rows', 500)
import re
from joblib import Parallel, delayed

## 評価関数

https://www.kaggle.com/c/data-science-bowl-2019/overview/evaluation

コンペのデータは子供向け学習ゲームの行動ログです。  
課題は各ユーザの行動ログから、そのユーザの５種類あるゲームでの成績を予測するものです。  
ゲームの成績`accuracy_group`は４種類あり、3 > 2 > 1 > 0の順番で良い成績です。  
なので**これらのカテゴリには順番があります**。

このコンペの評価関数はQuadratic weighted kappa (QWK)です。  
ざっくばらんに言うとQWKは真のラベルが0のものを2 / 3と間違えた場合、1と間違えた場合の4 / 9倍重くペナルティを与えます。  
ですのでラベルに順番があることを考慮した評価関数になっています。0と3の差が0と1の9倍なのか？という疑問はありますが。  

QWKがどのような値を取るか、試しに下のメソッド`qwk`に色々な配列を入れて見てみると良いと思います。

QWKを高速に計算するコードはこのNotebookのものです。  
https://www.kaggle.com/c/data-science-bowl-2019/discussion/114133#latest-660168

QWKの理解のためにはこちらのNotebookを参考にしました。  
コンペでも何度か登場している評価関数のようです。  
https://www.kaggle.com/aroraaman/quadratic-kappa-metric-explained-in-5-simple-steps

In [None]:
@jit
def qwk(a1, a2):
    """
    Source: https://www.kaggle.com/c/data-science-bowl-2019/discussion/114133#latest-660168

    :param a1:
    :param a2:
    :param max_rat:
    :return:
    """
    max_rat = 3
    a1 = np.asarray(a1, dtype=int)
    a2 = np.asarray(a2, dtype=int)

    hist1 = np.zeros((max_rat + 1, ))
    hist2 = np.zeros((max_rat + 1, ))

    o = 0
    for k in range(a1.shape[0]):
        i, j = a1[k], a2[k]
        hist1[i] += 1
        hist2[j] += 1
        o +=  (i - j) * (i - j)

    e = 0
    for i in range(max_rat + 1):
        for j in range(max_rat + 1):
            e += hist1[i] * hist2[j] * (i - j) * (i - j)

    e = e / a1.shape[0]

    return 1 - o / e

## データの読み込み

まずはデータの中身をじっくりと見ていきます。

データの説明は公式からありますが、かなりざっくりしているので自分でいろいろみてみるべきです。  
https://www.kaggle.com/c/data-science-bowl-2019/data

特にspec.csvは各イベントがどういうものかを説明してあります。  
有用なFeatureを考えるためにはこれを読む必要があります。  
例えばゲームの正解・不正解に対応したイベントがあったりするので、これを集計すると良いFeatureになることが分かります。

データは実際のゲームのものです。**自分でプレイしてみて、どういうものなのか一通り確認しておくべきだと思います。**  
https://measureup.pbskids.org/

データの何を見るべきか？というのは以下のNotebookがとてもよく書かれています。  
https://www.kaggle.com/jaseziv83/dsb-a-rrrare-r-notebook-and-baseline-model  
https://www.kaggle.com/gpreda/2019-data-science-bowl-eda

例えば

1. trainとtestで分布が大きく異なるfeatureはあるか？
2. 外れ値があったりskewなfeatureはあるは？

などを確認します。

In [None]:
# コードは(https://www.kaggle.com/braquino/890-features)を元にしています。

def read_data():
    print('Reading train.csv file....')
    train = pd.read_csv('/kaggle/input/data-science-bowl-2019/train.csv')
    print('Training.csv file have {} rows and {} columns'.format(train.shape[0], train.shape[1]))

    print('Reading test.csv file....')
    test = pd.read_csv('/kaggle/input/data-science-bowl-2019/test.csv')
    print('Test.csv file have {} rows and {} columns'.format(test.shape[0], test.shape[1]))

    print('Reading train_labels.csv file....')
    train_labels = pd.read_csv('/kaggle/input/data-science-bowl-2019/train_labels.csv')
    print('Train_labels.csv file have {} rows and {} columns'.format(train_labels.shape[0], train_labels.shape[1]))

    print('Reading specs.csv file....')
    specs = pd.read_csv('/kaggle/input/data-science-bowl-2019/specs.csv')
    print('Specs.csv file have {} rows and {} columns'.format(specs.shape[0], specs.shape[1]))

    print('Reading sample_submission.csv file....')
    sample_submission = pd.read_csv('/kaggle/input/data-science-bowl-2019/sample_submission.csv')
    print('Sample_submission.csv file have {} rows and {} columns'.format(sample_submission.shape[0], sample_submission.shape[1]))
    return train, test, train_labels, specs, sample_submission


## Featureの作成

このコンペのデータセットは時系列データです。  
各ユーザ`installation_id`の特定のゲーム(`Assessment`)での成績をそれ以前の行動ログから予測します。  
ここで考えなければならないのは以下の二点です。

1. 行動ログのFeature: 各行動ログからどのようなFeatureを取り出すか？
2. ユーザのFeature: ユーザの行動ログをまとめて一つのFeatureにする方法は？

2の方がCriticalな印象ですが、コードの順番上1から先に説明します。

### 1. 行動ログのFeature

各イベントは`title`と`event_code`で一意に定まります。なので各`title`毎に各eventが何回発生したかは良いFeatureになりそうです。  
実際多くのNotebookがこのFeatureを使っていました。  
https://www.kaggle.com/braquino/890-features

加えて、`type`毎、`world`毎のeventの発生回数も重要になりそうです。  
ゲームをやってみると分かりますが、world毎に習うテーマが異なります (大きさ、長さ、重さ)。  
なので`world`毎の行動ログも良いFeatureになりそうです。  
実際これらのFeatureを足すとわたしのモデルの精度が向上しました。  

ゲームのプレイ時間も良いFeatureになりそうです。過去のコンペでも時間関連のFeatureがよく使われていたようです。  
例えば`game_duration`, `clip_duration`などはユーザがそれぞれの`type`に費やしている時間です。  
これらのFeatureでも精度がよくなりました。

6000個以上のFeatureを作るすごい人たちもいるようです。  
もちろんこれらのすべてが役に立つFeatureというわけではなく、うまくPruningしてやる必要はありそうです。  
https://www.kaggle.com/keremt/fastai-feature-engineering-part1-6160-features  
https://www.kaggle.com/keremt/dsbowl2019-feng-part1

こちらのNotebookで指摘されているように、spec.csvにイベントの詳細が書かれています。  
これを読むとかなり有用そうなFeatureでいくつかあることが分かります。  
event_code 4020とCauldron_Fillerのevent_code 4025は正しい行動をしているかstep毎にチェックするイベントです。  
https://www.kaggle.com/bhavikapanara/2019-data-science-bowl-some-interesting-features

In [None]:
def encode_title(train, test, train_labels):
    # encode title
    train['title_event_code'] = list(map(lambda x, y: str(x) + '_' + str(y), train['title'], train['event_code']))
    test['title_event_code'] = list(map(lambda x, y: str(x) + '_' + str(y), test['title'], test['event_code']))
    all_title_event_code = list(set(train["title_event_code"].unique()).union(test["title_event_code"].unique()))
    
    train['type_world'] = list(map(lambda x, y: str(x) + '_' + str(y), train['type'], train['world']))
    test['type_world'] = list(map(lambda x, y: str(x) + '_' + str(y), test['type'], test['world']))
    all_type_world = list(set(train["type_world"].unique()).union(test["type_world"].unique()))
    
    # make a list with all the unique 'titles' from the train and test set
    list_of_user_activities = list(set(train['title'].unique()).union(set(test['title'].unique())))
    # make a list with all the unique 'event_code' from the train and test set
    list_of_event_code = list(set(train['event_code'].unique()).union(set(test['event_code'].unique())))
    list_of_event_id = list(set(train['event_id'].unique()).union(set(test['event_id'].unique())))
    # make a list with all the unique worlds from the train and test set
    list_of_worlds = list(set(train['world'].unique()).union(set(test['world'].unique())))
    # create a dictionary numerating the titles
    activities_map = dict(zip(list_of_user_activities, np.arange(len(list_of_user_activities))))
    activities_labels = dict(zip(np.arange(len(list_of_user_activities)), list_of_user_activities))
    activities_world = dict(zip(list_of_worlds, np.arange(len(list_of_worlds))))
    assess_titles = list(set(train[train['type'] == 'Assessment']['title'].value_counts().index).union(
        set(test[test['type'] == 'Assessment']['title'].value_counts().index)))
    # replace the text titles with the number titles from the dict
    train['title'] = train['title'].map(activities_map)
    test['title'] = test['title'].map(activities_map)
    train['world'] = train['world'].map(activities_world)
    test['world'] = test['world'].map(activities_world)
    train_labels['title'] = train_labels['title'].map(activities_map)
    
    win_code = dict(zip(activities_map.values(), (4100*np.ones(len(activities_map))).astype('int')))
    # then, it set one element, the 'Bird Measurer (Assessment)' as 4110, 10 more than the rest
    win_code[activities_map['Bird Measurer (Assessment)']] = 4110
    # convert text into datetime
    train['timestamp'] = pd.to_datetime(train['timestamp'])
    test['timestamp'] = pd.to_datetime(test['timestamp'])
    
    train['hour'] = train['timestamp'].dt.hour
    test['hour'] = test['timestamp'].dt.hour
    train['weekday'] = train['timestamp'].dt.weekday
    test['weekday'] = test['timestamp'].dt.weekday
    
    return train, test, train_labels, win_code, list_of_user_activities, list_of_event_code, activities_labels, assess_titles, list_of_event_id, all_title_event_code, activities_map, all_type_world


### 2. ユーザのFeature

ユーザの行動ログは間隔がまばらな時系列のデータです。  
時系列データ解析はそれで教科書がたくさん出ている深みのある分野ですが、わたしはよく分かりません！  
今回のタスクはFeatureは時系列ですが予測は一時点でのみなのでそこまで詳しい知識は必要ではないかもしれません。  
時系列データ解析を扱ったNotebookがあったのでそちらを参照しました。  
https://www.kaggle.com/kashnitsky/topic-9-part-1-time-series-analysis-in-python

古いデータほど重みを割り引いてから平均を取るExponential smoothingアプローチを試してみました。  
https://en.wikipedia.org/wiki/Exponential_smoothing

しかしわたしが実験した限りでは単純平均の方が精度が高かったです。  
最終的にわたしは行動ログのFeatureの単純な総和・平均standard deviationなどを使いました。  
~~このコンペで扱っているデータはたった３ヶ月間程度なので、その短い期間に大きな変化があるユーザは少なく、単純平均などの量で十分ということなのかもしれません。~~  

→上位のモデルの中ではRNNを一部に使ったものがありました。  
単純なexponential smoothingでは難しいですがちゃんとtimestepを考えたモデルを作るとかなり良い精度が得られるようです。  
https://www.kaggle.com/c/data-science-bowl-2019/discussion/127210


関連して、いくつかのNotebookで**未来の情報を使ってしまっているものがありました。**  
こちらのdiscussion指摘されていますが、当然テストデータに未来の情報は入っていないのでtrainとtestで異なる情報を使ってしまうことになります。  
https://www.kaggle.com/c/data-science-bowl-2019/discussion/117724

→上位モデルの中にはTFIDFを使っているものがありました。確かに！  
https://www.kaggle.com/c/data-science-bowl-2019/discussion/127210

In [None]:
clip_time = {'Welcome to Lost Lagoon!':19,'Tree Top City - Level 1':17,'Ordering Spheres':61, 'Costume Box':61,
        '12 Monkeys':109,'Tree Top City - Level 2':25, 'Pirate\'s Tale':80, 'Treasure Map':156,'Tree Top City - Level 3':26,
        'Rulers':126, 'Magma Peak - Level 1':20, 'Slop Problem':60, 'Magma Peak - Level 2':22, 'Crystal Caves - Level 1':18,
        'Balancing Act':72, 'Lifting Heavy Things':118,'Crystal Caves - Level 2':24, 'Honey Cake':142, 'Crystal Caves - Level 3':19,
        'Heavy, Heavier, Heaviest':61}

def get_data(user_sample, test_set=False):
    '''
    The user_sample is a DataFrame from train or test where the only one 
    installation_id is filtered
    And the test_set parameter is related with the labels processing, that is only requered
    if test_set=False
    '''
    # This the where we reduce the information per installation_id.
    
    # Constants and parameters declaration
    last_activity = 0
    
    if time_decay == 0:
        user_activities_count = {'Clip':0, 'Activity': 0, 'Assessment': 0, 'Game':0}
    else:
        user_activities_count = {'Clip':0.0, 'Activity': 0.0, 'Assessment': 0.0, 'Game':0.0}
    
    assess_4020_acc_dict = {'Cauldron Filler (Assessment)_4020_accuracy':0,
                                'Mushroom Sorter (Assessment)_4020_accuracy':0,
                                'Bird Measurer (Assessment)_4020_accuracy':0,
                                'Chest Sorter (Assessment)_4020_accuracy':0 }
    
    # new features: time spent in each activity
    last_session_time_sec = 0
    accuracy_groups = {0:0, 1:0, 2:0, 3:0}
    all_assessments = []
    accumulated_accuracy_group = 0
    accumulated_accuracy = 0
    accumulated_correct_attempts = 0 
    accumulated_uncorrect_attempts = 0
    accumulated_actions = 0
    
    accumulated_game_miss = 0
    mean_game_level = 0
    
    Cauldron_Filler_4025 = 0
    chest_assessment_uncorrect_sum = 0 # incorrect id = df4fe8b6
    
    counter = 0
    time_first_activity = float(user_sample['timestamp'].values[0])    
    durations = []    
    last_accuracy_title = {'acc_' + title: -1 for title in assess_titles}    
    
    def cnt_miss(df):
        cnt = 0
        for e in range(len(df)):
            x = df['event_data'].iloc[e]
            y = json.loads(x)['misses']
            cnt += y
        return cnt

    def get_4020_acc(df,counter_dict):

        for e in ['Cauldron Filler (Assessment)','Bird Measurer (Assessment)','Mushroom Sorter (Assessment)','Chest Sorter (Assessment)']:

            Assess_4020 = df[(df.event_code == 4020) & (df.title==activities_map[e])]   
            true_attempts_ = Assess_4020['event_data'].str.contains('true').sum()
            false_attempts_ = Assess_4020['event_data'].str.contains('false').sum()

            measure_assess_accuracy_ = true_attempts_/(true_attempts_+false_attempts_) if (true_attempts_+false_attempts_) != 0 else 0
            counter_dict[e+"_4020_accuracy"] += (counter_dict[e+"_4020_accuracy"] + measure_assess_accuracy_) / 2.0

        return counter_dict
    
    if time_decay == 0:
        event_code_count: Dict[str, float] = {ev: 0.0 for ev in list_of_event_code}
        event_id_count: Dict[str, float] = {eve: 0.0 for eve in list_of_event_id}
        title_count: Dict[str, float] = {eve: 0.0 for eve in activities_labels.values()} 
        title_event_code_count: Dict[str, float] = {t_eve: 0.0 for t_eve in all_title_event_code}
        type_world_count: Dict[str, float] = {eve: 0.0 for eve in all_type_world}
    else:
        event_code_count: Dict[str, int] = {ev: 0 for ev in list_of_event_code}
        event_id_count: Dict[str, int] = {eve: 0 for eve in list_of_event_id}
        title_count: Dict[str, int] = {eve: 0 for eve in activities_labels.values()} 
        title_event_code_count: Dict[str, int] = {t_eve: 0 for t_eve in all_title_event_code}
        type_world_count: Dict[str, int] = {eve: 0 for eve in all_type_world}

    # Features for each type.
    clip_durations = []
    activity_durations = []
    game_durations = []
    activity_sum_event_count = 0
    game_sum_event_count = 0
    activity_event_code_count: Dict[str, int] = {str(eve) + '_a': 0 for eve in list_of_event_code}
    game_event_code_count: Dict[str, int] = {str(eve) + '_g': 0 for eve in list_of_event_code}
    
    # itarates through each session of one instalation_id
    for i, session in user_sample.groupby('game_session', sort=False):
        # i = game_session_id
        # session is a DataFrame that contain only one game_session
        
        # get some sessions information
        session_type = session['type'].iloc[0]
        session_title = session['title'].iloc[0]
        session_title_text = activities_labels[session_title]
        
        if session_type == 'Clip':
            clip_durations.append((clip_time[activities_labels[session_title]]))
            
        if session_type == 'Activity':
            activity_sum_event_count += session['event_count'].iloc[-1]
            activity_durations.append((session.iloc[-1, 2] - session.iloc[0, 2]).seconds)
            def update_counters(counter: dict, col: str):
                num_of_session_count = Counter(session[col])
                for k in num_of_session_count.keys():
                    x = str(k) + '_a'
                    if col == 'title':
                        x = activities_labels[k]
                    counter[x] += num_of_session_count[k]
                return counter
            activity_event_code_count = update_counters(activity_event_code_count, "event_code")
            
        if session_type == 'Game':
            game_sum_event_count += session['event_count'].iloc[-1]
            game_durations.append((session.iloc[-1, 2] - session.iloc[0, 2]).seconds)
            def update_counters(counter: dict, col: str):
                num_of_session_count = Counter(session[col])
                for k in num_of_session_count.keys():
                    x = str(k) + '_g'
                    if col == 'title':
                        x = activities_labels[k]
                    counter[x] += num_of_session_count[k]
                return counter
            game_event_code_count = update_counters(game_event_code_count, "event_code")        
        
            game_s = session[session.event_code == 2030]
            misses_cnt = cnt_miss(game_s)
            accumulated_game_miss += misses_cnt
            
            try:
                game_level = json.loads(session['event_data'].iloc[-1])['level']
                mean_game_level = (mean_game_level + game_level) / 2.0
            except:
                pass
        
        if (session_type == 'Assessment') & (test_set or len(session)>1):
            # search for event_code 4100, that represents the assessments trial
            all_attempts = session.query(f'event_code == {win_code[session_title]}')
            # then, check the numbers of wins and the number of losses
            true_attempts = all_attempts['event_data'].str.contains('true').sum()
            false_attempts = all_attempts['event_data'].str.contains('false').sum()
            # copy a dict to use as feature template, it's initialized with some itens: 
            # {'Clip':0, 'Activity': 0, 'Assessment': 0, 'Game':0}
            features = user_activities_count.copy()
            features.update(last_accuracy_title.copy())
            features.update(event_code_count.copy())
            features.update(event_id_count.copy())
            features.update(title_count.copy())
            features.update(title_event_code_count.copy())
            features.update(last_accuracy_title.copy())
            
            features.update(assess_4020_acc_dict.copy())
            features.update(type_world_count.copy())

            features['hour'] = session['hour'].iloc[-1]
            features['weekday'] = session['weekday'].iloc[-1]            
            
            # get installation_id for aggregated features
            features['installation_id'] = session['installation_id'].iloc[-1]
            # add title as feature, remembering that title represents the name of the game
            features['session_title'] = session['title'].iloc[0]
            # the 4 lines below add the feature of the history of the trials of this player
            # this is based on the all time attempts *so far*, at the moment of this assessment
            features['accumulated_correct_attempts'] = accumulated_correct_attempts
            features['accumulated_uncorrect_attempts'] = accumulated_uncorrect_attempts
            accumulated_correct_attempts += true_attempts 
            accumulated_uncorrect_attempts += false_attempts
            # the time spent in the app so far
            if durations == []:
                features['duration_mean'] = 0
            else:
                features['duration_mean'] = np.mean(durations)
            durations.append((session.iloc[-1, 2] - session.iloc[0, 2] ).seconds)
            # the accurace is the all time wins divided by the all time attempts
            features['accumulated_accuracy'] = accumulated_accuracy/counter if counter > 0 else 0
            accuracy = true_attempts/(true_attempts+false_attempts) if (true_attempts+false_attempts) != 0 else 0
            accumulated_accuracy += accuracy
            last_accuracy_title['acc_' + session_title_text] = accuracy
            # a feature of the current accuracy categorized
            # it is a counter of how many times this player was in each accuracy group
            if accuracy == 0:
                features['accuracy_group'] = 0
            elif accuracy == 1:
                features['accuracy_group'] = 3
            elif accuracy == 0.5:
                features['accuracy_group'] = 2
            else:
                features['accuracy_group'] = 1
            features.update(accuracy_groups)
            accuracy_groups[features['accuracy_group']] += 1
            # mean of the all accuracy groups of this player
            features['accumulated_accuracy_group'] = accumulated_accuracy_group/counter if counter > 0 else 0
            accumulated_accuracy_group += features['accuracy_group']
            # how many actions the player has done so far, it is initialized as 0 and updated some lines below
            features['accumulated_actions'] = accumulated_actions
            
            # Features for each types
            if clip_durations == []:
                # it never happens, but in case.
                features['clip_duration_mean'] = 0
                features['clip_duration_std'] = 0
            else:
                features['clip_duration_mean'] = np.mean(clip_durations)
                features['clip_duration_std'] = np.std(clip_durations)

            if activity_durations == []:
                # it never happens, but in case.
                features['activity_duration_mean'] = 0
                features['activity_duration_std'] = 0
            else:
                features['activity_duration_mean'] = np.mean(activity_durations)
                features['activity_duration_std'] = np.std(activity_durations)

            if game_durations == []:
                # it never happens, but in case.
                features['game_duration_mean'] = 0
                features['game_duration_std'] = 0
            else:
                features['game_duration_mean'] = np.mean(game_durations)
                features['game_duration_std'] = np.std(game_durations)
                
            features['activitiy_sum_event_count'] = activity_sum_event_count
            features['game_sum_event_count'] = game_sum_event_count

            features['accumulated_game_miss'] = accumulated_game_miss
            features['mean_game_level'] = mean_game_level
            features['chest_assessment_uncorrect_sum'] = chest_assessment_uncorrect_sum
                
            features.update(game_event_code_count.copy())
            features.update(activity_event_code_count.copy())
            
            variety_features = [('var_event_code', event_code_count),
                              ('var_event_id', event_id_count),
                               ('var_title', title_count),
                               ('var_title_event_code', title_event_code_count),
                                ('var_type_world', type_world_count)]
            
            for name, dict_counts in variety_features:
                arr = np.array(list(dict_counts.values()))
                features[name] = np.count_nonzero(arr)
                
            features['Cauldron_Filler_4025'] = Cauldron_Filler_4025/counter if counter > 0 else 0
            ####################
            Assess_4025 = session[(session.event_code == 4025) & (session.title=='Cauldron Filler (Assessment)')]   
            true_attempts_ = Assess_4025['event_data'].str.contains('true').sum()
            false_attempts_ = Assess_4025['event_data'].str.contains('false').sum()

            cau_assess_accuracy_ = true_attempts_/(true_attempts_+false_attempts_) if (true_attempts_+false_attempts_) != 0 else 0
            Cauldron_Filler_4025 += cau_assess_accuracy_
            
            chest_assessment_uncorrect_sum += len(session[session.event_id=="df4fe8b6"])
                
            # there are some conditions to allow this features to be inserted in the datasets
            # if it's a test set, all sessions belong to the final dataset
            # it it's a train, needs to be passed throught this clausule: session.query(f'event_code == {win_code[session_title]}')
            # that means, must exist an event_code 4100 or 4110
            if test_set:
                all_assessments.append(features)
            elif true_attempts+false_attempts > 0:
                all_assessments.append(features)
                
            counter += 1
        
        # this piece counts how many actions was made in each event_code so far
        def update_counters(counter: dict, col: str):
                num_of_session_count = Counter(session[col])
                for k in num_of_session_count.keys():
                    x = k
                    if col == 'title':
                        x = activities_labels[k]
                    counter[x] += num_of_session_count[k]
                return counter
        
        event_code_count = update_counters(event_code_count, "event_code")
        event_id_count = update_counters(event_id_count, "event_id")
        title_count = update_counters(title_count, 'title')
        title_event_code_count = update_counters(title_event_code_count, 'title_event_code')
        type_world_count = update_counters(type_world_count, 'type_world')
        
        assess_4020_acc_dict = get_4020_acc(session, assess_4020_acc_dict)
        
        # counts how many actions the player has done so far, used in the feature of the same name
        accumulated_actions += len(session)
        if last_activity != session_type:
            user_activities_count[session_type] += 1
            last_activitiy = session_type 

        if not (time_decay == 0):
            user_activities_count = {x:y*(1.0 - time_decay) for (x, y) in user_activities_count.items()}
            event_code_count = {x:y*(1.0 - time_decay) for (x, y) in event_code_count.items()}
            event_id_count = {x:y*(1.0 - time_decay) for (x, y) in event_id_count.items()}
            title_count = {x:y*(1.0 - time_decay) for (x, y) in title_count.items()}
            title_event_code_count = {x:y*(1.0 - time_decay) for (x, y) in title_event_code_count.items()}
            accumulated_actions = accumulated_actions * (1.0 - time_decay)
                        
    # if it't the test_set, only the last assessment must be predicted, the previous are scraped
    if test_set:
        return all_assessments[-1]
    # in the train_set, all assessments goes to the dataset
    return all_assessments

def get_train_and_test(train, test):
    compiled_train = []
    compiled_test = []
    # We don't need to take all the users because some of the users didn't take assessment.
    for i, (ins_id, user_sample) in tqdm(enumerate(train.groupby('installation_id', sort = False)), total = 17000):
        compiled_train += get_data(user_sample)
    for ins_id, user_sample in tqdm(test.groupby('installation_id', sort = False), total = 1000):
        test_data = get_data(user_sample, test_set = True)
        compiled_test.append(test_data)
    reduce_train = pd.DataFrame(compiled_train)
    reduce_test = pd.DataFrame(compiled_test)
    categoricals = ['session_title']
    return reduce_train, reduce_test, categoricals



In [None]:
if not preprocessed:
    # read data
    train, test, train_labels, specs, sample_submission = read_data()
    # get usefull dict with maping encode
    train, test, train_labels, win_code, list_of_user_activities, list_of_event_code, activities_labels, assess_titles, list_of_event_id, all_title_event_code, activities_map, all_type_world = encode_title(train, test, train_labels)
    # tranform function to get the train and test set
    reduce_train, reduce_test, categoricals = get_train_and_test(train, test)
else:
    # Read the sample_submission file
    print('Reading sample_submission.csv file....')
    sample_submission = pd.read_csv('/kaggle/input/data-science-bowl-2019/sample_submission.csv')
    print('Sample_submission.csv file have {} rows and {} columns'.format(sample_submission.shape[0], sample_submission.shape[1]))


In [None]:
def preprocess(reduce_train, reduce_test):
    for df in [reduce_train, reduce_test]:
        df['installation_session_count'] = df.groupby(['installation_id'])['Clip'].transform('count')
        df['installation_duration_mean'] = df.groupby(['installation_id'])['duration_mean'].transform('mean')
        #df['installation_duration_std'] = df.groupby(['installation_id'])['duration_mean'].transform('std')
        df['installation_title_nunique'] = df.groupby(['installation_id'])['session_title'].transform('nunique')
        
        df['sum_event_code_count'] = df[[2050, 4100, 4230, 5000, 4235, 2060, 4110, 5010, 2070, 2075, 2080, 2081, 2083, 3110, 4010, 3120, 3121, 4020, 4021, 
                                        4022, 4025, 4030, 4031, 3010, 4035, 4040, 3020, 3021, 4045, 2000, 4050, 2010, 2020, 4070, 2025, 2030, 4080, 2035, 
                                        2040, 4090, 4220, 4095]].sum(axis = 1)
        
        df['installation_event_code_count_mean'] = df.groupby(['installation_id'])['sum_event_code_count'].transform('mean')
        #df['installation_event_code_count_std'] = df.groupby(['installation_id'])['sum_event_code_count'].transform('std')
        
    features = reduce_train.loc[(reduce_train.sum(axis=1) != 0), (reduce_train.sum(axis=0) != 0)].columns # delete useless columns
    features = [x for x in features if x not in ['accuracy_group', 'installation_id']] + ['acc_' + title for title in assess_titles]
   
    return reduce_train, reduce_test, features


In [None]:
# Save data
if time_decay == 0:
    d = ""    
else:
    d = '_' + str(time_decay)

if preprocessed:
    reduce_train = pd.read_csv(dirname + 'reduce_train' + d + '.csv', index_col=0)
    reduce_test = pd.read_csv(dirname + 'reduce_test' + d + '.csv', index_col=0)
    # features = read_csv(dirname + 'features.csv')

    codes = [0, 1, 2050, 2, 4100, 3, 4230, 5000, 4235, 2060, 4110, 5010, 2070, 2075, 2080, 2081, 2083, 3110, 4010, 3120, 3121, 4020, 4021, 4022, 4025, 4030, 4031, 3010, 4035, 4040, 3020, 3021, 4045, 2000, 4050, 2010, 2020, 4070, 2025, 2030, 4080, 2035, 2040, 4090, 4220, 4095]
    enc = {}
    for i in codes:
        enc[str(i)] = i
    reduce_train.rename(columns=enc, inplace=True)
    reduce_test.rename(columns=enc, inplace=True)

else:
    reduce_train, reduce_test, _ = preprocess(reduce_train, reduce_test)
        
    reduce_train.to_csv('reduce_train' + d + '.csv')
    reduce_test.to_csv('reduce_test' + d + '.csv')
    # print("features=", features)


In [None]:
reduce_train.head()

## 欠損値

多くのコンペのデータには欠損値や前処理が必要な値が存在します。  
あるいはFeature generationの過程で欠損値が生じることがあります。  
今回の例だとデータ数が0の量の平均やstandard deviationを取ろうとするとNAになります。  
データ数が0の量のstdは0にしました。意味的にも問題はありません。  

今回は単純に0でfillして問題ないですが、工夫してfillしなければならないときもあります (missであることに情報が含まれている場合)。  
わたしはこちらの資料を一通り読みました。  
http://www.stat.columbia.edu/~gelman/arm/missing.pdf

## 非対称なFeatureの処理

非対称なFeatureがあると学習が難しいモデルがたくさんあります。  
これらはBox-cox変換してやるとskewの絶対値を小さくできることがあります。  
https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.boxcox1p.html  
http://onlinestatbook.com/2/transformations/box-cox.html

(色々探して見ましたが、非対称なFeatureが具体的に何故学習を困難にするかについての理論的なお話を見つけられませんでした。  
直感としては各データ点がloss関数に与える影響がどう回帰を取っても大きく異なるからでしょうか。  
ESLにもMurphy本にもないようですが、どこで見つけられますか？)

最終的に使ったモデルはLightGBMのみだったのであまり関係なかったかもしれません。

In [None]:
########################################
# Data exploration
########################################
# Interestingly, features skewed in the training set is 
# not always skewed in test set and visa versa.
real_values = reduce_train.dtypes[reduce_train.dtypes == 'float64'].index
skewness_train = reduce_train[real_values].skew(axis=0, skipna=True).sort_values(ascending=False)

skewness_test = reduce_test[real_values].skew(axis=0, skipna=True).sort_values(ascending=False)

print(real_values)

In [None]:
skewed_features_train = skewness_train[abs(skewness_train) > 0.8].index
skewed_features_test = skewness_test[abs(skewness_test) > 0.8].index

print(skewed_features_train.shape)
print(skewed_features_test.shape)

In [None]:
from scipy.special import boxcox1p
from scipy import stats
from scipy.stats import norm

feature = 'duration_mean'
print('feature=', feature)
dset = reduce_train

print('before rescaling', dset[feature].skew(axis=0, skipna=True))

if dset[feature].skew(axis=0, skipna=True) > 0.0:
    sns.distplot(dset[feature], fit=norm)
    fig = plt.figure()
    res = stats.probplot(dset[feature], plot=plt)

In [None]:
rescaled = boxcox1p(dset[feature], 0.15)

print('after rescaling', rescaled.skew(axis=0, skipna=True))

sns.distplot(rescaled, fit=norm)
fig = plt.figure()
res = stats.probplot(rescaled, plot=plt)

rescaled = boxcox1p(rescaled, 0.15)


In [None]:
# Preprocessing features with high skewness
reduce_train['installation_event_code_count_mean'] = boxcox1p(reduce_train['installation_event_code_count_mean'], 0.15)
reduce_test['installation_event_code_count_mean'] = boxcox1p(reduce_test['installation_event_code_count_mean'], 0.15)

reduce_train['game_duration_mean'] = boxcox1p(reduce_train['game_duration_mean'], 0.15)
reduce_test['game_duration_mean'] = boxcox1p(reduce_test['game_duration_mean'], 0.15)

reduce_train['installation_duration_mean'] = boxcox1p(reduce_train['installation_duration_mean'], 0.15)
reduce_test['installation_duration_mean'] = boxcox1p(reduce_test['installation_duration_mean'], 0.15)

reduce_train['sum_event_code_count'] = boxcox1p(reduce_train['sum_event_code_count'], 0.15)
reduce_test['sum_event_code_count'] = boxcox1p(reduce_test['sum_event_code_count'], 0.15)

reduce_train['accumulated_game_miss'] = boxcox1p(reduce_train['accumulated_game_miss'], 0.15)
reduce_test['accumulated_game_miss'] = boxcox1p(reduce_test['accumulated_game_miss'], 0.15)

reduce_train['duration_mean'] = boxcox1p(reduce_train['duration_mean'], 0.15)
reduce_test['duration_mean'] = boxcox1p(reduce_test['duration_mean'], 0.15)



In [None]:
# Seaborn doesn't recognize indices in dtype=int.
# So we temporarily set it back to string for plotting.
codes = [0, 1, 2050, 2, 4100, 3, 4230, 5000, 4235, 2060, 4110, 5010, 2070, 2075, 2080, 2081, 2083, 3110, 4010, 3120, 3121, 4020, 4021, 4022, 4025, 4030, 4031, 3010, 4035, 4040, 3020, 3021, 4045, 2000, 4050, 2010, 2020, 4070, 2025, 2030, 4080, 2035, 2040, 4090, 4220, 4095]
enc = {}
for i in codes:
    enc[i] = str(i)
reduce_train.rename(columns=enc, inplace=True)
reduce_test.rename(columns=enc, inplace=True)


## Outliers

他のデータと大きく異なるFeatureを持ったデータをoutlierと呼びます。  
OutlierはLoss関数に大きな影響を及ぼしてしまい、それらにfitするために典型データの精度が悪くなってしまうことがあります。  

Outlierを取り除く例はこのNotebookにあります。  
https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python

機械学習のエンジニアとしてはoutlierに頑強なloss (e.g. huber loss)を使おうと考えますが、  
データが明に与えられているデータ分析ではoutlierを取り除く方が話が早いです。


最終的に使ったモデルはLightGBMのみだったので最終的にはoutlierの処理はしておりません。

In [None]:
if drop_outliers:
    covs = reduce_train.cov()['accuracy_group']
    covs_sorted = covs.abs().sort_values(ascending=False)
    covs_sorted.head(10)

    # change i to visualize each feature
    feature = '4070'
    sns.regplot(x=feature, y="accuracy_group", data=reduce_train, y_jitter=0.1)

    # Before removing outlier

In [None]:
if drop_outliers:
    outliers_ = reduce_train[reduce_train[feature] > 3000].index
    removed_ = reduce_train.drop(outliers_)

    # reduce_train.plot(kind='scatter', x=covs_sorted.index[i], y="accuracy_group")
    sns.regplot(x=feature, y="accuracy_group", data=removed_, y_jitter=0.1)
    print(removed_.shape)

    # After removing outlier

In [None]:
if drop_outliers:
    outliers = [('4070', 3000), ('duration_mean', 50000), ('installation_duration_mean', 30000),
               ('Cauldron Filler (Assessment)', 1200), ('3020', 400), ('3120', 400),
               ('Cart Balancer (Assessment)', 1000), ('Crystals Rule', 2000), ('4035', 400),
               ('Bubble Bath', 2000), ('1325467d', 500)]

    for o in outliers:
        outlier = reduce_train[reduce_train[o[0]] > o[1]].index
        reduce_train.drop(outlier, inplace=True)

    print(reduce_train.shape)


In [None]:
codes = [0, 1, 2050, 2, 4100, 3, 4230, 5000, 4235, 2060, 4110, 5010, 2070, 2075, 2080, 2081, 2083, 3110, 4010, 3120, 3121, 4020, 4021, 4022, 4025, 4030, 4031, 3010, 4035, 4040, 3020, 3021, 4045, 2000, 4050, 2010, 2020, 4070, 2025, 2030, 4080, 2035, 2040, 4090, 4220, 4095]
enc = {}
for i in codes:
    enc[str(i)] = i
reduce_train.rename(columns=enc, inplace=True)
reduce_test.rename(columns=enc, inplace=True)

## テストデータのscaling

こちらのNotebookではデータの平均値が大きく異なる時に平均値の割合に合わせてテストデータの量をscalingする前処理がされていました。  
https://www.kaggle.com/khoongweihao/top-5-pub-to-top-53-priv-convert-disaster

わたしはこのscalingは採用しないことにしました。  
なぜなら扱っているFeatureの多くはscaleを変えると意味が違ってしまうからです。  

→噂によると多くの最終submissionにこのscalingが使われ、大きく精度を落とす結果になってしまったそうです。

In [None]:
# We don't want to use features with their distribution very different in two data sets.

if ajust_mean:
    to_drop = []
    for f in reduce_train.select_dtypes(include=np.number).columns.tolist():
        train_mean = reduce_train[f].mean()
        test_mean = reduce_test[f].mean()
        if train_mean == 0.0 or test_mean == 0.0:
            continue
        else:
            mean_ratio = train_mean / test_mean
            if mean_ratio > 10.0 or mean_ratio < 1.0 / 10.0:
                print('### drop feature', f, 'as it is very different in train and test: ', mean_ratio)
                to_drop.append(f)

    reduce_train.drop(to_drop, axis=1, inplace=True, errors='ignore')
    reduce_test.drop(to_drop, axis=1, inplace=True, errors='ignore')

    print(reduce_train.shape)

## 相関するFeatureの処理

作ってきたFeatureの中には強い相関を示すものがたくさんあります。  
これらのFeatureは取り除いた方が良いモデルが学習しやすいのでこれらを取り除きます。  
特にnumericalなmodel使うときはSVDが怖いので注意が必要です。

相関したFeatureをどのように処理するかについてはこちらのNotebookを参考にしました。  
https://www.kaggle.com/reisel/how-to-handle-correlated-features  

わたしの実装は、既に加えてFeatureと強い相関のあるFeatureを飛ばす、というシンプルなものです。  
より良い方法があるかもしれませんが、今回はdecision treeベースのモデルだったのであまり関係がないと思いました。


→７位の方はすべてのFeatureがnoiseかどうかを一つ一つチェックしたそうです。  
なのでもっとちゃんと取り除くFeatureを精査するべきでした。  
https://www.kaggle.com/c/data-science-bowl-2019/discussion/127213


In [None]:
# Correlation of features
# We want to avoid using features with high correlation in test set.
corr_matrix = reduce_test.corr().abs()

upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# TODO: 0.95 may be too aggressive? Use 0.99 instead?
to_drop = [column for column in upper.columns if any(upper[column] > 0.99)]

In [None]:
# Drop features with features with high correlation to other features
reduce_train.drop(to_drop, axis=1, inplace=True, errors='ignore')
reduce_test.drop(to_drop, axis=1, inplace=True, errors='ignore')

print(len(to_drop))
print(reduce_train.shape)

# 学習モデル

学習モデルはLightGBMを採用しました。  
過去のコンペで好成績を残しているNotebookの多くがLightGBMを採用しているからです。  

今回はregression問題をモデルします。  
これは今回のタスクのラベルが順番を持った量であり、評価関数QWKもラベルの順番を意識した量だからです。  
なのでClassificationよりもRegressionの方がよりラベルの背景にある関係性を読み取れると思ったからです。  

→実際上位のNotebookを読むとだいたいRegressionをしています。  
→面白いことに、上位のNotebookの多くはNNベースのモデルを一つ以上取り入れていることが多いです。  
直感的にはgbdtが強そうなタスクだったのですが、データをうまくnormalizeしてNNに入れるアプローチが強いようです。

学習モデルのラッパーは以下のNotebookによるものです。  
学習の進捗などがきれいにvisualizationされており、手元のモデルの問題点(underfittingしているのかoverfittingしているのか)がすぐに見えるようになっています。  
https://www.kaggle.com/artgor/quick-and-dirty-regression


In [None]:
def eval_qwk_lgb_regr(y_true, y_pred):
    """
    Fast cappa eval function for lgb.
    """
    y_pred[y_pred <= 1.12232214] = 0
    y_pred[np.where(np.logical_and(y_pred > 1.12232214, y_pred <= 1.73925866))] = 1
    y_pred[np.where(np.logical_and(y_pred > 1.73925866, y_pred <= 2.22506454))] = 2
    y_pred[y_pred > 2.22506454] = 3

    # y_pred = y_pred.reshape(len(np.unique(y_true)), -1).argmax(axis=0)

    return 'cappa', qwk(y_true, y_pred), True


class LGBWrapper_regr(object):
    """
    A wrapper for lightgbm model so that we will have a single api for various models.
    """

    def __init__(self):
        self.model = lgb.LGBMRegressor()

    def fit(self, X_train, y_train, X_valid=None, y_valid=None, X_holdout=None, y_holdout=None, params=None):
        if params['objective'] == 'regression':
            eval_metric = eval_qwk_lgb_regr
        else:
            eval_metric = 'auc'

        eval_set = [(X_train, y_train)]
        eval_names = ['train']
        self.model = self.model.set_params(**params)

        if X_valid is not None:
            eval_set.append((X_valid, y_valid))
            eval_names.append('valid')

        if X_holdout is not None:
            eval_set.append((X_holdout, y_holdout))
            eval_names.append('holdout')

        if 'cat_cols' in params.keys():
            cat_cols = [col for col in params['cat_cols'] if col in X_train.columns]
            if len(cat_cols) > 0:
                categorical_columns = params['cat_cols']
            else:
                categorical_columns = 'auto'
        else:
            categorical_columns = 'auto'

        self.model.fit(X=X_train, y=y_train,
                       eval_set=eval_set, eval_names=eval_names, eval_metric=eval_metric,
                       verbose=params['verbose'], early_stopping_rounds=params['early_stopping_rounds'],
                       categorical_feature=categorical_columns)

        self.best_score_ = self.model.best_score_
        self.feature_importances_ = self.model.feature_importances_

    def predict(self, X_test):
        return self.model.predict(X_test, num_iteration=self.model.best_iteration_)


def eval_qwk_xgb(y_pred, y_true):
    """
    Fast cappa eval function for xgb.
    """
    # print('y_true', y_true)
    # print('y_pred', y_pred)
    y_true = y_true.get_label()
    y_pred = y_pred.argmax(axis=1)
    return 'cappa', -qwk(y_true, y_pred)


class MainTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, convert_cyclical: bool = False, create_interactions: bool = False, n_interactions: int = 20):
        """
        Main transformer for the data. Can be used for processing on the whole data.

        :param convert_cyclical: convert cyclical features into continuous
        :param create_interactions: create interactions between features
        """

        self.convert_cyclical = convert_cyclical
        self.create_interactions = create_interactions
        self.feats_for_interaction = None
        self.n_interactions = n_interactions

    def fit(self, X, y=None):

        if self.create_interactions:
            self.feats_for_interaction = [col for col in X.columns if 'sum' in col
                                          or 'mean' in col or 'max' in col or 'std' in col
                                          or 'attempt' in col]
            self.feats_for_interaction1 = np.random.choice(self.feats_for_interaction, self.n_interactions)
            self.feats_for_interaction2 = np.random.choice(self.feats_for_interaction, self.n_interactions)

        return self

    def transform(self, X, y=None):
        data = copy.deepcopy(X)
        if self.create_interactions:
            for col1 in self.feats_for_interaction1:
                for col2 in self.feats_for_interaction2:
                    data[f'{col1}_int_{col2}'] = data[col1] * data[col2]

        if self.convert_cyclical:
            data['timestampHour'] = np.sin(2 * np.pi * data['timestampHour'] / 23.0)
            data['timestampMonth'] = np.sin(2 * np.pi * data['timestampMonth'] / 23.0)
            data['timestampWeek'] = np.sin(2 * np.pi * data['timestampWeek'] / 23.0)
            data['timestampMinute'] = np.sin(2 * np.pi * data['timestampMinute'] / 23.0)

        return data

    def fit_transform(self, X, y=None, **fit_params):
        data = copy.deepcopy(X)
        self.fit(data)
        return self.transform(data)


class FeatureTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, main_cat_features: list = None, num_cols: list = None):
        """

        :param main_cat_features:
        :param num_cols:
        """
        self.main_cat_features = main_cat_features
        self.num_cols = num_cols

    def fit(self, X, y=None):

#         self.num_cols = [col for col in X.columns if 'sum' in col or 'mean' in col or 'max' in col or 'std' in col
#                          or 'attempt' in col]
        

        return self

    def transform(self, X, y=None):
        data = copy.deepcopy(X)
#         for col in self.num_cols:
#             data[f'{col}_to_mean'] = data[col] / data.groupby('installation_id')[col].transform('mean')
#             data[f'{col}_to_std'] = data[col] / data.groupby('installation_id')[col].transform('std')

        return data

    def fit_transform(self, X, y=None, **fit_params):
        data = copy.deepcopy(X)
        self.fit(data)
        return self.transform(data)
    
    
class RegressorModel(object):
    """
    A wrapper class for classification models.
    It can be used for training and prediction.
    Can plot feature importance and training progress (if relevant for model).

    """

    def __init__(self, columns: list = None, model_wrapper=None):
        """

        :param original_columns:
        :param model_wrapper:
        """
        self.columns = columns
        self.model_wrapper = model_wrapper
        self.result_dict = {}
        self.train_one_fold = False
        self.preprocesser = None

    def fit(self, X: pd.DataFrame, y,
            X_holdout: pd.DataFrame = None, y_holdout=None,
            folds=None,
            params: dict = None,
            eval_metric='rmse',
            cols_to_drop: list = None,
            preprocesser=None,
            transformers: dict = None,
            adversarial: bool = False,
            plot: bool = True):
        """
        Training the model.

        :param X: training data
        :param y: training target
        :param X_holdout: holdout data
        :param y_holdout: holdout target
        :param folds: folds to split the data. If not defined, then model will be trained on the whole X
        :param params: training parameters
        :param eval_metric: metric for validataion
        :param cols_to_drop: list of columns to drop (for example ID)
        :param preprocesser: preprocesser class
        :param transformers: transformer to use on folds
        :param adversarial
        :return:
        """

        if folds is None:
            folds = KFold(n_splits=3, random_state=42)
            self.train_one_fold = True

        self.columns = X.columns if self.columns is None else self.columns
        self.feature_importances = pd.DataFrame(columns=['feature', 'importance'])
        self.trained_transformers = {k: [] for k in transformers}
        self.transformers = transformers
        self.models = []
        self.folds_dict = {}
        self.eval_metric = eval_metric
        n_target = 1
        self.oof = np.zeros((len(X), n_target))
        self.n_target = n_target

        X = X[self.columns]
        if X_holdout is not None:
            X_holdout = X_holdout[self.columns]

        if preprocesser is not None:
            self.preprocesser = preprocesser
            self.preprocesser.fit(X, y)
            X = self.preprocesser.transform(X, y)
            self.columns = X.columns.tolist()
            if X_holdout is not None:
                X_holdout = self.preprocesser.transform(X_holdout)

        # Grouping all the data with the same installation_id
        for fold_n, (train_index, valid_index) in enumerate(folds.split(X, y, X['installation_id'])):

            if X_holdout is not None:
                X_hold = X_holdout.copy()
            else:
                X_hold = None
            self.folds_dict[fold_n] = {}
            if params['verbose']:
                print(f'Fold {fold_n + 1} started at {time.ctime()}')
            self.folds_dict[fold_n] = {}

            X_train, X_valid = X.iloc[train_index], X.iloc[valid_index]
            y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
            if self.train_one_fold:
                X_train = X[self.original_columns]
                y_train = y
                X_valid = None
                y_valid = None

            datasets = {'X_train': X_train, 'X_valid': X_valid, 'X_holdout': X_hold, 'y_train': y_train}
            X_train, X_valid, X_hold = self.transform_(datasets, cols_to_drop)

            self.folds_dict[fold_n]['columns'] = X_train.columns.tolist()

            model = copy.deepcopy(self.model_wrapper)

            # What is this?
            if adversarial:
                X_new1 = X_train.copy()
                if X_valid is not None:
                    X_new2 = X_valid.copy()
                elif X_holdout is not None:
                    X_new2 = X_holdout.copy()
                X_new = pd.concat([X_new1, X_new2], axis=0)
                y_new = np.hstack((np.zeros((X_new1.shape[0])), np.ones((X_new2.shape[0]))))
                X_train, X_valid, y_train, y_valid = train_test_split(X_new, y_new)

            model.fit(X_train, y_train, X_valid, y_valid, X_hold, y_holdout, params=params)

            self.folds_dict[fold_n]['scores'] = model.best_score_
            if self.oof.shape[0] != len(X):
                self.oof = np.zeros((X.shape[0], self.oof.shape[1]))
            if not adversarial:
                self.oof[valid_index] = model.predict(X_valid).reshape(-1, n_target)

            fold_importance = pd.DataFrame(list(zip(X_train.columns, model.feature_importances_)),
                                           columns=['feature', 'importance'])
            self.feature_importances = self.feature_importances.append(fold_importance)
            self.models.append(model)

        self.feature_importances['importance'] = self.feature_importances['importance'].astype(int)

        # if params['verbose']:
        self.calc_scores_()

        if plot:
            # print(classification_report(y, self.oof.argmax(1)))
            fig, ax = plt.subplots(figsize=(16, 12))
            plt.subplot(2, 2, 1)
            self.plot_feature_importance(top_n=20)
            plt.subplot(2, 2, 2)
            self.plot_metric()
            plt.subplot(2, 2, 3)
            plt.hist(y.values.reshape(-1, 1) - self.oof)
            plt.title('Distribution of errors')
            plt.subplot(2, 2, 4)
            plt.hist(self.oof)
            plt.title('Distribution of oof predictions');

    def transform_(self, datasets, cols_to_drop):
        for name, transformer in self.transformers.items():
            transformer.fit(datasets['X_train'], datasets['y_train'])
            datasets['X_train'] = transformer.transform(datasets['X_train'])
            if datasets['X_valid'] is not None:
                datasets['X_valid'] = transformer.transform(datasets['X_valid'])
            if datasets['X_holdout'] is not None:
                datasets['X_holdout'] = transformer.transform(datasets['X_holdout'])
            self.trained_transformers[name].append(transformer)
        if cols_to_drop is not None:
            cols_to_drop = [col for col in cols_to_drop if col in datasets['X_train'].columns]

            datasets['X_train'] = datasets['X_train'].drop(cols_to_drop, axis=1)
            if datasets['X_valid'] is not None:
                datasets['X_valid'] = datasets['X_valid'].drop(cols_to_drop, axis=1)
            if datasets['X_holdout'] is not None:
                datasets['X_holdout'] = datasets['X_holdout'].drop(cols_to_drop, axis=1)
        self.cols_to_drop = cols_to_drop

        return datasets['X_train'], datasets['X_valid'], datasets['X_holdout']

    def calc_scores_(self):
#         print()
        datasets = [k for k, v in [v['scores'] for k, v in self.folds_dict.items()][0].items() if len(v) > 0]
        self.scores = {}
        for d in datasets:
            scores = [v['scores'][d][self.eval_metric] for k, v in self.folds_dict.items()]
#             print(f"CV mean score on {d}: {np.mean(scores):.4f} +/- {np.std(scores):.4f} std.")
            self.scores[d] = np.mean(scores)

    def predict(self, X_test, averaging: str = 'usual'):
        """
        Make prediction

        :param X_test:
        :param averaging: method of averaging
        :return:
        """
        full_prediction = np.zeros((X_test.shape[0], self.oof.shape[1]))
        if self.preprocesser is not None:
            X_test = self.preprocesser.transform(X_test)
        for i in range(len(self.models)):
            X_t = X_test.copy()
            for name, transformers in self.trained_transformers.items():
                X_t = transformers[i].transform(X_t)

            if self.cols_to_drop is not None:
                cols_to_drop = [col for col in self.cols_to_drop if col in X_t.columns]
                X_t = X_t.drop(cols_to_drop, axis=1)
            y_pred = self.models[i].predict(X_t[self.folds_dict[i]['columns']]).reshape(-1, full_prediction.shape[1])

            # if case transformation changes the number of the rows
            if full_prediction.shape[0] != len(y_pred):
                full_prediction = np.zeros((y_pred.shape[0], self.oof.shape[1]))

            if averaging == 'usual':
                full_prediction += y_pred
            elif averaging == 'rank':
                full_prediction += pd.Series(y_pred).rank().values

        return full_prediction / len(self.models)

    def plot_feature_importance(self, drop_null_importance: bool = True, top_n: int = 10):
        """
        Plot default feature importance.

        :param drop_null_importance: drop columns with null feature importance
        :param top_n: show top n columns
        :return:
        """

        top_feats = self.get_top_features(drop_null_importance, top_n)
        feature_importances = self.feature_importances.loc[self.feature_importances['feature'].isin(top_feats)]
        feature_importances['feature'] = feature_importances['feature'].astype(str)
        top_feats = [str(i) for i in top_feats]
        sns.barplot(data=feature_importances, x='importance', y='feature', orient='h', order=top_feats)
        plt.title('Feature importances')

    def get_top_features(self, drop_null_importance: bool = True, top_n: int = 10):
        """
        Get top features by importance.

        :param drop_null_importance:
        :param top_n:
        :return:
        """
        grouped_feats = self.feature_importances.groupby(['feature'])['importance'].mean()
        if drop_null_importance:
            grouped_feats = grouped_feats[grouped_feats != 0]
        return list(grouped_feats.sort_values(ascending=False).index)[:top_n]

    def plot_metric(self):
        """
        Plot training progress.
        Inspired by `plot_metric` from https://lightgbm.readthedocs.io/en/latest/_modules/lightgbm/plotting.html

        :return:
        """
        full_evals_results = pd.DataFrame()
        for model in self.models:
            evals_result = pd.DataFrame()
            for k in model.model.evals_result_.keys():
                evals_result[k] = model.model.evals_result_[k][self.eval_metric]
            evals_result = evals_result.reset_index().rename(columns={'index': 'iteration'})
            full_evals_results = full_evals_results.append(evals_result)

        full_evals_results = full_evals_results.melt(id_vars=['iteration']).rename(columns={'value': self.eval_metric,
                                                                                            'variable': 'dataset'})
        sns.lineplot(data=full_evals_results, x='iteration', y=self.eval_metric, hue='dataset')
        plt.title('Training progress')

In [None]:
y = reduce_train['accuracy_group']

cols_to_drop = ['game_session', 'installation_id', 'timestamp', 'accuracy_group', 'timestampDate']

n_fold = 5
folds = GroupKFold(n_splits=n_fold)

In [None]:
if debug:
    # lgb
    n_estimators = 100
    early_stopping_rounds = 100
    
    # bayes_opt
    init_points = 2
    n_iter = 2
else:
    # lgb
    n_estimators = 2000
    early_stopping_rounds = 100
    
    # bayes_opt
    init_points = 16
    n_iter = 16

# ハイパーパラメータチューニング

良い精度を得るためにはハイパーパラメータチューニングが必須です。  
LightGBMでCriticalなハイパーパラメータとそのチューニングの仕方が公式ドキュメントにあります。  
https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html

わたしのモデルは最初train dataにoverfitしており、iterationとともにtrainのスコアは上がりvalのスコアが下がるという状況でした。
公式ドキュメントのDeal with Over-fittingを読むとmax_bin, num_leaves, min_data_in_leaf, max_depth, lambda_l1, lambda_l2あたりが重要そうです。  
今回のモデルのデータはdiscreteものが多いのでmax_binよりもnum_leaves/min_data_in_leaf/max_depthを変えた方が影響が大きいかな、と思ったのでそちらのハイパーパラメータを探索しました。  

こちらのNotebookのコードをそのまま使い、Bayesian optimizationで探索をしました。  
https://www.kaggle.com/hengzheng/bayesian-optimization-seed-blending

他のパラメータチューニングの方法としてはsklearnのGridSearch, optuna, hyperoptなどの選択肢があります。  
パラメータ探索アルゴリズムの比較論文はあります (Bergstra J.S. et al. 2011)が、やってみるしかないという印象です。  
https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf  

1, 2時間程度の実験だとbayes_optが一番良い結果を出したのでこれを採用しました。ですが...

これはコンテストの最後の方まで完全に頭の外にあったのですが、パラメータ探索は提出するNotebook内でやる必要はありません。  
このコンペはNotebookで７時間以内で動くものしか提出できない制限があったため、大きなパラメータ探索は出来ないと勝手に思っていました。  
しかしパラメータは事前に探索をしておいて、それをNotebookにはっておけばよいだけでした。  
もっと長い時間をかけて探索をすればもっと良いパラメータが見つかったかもしれません。  

Hyperopt: https://github.com/hyperopt/hyperopt  
Optuna: https://github.com/optuna/optuna

→実際上位陣のモデルを見るとハイパーパラメータが直書きされているようです。

In [None]:
def bayes_base(boosting_type,
                 num_leaves,
                 min_data_in_leaf,
                 max_depth,
                 lambda_l1,
                 lambda_l2,
                 bagging_fraction,
                 bagging_freq,
                 colsample_bytree,
                 learning_rate):
    
    params = {
        'boosting_type': boosting_type,
        'metric': 'rmse',
        'objective': 'regression',
        'eval_metric': 'cappa',
        'n_jobs': -1,
        'seed': 42,
        'early_stopping_rounds': early_stopping_rounds,
        'n_estimators': n_estimators,
        'learning_rate': learning_rate,
        'num_leaves': int(num_leaves),
        'min_data_in_leaf': int(min_data_in_leaf),
        'max_depth': int(max_depth),
        'lambda_l1': lambda_l1,
        'lambda_l2': lambda_l2,
        'bagging_fraction': bagging_fraction,
        'bagging_freq': int(bagging_freq),
        'colsample_bytree': colsample_bytree,
        'verbose': 0
    }
    
    mt = MainTransformer()
    ft = FeatureTransformer()
    transformers = {'ft': ft}
    model = RegressorModel(model_wrapper=LGBWrapper_regr())
    model.fit(X=reduce_train, 
              y=y, 
              folds=folds, 
              params=params, 
              preprocesser=mt, 
              transformers=transformers,
              eval_metric='cappa', 
              cols_to_drop=cols_to_drop,
              plot=False)
    
    return model.scores['valid']

def lgb_bayesian_(boosting_type):
    def h(num_leaves,
                 min_data_in_leaf,
                 max_depth,
                 lambda_l1,
                 lambda_l2,
                 bagging_fraction,
                 bagging_freq,
                 colsample_bytree,
                 learning_rate):
        return bayes_base(boosting_type,
                 num_leaves,
                 min_data_in_leaf,
                 max_depth,
                 lambda_l1,
                 lambda_l2,
                 bagging_fraction,
                 bagging_freq,
                 colsample_bytree,
                 learning_rate)
    return h


In [None]:
gc.collect()

In [None]:
def run_bo(random_state, boosting_type):
    bounds_LGB = {
        'num_leaves': (10, 50),
        'min_data_in_leaf': (10, 40),
        'max_depth': (6, 11),
        'lambda_l1': (0, 5),
        'lambda_l2': (0, 5),
        'bagging_fraction': (0.4, 0.6),
        'bagging_freq': (1, 10),
        'colsample_bytree': (0.4, 0.6),
        'learning_rate': (0.05, 0.1)
    }
    LGB_BO = BayesianOptimization(lgb_bayesian_(boosting_type), bounds_LGB, random_state=random_state)

    with warnings.catch_warnings():
        warnings.filterwarnings('ignore')
        LGB_BO.maximize(init_points=init_points, n_iter=n_iter, acq='ucb', xi=0.0, alpha=1e-6)
        
    del bounds_LGB
    
    return LGB_BO

def predict_with_final_model(LGB_BO):
    params = {
        'boosting_type': 'gbdt',
        'metric': 'rmse',
        'objective': 'regression',
        'eval_metric': 'cappa',
        'n_jobs': -1,
        'seed': 42,
        'early_stopping_rounds': early_stopping_rounds,
        'n_estimators': n_estimators,
        'learning_rate': LGB_BO.max['params']['learning_rate'],
        'num_leaves': int(LGB_BO.max['params']['num_leaves']),
        'min_data_in_leaf': int(LGB_BO.max['params']['min_data_in_leaf']),
        'max_depth': int(LGB_BO.max['params']['max_depth']),
        'lambda_l1': LGB_BO.max['params']['lambda_l1'],
        'lambda_l2': LGB_BO.max['params']['lambda_l2'],
        'bagging_fraction': LGB_BO.max['params']['bagging_fraction'],
        'bagging_freq': int(LGB_BO.max['params']['bagging_freq']),
        'colsample_bytree': LGB_BO.max['params']['colsample_bytree'],
        'verbose': 100
    }

    mt = MainTransformer()
    ft = FeatureTransformer()
    transformers = {'ft': ft}
    regressor_model = RegressorModel(model_wrapper=LGBWrapper_regr())
    regressor_model.fit(X=reduce_train, 
                        y=y, 
                        folds=folds, 
                        params=params, 
                        preprocesser=mt, 
                        transformers=transformers,
                        eval_metric='cappa', 
                        cols_to_drop=cols_to_drop)

    preds = regressor_model.predict(reduce_test)
    w = LGB_BO.max['target']
    
    del LGB_BO, params
    gc.collect()
    
    return preds, w, regressor_model

class mock():
    def __init__(self, d):
        self.max = d
        
# |   iter    |  target   | baggin... | baggin... | colsam... | lambda_l1 | lambda_l2 | learni... | max_depth | min_da... | num_le... |
# |  1        |  0.5937   |  0.4834   |  7.483    |  0.4      |  1.512    |  0.7338   |  0.05462  |  6.931    |  20.37    |  25.87    |
# |  3        |  0.5945   |  0.5536   |  2.446    |  0.5529   |  0.104    |  0.6761   |  0.05581  |  7.549    |  30.14    |  28.85    |
# |  30       |  0.5909   |  0.5993   |  1.225    |  0.4252   |  4.035    |  0.3233   |  0.05728  |  9.927    |  39.61    |  11.49    |
# |  2        |  0.5918   |  0.5752   |  4.22     |  0.5002   |  3.417    |  3.564    |  0.06851  |  8.806    |  25.09    |  10.55    |
# |  10       |  0.5922   |  0.5066   |  3.272    |  0.5442   |  1.837    |  2.493    |  0.06133  |  7.768    |  29.53    |  22.52    |

def bo_lgb(random_state=0):
    if skip_bo:
        # These parameters are from the run of V1.
            
        if random_state == 1:
            LGB_BO = mock({'target': 0.6053, 'params': {'num_leaves': 31,
                                           'min_data_in_leaf': 20,
                                           'bagging_fraction': 0.5999, 
                                           'bagging_freq': 3.71, 
                                           'colsample_bytree': 0.4492,
                                           'lambda_l1': 4.742,
                                           'lambda_l2': 0.4251,
                                           'learning_rate': 0.0522,
                                           'max_depth': 8.751}})
        elif random_state == 12:
            LGB_BO = mock({'target': 0.6042, 'params': {'num_leaves': 31,
                                           'min_data_in_leaf': 20,
                                           'bagging_fraction': 0.9994, 
                                           'bagging_freq': 4.42, 
                                           'colsample_bytree': 0.8754,
                                           'lambda_l1': 9.963,
                                           'lambda_l2': 2.676,
                                           'learning_rate': 0.1029,
                                           'max_depth': 13.99}})
        elif random_state == 123:
            LGB_BO = mock({'target': 0.6048, 'params': {'num_leaves': 31,
                                           'min_data_in_leaf': 20,
                                           'bagging_fraction': 0.7994, 
                                           'bagging_freq': 4.12, 
                                           'colsample_bytree': 0.6754,
                                           'lambda_l1': 7.963,
                                           'lambda_l2': 1.676,
                                           'learning_rate': 0.0829,
                                           'max_depth': 10.99}})
            
        elif random_state == 1234:
            LGB_BO = mock({'target': 0.5932, 'params': {'num_leaves': 31,
                                           'min_data_in_leaf': 20,
                                           'bagging_fraction': 0.6, 
                                           'bagging_freq': 1.0, 
                                           'colsample_bytree': 0.4,
                                           'lambda_l1': 0.0,
                                           'lambda_l2': 2.636,
                                           'learning_rate': 0.05,
                                           'max_depth': 8.0}})
        elif random_state == 12345:
            LGB_BO = mock({'target': 0.5939, 'params': {'num_leaves': 31,
                                           'min_data_in_leaf': 20,
                                           'bagging_fraction': 0.4735, 
                                           'bagging_freq': 5.488, 
                                           'colsample_bytree': 0.4453,
                                           'lambda_l1': 1.768,
                                           'lambda_l2': 3.254,
                                           'learning_rate': 0.06565,
                                           'max_depth': 10.31}})
        elif random_state == 2:
            LGB_BO = mock({'target': 0.5922, 'params': {'bagging_fraction': 0.5066, 
                                           'bagging_freq': 3.272, 
                                           'colsample_bytree': 0.5442,
                                           'lambda_l1': 1.837,
                                           'lambda_l2': 2.493,
                                           'learning_rate': 0.06133,
                                           'max_depth': 7.768,
                                           'min_data_in_leaf': 29.53,
                                           'num_leaves': 22.52}})

        elif random_state == 23:
            LGB_BO = mock({'target': 0.5918, 'params': {'bagging_fraction': 0.5752, 
                                           'bagging_freq': 4.22, 
                                           'colsample_bytree': 0.5002,
                                           'lambda_l1': 3.417,
                                           'lambda_l2': 3.564,
                                           'learning_rate': 0.06851,
                                           'max_depth': 8.806,
                                           'min_data_in_leaf': 25.09,
                                           'num_leaves': 10.55}})
        elif random_state == 234:
            LGB_BO = mock({'target': 0.5909, 'params': {'bagging_fraction': 0.5536, 
                                           'bagging_freq': 1.225, 
                                           'colsample_bytree': 0.4252,
                                           'lambda_l1': 4.035,
                                           'lambda_l2': 0.3233,
                                           'learning_rate': 0.05728,
                                           'max_depth': 9.927,
                                           'min_data_in_leaf': 39.61,
                                           'num_leaves': 11.49}})
        elif random_state == 2345:
            LGB_BO = mock({'target': 0.5945, 'params': {'bagging_fraction': 0.5536, 
                                           'bagging_freq': 2.446, 
                                           'colsample_bytree': 0.5529,
                                           'lambda_l1': 0.104,
                                           'lambda_l2': 0.6761,
                                           'learning_rate': 0.05581,
                                           'max_depth': 7.549,
                                           'min_data_in_leaf': 30.14,
                                           'num_leaves': 28.85}})
        elif random_state == 23456:
            LGB_BO = mock({'target': 0.5937, 'params': {'bagging_fraction': 0.4834, 
                                           'bagging_freq': 7.483, 
                                           'colsample_bytree': 0.4,
                                           'lambda_l1': 1.512,
                                           'lambda_l2': 0.7338,
                                           'learning_rate': 0.05462,
                                           'max_depth': 6.931,
                                           'min_data_in_leaf': 20.37,
                                           'num_leaves': 25.87}})
        elif random_state == 3:
            LGB_BO = mock({'target': 0.6053, 'params': {'num_leaves': 13,
                                           'min_data_in_leaf': 20,
                                           'bagging_fraction': 0.5999, 
                                           'bagging_freq': 3.71, 
                                           'colsample_bytree': 0.4492,
                                           'lambda_l1': 4.742,
                                           'lambda_l2': 0.4251,
                                           'learning_rate': 0.0422,
                                           'max_depth': 8.751}})
        elif random_state == 34:
            LGB_BO = mock({'target': 0.6042, 'params': {'num_leaves': 13,
                                           'min_data_in_leaf': 20,
                                           'bagging_fraction': 0.9994, 
                                           'bagging_freq': 4.42, 
                                           'colsample_bytree': 0.8754,
                                           'lambda_l1': 9.963,
                                           'lambda_l2': 2.676,
                                           'learning_rate': 0.1029,
                                           'max_depth': 13.99}})
        elif random_state == 345:
            LGB_BO = mock({'target': 0.6048, 'params': {'num_leaves': 13,
                                           'min_data_in_leaf': 20,
                                           'bagging_fraction': 0.7994, 
                                           'bagging_freq': 4.12, 
                                           'colsample_bytree': 0.6754,
                                           'lambda_l1': 7.963,
                                           'lambda_l2': 1.676,
                                           'learning_rate': 0.0829,
                                           'max_depth': 10.99}})
        elif random_state == 4:
            LGB_BO = mock({'target': 0.6053, 'params': {'num_leaves': 21,
                                           'min_data_in_leaf': 20,
                                           'bagging_fraction': 0.5999, 
                                           'bagging_freq': 3.71, 
                                           'colsample_bytree': 0.4492,
                                           'lambda_l1': 4.742,
                                           'lambda_l2': 0.4251,
                                           'learning_rate': 0.0222,
                                           'max_depth': 8.751}})
        elif random_state == 45:
            LGB_BO = mock({'target': 0.6042, 'params': {'num_leaves': 21,
                                           'min_data_in_leaf': 20,
                                           'bagging_fraction': 0.9994, 
                                           'bagging_freq': 4.42, 
                                           'colsample_bytree': 0.8754,
                                           'lambda_l1': 9.963,
                                           'lambda_l2': 2.676,
                                           'learning_rate': 0.0529,
                                           'max_depth': 13.99}})
        elif random_state == 456:
            LGB_BO = mock({'target': 0.6048, 'params': {'num_leaves': 21,
                                           'min_data_in_leaf': 20,
                                           'bagging_fraction': 0.7994, 
                                           'bagging_freq': 4.12, 
                                           'colsample_bytree': 0.6754,
                                           'lambda_l1': 7.963,
                                           'lambda_l2': 1.676,
                                           'learning_rate': 0.0429,
                                           'max_depth': 10.99}})
            
            
    else:
        LGB_BO = run_bo(random_state, 'gbdt')
    
    
    
    return predict_with_final_model(LGB_BO)

def bo_dart(random_state=0):
    LGB_BO = run_bo(random_state, 'dart')
    
    return predict_with_final_model(LGB_BO)
    

# モデルのEnsemble

Ensemble: https://scikit-learn.org/stable/modules/ensemble.html

lightgbmはrandom seedによって異なるモデルが出来ます。  
今回提出したモデルはbayesian optimizationで得られたハイパーパラメータから複数のモデルを作り、  
それぞれ予測を出してもらいその平均を取っています。

Ensembleは単なるヒューリスティックではなくちゃんとした理論的にな背景があります (Elements of Statistical LearningのChapter 16参照)。  

イメージとしては、どのモデルも空間の一部に弱点が出来てしまうが、その弱点がモデルによって異なる場所に出来ると仮定し、  
複数のモデルで予測をさせて平均や最瀕などを取ってやるとその弱点がうまくカバーできる、という感じです。  
各モデルをシンプルなものに保ちつつ全体の精度を高めることが出来る手法です。

Roundingをする前に各モデルの予想の平均を取っていますが、roundingした後に各モデルの予想の最瀕を取ることも出来ます。  
結果としては前者の方が精度が高かったので前者を採用しました。

→上位のモデルにはStackingによるEnsembleをしているものが多いようです。  
今回は各モデルの予測を単純な重み付き平均を取ってEnsembleの予測としました。  
それに対してStackingは各モデルの予測を入力としてラベルを予測する1レベルメタな予測モデルを置く手法です。  
わたしのモデルでは平均とStackingであまり性能の差がなかったのでシンプルな平均を使っていました。  
上位のモデルはNNとgbdtなど、より多様なモデルを用意してその上でStackingを行うことでより強いモデルが得られたようです。  
https://www.kaggle.com/c/data-science-bowl-2019/discussion/127210


Stackingはこちらの説明が分かりやすかったです。  
https://books.google.com/books?id=nwQZCwAAQBAJ&lpg=PA500&dq=stacking%20classifier%20subsets&pg=PA499#v=onepage&q&f=false  
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/

In [None]:
if debug:
    seeds = [1, 12]
else:
    seeds = [1, 12, 123, 1234, 3, 34, 345]

preds = []
models = []
for s in seeds:
    p, w, m = bo_lgb(s)
    preds.append((p, w))
    models.append(m)

# for s in seeds:
#     p, w = bo_dart(s)
#     preds.append((p, w))

In [None]:
w_total = sum([pred[1] for pred in preds])

lgb_preds = sum([pred[0] * pred[1] / w_total for pred in preds])


In [None]:
# # Add xgboost!
def train_xgboost():
    xgb_params = {
            'colsample_bytree': 0.8,
            'learning_rate': 0.05,
            'max_depth': 7,
            'subsample': 1,
            'objective':'multi:softprob',
            'num_class':4,
            'eval_metric':'merror',
            'min_child_weight':10,
            'gamma':0.25,
            'n_estimators':500,
            'nthread': 6,
            'verbose': 1000,
            'early_stopping_rounds': 100,
        }
    mt = MainTransformer()
    ft = FeatureTransformer()
    transformers = {'ft': ft}
    xgb_model = ClassifierModel(model_wrapper=XGBWrapper())
    xgb_model.fit(X=reduce_train, y=y, folds=folds, params=xgb_params, preprocesser=mt, transformers=transformers,
                  eval_metric='cappa', cols_to_drop=cols_to_drop)
    return xgb_model.predict(reduce_test)
    
# we set the weight of the xgboost to be 20% of lgb.
if not skip_xgb:
    xgb_pred = train_xgboost()
    print(xgb_pred)
    

# Rounding

Regressionで得られた実数をラベルに変換します。  
どこでRoundingするかが問題になりますが、このRoundingの閾値も学習することが出来ます。  
https://www.kaggle.com/naveenasaithambi/optimizedrounder-improved

最終的には決め打ちの値を使った方が精度が出ました。

In [None]:
if skip_xgb:
    preds = lgb_preds
else:
    preds = 0.8 * lgb_preds + 0.2 * xgb_preds

In [None]:
class OptimizedRounder(object):
    """
    An optimizer for rounding thresholds
    to maximize Quadratic Weighted Kappa (QWK) score
    # https://www.kaggle.com/naveenasaithambi/optimizedrounder-improved
    """
    def __init__(self):
        self.coef_ = 0

    def _kappa_loss(self, coef, X, y):
        """
        Get loss according to
        using current coefficients
        
        :param coef: A list of coefficients that will be used for rounding
        :param X: The raw predictions
        :param y: The ground truth labels
        """
        X_p = pd.cut(X, [-np.inf] + list(np.sort(coef)) + [np.inf], labels = [0, 1, 2, 3])

        return -qwk(y, X_p)

    def fit(self, X, y):
        """
        Optimize rounding thresholds
        
        :param X: The raw predictions
        :param y: The ground truth labels
        """
        loss_partial = partial(self._kappa_loss, X=X, y=y)
        initial_coef = [0.5, 1.5, 2.5]
        self.coef_ = sp.optimize.minimize(loss_partial, initial_coef, method='nelder-mead')

    def predict(self, X, coef):
        """
        Make predictions with specified thresholds
        
        :param X: The raw predictions
        :param coef: A list of coefficients that will be used for rounding
        """
        return pd.cut(X, [-np.inf] + list(np.sort(coef)) + [np.inf], labels = [0, 1, 2, 3])


    def coefficients(self):
        """
        Return the optimized coefficients
        """
        return self.coef_['x']

In [None]:
# Discretize the predictions
def cv_predict(models, X):
    return np.mean([model.predict(X) for model in models], axis=0)

if train_rounder:
    # TODO: Train rounder using the xgb model too?
    regressed = cv_predict(models, reduce_train)
    rounder = OptimizedRounder()
    rounder.fit(regressed.reshape(-1,), y)
    coefficients = rounder.coefficients()

In [None]:
# Fixed Rounding
if not train_rounder:
    coefficients = [1.12232214, 1.73925866, 2.22506454]

In [None]:
preds[preds <= coefficients[0]] = 0
preds[np.where(np.logical_and(preds > coefficients[0], preds <= coefficients[1]))] = 1
preds[np.where(np.logical_and(preds > coefficients[1], preds <= coefficients[2]))] = 2
preds[preds > coefficients[2]] = 3

In [None]:
sample_submission['accuracy_group'] = preds.astype(int)
sample_submission.to_csv('submission.csv', index=False)

In [None]:
sample_submission['accuracy_group'].value_counts(normalize=True)

# データサイエンスの学び方

自戒のために

1. 巨人の肩の上に立つ

コンペに参加する時にはまず最初にVoteが多いnotebookを一通り読むべきです。  
特にコンテストの後半から参加する場合、scratchから思いつくfeatureやモデルはだいたいすでに誰かが思いついています。  
まずは先人が思いついたことを一通り確認してから自分のものを探していくべきだと思います。  
巨人にしがみつくだけでなくその肩の上に立つためにはもちろん巨人の考えをちゃんと理解する必要があります。  

→今回非常にスコアの高いpublic kernelにFeatureをscalingするアイディアがありました。  
わたしはこれはFeatureの意味を変えてしまうことに懐疑的だったので採用しませんでした。  
このscalingを採用した多くのモデルが非常に悪い精度になってしまったという噂です。  
巨人を理解しようとする心のお陰でこれを免れました。

2. 過去のコンペのNotebookを読んでみる

過去の似たようなコンペで有力だったfeature、モデルはこのコンペでも有力かもしれません。  
過去のコンペの上位のNotebookをまとめてくれた方がいます。  

https://www.kaggle.com/sudalairajkumar/winning-solutions-of-kaggle-competitions/data

このリストを見るとだいたいlightgbm, xgboost, catboost, NNとそれらのEnsembleが席巻しているのが見て取れます。  

3. 思いついた改善のほとんどは失敗する

思いついた新しいFeatureやモデルの改善はだいたいうまくいきません (たぶん強い方でもそう)。  
わたしは大学院で研究をしていたのでそういうものだと理解していますが、精度をあげるには泥臭い試行錯誤をする必要があります。  
データサイエンスも泥にまみれる覚悟を持つべきなのでしょう。


## 参考にしたNotebook

QWKの解説: https://www.kaggle.com/aroraaman/quadratic-kappa-metric-explained-in-5-simple-steps  
QWKの計算: https://www.kaggle.com/c/data-science-bowl-2019/discussion/114133#latest-660168  
QWKの高速な計算: https://www.kaggle.com/cpmpml/ultra-fast-qwk-calc-method  

Exploratory Data Analysis 1: https://www.kaggle.com/jaseziv83/dsb-a-rrrare-r-notebook-and-baseline-model  
Exploratory Data Analysis 2: https://www.kaggle.com/gpreda/2019-data-science-bowl-eda  
Time series analysis: https://www.kaggle.com/kashnitsky/topic-9-part-1-time-series-analysis-in-python

Truncated features: https://www.kaggle.com/braquino/890-features  
Featureをたくさん作ったNotebook: https://www.kaggle.com/keremt/fastai-feature-engineering-part1-6160-features  
spec.csvに注目したFeature: https://www.kaggle.com/bhavikapanara/2019-data-science-bowl-some-interesting-features  
dataのscaling: https://www.kaggle.com/khoongweihao/top-5-pub-to-top-53-priv-convert-disaster (うまくいかないアイディアでした)  
Feature同士の相関: https://www.kaggle.com/reisel/how-to-handle-correlated-features

Outlierの処理: https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python

Bayesian optimization: https://www.kaggle.com/hengzheng/bayesian-optimization-seed-blending

## 参考にしたDiscussion

未来の情報を使ったFeature: https://www.kaggle.com/c/data-science-bowl-2019/discussion/117724   

2位 Decayed features: https://www.kaggle.com/c/data-science-bowl-2019/discussion/127388  
7位 150個のFeatureがノイズかどうか全て確かめた: https://www.kaggle.com/c/data-science-bowl-2019/discussion/127213  
8位 FeatureをNormalizeしてNNのみのモデル:  https://www.kaggle.com/c/data-science-bowl-2019/discussion/127285

4位 RNNとStackingを使ったモデル: https://www.kaggle.com/c/data-science-bowl-2019/discussion/127210


## 役に立ったチュートリアル

Pandas Tutorial: https://www.kaggle.com/learn/pandas

Seaborn Tutorial: https://www.kaggle.com/kanncaa1/seaborn-tutorial-for-beginners

Kaggleに登録したら次にやること ～ これだけやれば十分闘える！Titanicの先へ行く入門 10 Kernel ～: https://qiita.com/upura/items/3c10ff6fed4e7c3d70f0


## 参考資料

PBS Kids Measure Up: https://measureup.pbskids.org/

LightGBM Document: https://lightgbm.readthedocs.io/

Bayesian Optimization: https://github.com/fmfn/BayesianOptimization

Box-Cox Transformation: http://onlinestatbook.com/2/transformations/box-cox.html

Ensemble Methods: https://scikit-learn.org/stable/modules/ensemble.html

Stacking Ensemble: http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/  
Stacking Ensemble: https://books.google.com/books?id=nwQZCwAAQBAJ&lpg=PA500&dq=stacking%20classifier%20subsets&pg=PA499#v=onepage&q&f=false

Ke, Guolin, et al. "Lightgbm: A highly efficient gradient boosting decision tree." Advances in neural information processing systems. 2017.  
https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf

Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016.  
https://dl.acm.org/doi/pdf/10.1145/2939672.2939785?download=true

Bergstra, James S., et al. "Algorithms for hyper-parameter optimization." Advances in neural information processing systems. 2011.  
https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf


Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. No. 10. New York: Springer series in statistics, 2001. 

Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.

