# About this notebook

You might have noticed that the train dataset is composed of over 11M data points, but there are only 17k training labels, and 1000k test labels you are predicting. The reason for that is there are many thousand different entries for each `installation_id`, each representing an `event`. This notebook simply gathers all the events into 17k groups, each group corresponds to an `installation_id`. Then, it takes the aggregation (using sums, counts, mean, std, etc.) of those groups, thus resulting in a dataset of summary statistics of each `installation_id`. After that, it simply fits a model on that dataset.

## Updates

V20:
* Updated variable names for clarity.

V17:
* Removed statistics on event codes, since that created a lot of columns and LGBM seems to overfit on that information.

V16:
* Added mode of title `accuracy_group` (retrieved from training set) as a feature

V10:
* Fixed labelling problem. Before that, I was blindly predicting the target without even the title I was trying to assess 🤦. I added that now by using the "title" column from `train_labels.csv`, and using the last row of each installation_id from `test.csv` to construct a `test_labels` dataframe.

V8: 
* Added `cv_train`, a function that trains k-models on each of k-fold CV splits. Then, you can use function `cv_predict` to use the list of models to predict an output (and blend the results).
* Added more summary statistics for `event_code` and `game_time`, including skewness of the distribution.

## References

* CV idea inspired from [this kernel](https://www.kaggle.com/tanreinama/ds-bowl-2019-simple-lgbm-aggregated-data-with-cv). Thank you!
* Adding mode as a feature: https://www.kaggle.com/mhviraf/a-baseline-for-dsb-2019

In [2]:
import os

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#import lightgbm as lgb
import scipy as sp
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
from sklearn.model_selection import KFold
from tqdm import tqdm

In [3]:
tqdm.pandas()

# Load Data

In [5]:
%%time
# Only load those columns in order to save space
keep_cols = ['event_id', 'game_session', 'installation_id', 'event_count', 'event_code', 'title', 'game_time', 'type', 'world']

train = pd.read_csv(r'D:\Artificial Intelligence\Kaggle\2019 Data Science Bowl\Data\train.csv')
train_labels = pd.read_csv(r'D:\Artificial Intelligence\Kaggle\2019 Data Science Bowl\Data\train_labels.csv')
spec = pd.read_csv(r'D:\Artificial Intelligence\Kaggle\2019 Data Science Bowl\Data\specs.csv')
    
test = pd.read_csv(r'D:\Artificial Intelligence\Kaggle\2019 Data Science Bowl\Data\test.csv')
submission = pd.read_csv(r'D:\Artificial Intelligence\Kaggle\2019 Data Science Bowl\Data\sample_submission.csv')

Wall time: 48.2 s


In [6]:
test_assess = test[test.type == 'Assessment'].copy()
test_labels = submission.copy()
test_labels['title'] = test_labels.installation_id.progress_apply(
    lambda install_id: test_assess[test_assess.installation_id == install_id].iloc[-1].title
)

100%|█████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:06<00:00, 157.09it/s]


In [18]:
test_labels

Unnamed: 0,installation_id,accuracy_group,title,title_mode
0,00abaee7,3,Cauldron Filler (Assessment),3
1,01242218,3,Cart Balancer (Assessment),3
2,017c5718,3,Mushroom Sorter (Assessment),3
3,01a44906,3,Mushroom Sorter (Assessment),3
4,01bc6cb6,3,Cart Balancer (Assessment),3
5,02256298,3,Cart Balancer (Assessment),3
6,0267757a,3,Mushroom Sorter (Assessment),3
7,027e7ce5,3,Bird Measurer (Assessment),0
8,02a29f99,3,Chest Sorter (Assessment),0
9,0300c576,3,Cart Balancer (Assessment),3


# Group and Reduce

In [7]:
def compute_game_time_stats(group, col):
    return group[
        ['installation_id', col, 'event_count', 'game_time']
    ].groupby(['installation_id', col]).agg(
        [np.mean, np.sum, np.std]
    ).reset_index().pivot(
        columns=col,
        index='installation_id'
    )

In [8]:
def group_and_reduce(df, df_labels):
    """
    Author: https://www.kaggle.com/xhlulu/
    Source: https://www.kaggle.com/xhlulu/ds-bowl-2019-simple-lgbm-using-aggregated-data
    """
    
    # First only filter the useful part of the df
    df = df[df.installation_id.isin(df_labels.installation_id.unique())]
    
    # group1 is am intermediary "game session" group,
    # which are reduced to one record by game session. group_game_time takes
    # the max value of game_time (final game time in a session) and 
    # of event_count (total number of events happened in the session).
    group_game_time = df.drop(columns=['event_id', 'event_code']).groupby(
        ['game_session', 'installation_id', 'title', 'type', 'world']
    ).max().reset_index()

    # group3, group4 are grouped by installation_id 
    # and reduced using summation and other summary stats
    title_group = (
        pd.get_dummies(
            group_game_time.drop(columns=['game_session', 'event_count', 'game_time']),
            columns=['title', 'type', 'world'])
        .groupby(['installation_id'])
        .sum()
    )

    event_game_time_group = (
        group_game_time[['installation_id', 'event_count', 'game_time']]
        .groupby(['installation_id'])
        .agg([np.sum, np.mean, np.std, np.min, np.max])
    )
    
    # Additional stats on group1
    world_time_stats = compute_game_time_stats(group_game_time, 'world')
    type_time_stats = compute_game_time_stats(group_game_time, 'type')
    
    return (
        title_group.join(event_game_time_group)
        .join(world_time_stats)
        .join(type_time_stats)
        .fillna(0)
    )

In [9]:
%%time
train_small = group_and_reduce(train, train_labels)
test_small = group_and_reduce(test, test_labels)

print(train_small.shape)
train_small.head()



(3614, 110)
Wall time: 1min 37s


Unnamed: 0_level_0,title_12 Monkeys,title_Air Show,title_All Star Sorting,title_Balancing Act,title_Bird Measurer (Assessment),title_Bottle Filler (Activity),title_Bubble Bath,title_Bug Measurer (Activity),title_Cart Balancer (Assessment),title_Cauldron Filler (Assessment),...,"(game_time, mean, Clip)","(game_time, mean, Game)","(game_time, sum, Activity)","(game_time, sum, Assessment)","(game_time, sum, Clip)","(game_time, sum, Game)","(game_time, std, Activity)","(game_time, std, Assessment)","(game_time, std, Clip)","(game_time, std, Game)"
installation_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0006a69f,2.0,2.0,4.0,0.0,2.0,2.0,2.0,2.0,0.0,0.0,...,0.0,106966.45,3199695.0,236429.0,0.0,2139329.0,350054.566401,28330.303185,0.0,58189.254197
0006c192,1.0,0.0,0.0,2.0,1.0,1.0,1.0,1.0,0.0,2.0,...,0.0,88345.5,1210530.0,323061.0,0.0,530073.0,127422.7825,98940.202632,0.0,62500.291205
00129856,0.0,0.0,0.0,1.0,1.0,2.0,0.0,2.0,0.0,0.0,...,0.0,0.0,1021179.0,39742.0,0.0,0.0,130499.803239,28043.854942,0.0,0.0
001d0ed0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0,2.0,0.0,...,0.0,158426.166667,92282.0,201941.0,0.0,950557.0,24694.997226,17737.374861,0.0,123969.846618
00225f67,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,129984.75,294517.0,35637.0,0.0,519939.0,49028.831364,12301.536672,0.0,65432.543128


## Adding mode as feature

In [10]:
def create_title_mode(train_labels):
    titles = train_labels.title.unique()
    title2mode = {}

    for title in titles:
        mode = (
            train_labels[train_labels.title == title]
            .accuracy_group
            .value_counts()
            .index[0]
        )
        title2mode[title] = mode
    return title2mode

def add_title_mode(labels, title2mode):
    labels['title_mode'] = labels.title.apply(lambda title: title2mode[title])
    return labels

In [11]:
title2mode = create_title_mode(train_labels)
train_labels = add_title_mode(train_labels, title2mode)
test_labels = add_title_mode(test_labels, title2mode)

In [17]:
train_labels

Unnamed: 0,game_session,installation_id,title,num_correct,num_incorrect,accuracy,accuracy_group,title_mode
0,6bdf9623adc94d89,0006a69f,Mushroom Sorter (Assessment),1,0,1.000000,3,3
1,77b8ee947eb84b4e,0006a69f,Bird Measurer (Assessment),0,11,0.000000,0,0
2,901acc108f55a5a1,0006a69f,Mushroom Sorter (Assessment),1,0,1.000000,3,3
3,9501794defd84e4d,0006a69f,Mushroom Sorter (Assessment),1,1,0.500000,2,3
4,a9ef3ecb3d1acc6a,0006a69f,Bird Measurer (Assessment),1,0,1.000000,3,0
5,197a373a77101924,0006c192,Cauldron Filler (Assessment),1,0,1.000000,3,3
6,957406a905d59afd,0006c192,Bird Measurer (Assessment),1,1,0.500000,2,0
7,b2297d292892745a,0006c192,Mushroom Sorter (Assessment),0,4,0.000000,0,3
8,ae691ec5ad5652cf,00129856,Bird Measurer (Assessment),1,0,1.000000,3,0
9,7b536271e99518f0,001d0ed0,Bird Measurer (Assessment),0,5,0.000000,0,0


In [12]:
title2mode

{'Mushroom Sorter (Assessment)': 3,
 'Bird Measurer (Assessment)': 0,
 'Cauldron Filler (Assessment)': 3,
 'Chest Sorter (Assessment)': 0,
 'Cart Balancer (Assessment)': 3}

## Combine train/test labels with summary stats

In [15]:
def preprocess_train(train_labels, last_records_only=True):
    """
    last_records_only (bool): Use only the last record of each user.
    """
    final_train = pd.get_dummies(
        (
            train_labels.set_index('installation_id')
            .drop(columns=['num_correct', 'num_incorrect', 'accuracy', 'game_session'])
            .join(train_small)
        ), 
        columns=['title']
    )
    
    if last_records_only:
        final_train = (
            final_train
            .reset_index()
            .groupby('installation_id')
            .apply(lambda x: x.iloc[-1])
            .drop(columns='installation_id')
        )
    
    return final_train

def preprocess_test(test_labels, test_small):
    return pd.get_dummies(
        test_labels.set_index('installation_id').join(test_small), columns=['title']
    )

In [16]:
final_train = preprocess_train(train_labels)
print(final_train.shape)
final_train.head()

(3614, 117)


Unnamed: 0_level_0,accuracy_group,title_mode,title_12 Monkeys,title_Air Show,title_All Star Sorting,title_Balancing Act,title_Bird Measurer (Assessment),title_Bottle Filler (Activity),title_Bubble Bath,title_Bug Measurer (Activity),...,"(game_time, sum, Game)","(game_time, std, Activity)","(game_time, std, Assessment)","(game_time, std, Clip)","(game_time, std, Game)",title_Bird Measurer (Assessment),title_Cart Balancer (Assessment),title_Cauldron Filler (Assessment),title_Chest Sorter (Assessment),title_Mushroom Sorter (Assessment)
installation_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0006a69f,3,0,2.0,2.0,4.0,0.0,2.0,2.0,2.0,2.0,...,2139329.0,350054.566401,28330.303185,0.0,58189.254197,1,0,0,0,0
0006c192,0,3,1.0,0.0,0.0,2.0,1.0,1.0,1.0,1.0,...,530073.0,127422.7825,98940.202632,0.0,62500.291205,0,0,0,0,1
00129856,3,0,0.0,0.0,0.0,1.0,1.0,2.0,0.0,2.0,...,0.0,130499.803239,28043.854942,0.0,0.0,1,0,0,0,0
001d0ed0,3,3,0.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0,...,950557.0,24694.997226,17737.374861,0.0,123969.846618,0,0,0,0,1
00225f67,0,0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,...,519939.0,49028.831364,12301.536672,0.0,65432.543128,1,0,0,0,0


In [19]:
final_test = preprocess_test(test_labels, test_small)
print(final_test.shape)
final_test.head()

(1000, 117)


Unnamed: 0_level_0,accuracy_group,title_mode,title_12 Monkeys,title_Air Show,title_All Star Sorting,title_Balancing Act,title_Bird Measurer (Assessment),title_Bottle Filler (Activity),title_Bubble Bath,title_Bug Measurer (Activity),...,"(game_time, sum, Game)","(game_time, std, Activity)","(game_time, std, Assessment)","(game_time, std, Clip)","(game_time, std, Game)",title_Bird Measurer (Assessment),title_Cart Balancer (Assessment),title_Cauldron Filler (Assessment),title_Chest Sorter (Assessment),title_Mushroom Sorter (Assessment)
installation_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
00abaee7,3,3,2.0,0.0,1.0,1.0,0.0,0.0,0.0,2.0,...,2285229.0,36886.664956,21240.073493,0.0,1038605.0,0,0,1,0,0
01242218,3,3,1.0,1.0,1.0,3.0,1.0,2.0,1.0,1.0,...,1420909.0,98521.245018,32761.743006,0.0,37797.81,0,1,0,0,0
017c5718,3,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,6389.416875,0.0,0.0,0.0,0,0,0,0,1
01a44906,3,3,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,77204.0,43064.217188,0.0,0.0,0.0,0,0,0,0,1
01bc6cb6,3,3,0.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,...,984880.0,0.0,0.0,0.0,178042.6,0,1,0,0,0


# Training model

In [20]:
def cv_train(X, y, cv, **kwargs):
    """
    Author: https://www.kaggle.com/xhlulu/
    Source: https://www.kaggle.com/xhlulu/ds-bowl-2019-simple-lgbm-using-aggregated-data
    """
    models = []
    
    kf = KFold(n_splits=cv, random_state=2019)
    
    for train, test in kf.split(X):
        x_train, x_val, y_train, y_val = X[train], X[test], y[train], y[test]
        
        train_set = lgb.Dataset(x_train, y_train)
        val_set = lgb.Dataset(x_val, y_val)
        
        model = lgb.train(train_set=train_set, valid_sets=[train_set, val_set], **kwargs)
        models.append(model)
        
        if kwargs.get("verbose_eval"):
            print("\n" + "="*50 + "\n")
    
    return models

def cv_predict(models, X):
    return np.mean([model.predict(X) for model in models], axis=0)

In [25]:
train[train['installation_id'].isin(['0006c192']) ]

Unnamed: 0,event_id,game_session,timestamp,event_data,installation_id,event_count,event_code,game_time,title,type,world
5339,27253bdc,3e3ac29e618b6f0a,2019-09-13T00:30:24.242Z,"{""event_code"": 2000, ""event_count"": 1}",0006c192,1,2000,0,Welcome to Lost Lagoon!,Clip,NONE
5340,27253bdc,2be846a18a653f7c,2019-09-13T00:30:51.223Z,"{""event_code"": 2000, ""event_count"": 1}",0006c192,1,2000,0,Crystal Caves - Level 1,Clip,CRYSTALCAVES
5341,7d093bf9,e6a6a262a8243ff7,2019-09-13T00:31:20.165Z,"{""version"":""1.0"",""round"":0,""event_count"":1,""ga...",0006c192,1,2000,0,Chow Time,Game,CRYSTALCAVES
5342,f93fc684,e6a6a262a8243ff7,2019-09-13T00:31:22.262Z,"{""coordinates"":{""x"":452,""y"":680,""stage_width"":...",0006c192,2,4010,2157,Chow Time,Game,CRYSTALCAVES
5343,7ec0c298,e6a6a262a8243ff7,2019-09-13T00:31:24.467Z,"{""description"":""It's Chow Time! We have some V...",0006c192,3,3010,4351,Chow Time,Game,CRYSTALCAVES
5344,0d1da71f,e6a6a262a8243ff7,2019-09-13T00:31:30.572Z,"{""description"":""It's Chow Time! We have some V...",0006c192,4,3110,10434,Chow Time,Game,CRYSTALCAVES
5345,63f13dd7,e6a6a262a8243ff7,2019-09-13T00:31:30.573Z,"{""dinosaur"":""buddy"",""diet"":""carnivore"",""target...",0006c192,5,2020,10434,Chow Time,Game,CRYSTALCAVES
5346,7372e1a5,e6a6a262a8243ff7,2019-09-13T00:31:30.680Z,"{""coordinates"":{""x"":970,""y"":360,""stage_width"":...",0006c192,6,4070,10568,Chow Time,Game,CRYSTALCAVES
5347,7372e1a5,e6a6a262a8243ff7,2019-09-13T00:31:30.982Z,"{""coordinates"":{""x"":984,""y"":371,""stage_width"":...",0006c192,7,4070,10868,Chow Time,Game,CRYSTALCAVES
5348,7372e1a5,e6a6a262a8243ff7,2019-09-13T00:31:32.216Z,"{""coordinates"":{""x"":319,""y"":306,""stage_width"":...",0006c192,8,4070,12103,Chow Time,Game,CRYSTALCAVES


In [26]:
train_labels[train_labels['installation_id'].isin(['0006c192']) ]

Unnamed: 0,game_session,installation_id,title,num_correct,num_incorrect,accuracy,accuracy_group,title_mode
5,197a373a77101924,0006c192,Cauldron Filler (Assessment),1,0,1.0,3,3
6,957406a905d59afd,0006c192,Bird Measurer (Assessment),1,1,0.5,2,0
7,b2297d292892745a,0006c192,Mushroom Sorter (Assessment),0,4,0.0,0,3


In [22]:
X_old = final_train


final_train['accuracy_group']

installation_id
0006a69f    3
0006c192    0
00129856    3
001d0ed0    3
00225f67    0
00279ac5    0
002db7e3    0
003372b0    3
004c2091    3
00634433    0
00667b88    3
00691033    0
00a0dbeb    0
00a53963    3
00ad158e    1
00b9d8e6    2
00cef781    3
00e17272    3
00e536bf    3
00fa8681    3
00fc65b6    1
010bc1d5    3
01120f12    0
0153c957    0
0155dd86    0
015776b4    0
01582211    0
0160e7c5    1
01825124    1
01bdd720    0
           ..
fd6e3ad6    3
fd8ee8db    3
fd97b5b2    0
fd997268    3
fd99b7b3    3
fdd082f9    0
fddf4b1e    3
fdf4eb95    3
fe191c4a    2
fe1a1d3f    3
fe488283    1
fe4a63a7    3
fe4d880a    0
fe5f0699    0
fe73bf4b    1
fe769df4    1
fe9f9b60    0
fed331e8    3
ff00d909    3
ff107709    2
ff24ea49    3
ff3e1e35    0
ff7fb595    3
ff882868    3
ff90db99    3
ff9305d7    0
ff9715db    3
ffc90c32    3
ffd2871d    3
ffeb0b1b    1
Name: accuracy_group, Length: 3614, dtype: int64

In [None]:
X = final_train.drop(columns='accuracy_group').values
y = final_train['accuracy_group'].values

In [None]:

params = {
    'learning_rate': 0.01,
    'bagging_fraction': 0.9,
    'feature_fraction': 0.2,
    'max_height': 3,
    'lambda_l1': 10,
    'lambda_l2': 10,
    'metric': 'multiclass',
    'objective': 'multiclass',
    'num_classes': 4,
    'random_state': 2019
}

models = cv_train(X, y, cv=20, params=params, num_boost_round=1000,
                  early_stopping_rounds=100, verbose_eval=500)

# Submission

In [None]:
X_test = final_test.drop(columns=['accuracy_group'])
test_pred = cv_predict(models=models, X=X_test).argmax(axis=1)

final_test['accuracy_group'] = test_pred
final_test[['accuracy_group']].to_csv('submission.csv')

# Visualize Model

In [None]:
for model in models:
    lgb.plot_importance(model, max_num_features=15, height=0.3)

In [None]:
plt.hist(test_pred)