## About

This is the kernel that was used in Data Science Bowl 2019 competition. The kernel scored in top 2% of participats, on 35th place out of 3497.

The [competition's](https://www.kaggle.com/c/data-science-bowl-2019) goal was to predict accuracy group of children in assessments taken in game-like iOS app. The app is designed for children of ages 3-5 to learn basic mathematical concepts (e.g. sorting objects). The app offers children a series of acivities that they may undertake in a free order as they travel across virtual island. THe activities consist of various clips, activities, games and assessments. Children participate in all these activites while app generates log of all the actions that have occured within each activity. As children use the app, they eventually pass one or more assessments, which takes them one, two, three or more attempts to successfuly pass (if they pass at all). Based on attempts taken to pass an assessment, children are categorised as belonging to one of four accuracy groups. The goal is then to use the event log to predict accuracy group of a child in his/her last assessment at a given time period.

**Key ideas:**

**Pre-processing**
* A lot of feature engineering, more than 1000 features generated
* Using data points from test set in addition to train set to train models
* Normalisation of features in train and test using cubic root
* Exclusion of highly correlated features using full test as opposed to truncated version used in for actual scoring (see pre-processing section for details)

**Modelling**
* Increased complexity models - more parameters in neural networks and deeper decision trees

**Post-processing**
* No Nelder-Mead optimisation. Instead, rounding based on both distribution of response variable in train set and actual model predictions (see post-processing section for details).


**Acknowledgements**
This kernel is a long evolved version Bruno Aquino's public kernel

## Pre-processing and modelling config

MIX_UP - I have experimented with mixing up samples and their labels in different propoertions to obtain new pseudo training samples. This did not work in short time I have experimented with it, so I have turned it off.

USE_FORGETTING_CURVE_DISCOUNT - there is a timestampt of each event that is recorded when child interacts with the app. I have used that timestamp to measure how much time have passed between different events. Then I have used that time to discount older data points where a lot of time has passed between relevant events. This idea is taken from a number of psychological studies that have shown that people tend to forget information quickly if they are not using it. The recall rate compared against time passed since initial learning was best described by exponential curve.

MEMORY_STABILITY - this is parameter that is used to calculate exponential cureve mentioned above. The larger this parameter is - the better is the hypothetical retention in memory of a child. The code its current version required re-run each time this parameter is changed and such re-run took a long time (~3-4 hours) so I have not tried to iteratively estimate this parameter but instead hand-picked it after some trial-and-error.

USE_DIFFERENT_BOUNDS - as I describe in the last sectino of this kernel, I have used distributino of response variable in training set to round my models' predictions (response variable had to be an interger in 0-3 range). This parameter have allowed me to easily experiement with modified distributions like distribution of response variable in training set, but only certain samples.

RETURN_PREDS_LO - we have to predict accuracy group on a last assessment that a child understakes. Consequently assessments can be categorised as last and not last. Last assessments where always different as children tended to accumulate more experience than when they had undergone all non-last assessments. Consequenty I have thought it may make sense to only use last assessments in a sequence of assessments for training my models. This however resulted very small training sizes and so did not work well. Retrospectively I have also though that any assessment can be last in sequence, so it should not matter conceptually.

POWER - this is power used to normalise features, many of which where highly skewed. Cubic root seemed to work best after some trial-and-error.

In [None]:
MIX_UP = False

USE_FORGETTING_CURVE_DISCOUNT = True
MEMORY_STABILITY = 100000

USE_DIFFERENT_BOUNDS = False
RETURN_PREDS_LO = False

POWER = 1/3

## Import libraries & set configurations

### Imports and library configs

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import xgboost as xgb
from xgboost import XGBClassifier, XGBRegressor
from xgboost import plot_importance
from catboost import CatBoostRegressor
import tensorflow as tf
import lightgbm as lgb
from sklearn.metrics import cohen_kappa_score, mean_squared_error, confusion_matrix
from sklearn.model_selection import KFold, StratifiedKFold, GroupKFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

import os
from time import time
from collections import Counter
from scipy import stats
import gc
import json
from functools import partial
from scipy import stats
import pickle
from tqdm import tqdm
import itertools
import matplotlib

In [None]:
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

COLOR = 'black' # used 'white' when editing in interactive mode with dark theme ON
matplotlib.rcParams['text.color'] = COLOR
matplotlib.rcParams['axes.labelcolor'] = COLOR
matplotlib.rcParams['xtick.color'] = COLOR
matplotlib.rcParams['ytick.color'] = COLOR

## Data pre-processing and modelling support functions

### Support functions

In [None]:
def eval_qwk_lgb_regr(y_true, y_pred, is_last=False):
    """
    Fast cappa eval function for lgb.
    """
    if MIX_UP:
        idx_int = y_true % 1 == 0
        y_true = y_true[idx_int]
        y_pred = y_pred[idx_int]
    
    dist_source = reduce_train.loc[reduce_train.is_last==1, 'accuracy_group'].reset_index(drop=True) if is_last else reduce_train['accuracy_group']
    dist = Counter(dist_source)
    for k in dist:
        dist[k] /= len(dist_source)
    dist_source.hist()
    
    acum = 0
    bound = {}
    for i in range(3):
        acum += dist[i]
        bound[i] = np.percentile(y_pred, acum * 100)

    def classify(x):
        if x <= bound[0]:
            return 0
        elif x <= bound[1]:
            return 1
        elif x <= bound[2]:
            return 2
        else:
            return 3

    y_pred = np.array(list(map(classify, y_pred))).reshape(y_true.shape)

    return 'cappa', cohen_kappa_score(y_true, y_pred, weights='quadratic'), True

In [None]:
def cohenkappa(ypred, y):
    y = y.get_label().astype("int")
    ypred = ypred.reshape((4, -1)).argmax(axis = 0)
    loss = cohenkappascore(y, y_pred, weights = 'quadratic')
    return "cappa", loss, True

def get_event_data(x, key, prefix):
    x = json.loads(x)
    try: res = f'{prefix}_{x[key]}'
    except: res = f'{prefix}_None'
    return res

def get_correct(x):
    x = json.loads(x)
    try: res = 1 if x['correct'] else 0
    except: res = -1
    return res

In [None]:
def read_data():
    print('Reading train.csv file....')
    train = pd.read_csv('/kaggle/input/data-science-bowl-2019/train.csv')
    print('Training.csv file have {} rows and {} columns'.format(train.shape[0], train.shape[1]))

    print('Reading test.csv file....')
    test = pd.read_csv('/kaggle/input/data-science-bowl-2019/test.csv')
    print('Test.csv file have {} rows and {} columns'.format(test.shape[0], test.shape[1]))

    print('Reading train_labels.csv file....')
    train_labels = pd.read_csv('/kaggle/input/data-science-bowl-2019/train_labels.csv')
    print('Train_labels.csv file have {} rows and {} columns'.format(train_labels.shape[0], train_labels.shape[1]))

    print('Reading specs.csv file....')
    specs = pd.read_csv('/kaggle/input/data-science-bowl-2019/specs.csv')
    print('Specs.csv file have {} rows and {} columns'.format(specs.shape[0], specs.shape[1]))

    print('Reading sample_submission.csv file....')
    sample_submission = pd.read_csv('/kaggle/input/data-science-bowl-2019/sample_submission.csv')
    print('Sample_submission.csv file have {} rows and {} columns'.format(sample_submission.shape[0], sample_submission.shape[1]))
    
    media_durations = pd.read_csv('/kaggle/input/basedon460/media_sequence.csv')
    return train, test, train_labels, specs, sample_submission, media_durations

In [None]:
def encode_categoricals(train, test, train_labels):
    # concat title and event_code
    train['title_event_code'] = list(map(lambda x, y: str(x) + '_' + str(y), train['title'], train['event_code']))
    test['title_event_code'] = list(map(lambda x, y: str(x) + '_' + str(y), test['title'], test['event_code']))
    
    # unqiue lists
    unique_title_event_codes = list(set(train["title_event_code"].unique()).union(test["title_event_code"].unique()))
    unique_titles = list(set(train['title'].unique()).union(set(test['title'].unique())))
    unique_event_codes = list(set(train['event_code'].unique()).union(set(test['event_code'].unique())))
    unique_event_ids = list(set(train['event_id'].unique()).union(set(test['event_id'].unique())))
    unique_worlds = list(set(train['world'].unique()).union(set(test['world'].unique())))
    unique_assess_titles = list(set(train[train['type'] == 'Assessment']['title'].value_counts().index).union(set(test[test['type'] == 'Assessment']['title'].value_counts().index)))
    unique_game_titles = list(set(train[train['type'] == 'Game']['title'].value_counts().index).union(set(test[test['type'] == 'Game']['title'].value_counts().index)))
    
    # maps
    titles_map = dict(zip(unique_titles, np.arange(len(unique_titles))))
    titles_inverse_map = dict(zip(np.arange(len(unique_titles)), unique_titles))
    worlds_map = dict(zip(unique_worlds, np.arange(len(unique_worlds))))
    
    # apply maps
    train['title'] = train['title'].map(titles_map)
    test['title'] = test['title'].map(titles_map)
    train['world'] = train['world'].map(worlds_map)
    test['world'] = test['world'].map(worlds_map)
    
    train_labels['title'] = train_labels['title'].map(titles_map)
    
    # do those win_codes - kind of maps, not sure why it is done this way yet
    win_code = dict(zip(titles_map.values(), (4100*np.ones(len(titles_map))).astype('int')))
    win_code[titles_map['Bird Measurer (Assessment)']] = 4110
    
    # format dates
    train['timestamp'] = pd.to_datetime(train['timestamp'])
    test['timestamp'] = pd.to_datetime(test['timestamp'])
    
    train['game_cor'] = train.loc[:,'event_data'].apply(get_correct)
    test['game_cor'] = test.loc[:,'event_data'].apply(get_correct)
    
    return train, test, train_labels, win_code, unique_titles, unique_event_codes, titles_inverse_map, unique_assess_titles, unique_event_ids, unique_title_event_codes, unique_game_titles

In [None]:
# this piece counts how many actions was made in each event_code so far
def update_counters(counter: dict, col: str, session_group, disc_rate, cur_rate):
    num_of_session_count = Counter(session_group[col])
    for k in num_of_session_count.keys():
        x = k
        if col == 'title':
            x = titles_inverse_map[k]
        counter[x] = counter[x] + num_of_session_count[k]
    return counter

def discount(store_dict, disc_rate):
    for key in store_dict:
        store_dict[key] = disc_rate*store_dict[key]
    return store_dict

def get_dr(t):
    DR = np.exp(-(t/MEMORY_STABILITY)) if USE_FORGETTING_CURVE_DISCOUNT else 1
    return DR

def get_basic_stats(arr, prefix):
    result = {}
    result[f'{prefix}_mean'] = np.mean(arr) if len(arr) != 0 else 0
    result[f'{prefix}_med'] = np.median(arr) if len(arr) != 0 else 0
    result[f'{prefix}_mode'] = stats.mode(arr)[0][0] if len(arr) != 0 else 0
    result[f'{prefix}_sd'] = np.std(arr) if len(arr) != 0 else 0
    result[f'{prefix}_sum'] = np.sum(arr) if len(arr) != 0 else 0
    result[f'{prefix}_max'] = np.max(arr) if len(arr) != 0 else 0
    result[f'{prefix}_min'] = np.min(arr) if len(arr) != 0 else 0
    return result

In [None]:
aei = {}
aemt = {}

### Example of exponential discounting based on time-off between assessment at t and t-1

In [None]:
days = 14
t = np.arange(0,(60*60*24*days+1))
plt.plot(t,np.exp(-(t/400000)))
plt.plot(t,np.exp(-(t/200000)))
plt.plot(t,np.exp(-(t/100000)))
plt.xticks(np.arange(0,(60*60*24*days+1),60*60*24), labels = np.arange(0,days+1))
plt.legend('Legend', facecolor='Black', labels=['4k','2k','1k'])
plt.show()

### Key feature engineering function

In [None]:
def get_data(user_sample, disc_rate, cur_rate, is_test=False):
    """ 
    Extracts prediction items and featrues from event series of a single user
  
    This function contains most of feature engineering code for this kernel. It has originated as a smaller functions however grew bigger overtime. 
  
    Parameters: 
    user_sample (pd.DataFrame): pandas dataframe containing events of a single user
    disc_rate (float): deprecated, was used for constant rate discounting of older data points aka momentum
    cur_rate (float): deprecated, used for discounting present features
    is_test: used to indicate that dataset is test in oder to process it slightly differently. This is because test dataset contains no events other than the first one in the last assessment, to prevent data leaks.
  
    Returns: 
    list: list of dictionaries containing user events groupped by assessment with features engineered
  
    """
    all_assessments = []
    durations = []
    durations_all = []
    durations_preass = [] 
    
    last_session_time_sec = 0
    time_first_activity = float(user_sample['timestamp'].values[0])
    
    accuracy_groups = {0:0, 1:0, 2:0, 3:0}
    accumulated_accuracy_group = 0
    accumulated_accuracy = 0
    accumulated_correct_attempts = 0 
    accumulated_incorrect_attempts = 0
    
    user_activities_count = {'Clip':0, 'Activity': 0, 'Assessment': 0, 'Game':0}
    user_activities_time = {'Clip_time':0, 'Activity_time': 0, 'Assessment_time': 0, 'Game_time':0}
    accumulated_actions = 0 
    last_activity = 0
    counter = 0
    
    last_accuracy_title = {f'acc_{i}': -1 for i in unique_assess_titles}
    time_spent_each_title = {f'{i}_time': 0 for i in unique_titles}
        
    event_code_count = {i: 0 for i in unique_event_codes}
    event_id_count = {i: 0 for i in unique_event_ids}
    title_count = {i: 0 for i in unique_titles} 
    title_event_code_count = {i: 0 for i in unique_title_event_codes}
    
    accumulated_game_correctness = {f'game_{i}_cor':0 for i in unique_game_titles}
    accumulated_game_incorrectness = {f'game_{i}_incor':0 for i in unique_game_titles}
    accumulated_incomplete_games = {f'game_{i}_incomplete':0 for i in unique_game_titles}
    accumulated_per_ass_corr = {f'{i}_cor': 0 for i in unique_assess_titles}
    accumulated_per_ass_incorr = {f'{i}_incor': 0 for i in unique_assess_titles}
    
    last_ass_timestamp = 0
    prev_session_timestamp = 0
    pre_ass_between_sessions_time = []
    
    first_session_time_stamp = 0

    accumulated_event_identifiers = {i:0 for i in aei}
    accumulated_event_media_types = {i:0 for i in aemt}
    
    incomplete_assessments = {f'{i}_incomplete': 0 for i in unique_assess_titles}
    
    session_count = 0
    session_count_discounted = 0
    prev_ses_last_timestamp = 0
    prev_ses_title = 0
    orderliness = 0
    orderliness_neg = 0
    orderliness_disc = 0
    prev_ses_count = 0 
    
    type_speed_accum = {'Game_speed_accum':[], 'Activity_speed_accum':[]}
    type_speed_preass = {'Game_speed_preass':[], 'Activity_speed_preass':[]}
    
    time_offs_all = []
    time_off_before_ass_all = []
    intra_ses_durations_latest = []
    intra_ses_durations_all = []
    intra_ses_durations_preass = []
    
    # --- SESSIONS LOOP START --- #
    for session_id, session_group in user_sample.groupby('game_session', sort=False):
        session_type = session_group['type'].iloc[0]
        session_installation_id = session_group['installation_id'].iloc[-1]
        session_title = session_group['title'].iloc[0]
        session_title_text = titles_inverse_map[session_title]
        session_timestamp = session_group['timestamp'].iloc[0]
        time_spent = int(session_group["game_time"].iloc[-1] / 1000)
        
        first_session_time_stamp = session_timestamp if first_session_time_stamp == 0 else first_session_time_stamp
        
        t = np.round(((session_timestamp - prev_ses_last_timestamp).seconds / 60), 3) if prev_ses_last_timestamp != 0 else 0
        DR = get_dr(t)
        time_offs_all.append(t)
        
        # discounting
        event_code_count = discount(event_code_count, DR)
        event_id_count = discount(event_id_count, DR)
        title_count = discount(title_count, DR)
        title_event_code_count = discount(title_event_code_count, DR)
        time_spent_each_title = discount(time_spent_each_title, DR)
        accumulated_event_identifiers = discount(accumulated_event_identifiers, DR)
        accumulated_event_media_types = discount(accumulated_event_media_types, DR)
        user_activities_count = discount(user_activities_count, DR)
        user_activities_time = discount(user_activities_time, DR)
        accumulated_game_correctness = discount(accumulated_game_correctness, DR)
        accumulated_game_incorrectness = discount(accumulated_game_incorrectness, DR)
        accumulated_incomplete_games = discount(accumulated_incomplete_games, DR)
        accumulated_per_ass_corr = discount(accumulated_per_ass_corr, DR)
        accumulated_per_ass_incorr = discount(accumulated_per_ass_incorr, DR)
        accuracy_groups = discount(accuracy_groups, DR)
        incomplete_assessments = discount(incomplete_assessments, DR)
        
        accumulated_accuracy *= DR
        accumulated_correct_attempts *= DR
        accumulated_incorrect_attempts  *= DR
        accumulated_accuracy_group *= DR
        accumulated_actions *= DR
        session_count_discounted *= DR
        orderliness_disc *= DR
        
        # updating orderliness
        if (prev_ses_title == media_durations.loc[media_durations.title==session_title_text, 'prev_title'].values[0]) or (media_durations.loc[media_durations.title==session_title_text, 'prev_title'].values[0] == 0):
            orderliness += 1
            orderliness_neg += 1
            orderliness_disc += 1
        else:
            orderliness_neg -= 1
            
        orderliness_prop = orderliness / (session_count + 1)
        orderliness_prop_neg = orderliness_neg / (session_count + 1)
        orderliness_prop_disc = orderliness_disc / session_count_discounted
        
        if session_type == 'Game': #DO NOT BLINDLY COMBINE THIS ONE WITH IF BELOW! WOULD BREAK LOGIC
            game_cor = len(session_group.loc[session_group.game_cor == 1, 'game_cor'])
            accumulated_game_correctness[f'game_{titles_inverse_map[session_title]}_cor'] = accumulated_game_correctness[f'game_{titles_inverse_map[session_title]}_cor'] + game_cor
            game_incor = len(session_group.loc[session_group.game_cor == 0, 'game_cor'])
            accumulated_game_incorrectness[f'game_{titles_inverse_map[session_title]}_incor'] = accumulated_game_incorrectness[f'game_{titles_inverse_map[session_title]}_incor'] + game_incor
            if game_cor + game_incor == 0:
                accumulated_incomplete_games[f'game_{titles_inverse_map[session_title]}_incomplete'] = accumulated_incomplete_games[f'game_{titles_inverse_map[session_title]}_incomplete'] + 1
        if session_type == 'Game' or session_type == 'Activity':
            ts = session_group.shape[0] / time_spent if time_spent != 0 else 0
            type_speed_accum[f'{session_type}_speed_accum'].append(ts)
            type_speed_preass[f'{session_type}_speed_preass'].append(ts)
        
        if session_type != 'Assessment':
            time_spent_each_title[f'{titles_inverse_map[session_title]}_time'] = time_spent_each_title[f'{titles_inverse_map[session_title]}_time'] + time_spent
            durations_all.append(time_spent)
            durations_preass.append(time_spent)
            intra_ses_durations_latest = ((session_group.loc[:,'timestamp'].iloc[1:].values - session_group.loc[:,'timestamp'].iloc[0:-1].values) / np.timedelta64(1, 's')).tolist()
            intra_ses_durations_all += intra_ses_durations_latest
            intra_ses_durations_preass += intra_ses_durations_latest
            
            for i in session_group['event_data']:
                identifier = get_event_data(i, key='identifier', prefix=session_type)
                media_type = get_event_data(i, key='media_type', prefix=session_type)
                try:
                    accumulated_event_identifiers[identifier] = accumulated_event_identifiers[identifier] + 1
                except:
                    if is_test != True: accumulated_event_identifiers[identifier] = 0
                try:
                    accumulated_event_media_types[media_type] = accumulated_event_media_types[media_type] + 1
                except:
                    if is_test != True: accumulated_event_identifiers[media_type] = 0
            
            # updating counters without discounting as betweena assessments
            event_code_count = update_counters(event_code_count, "event_code", session_group, 1, 1)
            event_id_count = update_counters(event_id_count, "event_id", session_group, 1, 1)
            title_count = update_counters(title_count, 'title', session_group, 1, 1)
            title_event_code_count = update_counters(title_event_code_count, 'title_event_code', session_group, 1, 1)
            
            time_diff = np.round(((session_timestamp - prev_session_timestamp).seconds / 60), 3) if prev_session_timestamp != 0 else 0
            prev_session_timestamp = session_timestamp
            pre_ass_between_sessions_time.append(time_diff)

            # counts how many actions the player has done so far, used in the feature of the same name
            accumulated_actions = accumulated_actions + len(session_group)

            user_activities_count[session_type] = user_activities_count[session_type] + 1
            user_activities_time[f'{session_type}_time'] = user_activities_time[f'{session_type}_time'] + time_spent
            last_activitiy = session_type 
            
            session_count += 1
            session_count_discounted += 1
        
        elif (session_type == 'Assessment') & (is_test or len(session_group)>1):
            all_attempts = session_group.query(f'event_code == {win_code[session_title]}')
            
            session_correct_attempts = all_attempts['event_data'].str.contains('true').sum()
            session_incorrect_attempts = all_attempts['event_data'].str.contains('false').sum()
            
            # adding features
            if (session_correct_attempts + session_incorrect_attempts == 0) & (is_test==False):
                incomplete_assessments[f'{titles_inverse_map[session_title]}_incomplete'] += 1
                continue
            elif (is_test==True):
                if (session_correct_attempts + session_incorrect_attempts == 0) & (session_id != last_ass_session.loc[last_ass_session.installation_id==session_installation_id,'game_session'].values[0]):
                    incomplete_assessments[f'{titles_inverse_map[session_title]}_incomplete'] += 1
                    continue
            
            time_since_last_ass = np.round(((session_timestamp - last_ass_timestamp).seconds / 60), 3) if last_ass_timestamp != 0 else 0
            last_ass_timestamp = session_timestamp
        
            features = user_activities_count.copy()
            features['time_since_last_ass'] = time_since_last_ass
            features.update(get_basic_stats(type_speed_accum['Activity_speed_accum'], 'Activity_speed_accum'))
            features.update(get_basic_stats(type_speed_accum['Game_speed_accum'], 'Game_speed_accum'))
            features.update(get_basic_stats(type_speed_preass['Activity_speed_preass'], 'Activity_speed_preass'))
            features.update(get_basic_stats(type_speed_preass['Game_speed_preass'], 'Game_speed_preass'))
            type_speed_preass['Activity_speed_preass'], type_speed_preass['Game_speed_preass'] = [], []
            features.update(last_accuracy_title.copy())
            features.update(event_code_count.copy())
            features.update(event_id_count.copy())
            features.update(title_count.copy())
            features.update(title_event_code_count.copy())
            features.update(time_spent_each_title.copy())
            features.update(accuracy_groups.copy())
            features.update(accumulated_game_correctness.copy())
            features.update(accumulated_game_incorrectness.copy())
            features.update(accumulated_incomplete_games.copy())
            features.update(get_basic_stats([accumulated_game_correctness[i] for i in accumulated_game_correctness], 'game_cor'))
            features.update(get_basic_stats([accumulated_game_incorrectness[i] for i in accumulated_game_incorrectness], 'game_incor'))
            features.update(get_basic_stats([accumulated_incomplete_games[i] for i in accumulated_incomplete_games], 'game_incomp'))
            features.update(accumulated_per_ass_corr.copy())
            features.update(accumulated_per_ass_incorr.copy())
            features.update(user_activities_time.copy())
            features.update(accumulated_event_identifiers.copy())
            features.update(accumulated_event_media_types.copy())
            features.update(incomplete_assessments.copy())
            features['sum_ass_incomp'] = np.sum([incomplete_assessments[i] for i in incomplete_assessments])
            features['accumulated_session_count'] = session_count
            features['discounted_session_count'] = session_count_discounted
            features['orderliness'] = orderliness 
            features['orderliness_prop'] = orderliness_prop
            features['orderliness_prop_neg'] = orderliness_prop_neg
            features['orderliness_prop_disc'] = orderliness_prop_disc
            time_off_before_ass = np.round(((prev_ses_last_timestamp - session_timestamp).seconds / 60), 3) if prev_ses_last_timestamp != 0 else 1000
            features['time_off_before_ass'] = time_off_before_ass
            time_off_before_ass_all.append(time_off_before_ass)
            features.update(get_basic_stats(time_off_before_ass_all, 'time_off_before_ass_all'))
            features.update(get_basic_stats(time_offs_all, 'time_offs_sum'))
            features.update(get_basic_stats(durations_all, 'durations_all'))
            features.update(get_basic_stats(durations_preass, 'durations_preass'))
            durations_preass = []
            features.update(get_basic_stats(intra_ses_durations_latest, 'intra_ses_durations_latest'))
            features.update(get_basic_stats(intra_ses_durations_all, 'intra_ses_durations_all'))
            features.update(get_basic_stats(intra_ses_durations_preass, 'intra_ses_durations_preass'))
            intra_ses_durations_preass = []
            
            variety_features = [('var_event_code', event_code_count),
                              ('var_event_id', event_id_count),
                               ('var_title', title_count),
                               ('var_title_event_code', title_event_code_count)]
            
            for name, dict_counts in variety_features:
                arr = np.array(list(dict_counts.values()))
                features[name] = np.count_nonzero(arr)
            
            features[f'{titles_inverse_map[session_title]}_cor'] = accumulated_per_ass_corr[f'{titles_inverse_map[session_title]}_cor']
            accumulated_per_ass_corr[f'{titles_inverse_map[session_title]}_cor'] = features[f'{titles_inverse_map[session_title]}_cor'] + session_correct_attempts
            features[f'{titles_inverse_map[session_title]}_incor'] = accumulated_per_ass_incorr[f'{titles_inverse_map[session_title]}_incor']
            accumulated_per_ass_incorr[f'{titles_inverse_map[session_title]}_incor'] = features[f'{titles_inverse_map[session_title]}_incor'] + session_incorrect_attempts
            
            features['installation_id'] = session_installation_id
            features['session_title'] = session_title
            
            features['accumulated_correct_attempts'] = accumulated_correct_attempts
            features['accumulated_incorrect_attempts'] = accumulated_incorrect_attempts
            accumulated_correct_attempts = accumulated_correct_attempts + session_correct_attempts 
            accumulated_incorrect_attempts = accumulated_incorrect_attempts + session_incorrect_attempts
            
            features.update(get_basic_stats(durations, 'durations'))
            durations.append((session_group.iloc[-1, 2] - session_group.iloc[0, 2] ).seconds)
            
            features['accumulated_accuracy'] = accumulated_accuracy/counter if counter > 0 else 0
            accuracy = session_correct_attempts/(session_correct_attempts+session_incorrect_attempts) if (session_correct_attempts+session_incorrect_attempts) != 0 else 0
            accumulated_accuracy = accumulated_accuracy + accuracy
            last_accuracy_title['acc_' + session_title_text] = accuracy
            
            if accuracy == 0: features['accuracy_group'] = 0
            elif accuracy == 1: features['accuracy_group'] = 3
            elif accuracy == 0.5: features['accuracy_group'] = 2
            else: features['accuracy_group'] = 1
                
            accuracy_groups[features['accuracy_group']] = accuracy_groups[features['accuracy_group']] + 1
            
            features['accumulated_accuracy_group'] = accumulated_accuracy_group/counter if counter > 0 else 0
            accumulated_accuracy_group = accumulated_accuracy_group + features['accuracy_group']
            
            features['accumulated_actions'] = accumulated_actions
            
            features['pre_ass_ses_count'] = session_count - prev_ses_count
            
            features.update(get_basic_stats(pre_ass_between_sessions_time, 'pre_ass_between_sessions_time'))
            
            pre_ass_between_sessions_time = []
            prev_session_timestamp = 0
            
            features['time_since_installation'] = np.round(((session_timestamp - first_session_time_stamp).seconds / 60), 3)
            
            for i in session_group['event_data']:
                identifier = get_event_data(i, key='identifier', prefix='Ass')
                media_type = get_event_data(i, key='media_type', prefix='Ass')
                try:
                    accumulated_event_identifiers[identifier] = accumulated_event_identifiers[identifier] + 1
                except:
                    if is_test != True: accumulated_event_identifiers[identifier] = 0
                try:
                    accumulated_event_media_types[media_type] = accumulated_event_media_types[media_type] + 1
                except:
                    if is_test != True: accumulated_event_media_types[media_type] = 0
            
            # updating counters           
            event_code_count = update_counters(event_code_count, "event_code", session_group, 1, 1)
            event_id_count = update_counters(event_id_count, "event_id", session_group, 1, 1)
            title_count = update_counters(title_count, 'title', session_group, 1, 1)
            title_event_code_count = update_counters(title_event_code_count, 'title_event_code', session_group, 1, 1) 
            
            user_activities_count[session_type] = user_activities_count[session_type] + 1
            user_activities_time[f'{session_type}_time'] = user_activities_time[f'{session_type}_time'] + time_spent
            last_activitiy = session_type 
            
            counter += 1
            session_count += 1
            session_count_discounted += 1
            prev_ses_count = session_count
            
            all_assessments.append(features)
        
        prev_ses_last_timestamp = session_group['timestamp'].iloc[-1]
        prev_ses_title = session_title_text
    
    if is_test != True:
        aei.update(accumulated_event_identifiers)
        aemt.update(accumulated_event_media_types)
            
    # --- SESSIONS LOOP END --- # 
    return all_assessments

In [None]:
def parse_all_installation_ids(df, **kwargs):
    compiled_df = []
    for _, user_sample in tqdm(df.groupby('installation_id', sort = False), total = df.installation_id.nunique()):
        compiled_df += get_data(user_sample, **kwargs)
    reduced_df = pd.DataFrame(compiled_df)
    return reduced_df

In [None]:
def get_installation_aggr_features(reduce_train, reduce_test):
    for df in [reduce_train, reduce_test]:
        df['installation_session_count'] = df.groupby(['installation_id'])['Clip'].transform('count')
        df['installation_duration_mean'] = df.groupby(['installation_id'])['durations_mean'].transform('mean')
        df['installation_duration_std'] = df.groupby(['installation_id'])['durations_mean'].transform('std')
        df['installation_title_nunique'] = df.groupby(['installation_id'])['session_title'].transform('nunique')
        
        df['sum_event_code_count'] = df.loc[:,[2050, 4100, 4230, 5000, 4235, 2060, 4110, 5010, 2070, 2075, 2080, 2081, 2083, 3110, 4010, 3120, 3121, 4020, 4021, 
                                        4022, 4025, 4030, 4031, 3010, 4035, 4040, 3020, 3021, 4045, 2000, 4050, 2010, 2020, 4070, 2025, 2030, 4080, 2035, 
                                        2040, 4090, 4220, 4095]].sum(axis = 1)
        
        df['installation_event_code_count_mean'] = df.groupby(['installation_id'])['sum_event_code_count'].transform('mean')
        df['installation_event_code_count_std'] = df.groupby(['installation_id'])['sum_event_code_count'].transform('std')
        
    return reduce_train, reduce_test

## Models' code

In [None]:
class Base_Model(object):
    """
    Parent model class, contains functions universal for all models used. Initial implementation credits to Bruno Aquino.
    """
    def __init__(self, train_df, test_df, features, categoricals=[], n_splits=5, verbose=True, target='accuracy_group'):
        self.train_df = train_df
        self.test_df = test_df
        self.is_last = train_df.is_last
        self.features = features
        self.n_splits = n_splits
        self.categoricals = categoricals
        self.target = target
        self.cv = self.get_cv()
        self.verbose = verbose
        self.params = self.get_params()
        self.y_pred, self.score, self.model, self.val_ys, self.val_preds = self.fit()
        
    def train_model(self, train_set, val_set):
        raise NotImplementedError
        
    def get_cv(self):
        if MIX_UP:
            cv = KFold(n_splits=self.n_splits, shuffle=True, random_state=42)
        else:
            cv = StratifiedKFold(n_splits=self.n_splits, shuffle=True, random_state=42)
        return cv.split(self.train_df, self.train_df[self.target])
    
    def get_params(self):
        raise NotImplementedError
        
    def convert_dataset(self, x_train, y_train, x_val, y_val):
        raise NotImplementedError
        
    def convert_x(self, x):
        return x
        
    def fit(self):
#         oof_pred = np.zeros((len(unseen_valid), )) if MIX_UP else np.zeros((len(self.train_df), ))
        oof_pred = []
        oof_ys_all = []
        oof_pred_lastonly = []
        oof_ys_lastonly = []
        y_pred = np.zeros((len(self.test_df), ))
        for fold, (train_idx, val_idx) in enumerate(self.cv):
            if MIX_UP:
                x_train, x_val = self.train_df[self.features].iloc[train_idx], unseen_valid[self.features]
                y_train, y_val = self.train_df[self.target][train_idx], unseen_valid[self.target]
                self.is_last = unseen_valid.is_last
                cur_is_last = unseen_valid.is_last
            else:
                x_train, x_val = self.train_df[self.features].iloc[train_idx], self.train_df[self.features].iloc[val_idx]
                y_train, y_val = self.train_df[self.target][train_idx], self.train_df[self.target][val_idx]
                cur_is_last = self.is_last.iloc[val_idx]
            train_set, val_set = self.convert_dataset(x_train, y_train, x_val, y_val)
            model = self.train_model(train_set, val_set)
            conv_x_val = self.convert_x(x_val.reset_index(drop=True))
            conv_x_val_lastonly = self.convert_x(x_val.loc[cur_is_last==1,:].reset_index(drop=True))
#             oof_pred[val_idx] = model.predict(conv_x_val).reshape(oof_pred[val_idx].shape)
            preds_all = model.predict(conv_x_val)
            preds_lastonly = model.predict(conv_x_val_lastonly)
            oof_pred_lastonly.append(preds_lastonly)
            oof_pred.append(preds_all)
            x_test = self.convert_x(self.test_df[self.features])
            y_pred += model.predict(x_test).reshape(y_pred.shape) / self.n_splits
#             print('Partial score (all) of fold {} is: {}'.format(fold, eval_qwk_lgb_regr(y_val, oof_pred[val_idx])[1]))
            print('Partial score (all) of fold {} is: {}'.format(fold, eval_qwk_lgb_regr(y_val, np.array(preds_all).reshape(-1,1))[1]))
            print('Partial score (lastonly) of fold {} is: {}'.format(fold, eval_qwk_lgb_regr(y_val.loc[cur_is_last==1].reset_index(drop=True), np.array(preds_lastonly).reshape(-1,1), is_last=True)[1]))
            oof_ys_lastonly.append([i for i in y_val.loc[cur_is_last==1].reset_index(drop=True).values])
            oof_ys_all.append(y_val.reset_index(drop=True).values)
#         _, loss_score, _ = eval_qwk_lgb_regr(self.train_df[self.target], oof_pred)
        oof_ys_all = np.concatenate(oof_ys_all).reshape(-1,1)
        oof_pred = np.concatenate(oof_pred).reshape(-1,1)
        _, loss_score, _ = eval_qwk_lgb_regr(oof_ys_all, oof_pred)
        try:
            oof_ys_lastonly = np.concatenate(oof_ys_lastonly).reshape(-1,1)
            oof_pred_lastonly = np.concatenate(oof_pred_lastonly).reshape(-1,1)
            _, loss_score_lastonly, _ = eval_qwk_lgb_regr(oof_ys_lastonly, oof_pred_lastonly, is_last=True)
        except:
            loss_score_lastonly,oof_ys_lastonly,oof_pred_lastonly = None,None,None
        if self.verbose:
            print('Our oof cohen kappa score (all) is: ', loss_score)
            print('Our oof cohen kappa score (lastonly) is: ', loss_score_lastonly)
        if RETURN_PREDS_LO:
            return y_pred, loss_score, model, oof_ys_lastonly, oof_pred_lastonly
        else:
            return y_pred, loss_score, model, self.train_df[self.target], oof_pred

In [None]:
class Lgb_Model(Base_Model):
    def train_model(self, train_set, val_set):
        verbosity = 100 if self.verbose else 0
        return lgb.train(self.params, train_set, valid_sets=[train_set, val_set], verbose_eval=verbosity)
        
    def convert_dataset(self, x_train, y_train, x_val, y_val):
        train_set = lgb.Dataset(x_train, y_train, categorical_feature=self.categoricals)
        val_set = lgb.Dataset(x_val, y_val, categorical_feature=self.categoricals)
        return train_set, val_set
        
    def get_params(self):
        params = {'n_estimators':6000,
                    'boosting_type': 'gbdt',
                    'objective': 'regression',
                    'metric': 'rmse',
                    'subsample': 0.75,
                    'subsample_freq': 1,
                    'learning_rate': 0.01,
                    'feature_fraction': 0.8,
                    'max_depth': 150, # was 15
                  'num_leaves': 50,
                    'lambda_l1': 0.1,  
                    'lambda_l2': 0.1,
                    'early_stopping_rounds': 300,
                  'min_data_in_leaf': 1,
                          'min_gain_to_split': 0.01,
                          'max_bin': 400
                    }
        return params

In [None]:
class Xgb_Model(Base_Model):
    def train_model(self, train_set, val_set):
        verbosity = 100 if self.verbose else 0
        return xgb.train(self.params, train_set, 
                         num_boost_round=5000, evals=[(train_set, 'train'), (val_set, 'val')], 
                         verbose_eval=verbosity, early_stopping_rounds=100)
        
    def convert_dataset(self, x_train, y_train, x_val, y_val):
        train_set = xgb.DMatrix(x_train, y_train)
        val_set = xgb.DMatrix(x_val, y_val)
        return train_set, val_set
    
    def convert_x(self, x):
        return xgb.DMatrix(x)
        
    def get_params(self):
        params = {'colsample_bytree': 0.8,                 
            'learning_rate': 0.01,
            'max_depth': 25,
            'subsample': 1,
            'objective':'reg:squarederror',
            #'eval_metric':'rmse',
            'min_child_weight':3,
            'gamma':0.25,
            'n_estimators':5000}

        return params

In [None]:
class Catb_Model(Base_Model):
    def train_model(self, train_set, val_set):
        verbosity = 100 if self.verbose else 0
        clf = CatBoostRegressor(**self.params)
        clf.fit(train_set['X'], 
                train_set['y'], 
                eval_set=(val_set['X'], val_set['y']),
                verbose=verbosity, 
                cat_features=self.categoricals)
        return clf
        
    def convert_dataset(self, x_train, y_train, x_val, y_val):
        train_set = {'X': x_train, 'y': y_train}
        val_set = {'X': x_val, 'y': y_val}
        return train_set, val_set
        
    def get_params(self):
        params = {'loss_function': 'RMSE',
                   'task_type': "CPU",
                   'iterations': 5000,
                   'od_type': "Iter",
                    'depth': 10,
                  'colsample_bylevel': 0.5, 
                   'early_stopping_rounds': 300,
                    'l2_leaf_reg': 18,
                   'random_seed': 421,
                    'use_best_model': True
                    }
        return params

In [None]:
class Nn_Model(Base_Model):
    def __init__(self, train_df, test_df, features, categoricals=[], n_splits=5, verbose=True):
        features = features.copy()
        if len(categoricals) > 0:
            for cat in categoricals:
                enc = OneHotEncoder()
                train_cats = enc.fit_transform(train_df[[cat]])
                test_cats = enc.transform(test_df[[cat]])
                cat_cols = ['{}_{}'.format(cat, str(col)) for col in enc.active_features_]
                features += cat_cols
                train_cats = pd.DataFrame(train_cats.toarray(), columns=cat_cols)
                test_cats = pd.DataFrame(test_cats.toarray(), columns=cat_cols)
                train_df = pd.concat([train_df, train_cats], axis=1)
                test_df = pd.concat([test_df, test_cats], axis=1)
        scalar = MinMaxScaler()
        train_df[features] = scalar.fit_transform(train_df[features])
        test_df[features] = scalar.transform(test_df[features])
        print(train_df[features].shape)
        super().__init__(train_df, test_df, features, categoricals, n_splits, verbose)
        
    def train_model(self, train_set, val_set):
        verbosity = 100 if self.verbose else 0
        model = tf.keras.models.Sequential([
            tf.keras.layers.Input(shape=(train_set['X'].shape[1],)),
            tf.keras.layers.Dense(400, activation='relu'),
            tf.keras.layers.LayerNormalization(),
            tf.keras.layers.Dropout(0.25),
            tf.keras.layers.Dense(200, activation='relu'),
            tf.keras.layers.LayerNormalization(),
            tf.keras.layers.Dropout(0.20),
            tf.keras.layers.Dense(100, activation='relu'),
            tf.keras.layers.LayerNormalization(),
            tf.keras.layers.Dropout(0.15),
            tf.keras.layers.Dense(50, activation='relu'),
            tf.keras.layers.LayerNormalization(),
            tf.keras.layers.Dropout(0.1),
            tf.keras.layers.Dense(1, activation='relu')
        ])
        model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=4e-4), loss='mse')
        print(model.summary())
        save_best = tf.keras.callbacks.ModelCheckpoint('nn_model.w8', save_weights_only=True, save_best_only=True, verbose=1)
        early_stop = tf.keras.callbacks.EarlyStopping(patience=20)
        model.fit(train_set['X'], 
                train_set['y'], 
                validation_data=(val_set['X'], val_set['y']),
                epochs=100,
                 callbacks=[save_best, early_stop])
        model.load_weights('nn_model.w8')
        return model
        
    def convert_dataset(self, x_train, y_train, x_val, y_val):
        train_set = {'X': x_train, 'y': y_train}
        val_set = {'X': x_val, 'y': y_val}
        return train_set, val_set
        
    def get_params(self):
        return None

In [None]:
from random import choice

class Cnn_Model(Base_Model):
    
    def __init__(self, train_df, test_df, features, categoricals=[], n_splits=5, verbose=True):
        features = features.copy()
        if len(categoricals) > 0:
            for cat in categoricals:
                enc = OneHotEncoder()
                train_cats = enc.fit_transform(train_df[[cat]])
                test_cats = enc.transform(test_df[[cat]])
                cat_cols = ['{}_{}'.format(cat, str(col)) for col in enc.active_features_]
                features += cat_cols
                train_cats = pd.DataFrame(train_cats.toarray(), columns=cat_cols)
                test_cats = pd.DataFrame(test_cats.toarray(), columns=cat_cols)
                train_df = pd.concat([train_df, train_cats], axis=1)
                test_df = pd.concat([test_df, test_cats], axis=1)
        scalar = MinMaxScaler()
        train_df[features] = scalar.fit_transform(train_df[features])
        test_df[features] = scalar.transform(test_df[features])
        self.create_feat_2d(features)
        super().__init__(train_df, test_df, features, categoricals, n_splits, verbose)
        
    def create_feat_2d(self, features, n_feats_repeat=50):
        self.n_feats = len(features)
        self.n_feats_repeat = n_feats_repeat
        self.mask = np.zeros((self.n_feats_repeat, self.n_feats), dtype=np.int32)
        for i in range(self.n_feats_repeat):
            l = list(range(self.n_feats))
            for j in range(self.n_feats):
                c = l.pop(choice(range(len(l))))
                self.mask[i, j] = c
        self.mask = tf.convert_to_tensor(self.mask)
        print(self.mask.shape)
       
        
    
    def train_model(self, train_set, val_set):
        verbosity = 100 if self.verbose else 0

        inp = tf.keras.layers.Input(shape=(self.n_feats))
        x = tf.keras.layers.Lambda(lambda x: tf.gather(x, self.mask, axis=1))(inp)
        x = tf.keras.layers.Reshape((self.n_feats_repeat, self.n_feats, 1))(x)
        x = tf.keras.layers.Conv2D(18, (50, 50), strides=50, activation='relu')(x)
        x = tf.keras.layers.Flatten()(x)
        x = tf.keras.layers.Dense(200, activation='relu')(x)
        x = tf.keras.layers.LayerNormalization()(x)
        x = tf.keras.layers.Dropout(0.2)(x)
        x = tf.keras.layers.Dense(100, activation='relu')(x)
        x = tf.keras.layers.LayerNormalization()(x)
        x = tf.keras.layers.Dropout(0.15)(x)
        x = tf.keras.layers.Dense(50, activation='relu')(x)
        x = tf.keras.layers.LayerNormalization()(x)
        x = tf.keras.layers.Dropout(0.1)(x)
        out = tf.keras.layers.Dense(1)(x)
        
        model = tf.keras.Model(inp, out)
    
        model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4), loss='mse')
        print(model.summary())
        save_best = tf.keras.callbacks.ModelCheckpoint('nn_model.w8', save_weights_only=True, save_best_only=True, verbose=1)
        early_stop = tf.keras.callbacks.EarlyStopping(patience=20)
        model.fit(train_set['X'], 
                train_set['y'], 
                validation_data=(val_set['X'], val_set['y']),
                epochs=100,
                 callbacks=[save_best, early_stop])
        model.load_weights('nn_model.w8')
        return model
        
    def convert_dataset(self, x_train, y_train, x_val, y_val):
        train_set = {'X': x_train, 'y': y_train}
        val_set = {'X': x_val, 'y': y_val}
        return train_set, val_set
        
    def get_params(self):
        return None

## Data pre-processing

Read data and determine last assesment session id in the test dataset. Test dataset has last assessment sessions truncated to first event only for all the users, to prevent data leakage. However, non-last assessment sessions may not contain all expected events as well, simply because child has not finished an assessment. Thus we need to distinguish between genuinely incomplete assessments versus those that were truncated by the organisers. Hence here we get all last assessment session ids in test dataset and use this in get_data() function to appropriately parse last assessments in test set despite that they are truncated.

In [None]:
# read data
train, test, train_labels, specs, sample_submission, media_durations = read_data()

# get last game sessions in test
last_ass_session = test.loc[:,['installation_id','type','game_session'], ].groupby(['installation_id','type'], sort=False,as_index=False).last()
last_ass_session = last_ass_session.loc[last_ass_session.type=='Assessment',['installation_id', 'game_session']]

Initial competition datasets did not contain duration of clip titles. The organisers later released a separate file "media durations" that contained durations of all the clips in the game. THe cell below adds duration of clips to the main dataset.

In [None]:
train = train.merge(media_durations, on=['title'], how='left')
train.loc[train.type_x == 'Clip','game_time'] = train.loc[train.type_x == 'Clip','duration'] 
train.drop(['type_y','duration'], axis=1, inplace=True)
train.rename({'type_x':'type'}, axis=1, inplace=True)

test = test.merge(media_durations, on=['title'], how='left')
test.loc[test.type_x == 'Clip','game_time'] = test.loc[test.type_x == 'Clip','duration'] 
test.drop(['type_y','duration'], axis=1, inplace=True)
test.rename({'type_x':'type'}, axis=1, inplace=True)

Children may play the game in any order. However, there is a default order that is suggested by the developers.

Initial datasets did not contain any indication of titles order. The "media durations" file that was mentioned above contained all event titles in an intended order. I have used this information to obtain previouts intended title for each event tile. This is used in get_data() function to create "orderliness" features that indicate if child tends to interact with the app in inteded order or not.

In [None]:
media_durations['prev_title'] = 0
for i in range(media_durations.shape[0]):
    media_durations.iloc[i, 3] =  media_durations.iloc[i-1, 0] if media_durations.iloc[i-1, 1] != 'Assessment' else 0

Here I convert categorical to numeric features and get dictionaries.

In [None]:
# get usefull dict with maping encode
train, test, train_labels, win_code, list_of_user_activities, list_of_event_code, activities_labels, assess_titles, list_of_event_id, all_title_event_code,  unique_game_titles = encode_categoricals(train, test, train_labels)

unique_titles = list_of_user_activities
unique_event_codes = list_of_event_code
titles_inverse_map = activities_labels
unique_assess_titles = assess_titles
unique_event_ids = list_of_event_id
unique_title_event_codes = all_title_event_code

Running parsing functions (they take a while due to unoptimized code, as execution speed was not a priority in this comopetition)

In [None]:
# tranform function to get the train and test set
PARSE_ATTEMPTED = True

reduce_train = parse_all_installation_ids(train, disc_rate = 1, cur_rate = 1, is_test = False)
reduce_test = parse_all_installation_ids(test, disc_rate = 1, cur_rate = 1, is_test = True)

categoricals = ['session_title']

Getting a few more features

In [None]:
reduce_train, reduce_test = get_installation_aggr_features(reduce_train, reduce_test)

In [None]:
for title in unique_titles:
   reduce_train.loc[:,f'{title}_speed'] = reduce_train.loc[:,f'{title}_time']/reduce_train.loc[:,title]
   reduce_test.loc[:,f'{title}_speed'] = reduce_test.loc[:,f'{title}_time']/reduce_test.loc[:,title]
    
    
reduce_train.replace([np.inf, -np.inf], np.nan, inplace=True)
reduce_train.fillna(value=0, inplace=True)

reduce_test.replace([np.inf, -np.inf], np.nan, inplace=True)
reduce_test.fillna(value=0, inplace=True)

In [None]:
titles_numbers = [i for i in titles_inverse_map]
titles_titles = [titles_inverse_map[i] for i in titles_inverse_map]
titles_map = {titles_titles[i]:titles_numbers[i] for i in range(len(titles_titles))}

In [None]:
ass_features = [i for i in reduce_train.columns if i in unique_assess_titles]

Saving parsed datasets or loading existing ones if parsing failed. This was done to allow testing modell fitting at early stage when parsing functions where too buggy,

In [None]:
import pickle

try:
    if PARSE_ATTEMPTED:
        pickle.dump(reduce_train, open("/kaggle/working/reduce_train.p", "wb"))
        pickle.dump(reduce_test, open("/kaggle/working/reduce_test.p", "wb"))
        pickle.dump(titles_map, open("/kaggle/working/titles_map.p", "wb"))
        pickle.dump(ass_features, open("/kaggle/working/ass_features.p", "wb"))
        print('Saved data')
except:
    reduce_train = pickle.load(open("../input/basedon460/reduce_train.p", "rb"))
    reduce_test = pickle.load(open("../input/basedon460/reduce_test.p", "rb"))
    titles_map = pickle.load(open("../input/basedon460/titles_map.p", "rb"))
    ass_features = pickle.load(open("../input/basedon460/ass_features.p", "rb"))
    print('Loaded data')

This is one of the key competition success components - a lot of data from test set could have been used for training purposes. The code below adds such data from test to train set.

In [None]:
# add is_last and non-last-assessment rows from test to train

use_rows_from_test = True

reduce_train.loc[:,'index'] = reduce_train.index
last_idx = reduce_train.loc[:,['index','installation_id'], ].groupby(['installation_id'], sort=False).last()['index']
reduce_train.loc[:,'is_last'] = 0
reduce_train.loc[last_idx,'is_last'] = 1
reduce_train.drop(['index'], axis=1, inplace=True)

reduce_test.loc[:,'index'] = reduce_test.index
last_idx = reduce_test.loc[:,['index','installation_id'], ].groupby(['installation_id'], sort=False).last()['index']
reduce_test.loc[:,'is_last'] = 0
reduce_test.loc[last_idx,'is_last'] = 1
reduce_test.drop(['index'], axis=1, inplace=True)

if use_rows_from_test:
    reduce_train = pd.concat([reduce_train, reduce_test.loc[reduce_test.is_last==0,:]], ignore_index=True)

The functions below where used to experiment with oversampling and mix-up. Oversampling was attempted for under-represented samples (e.g. last assessments vs non-last). "Mix-up" was an augmentation idea from image recognition practices where 2 images are overlayed on each other with less than 100% transaparency. Same "overlaying" is done to labels where a new label is a weighted composition of intiial labels. 

Neither of the ideas have worked in short time I have spent on them. I bevlieve both ideas could have worked if done right, however I had no time to experiemnt with them further.

In [None]:
def balance(dim_to_balance, df, shuffle=True):
    np.random.seed(2103)
    counts = df.loc[:,[dim_to_balance, 'installation_id']].groupby([dim_to_balance], as_index=False).count().sort_values(['installation_id'], ascending=False).reset_index(drop=True)
    print('initial dist: '); print(counts)
    balanced_df = df
    max_len = counts.loc[0,'installation_id']
    for i in counts.loc[1:,dim_to_balance]:
        cur_df = df.loc[df.loc[:,dim_to_balance]==i,]
        cur_len = cur_df.shape[0]
        copy_size = max_len - cur_len
        idx = np.random.randint(0, cur_len, copy_size)
        copy_df = cur_df.iloc[idx, :]
        balanced_df = pd.concat([balanced_df, copy_df], ignore_index = True)
    if shuffle:   
        idx = np.random.permutation(balanced_df.shape[0])
        balanced_df = balanced_df.iloc[idx,].reset_index(drop=True)
    print('corrected dist: '); print(balanced_df.loc[:,[dim_to_balance, 'installation_id']].groupby([dim_to_balance], as_index=False).count())
    return balanced_df

def mix_up(dim_to_balance, df, shuffle=True):
    np.random.seed(2103)
    counts = df.loc[:,[dim_to_balance, 'installation_id']].groupby([dim_to_balance], as_index=False).count().sort_values(['installation_id'], ascending=False).reset_index(drop=True)
    print('initial dist: '); print(counts)
    balanced_df = df
    max_len = counts.loc[0,'installation_id']
    cols_to_mix = [i for i in df.columns if i not in ['installation_id', 'is_last']]
    for i in counts.loc[1:,dim_to_balance]:
        cur_df = df.loc[df.loc[:,dim_to_balance]==i,]
        cur_len = cur_df.shape[0]
        copy_size = max_len - cur_len
        idx_1 = np.random.randint(0, cur_len, copy_size)
        idx_2 = np.random.randint(0, cur_len, copy_size)
        m_1 = np.random.beta(0.5,0.5,copy_size)
        m_2 = 1-m_1
        mixed_rows = []
        for q in tqdm(range(copy_size)):
            mixed_rows += [m_1[q]*cur_df.loc[:,cols_to_mix].iloc[idx_1[q],:] + m_2[q]*cur_df.loc[:,cols_to_mix].iloc[idx_2[q],:]]
        copy_df = pd.DataFrame(mixed_rows)
        balanced_df = pd.concat([balanced_df, copy_df], ignore_index = True)
    if shuffle:   
        idx = np.random.permutation(balanced_df.shape[0])
        balanced_df = balanced_df.iloc[idx,].reset_index(drop=True)
    print('corrected dist: '); print(balanced_df.loc[:,[dim_to_balance, 'installation_id']].groupby([dim_to_balance], as_index=False).count())
    return balanced_df

In [None]:
# if MIX_UP:
#     np.random.seed(1997)
#     idx = np.random.permutation(reduce_train.shape[0])[0:round(0.2*reduce_train.shape[0])]
#     reduce_train_unbalanced = reduce_train.copy()
#     unseen_valid = reduce_train_unbalanced.iloc[idx,:]
#     reduce_train = mix_up('is_last', reduce_train.iloc[[i for i in range(reduce_train.shape[0]) if i not in idx],:])

In [None]:
if MIX_UP:
    np.random.seed(1997)
    idx = np.random.permutation(reduce_train.shape[0])[0:round(0.2*reduce_train.shape[0])]
#     reduce_train_unbalanced = reduce_train.copy()
    unseen_valid = reduce_train.iloc[idx,:].copy()
    reduce_train = balance('is_last', reduce_train.iloc[[i for i in range(reduce_train.shape[0]) if i not in idx],:])
    
    reduce_train.loc[reduce_train.is_last.isna(),'is_last'] = 1
    print(reduce_train.is_last.value_counts())
    
    reduce_train.accuracy_group.hist(bins=30)

Transforming features - dividing test and train by train's standard deviation and applying cubic root transform. This is because I have noticed that a lot of variables' distributions are skewed and hypothesized some normalisation may help. I have tried using other powers but cubic root seemed to work best.

In [None]:
no_transform = ['accuracy_group', 'installation_id', 'session_title', 'is_last'] + [f'{ass}_onehot' for ass in ass_features]

for i in tqdm([c for c in reduce_train.columns if c not in no_transform]):
#     mean = reduce_train[i].mean()
    sd = reduce_train[i].std()
    reduce_train.loc[:,i] = np.power((reduce_train.loc[:,i])/(sd+0.01), POWER)
    reduce_test.loc[:,i] = np.power((reduce_test.loc[:,i])/(sd+0.01), POWER)
    if MIX_UP: unseen_valid.loc[:,i] = np.power((unseen_valid.loc[:,i])/(sd+0.01), POWER)

Cleaning data from infinities and nulls - some features may have had such numbers due to divisions of very large by very small numbers or by zero.

In [None]:
reduce_train.replace([np.inf, -np.inf], np.nan, inplace=True)
reduce_test.replace([np.inf, -np.inf], np.nan, inplace=True)
if MIX_UP: unseen_valid.replace([np.inf, -np.inf], np.nan, inplace=True)

In [None]:
reduce_train.fillna(value=0, inplace=True)
reduce_test.fillna(value=0, inplace=True)
if MIX_UP: unseen_valid.fillna(value=0, inplace=True)

In [None]:
reduce_train.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in reduce_train.columns]
reduce_test.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in reduce_test.columns]
if MIX_UP: unseen_valid.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in unseen_valid.columns]

full_reduce_test = reduce_test.copy()
reduce_test = reduce_test.loc[reduce_test.is_last==1,:].reset_index(drop=True)

## Feature selection

In [None]:
# call feature engineering function
features = reduce_train.columns
features = [x for x in features if x not in ['accuracy_group', 'installation_id']]

Throw away features with too high correlation

In [None]:
counter = 0
to_remove = []
for feat_a in features:
    for feat_b in features:
        if feat_a != feat_b and feat_a not in to_remove and feat_b not in to_remove:
            c = np.corrcoef(reduce_train[feat_a], reduce_train[feat_b])[0][1]
            if c > 0.995:
                counter += 1
                to_remove.append(feat_b)
                print('{}: FEAT_A: {} FEAT_B: {} - Correlation: {}'.format(counter, feat_a, feat_b, c))

Throw away features if they differ between train and test sets too much

In [None]:
differences = []
adjusted_counter = 0
unadjusted_counter = 0

to_exclude = [] 
ajusted_test = reduce_test.copy()
for feature in ajusted_test.columns:
    if feature not in ['accuracy_group', 'installation_id', 'accuracy_group', 'session_title', 'is_last'] + [f'{ass}_onehot' for ass in ass_features] + to_remove:
        data = reduce_train[feature]
        train_mean = data.mean()
        data = full_reduce_test[feature] 
        test_mean = data.mean()

        adjust_factor = (train_mean + 0.01) / (test_mean + 0.01)
        
        differences.append((feature, adjust_factor))

        if adjust_factor > 5 or adjust_factor < 0.2:
            to_exclude.append(feature)
            print(feature, train_mean, test_mean)

In [None]:
features = [x for x in features if x not in (to_exclude + to_remove)]
reduce_train[features].shape

## Fitting models

In [None]:
lgb_model = Lgb_Model(reduce_train, ajusted_test, features, categoricals=categoricals)

In [None]:
# XGB model did not seem to perform well and was very slow

# xgb_model = Xgb_Model(reduce_train, ajusted_test, features, categoricals=categoricals)

In [None]:
nn_model = Nn_Model(reduce_train, ajusted_test, features, categoricals=categoricals)

In [None]:
cat_model = Catb_Model(reduce_train, ajusted_test, features, categoricals=categoricals) #if MIX_UP != True else None

In [None]:
cnn_model = Cnn_Model(reduce_train, ajusted_test, features, categoricals=categoricals)

## Post-processing and submission

### Mixing models together and saving submission file

Mix models' predictions together using hand-picked weights. In restrospective could have weighted by validation scores.

In [None]:
try:
    weights = {'lbg': 0.20, 'nn': 0.30, 'cat':0.30, 'cnn':0.20}
    final_pred = (lgb_model.y_pred * weights['lbg'])  + (nn_model.y_pred * weights['nn']) + (cat_model.y_pred * weights['cat']) + (cnn_model.y_pred * weights['cnn'])
    final_val_ys = (lgb_model.val_ys * weights['lbg'])  + (nn_model.val_ys * weights['nn']) + (cat_model.val_ys * weights['cat']) + (cnn_model.val_ys * weights['cnn'])
    final_val_preds = (lgb_model.val_preds * weights['lbg']) + (nn_model.val_preds * weights['nn']) + (cat_model.val_preds * weights['cat']) + (cnn_model.val_preds * weights['cnn'])
    print('Used all models')
except:
    weights = {'lbg': 0.20, 'nn': 0.40, 'cnn': 0.40}
    final_pred = (lgb_model.y_pred * weights['lbg']) + (nn_model.y_pred * weights['nn']) + (cnn_model.y_pred * weights['cnn'])
    final_val_ys = (lgb_model.val_ys * weights['lbg']) + (nn_model.val_ys * weights['nn']) + (cnn_model.val_ys * weights['cnn'])
    final_val_preds = (lgb_model.val_preds * weights['lbg']) + (nn_model.val_preds * weights['nn']) + (cnn_model.val_preds * weights['cnn'])
    print('Used only 3 models')

print(final_pred.shape)

Rounding accuracy groups. Goal of the competition was to predict accuracy group of a child which could only have been 0, 1, 2 or 3. Simple rounding yielded poor results, so most of the participants used Nelder-Mead optimisatio to get thresholds for rounding. However such approach led to very high QWK on validation (~0.8) and that did not translate to good public LB score (~0.6 at most). So I have decided to get thresholds using train distributino of accuracy groups. That is, for example, if 10% of children where in group 0 I have taken sorted models' prediction value at 10th percentile as threshold.

In [None]:
dist_source = reduce_train.loc[reduce_train.is_last==1,'accuracy_group'].reset_index(drop=True) if USE_DIFFERENT_BOUNDS else reduce_train['accuracy_group']

dist = Counter(dist_source)
for k in dist:
    dist[k] /= len(dist_source)
dist_source.hist()

acum = 0
bound = {}
for i in range(3):
    acum += dist[i]
    bound[i] = np.percentile(final_val_preds, acum * 100)
print(bound)

In [None]:
def classify(x):
    if x <= bound[0]:
        return 0
    elif x <= bound[1]:
        return 1
    elif x <= bound[2]:
        return 2
    else:
        return 3
    
final_pred = np.array(list(map(classify, final_pred)))

sample_submission['accuracy_group'] = final_pred.astype(int)
sample_submission.to_csv('submission.csv', index=False)
sample_submission['accuracy_group'].value_counts(normalize=True)

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,5))
ax1.hist(final_pred)
ax1.set_title('Accuracy group in final predictions')
ax2.hist(reduce_train.loc[reduce_train.is_last==1,'accuracy_group'])
ax2.set_title('Accuracy group in train')

### Looking at variable importance from LGB model

In [None]:
fi = [(i,f) for i, f in zip(lgb_model.model.feature_name(), lgb_model.model.feature_importance())]
fi = sorted(fi, key = lambda x: x[1], reverse=True)
cutoff_trh = np.percentile([i[1] for i in fi], 10)
print(cutoff_trh)
important_features = [i[0] for i in fi if i[1] > cutoff_trh]
for i,z in fi: print((i,z))

In [None]:
lgb.plot_importance(lgb_model.model, max_num_features = 50)