## Brief description

Our approach is based on single LGBM model. This model gives 79.9 auc on validation, 79.7 on public, 79.9 on private leaderboard. 

We tried to stack model with XGBoost trained on the same features, but it gives us no significant boost. Unfortunately, we didn' t succeed in building strong SAKT or SAINT+ model :(

This notebook contains only training part of our solution. We upload the inference part later.

In the description we use the term "accuracy" instead of "answered correctly". 

Features that we use in the model can be divided into groups: 


* *Raw features:*

    1) prior_question_elapsed_time;
    
    2) prior_question_had_explanation.


* *Question statistics without update*:

    1) count/average/std of accuracy for the current question;
    
    2) average/std of elapsed time;
    
    3) average/std of "had_explanation" flag;
    
    4) 15 question "embeddings", obtained through ALS factorization of (user_id - content_id) matrix.
   
    To get "elapsed time" and "had explanation" for the current question, we shifted "prior_question_elapsed_time" and "prior_question_had_explanation" fields by one position back grouping by each user_id.
    
    For ALS factorization, we used only the train part of the data to prevent overfitting.
    
    
* *User statistics with update:*

    All statistics are calculated for all user history between beginning (*timestamp = 0*) and current moment.

    1) count/avg of accuracy for the current user;
    
    2) avg of *(accuracy - question_accuracy_mean)*;
    
    3) avg of *(1 - accuracy) * question_accuracy_mean*;
    
    4) avg of "prior_question_elapsed_time" and "prior_question had_explanation";
    
    The main idea of 2 - 3 is to estimate user abilities, taking into consideration not only the rate of correct answers, but also the difficulty of questions he succeeded/not succeeded to answer correctly.
    

* *Timedelta features:*

    1) Time since previous question for the current user;
    
    2) Time since previous question with correct/incorrect answer;
    
    3) Time since 2 previous questions/3 previous questions.
    
    
* *User-session statistics with update:*

    The idea is to calculate user statistics only for last user session. To start a new session for each user, we waited for the timedelta between current question and previous question exceed fixed period of time (in final solution we use hour and week as periods).

    1) count/avg of accuracy, avg of *(accuracy - question_accuracy_mean)* for user within 1 hour session;
    
    2) count/avg of accuracy, avg of *(accuracy - question_accuracy_mean)* for user within 1 week session;
    
    
* *User-window statistics with update:*
    
    The idea is the same as in the previous item, but instead of sessions here we used simple rolling windows:
    
    1) sum of *(accuracy - question_accuracy_mean)* for user within window with depth 5/10/15
    
    
* *User-part statistics with update:*

    The statistics here are the same as user statistics, but instead of grouping by "user_id" field, we grouped by "user_id" and "part" fields.
    
    
* *User-als statistics with update:*

    For this group of features, we calculated vectors of each question through ALS-factorization of (user_id - content_id) matrix (the same way as in the first item, but instead of dim=15 we used dim=500). Then we clusterized these vectors with K-means and got cluster for each content_id of question ("als_cluster" field). Finally we calculated some statistics grouping by "user_id" and "als_cluster" fields:
    
    1) count/avg of *(accuracy - question_accuracy_mean)*;
    
    2) time since previous question for the "user_id" - "als_cluster" aggregation.
    
    
* *User-question statistics with update:*
    
    The statistics here are the same as in the previous item, but instead of grouping by "user_id" and "als_cluster" fields, we simply grouped by "user_id" and "content_id" fields. This helps a lot in the situations, when user is proposed to answer the question he answered earlier.

    
* *Other features:*

    1) avg of accuracy for the current part;
    
    2) avg of accuracy for the current als cluster.

Thanks to organizers for this interesting competition and thanks to [@tito](https://www.kaggle.com/its7171) for his brilliant notebooks with baseline solution and validation strategy!

In [None]:
import pandas as pd
import numpy as np
import gc
from sklearn.metrics import roc_auc_score, roc_curve
from tqdm import tqdm
import lightgbm as lgb
import pickle
import sys
import importlib
import matplotlib.pyplot as plt
from collections import defaultdict

## Functions

Here we declare some basic functions for generating features with updates.

In [None]:
def calc_arrays(df, dict_agg_user,
                dict_agg_user_part,
                dict_agg_user_als,
                dict_agg_uq,
                df_type='train',
                max_td=6048000, 
                windows_list=[5, 10, 15]):
    
    if df_type == 'train':
        shape = df[df.segment == 1].shape[0]
    else:
        shape = df.shape[0]
        
    ## const
    HOUR = 3600*10
    WEEK = 24 * 7 * HOUR
    USER_DIM = max(windows_list)
    
    ## user
    user_array = np.zeros((shape, 11), dtype='int32')
    user_sess_array = np.zeros((shape, 6), dtype='int32')
    user_window_array = np.zeros((shape, len(windows_list)), dtype='int16')
    
    ## user-part/als/question
    user_part_array = np.zeros((shape, 6), dtype='int32')
    user_als_array = np.zeros((shape, 3), dtype='int32')
    uq_array = np.zeros((shape, 3), dtype='int32')
    
    i = -1
    for row in tqdm(df[['user_id', 'answered_correctly', 'timestamp', 'content_id', 
                        'part', 'als_cluster', 'question_acc_mean', 
                        'prior_question_elapsed_time', 
                        'prior_question_had_explanation', 'segment']].values):
        user_id = int(row[0])
        answered_correctly = int(row[1])
        timestamp = int(row[2])
        content_id = int(row[3])
        part = row[4]
        als_cluster = row[5]
        q_level = row[6]
        elaps_time = row[7]
        had_expl = row[8]
        segment = row[9]
        
        acc_diff_q = int(answered_correctly * 1000 - q_level * 1000)
        neg_acc_mul_q = int((1 - answered_correctly) * q_level * 1000)
        
        if (df_type == 'train') and (segment == 1):
            i += 1
        elif df_type == 'valid':
            i += 1
        
        ## user_id
        user_row = list(dict_agg_user[user_id])
        
        # window features
        if (df_type == 'valid') or (segment == 1):
            for k, j in enumerate(windows_list):
                user_window_array[i, k] = user_row[7] - user_row[16+j] 
        
        for j in range(USER_DIM - 1, 0, -1):
            user_row[17 + j] = user_row[16 + j]
        user_row[17] = user_row[7]
        
        # main features
        user_row[9] += elaps_time
        user_row[10] += had_expl
        timedelta = timestamp - user_row[1]
        timedelta = timedelta if timedelta <= max_td else max_td
        timedelta_neg = timestamp - user_row[2] if user_row[2] >= 0 else 0
        timedelta_neg = timedelta_neg if timedelta_neg <= max_td else max_td
        timedelta_pos = timestamp - user_row[3] if user_row[3] >= 0 else 0
        timedelta_pos = timedelta_pos if timedelta_pos <= max_td else max_td
        timedelta_2 = timestamp - user_row[4]
        timedelta_2 = timedelta_2 if timedelta_2 <= max_td else max_td
        timedelta_3 = timestamp - user_row[5]
        timedelta_3 = timedelta_3 if timedelta_3 <= max_td else max_td
        
        if (df_type == 'valid') or (segment == 1):
            user_array[i, 0] = user_row[0]
            user_array[i, 1] = timedelta
            user_array[i, 2] = timedelta_neg
            user_array[i, 3] = timedelta_pos
            user_array[i, 4] = timedelta_2
            user_array[i, 5] = timedelta_3
            for j in range(5):
                user_array[i, j+6] = user_row[j+6]
        
        user_row[0] += 1
        user_row[5] = user_row[4]
        user_row[4] = user_row[1]
        user_row[1] = timestamp
        user_row[2] = timestamp if answered_correctly == 0 else user_row[2]
        user_row[3] = timestamp if answered_correctly == 1 else user_row[3]
        user_row[6] += int(answered_correctly)
        user_row[7] += acc_diff_q
        user_row[8] += neg_acc_mul_q
        
        # session features
        if timedelta >= HOUR:
            user_row[11] = 0
            user_row[13] = 0
            user_row[15] = 0
        if timedelta >= WEEK:
            user_row[12] = 0
            user_row[14] = 0 
            user_row[16] = 0
            
        if (df_type == 'valid') or (segment == 1):
            for j in range(6):
                user_sess_array[i, j] = user_row[11+j]
         
        for j in range(2):
            user_row[11+j] += 1
            user_row[13+j] += int(answered_correctly)
            user_row[15+j] += acc_diff_q 
        
        dict_agg_user[user_id] = tuple(user_row)
        
        ## user-part
        user_id_part = int(user_id*10 + part)
        user_part_row = list(dict_agg_user_part[user_id_part])
        user_part_row[4] += elaps_time
        user_part_row[5] += had_expl
        
        if (df_type == 'valid') or (segment == 1):
            user_part_array[i, 0] = user_part_row[0]
            for j in range(4):
                user_part_array[i, j+1] = user_part_row[j+1]
        
        user_part_row[0] += 1
        user_part_row[1] += int(answered_correctly)
        user_part_row[2] += acc_diff_q
        user_part_row[3] += neg_acc_mul_q
        dict_agg_user_part[user_id_part] = tuple(user_part_row)
        
        ## user-als
        user_id_als = int(user_id*1000 + als_cluster)
        user_als_row = list(dict_agg_user_als[user_id_als])
        timedelta = timestamp - user_als_row[1]
        timedelta = timedelta if timedelta <= max_td else max_td
        if (df_type == 'valid') or (segment == 1):
            user_als_array[i, 0] = user_als_row[0]
            user_als_array[i, 1] = timedelta
            user_als_array[i, 2] = user_als_row[2]
            
        user_als_row[0] += 1
        user_als_row[1] = timestamp
        user_als_row[2] += acc_diff_q
        dict_agg_user_als[user_id_als] = tuple(user_als_row)
        
        ## user-question
        user_content_id = int(user_id*100000 + content_id)
        uq_row = list(dict_agg_uq[user_content_id])
        timedelta = timestamp - uq_row[1]
        timedelta = timedelta if timedelta <= max_td else max_td
        if (df_type == 'valid') or (segment == 1):
            uq_array[i, 0] = uq_row[0]
            uq_array[i, 1] = timedelta
            uq_array[i, 2] = uq_row[2]
            
        uq_row[0] += 1
        uq_row[1] = timestamp
        uq_row[2] += acc_diff_q
        dict_agg_uq[user_content_id] = tuple(uq_row)
        
    return user_array, user_sess_array, user_window_array, user_part_array, user_als_array, uq_array

In [None]:
def add_user_features(df_, user_array, user_sess_array, 
                      user_window_array, user_part_array, 
                      user_als_array, uq_array, 
                      windows_list=[5, 10, 20]):    
    target_names = ['acc', 'acc_diff_q_level', 'neg_acc_mul_q_level']
    
    # user
    df_['user_id_cumcount'] = user_array[:, 0]
    df_['user_id_time_since_last_ans'] = user_array[:, 1]
    df_['user_id_time_since_last_incorr_ans'] = user_array[:, 2]
    df_['user_id_time_since_last_corr_ans'] = user_array[:, 3]
    df_['user_id_time_since_last_ans_2'] = user_array[:, 4]
    df_['user_id_time_since_last_ans_3'] = user_array[:, 5]
    df_['user_id_elapsed_time_mean'] = (user_array[:, 9]/(1+df_['user_id_cumcount'])).astype('float32')
    df_['user_id_had_expl_mean'] = (user_array[:, 10]/(1+df_['user_id_cumcount'])).astype('float32')
    for j, name in enumerate(target_names):
        df_['user_id_' + name + '_mean'] = (user_array[:, j+6]/df_['user_id_cumcount']).astype('float32')
    
    # user-window
    for k, j in enumerate(windows_list):
        df_['user_id_acc_diff_q_level_window_'+str(j)] = user_window_array[:, k]
    
    # user-session
    for j, p in enumerate(['hour', 'week']): 
        df_['user_session_cumcount_'+p] = user_sess_array[:, j]
        df_['user_session_acc_mean_'+p] = (user_sess_array[:, j+2]/df_['user_session_cumcount_'+p]).astype('float32')
        df_['user_session_acc_diff_q_level_mean_'+p] = (user_sess_array[:, j+4]/df_['user_session_cumcount_'+p]).astype('float32')
    
    # user-part
    df_['user_id_part_cumcount'] = user_part_array[:, 0]
    df_['user_id_part_acc_mean'] = user_part_array[:, 1]
    df_['user_id_part_elapsed_time_mean'] = (user_part_array[:, 4]/(1+df_['user_id_part_cumcount'])).astype('float32')
    df_['user_id_part_had_expl_mean'] = (user_part_array[:, 5]/(1+df_['user_id_part_cumcount'])).astype('float32')
    for j, name in enumerate(target_names[1:]):
        df_['user_id_part_' + name + '_mean'] = (user_part_array[:, j+2]/df_['user_id_part_cumcount']).astype('float32')
    
    # user-als
    df_['user_id_als_cumcount'] = user_als_array[:, 0]
    df_['user_id_als_timedelta'] = user_als_array[:, 1]
    df_['user_id_als_acc_diff_q_level_mean'] = (user_als_array[:, 2]/df_['user_id_als_cumcount']).astype('float32')
    
    # user-content
    df_['user_content_id_cumcount'] = uq_array[:, 0]
    df_['user_content_id_timedelta'] = uq_array[:, 1]
    df_['user_content_id_acc_diff_q_level_mean'] = (uq_array[:, 2]/df_['user_content_id_cumcount']).astype('float32')
    return df_

## Read & preprocess data

Here we use train and valid dataframes, generated by this notebook by [@tito](https://www.kaggle.com/its7171): https://www.kaggle.com/its7171/cv-strategy.

Note that the notebook requires significant amount of memory (we had 64 GB of RAM on our machine), so it is impossible to reproduce it in the Kaggle environment (to run all cells, you can set "IS_KAGGLE" param to True). 

In [None]:
## paths
input_path = '../input/riiid-test-answer-prediction'
cv_path = '../input/riiid-cross-validation-files'
external_path = '../input/external-data'

train_pickle = f'{cv_path}/cv1_train.pickle'
valid_pickle = f'{cv_path}/cv1_valid.pickle'

question_file = f'{input_path}/questions.csv'

embeddings_file = f'{external_path}/df_content_emb.csv'
als_clusters_file = f'{external_path}/als_clusters.csv'
feld_needed = ['row_id', 'user_id', 'content_id', 'content_type_id', 'answered_correctly', 
               'prior_question_elapsed_time', 'prior_question_had_explanation', 'timestamp']

In [None]:
## params

# Kaggle environment
IS_KAGGLE = True

# dimension of question "embeddings"
DIM = 15

# maximum value of timedelta for clipping and threshold to determine outdated "user-question" pairs
MAX_TD = 6048000

# subsample of train data, actually used for training
TRAIN_FRAC = 0.75

# rolling windows for user statistics
WINDOWS_LIST = [5, 10, 15]

In [None]:
df_train = pd.read_pickle(train_pickle)[feld_needed]
df_valid = pd.read_pickle(valid_pickle)[feld_needed]

if IS_KAGGLE:
    df_train = df_train.iloc[:1000000].reset_index(drop=True)

## divide timestamp by 100 to save memory for dictionaries
df_train['timestamp'] = (df_train.timestamp/1e2).astype('int64')
df_valid['timestamp'] = (df_valid.timestamp/1e2).astype('int64')

df_train = df_train.loc[df_train.content_type_id == False].reset_index(drop=True)
df_valid = df_valid.loc[df_valid.content_type_id == False].reset_index(drop=True)
df_train.shape, df_valid.shape

In [None]:
df_train['prior_question_had_explanation'] = df_train['prior_question_had_explanation'].fillna(False).astype('int8')
df_valid['prior_question_had_explanation'] = df_valid['prior_question_had_explanation'].fillna(False).astype('int8')

Due to memory issues, we couldn't train our model on full *df_train*. So, we decided to train it only on 75% of the data (segment == 1):  

In [None]:
## segments
train_ids = df_train[['row_id']].sample(frac=TRAIN_FRAC, random_state=0).row_id.unique()
df_train['segment'] = np.where(df_train.row_id.isin(train_ids), 1, 0).astype('int8')
df_valid['segment'] = np.ones(df_valid.shape[0], dtype='int8')
df_train.segment.value_counts()

## Generate question features

In [None]:
df_train['elapsed_time'] = df_train.groupby('user_id').prior_question_elapsed_time.shift(periods=-1)
df_train['had_explanation'] = df_train.groupby('user_id').prior_question_had_explanation.shift(periods=-1)

#### Adding main question statistics:

In [None]:
df_question_mean = (
    df_train[['content_id', 'answered_correctly', 
              'elapsed_time', 'had_explanation']]
    .groupby(['content_id'])
    .agg({'answered_correctly': ['count', 'mean', 'std'], 
          'elapsed_time': ['mean', 'std'], 
          'had_explanation': ['mean', 'std']})
    .reset_index()
)
df_train.drop(columns=['elapsed_time', 'had_explanation'], inplace=True)

columns = ['content_id']
columns += ['question_acc_count', 'question_acc_mean', 'question_acc_std']
columns += ['question_elapsed_time_mean', 'question_elapsed_time_std']
columns += ['question_had_expl_mean', 'question_had_expl_std']
df_question_mean.columns = columns
for col in columns:
    df_question_mean[col] = df_question_mean[col].astype('float32')
    
df_train = pd.merge(df_train, df_question_mean, on='content_id', how="left")
df_valid = pd.merge(df_valid, df_question_mean, on='content_id', how="left")
df_train.fillna(0, inplace=True)
df_valid.fillna(0, inplace=True)

df_train.shape, df_valid.shape

#### Adding question "embeddings":

In [None]:
df_emb = pd.read_csv(embeddings_file).rename(columns={'dim_'+str(i): 'content_id_dim_'+str(i) for i in range(DIM)})
df_emb['content_id'] = df_emb.content_id.astype('int32')
for f in list(df_emb.columns)[1:]:
    df_emb[f] = df_emb[f].astype('float16')
print(df_emb.shape)

df_train = pd.merge(df_train, df_emb, on='content_id', how="left")
df_valid = pd.merge(df_valid, df_emb, on='content_id', how="left")

df_train.shape, df_valid.shape

#### Adding "part" and "als_cluster" fields:

In [None]:
df_part = pd.read_csv(question_file)[['question_id', 'part']]
df_part.columns = ['content_id', 'part']
df_train = pd.merge(df_train, df_part, on = 'content_id', how = 'left')
df_valid = pd.merge(df_valid, df_part, on = 'content_id', how = 'left')

df_als = pd.read_csv(als_clusters_file)[['question_id', 'cluster']]
df_als.columns = ['content_id', 'als_cluster']
df_train = pd.merge(df_train, df_als, on = 'content_id', how = 'left')
df_valid = pd.merge(df_valid, df_als, on = 'content_id', how = 'left')

df_train.shape, df_valid.shape

#### Adding "part" and "als_cluster" statistics:

In [None]:
df_part_mean = df_train[['part', 'answered_correctly']].groupby(['part']).agg(['mean']).reset_index()
df_part_mean.columns = ['part', 'part_acc_mean']
df_train = pd.merge(df_train, df_part_mean, on='part', how="left")
df_valid = pd.merge(df_valid, df_part_mean, on='part', how="left")
df_train.shape, df_valid.shape

In [None]:
df_als_mean = df_train[['als_cluster', 'answered_correctly']].groupby(['als_cluster']).agg(['mean']).reset_index()
df_als_mean.columns = ['als_cluster', 'als_acc_mean']
df_train = pd.merge(df_train, df_als_mean, on='als_cluster', how="left")
df_valid = pd.merge(df_valid, df_als_mean, on='als_cluster', how="left")
df_train.shape, df_valid.shape

#### Saving all necessary files (doesn't make sence on Kaggle environment):

In [None]:
if IS_KAGGLE == False:
    df_part_als = df_part.merge(df_als, on='content_id')
    df_part_als.to_csv(f'{external_path}/df_part_als.csv', index=False)
    df_question_mean.to_csv(f'{external_path}/df_question_mean.csv', index=False)
    df_part_mean.to_csv(f'{external_path}/df_part_mean.csv', index=False)
    df_als_mean.to_csv(f'{external_path}/df_als_mean.csv', index=False)

    del df_part_als, df_question_mean, df_part_mean, df_als_mean
    gc.collect()

In [None]:
df_train.memory_usage(deep=True).sum()/2**30

## Generate user features

### 1) train

To make update features on train/valid/inference, we use simple defaultdicts from "collections" library:

In [None]:
dict_agg_user = defaultdict(lambda: tuple([0, 0, -1, -1] + [0]*(13+max(WINDOWS_LIST))))
dict_agg_user_part = defaultdict(lambda: (0, 0, 0, 0, 0, 0))
dict_agg_user_als = defaultdict(lambda: (0, 0, 0))
dict_agg_uq = defaultdict(lambda: (0, 0, 0))

Calculating arrays with features for train data with *segment == 1* and updating dictionaries:

In [None]:
%%time
user_array, user_sess_array, \
user_window_array, user_part_array, \
user_als_array, uq_array = calc_arrays(df_train, 
                                       dict_agg_user,
                                       dict_agg_user_part,
                                       dict_agg_user_als,
                                       dict_agg_uq, 
                                       df_type='train', 
                                       windows_list=WINDOWS_LIST)

del dict_agg_uq; gc.collect()

In [None]:
df_train_ = df_train[df_train.segment == 1].reset_index(drop=True)
df_train_.memory_usage(deep=True).sum()/2**30

One of the dictionaries (dict_agg_uq) is used to create "user-question" statistics. This dictionary itself is very large, so we remove it and create new one, without outdated "user-question" pairs. 

We consider "user-question" pair outdated, if the following condition is met:

    max(user_timestamp) - user_timestamp >= MAX_TD

In [None]:
## correct user-content dict
df_train = df_train[['user_id', 'content_id', 'timestamp', 'answered_correctly', 'question_acc_mean']].reset_index(drop=True)
df_train['max_user_tmp'] = df_train.groupby('user_id').timestamp.transform('max')
df_train = df_train.query(f"max_user_tmp - timestamp  < {MAX_TD}").drop(columns=['max_user_tmp']).reset_index(drop=True)
df_train['user_content_id'] = (df_train.user_id*100000 + df_train.content_id).astype('int64')
df_train['acc_diff_q_level'] = (1000 * df_train.answered_correctly - 1000 * df_train.question_acc_mean).astype('int32')
        
df_agg_uq = df_train.groupby('user_content_id').agg({'user_id': 'count', 'timestamp': 'max', 'acc_diff_q_level': 'sum'})
del df_train; gc.collect()
df_agg_uq.columns = ['count', 'timestamp', 'acc_diff_q_level_mean'] 
df_agg_uq['acc_diff_q_level_mean'] /= df_agg_uq['count']
dict_agg_uq = defaultdict(lambda: (0, 0, 0), df_agg_uq.apply(tuple, axis=1).to_dict())
del df_agg_uq; gc.collect()

Finally, generating features for train data with *segment == 1*:

In [None]:
df_train_ = add_user_features(df_train_, user_array, user_sess_array, 
                              user_window_array, user_part_array, 
                              user_als_array, uq_array, 
                              windows_list=WINDOWS_LIST)

del user_array, user_sess_array, user_window_array, user_part_array, user_als_array, uq_array; gc.collect()

#### Saving dictionaries for debug (doesn't make sence on Kaggle environment):

In [None]:
if IS_KAGGLE == False:
    ## user dict
    pickle.dump(dict(dict_agg_user), open(f"{external_path}/user_dict_debug.pickle.dat", "wb"))
    print("User dict saved!", len(dict_agg_user))

    ## user part dict
    pickle.dump(dict(dict_agg_user_part), open(f"{external_path}/user_part_dict_debug.pickle.dat", "wb"))
    print("User part dict saved!", len(dict_agg_user_part))

    ## user als dict
    pickle.dump(dict(dict_agg_user_als), open(f"{external_path}/user_als_dict_debug.pickle.dat", "wb"))
    print("User als dict saved!", len(dict_agg_user_als))

    ## user-question dict
    pickle.dump(dict(dict_agg_uq), open(f"{external_path}/uq_dict_debug.pickle.dat", "wb"))
    print("User-questions dict saved!", len(dict_agg_uq))

### 2) valid

Calculating arrays with features for valid data and updating dictionaries:

In [None]:
%%time
user_array, user_sess_array, \
user_window_array, user_part_array, \
user_als_array, uq_array = calc_arrays(df_valid, 
                                       dict_agg_user,
                                       dict_agg_user_part,
                                       dict_agg_user_als,
                                       dict_agg_uq, 
                                       df_type='valid', 
                                       windows_list=WINDOWS_LIST)
df_valid.shape

Generating features for valid data:

In [None]:
df_valid = add_user_features(df_valid, user_array, user_sess_array, 
                             user_window_array, user_part_array, 
                             user_als_array, uq_array, 
                             windows_list=WINDOWS_LIST)

del user_array, user_sess_array, user_window_array, user_part_array, user_als_array, uq_array; gc.collect()

#### Saving final dictionaries (doesn't make sence on Kaggle environment):

In [None]:
if IS_KAGGLE == False:
    ## user dict
    pickle.dump(dict(dict_agg_user), open(f"{external_path}/user_dict.pickle.dat", "wb"))
    print("User dict saved!", len(dict_agg_user))

    ## user part dict
    pickle.dump(dict(dict_agg_user_part), open(f"{external_path}/user_part_dict.pickle.dat", "wb"))
    print("User dict saved!", len(dict_agg_user_part))

    ## user als dict
    pickle.dump(dict(dict_agg_user_als), open(f"{external_path}/user_als_dict.pickle.dat", "wb"))
    print("User dict saved!", len(dict_agg_user_als))

    ## user-question dict
    pickle.dump(dict(dict_agg_uq), open(f"{external_path}/uq_dict.pickle.dat", "wb"))
    print("User-questions dict saved!", len(dict_agg_uq))

In [None]:
del dict_agg_uq, dict_agg_user, dict_agg_user_part, dict_agg_user_als
gc.collect()

## Modelling

List of used features:

In [None]:
selected_features = [
 ## raw features
 'prior_question_elapsed_time',
 'prior_question_had_explanation',
 
 ## question statistics
 'question_acc_count',
 'question_acc_mean',
 'question_acc_std',  
 'question_elapsed_time_mean', 
 'question_elapsed_time_std',
 'question_had_expl_mean', 
 'question_had_expl_std',

 ## user statistics
 'user_id_cumcount',  
 'user_id_acc_mean',
 'user_id_acc_diff_q_level_mean',
 'user_id_neg_acc_mul_q_level_mean',
 'user_id_elapsed_time_mean',
 'user_id_had_expl_mean',
    
 ## timedelta features
 'user_id_time_since_last_ans',
 'user_id_time_since_last_corr_ans',
 'user_id_time_since_last_incorr_ans',
 'user_id_time_since_last_ans_2',
 'user_id_time_since_last_ans_3', 

 ## user-session statistics
 'user_session_cumcount_hour',
 'user_session_acc_mean_hour',
 'user_session_acc_diff_q_level_mean_hour',   
 'user_session_cumcount_week',
 'user_session_acc_mean_week',
 'user_session_acc_diff_q_level_mean_week',
  
 ## user-part statistics
 'user_id_part_cumcount',
 'user_id_part_acc_mean',
 'user_id_part_acc_diff_q_level_mean',
 'user_id_part_neg_acc_mul_q_level_mean',
 'user_id_part_elapsed_time_mean',

 ## user-als statistics
 'user_id_als_cumcount',
 'user_id_als_timedelta', 
 'user_id_als_acc_diff_q_level_mean', 
    
 ## user-question statistics
 'user_content_id_cumcount',
 'user_content_id_timedelta', 
 'user_content_id_acc_diff_q_level_mean',
    
 ## other features
 'part_acc_mean',
 'als_acc_mean'
] 

## question "embeddings"
selected_features += ['content_id_dim_' + str(i) for i in range(DIM)]

## user-window statistics
selected_features += [f'user_id_acc_diff_q_level_window_{j}' for j in WINDOWS_LIST]

len(selected_features)

#### Writing train and valid data to csv (doesn't make sence on kaggle environment):

In [None]:
if IS_KAGGLE == False:
    df_train_.to_csv(f'{external_path}/df_train.csv', index=False)
    df_valid.to_csv(f'{external_path}/df_valid.csv', index=False)

In [None]:
if IS_KAGGLE == False:
    get_type = lambda f: 'float32' if 'dim' not in f else 'float16'

    df_train_ = pd.read_csv(f'{external_path}/df_train.csv',
                            usecols=['row_id', 'answered_correctly'] + selected_features,
                            dtype={f: get_type(f) for f in selected_features})
    df_valid = pd.read_csv(f'{external_path}/df_valid.csv',
                           usecols=['row_id', 'answered_correctly'] + selected_features,
                           dtype={f: get_type(f) for f in selected_features})

    df_train_.memory_usage(deep=True).sum()/2**30

In [None]:
%%time

lgb_train = lgb.Dataset(df_train_[selected_features].values.astype('float32'), df_train_.answered_correctly)
lgb_valid = lgb.Dataset(df_valid[selected_features].values.astype('float32'), df_valid.answered_correctly)
#del df_train_; gc.collect()

model = lgb.train(
    {'objective': 'binary'},
    lgb_train,
    valid_sets=[lgb_train, lgb_valid],
    verbose_eval=50,
    num_boost_round=15000,
    early_stopping_rounds=40
)

In [None]:
%%time
if IS_KAGGLE == False:
    df_valid['score'] = model.predict(df_valid[selected_features])
    print(roc_auc_score(df_valid.answered_correctly, df_valid.score))

#### Saving scored validation sample and model to csv (doesn't make sence on kaggle environment):

In [None]:
if IS_KAGGLE == False:
    df_valid.to_csv(f"{external_path}/df_valid_scored.csv", index=False)
    pickle.dump(model, open(f"{external_path}/model_v21.pickle.dat", "wb"))

Feature importances table (note that with flag *IS_KAGGLE = True* this table will be different from the result of training on all data):

In [None]:
df_importances = pd.DataFrame(selected_features, columns=['feature_name'])
df_importances['gain'] = model.feature_importance('gain')
df_importances = df_importances.sort_values('gain', ascending=False).reset_index(drop=True)
df_importances.head(50)