# Welcome!

This is the final version of [this](https://www.kaggle.com/markwijkhuizen/riiid-training-and-prediction-using-a-state) notebook, which I shared at the start of this competition.

Several features have been added since then and some RAM efficiency improvements have been made.

I learned a lot during this competition from all the public notebooks and discussions and references to notebooks and discussions I got valuable ideas from will be mentioned throughout this notebook.

Whenever you have questions just leave a comment and I will answer it soon.

In [None]:
import numpy as np
import pandas as pd

import lightgbm as lgb
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import roc_auc_score
from multiprocessing import cpu_count
from tqdm.notebook import tqdm
from memory_profiler import memory_usage
from cairosvg import svg2png
from PIL import Image
from io import BytesIO
from collections import deque
from glob import glob
from statistics import mean

import gc
import os
import sys
import re
import pickle
import math

In [None]:
VERSION = 'V1G'
NUM_BOOST_ROUND = 5000
VERBOSE_EVAL = 10
METRICS = ['auc']
N_ROWS = 99271300

# used for setting the index of a new DataFrame
def get_index_np():
    return np.arange(N_ROWS)

In [None]:
# This function returns a feature
# A list of indices and a data type can be passed to retrieve spcific training/validation rows as float32
def load_feature(feature, idxs=None, dtype=None):
    file_path = f'/kaggle/input/riiid-answer-correctness-prediction-features-temp/FEATURES_{VERSION}/{feature}.npz'
    if idxs is None and dtype is None:
        return np.load(file_path, allow_pickle=True)['v']
    elif idxs is not None and dtype is not None:
        return np.load(file_path, allow_pickle=True)['v'][idxs].astype(dtype)

In [None]:
# This simply shows all features
for file_path in glob(f'/kaggle/input/riiid-answer-correctness-prediction-features-temp/FEATURES_{VERSION}/*.npz'):
    print(re.findall('(?<=.\/)([a-z_0-9]*)(.npz)', file_path)[0][0])

# All features and a short description

# Given features

**prior_question_elapsed_time:** given feature

**prior_question_had_explanation:** given feature

**bundle_id:** given feature retrieved by merging questions df with train df, categorical feature

# User features

**mean_user_accuracy:** expanding mean user accuracy

**mean_accuracy_diff:** expanding mean of the difference between the mean user accuracy and mean content accuracy

**mean_user_content_accuracy:** expanding mean of the mean content accuracy of questions a user answered

**answered_correctly_user:** amount of questions a user answered correctly

**answered_user:** amount of questions a user answered
    
**mean_user_accuracy_r10:** rolling mean user accuracy with window size 10

**mean_user_accuracy_r25:** rolling mean user accuracy with window size 25
    
**mean_user_accuracy_r100:** rolling mean user accuracy with window size 100 
    
# Content features
    
**mean_content_accuracy:** mean content accuracy given retry and prior_question_had_explanation (explained later)

**content_id_pos_diff:** difference between mean position question and actual position of question.
If on average a question is asked as 100st question and a user get the question as 10th question this feature is 10-100=-90


**content_count:** amount of times a question is asked

# Part features
    
**part:** part the question belongs to, categorical feature

**mean_user_part_accuracy:** expanding mean accuracy for a user for the part the question belongs to

**answered_part_user:** amount of questions a user answered in the part the question belongs to
    
# Tag features
    
**tag_1:** first tag of the question, categorical feature
    
**tag_2:** second tag of the question, categorical feature, -1 if question has no second part

# User content features
    
**hmean_user_content_accuracy:** harmonic mean of the mean_user_accuracy and mean_content_accuracy, based on [this](https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/189569) discussion

# Last interaction elapsed time

All three features are based on [this](https://www.kaggle.com/ragnar123/riiid-model-lgbm) notebook

**last_interaction_elapsed_time_l1:** timestamp difference with last interaction

**last_interaction_elapsed_time_l2:** timestamp difference with two interactions earlier


**last_interaction_elapsed_time_l3:** timestamp difference with three interactions earlier


# Other

**attempt:** amount of times a user has attempted the question, based on [this](https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/194266) discussion

**lectures_seen:** amount of lectures a user has seen


# Lastly addded feature

**last_question_mean_acc_diff:** difference in mean content accuracy between the current and last feature

**user_part_answered_consecutive:** amount of consecutive answers of the current part.

**mean_user_mean_content_acc_diff:** absolute difference between the mean_user_accuracy and mean_content_accuracy

* The next two questions are based on [this](https://www.kaggle.com/ragnar123/riiid-model-lgbm) notebook

**last_correct_time_elapsed:** timestamp difference between current question and last correctly answered question

**last_incorrect_time_elapsed:** timestamp difference between current question and last incorrectly answered question

In [None]:
# this are given features, bundle_id is retrieved by merging the questions df with the train df
given_features = [
    'prior_question_elapsed_time',
    'prior_question_had_explanation',
    'bundle_id',
]

deduced_features = [
    # user features
    'mean_user_accuracy',
    'mean_accuracy_diff',
    'mean_user_content_accuracy',
    'answered_correctly_user',
    'answered_user',
    'mean_user_accuracy_r10',
    'mean_user_accuracy_r25',
    'mean_user_accuracy_r100',
    # content features
    'mean_content_accuracy',
    'content_id_pos_diff',
    'content_count',
    # part features
    'part',
    'mean_user_part_accuracy',
    'answered_part_user',
    # tag features
    'tag_1',
    'tag_2',
    # user content features
    'hmean_user_content_accuracy',
    # last interaction elapsed time
    'last_interaction_elapsed_time_l1',
    'last_interaction_elapsed_time_l2',
    'last_interaction_elapsed_time_l3',
    # other
    'attempt',
    'lectures_seen',
    
    # lastly added features
    'last_question_mean_acc_diff',
    'user_part_answered_consecutive',
    'mean_user_mean_content_acc_diff',
    'last_correct_time_elapsed',
    'last_incorrect_time_elapsed',
]

features = given_features + deduced_features

features_df_cols = [
    'user_id', 'content_id', 'part', 'tags', # merge keys
    'tags_label', 'answered_user', # deduced data
    'answered_correctly_user', 'mean_user_accuracy', 'mean_content_accuracy', # deduced features
]

target = 'answered_correctly'

# add indices of categorical features as data will be fed as numpy array to LightGBM and not as DataFrame
# in DataFrame all categoricaly set columns are automatically interpreted as categorical features
# This is not the case with a numpy array, as it is a simle matrix
# We thus need to specify the indices of the columns with categorical features to LightGBM
categorical_feature = ['part', 'retry', 'prior_question_had_explanation', 'bundle_id', 'tag_1', 'tag_2']
categorical_feature_idxs = []
for v in categorical_feature:
    try:
        categorical_feature_idxs.append(features.index(v))
    except:
        pass

# Make train and validation datasets

In [None]:
def get_train_val_idxs(TRAIN_SIZE, VAL_SIZE):
    train_idxs = []
    val_idxs = []
    NEW_USER_FRAC = 1/4 # fraction of new users, 25% of validation rows are new users
    np.random.seed(42)
    
    # create df with user_ids and indices
    df = pd.DataFrame(index=get_index_np())
    for feature in ['user_id']:
        df[feature] = load_feature(feature)

    df['index'] = df.index.values.astype(np.uint32)
    user_id_index = df.groupby('user_id')['index'].apply(np.array)
    
    # iterate over users in random order
    for indices in user_id_index.sample(user_id_index.size, random_state=42):
        if len(train_idxs) > TRAIN_SIZE:
            break

        # fill validation data
        if len(val_idxs) < VAL_SIZE:
            # add new user
            if np.random.rand() < NEW_USER_FRAC:
                val_idxs += list(indices)
            # randomly split user between train and val
            else:
                offset = np.random.randint(0, indices.size)
                train_idxs += list(indices[:offset])
                val_idxs += list(indices[offset:])
        else:
            train_idxs += list(indices)
    
    train_idxs = np.array(train_idxs, dtype=np.uint32)
    val_idxs = np.array(val_idxs, dtype=np.uint32)
    
    return train_idxs, val_idxs

train_idxs, val_idxs = get_train_val_idxs(int(96.5e6), 2.5e6)
print(f'len train_idxs: {len(train_idxs)}, len validation_idxs: {len(val_idxs)}')

In [None]:
def make_x_y(train_idxs, val_idxs):
    # create numpy arrays
    X_train = np.ndarray(shape=(len(train_idxs), len(features)), dtype=np.float32)
    X_val = np.ndarray(shape=(len(val_idxs), len(features)), dtype=np.float32)
    
    # now fill them up column wise to reduce memory usage
    # features are loaded from disk as npz files (compressed numpy arrays)
    for idx, feature in enumerate(tqdm(features)):
        X_train[:,idx] = load_feature(feature, train_idxs, np.float32)
        X_val[:,idx] = load_feature(feature, val_idxs, np.float32)
    
    y_train = load_feature(target, train_idxs, np.int8)
    y_val = load_feature(target, val_idxs, np.int8)
                         
    return X_train, y_train, X_val, y_val
    
X_train, y_train, X_val, y_val = make_x_y(train_idxs, val_idxs)

In [None]:
print(f'X_train.shape: {X_train.shape}\t y_train.shape: {y_train.shape}')
print(f'X_val.shape: {X_val.shape}\t y_val.shape: {y_val.shape}')

The next DataFrame gives an overview of the features, note all features are type float32 as LightGBM will use the data without copying. When feeding the features as DataFrame LightGBM will convert the DataFrame to float64, causing a massive RAM usage spike.

Passing the features as numpy array of type float32 will prevent LightGBM from copying the data and numpy arrays are by default much more memory efficient than DataFrames.

In [None]:
# show train features
display(pd.DataFrame(X_train[:25], columns=features))

In [None]:
# Check the target (answered correctly) as sanity check
display(y_train[:25])

In [None]:
# make train and validation dataset
# Do not specify the categorical feature here, but only for training
# If we do specify it here LightGBM will throw a warning for overriding the categorical features
train_data = lgb.Dataset(
    data = X_train,
    label = y_train,
    categorical_feature = None,
)

val_data = lgb.Dataset(
    data = X_val,
    label = y_val,
    categorical_feature = None,
)

In [None]:
# Free up RAM
del X_train, y_train, X_val, y_val, train_idxs, val_idxs
gc.collect()

# Training

In [None]:
# Simple LightGBM parameters
# There is some room for improvement here, any suggestions are welcome!
lgbm_params = {
    'objective': 'binary',
    'metric': METRICS,
    'num_leaves': 200,
    'learning_rate': 0.1,
}

In [None]:
%%time
def train():
    evals_result = {}
    model = lgb.train(
        params = lgbm_params,
        train_set = train_data,
        valid_sets = [val_data],
        num_boost_round = NUM_BOOST_ROUND,
        verbose_eval = VERBOSE_EVAL,
        evals_result = evals_result,
        early_stopping_rounds = 10,
        categorical_feature = categorical_feature_idxs,
        feature_name = features,
    )

    # save model
    model.save_model(f'model_{VERSION}_{NUM_BOOST_ROUND}.lgb')
    
    return model, evals_result
    
model, evals_result = train()

# Training History

In [None]:
def plot_history(evals_result):
    for metric in METRICS:
        plt.figure(figsize=(20,8))
        
        for key in evals_result.keys():
            history_len = len(evals_result.get(key)[metric])
            history = evals_result.get(key)[metric]
            x_axis = np.arange(1, history_len + 1)
            plt.plot(x_axis, history, label=key)
        
        x_ticks = list(filter(lambda e: (e % (history_len // 100 * 10) == 0) or e == 1, x_axis))
        plt.xticks(x_ticks, fontsize=12)
        plt.yticks(fontsize=12)

        plt.title(f'{metric.upper()} History of training', fontsize=18);
        plt.xlabel('EPOCH', fontsize=16)
        plt.ylabel(metric.upper(), fontsize=16)
        
        if metric in ['auc']:
            plt.legend(loc='upper left', fontsize=14)
        else:
            plt.legend(loc='upper right', fontsize=14)
        plt.grid()
        plt.show()

plot_history(evals_result)

In [None]:
def show_feature_importances(model, importance_type, max_num_features=10**10):
    feature_importances = pd.DataFrame()
    feature_importances['feature'] = features
    feature_importances['value'] = pd.DataFrame(model.feature_importance(importance_type))
    feature_importances = feature_importances.sort_values(by='value', ascending=False) # sort feature importance
    feature_importances.to_csv(f'feature_importances_{importance_type}.csv') # write feature importance to csv
    feature_importances = feature_importances[:max_num_features] # only show max_num_features
    
    plt.figure(figsize=(18, 8))
    plt.xlim([0, feature_importances.value.max()*1.1])
    plt.title(f'Feature {importance_type}', fontsize=18);
    sns.barplot(data=feature_importances, x='value', y='feature', palette='rocket');
    for idx, v in enumerate(feature_importances.value):
        plt.text(v, idx, "  {:.2e}".format(v))

show_feature_importances(model, 'gain')
show_feature_importances(model, 'split')

In [None]:
# show tree and save as png
def save_tree_diagraph(model):
    tree_digraph = lgb.create_tree_digraph(model, show_info=['split_gain', 'internal_count'])

    tree_png = svg2png(tree_digraph._repr_svg_(), output_width=3840)
    tree_png = Image.open(BytesIO(tree_png))

    tree_png.save('create_tree_digraph.png')

    display(tree_png)
    
save_tree_diagraph(model)

In [None]:
# remove train and validation data to free memory before prediction phase
del train_data, val_data
gc.collect()

# Prediction Preparation

The next dataframe is used for merging with the test_df to get some non-user dependent content, part and tag features

In [None]:
def get_features_questions_df():
    # create DataFrame of features
    features_questions_df = pd.DataFrame(index=get_index_np())
    cols = [
        'content_id',
        'part',
        'tag_1',
        'tag_2',
        'content_count',
        'mean_content_id_pos',
        'bundle_id',
    ]
    for feature in tqdm(cols):
        features_questions_df[feature] = load_feature(feature)

    # content features
    features_questions_df.drop_duplicates(subset='content_id', inplace=True)
    features_questions_df.sort_values('content_id', inplace=True)
    features_questions_df.reset_index(drop=True, inplace=True)
    
    return features_questions_df
    
features_questions_df = get_features_questions_df()
print(f'features_questions_df, rows: {features_questions_df.shape[0]}')
display(features_questions_df.head())

# STATE

This next function is the beating heart of my prediction phase, a massive dictionary to keep track of all features of all users and update them with every interaction.

I agree, the code is somewhat unreadable, but the basic idea is as follows:

Compute features over all user data (mean_user_accuracy, answered_user, answered_correctly_user) 
as these features have a lag of 1

Get the last data point for other features (lecturs seen, mean_content_accuracy, etc)

Create a dictionary where for each user all features are kept track of, an example of a user is shown below the function

In [None]:
def get_state():
    # create DataFrame of features
    features_df = pd.DataFrame(index=get_index_np())
    cols = [
        'user_id', 'content_id', 'answered_correctly','lectures_seen',
        'mean_user_accuracy', 'mean_content_accuracy',
        'last_correct_time_elapsed', 'last_incorrect_time_elapsed', 'part', 'user_part_answered_consecutive',
        'timestamp', 'last_interaction_elapsed_time_l1', 'last_interaction_elapsed_time_l2', 'last_interaction_elapsed_time_l3',
    ]
    for f in tqdm(cols):
        features_df[f] = load_feature(f)
        
    # get last features
    last_features = features_df.groupby('user_id')[[
        'timestamp', 'lectures_seen', 'mean_content_accuracy', 'part', 'user_part_answered_consecutive',
        'last_correct_time_elapsed', 'last_incorrect_time_elapsed',
        'last_interaction_elapsed_time_l1', 'last_interaction_elapsed_time_l2', 'last_interaction_elapsed_time_l3',
    ]].last()
    
    # last correct/incorrect time elapsed
    last_correct_features = features_df.groupby(['user_id', 'answered_correctly'])['timestamp'].last()
    
    # drop features only used for last feature computation
    features_df.drop([
        'timestamp', 'last_interaction_elapsed_time_l1', 'last_interaction_elapsed_time_l2', 'last_interaction_elapsed_time_l3',
        'last_correct_time_elapsed', 'last_incorrect_time_elapsed', 'user_part_answered_consecutive',
    ], axis=1, inplace=True)
        
    # compute user features over all train data
    features_df_grouped_by_user = features_df[['user_id', 'answered_correctly']].groupby('user_id')['answered_correctly']
    mean_user_accuracy = features_df_grouped_by_user.mean().values.astype(np.float32)
    answered_correctly_user = features_df_grouped_by_user.sum().values.astype(np.uint16)
    answered_user = features_df_grouped_by_user.count().values.astype(np.uint16)
    last_questions_answered_100 = features_df.groupby('user_id')[['user_id', 'answered_correctly']].tail(100).groupby('user_id')['answered_correctly'].apply(np.array).apply(np.flip)
    # user_mean_content_accuracy_sum for computing mean_user_content_accuracy
    mean_content_accuracy_sum = features_df.groupby('user_id')['mean_content_accuracy'].sum().values
    # mean_accuracy_diff
    mean_accuracy_diff_sum = (features_df.groupby('user_id')['mean_user_accuracy'].sum() - features_df.groupby('user_id')['mean_content_accuracy'].sum()).values
    # user part features (count, sum)
    user_part_features = features_df.groupby(['user_id', 'part'])['answered_correctly'].agg(['count', 'sum']).reset_index().astype({'user_id': int, 'part': int})
    
    del features_df_grouped_by_user, features_df
    gc.collect()
    
    # get state with precomputed attempts
    with open('/kaggle/input/riiid-answer-correctness-prediction-features/state.pkl', 'rb') as state_pickle_file:
         state = pickle.load(state_pickle_file)
    
    # add all features to state
    for idx, user_id in tqdm(enumerate(state.keys()), total=len(state)):
        state[user_id]['mean_user_accuracy'] = mean_user_accuracy[idx]
        state[user_id]['answered_correctly_user'] = answered_correctly_user[idx]
        state[user_id]['answered_user'] = answered_user[idx]
        state[user_id]['mean_content_accuracy_sum'] = mean_content_accuracy_sum[idx]
        state[user_id]['mean_accuracy_diff_sum'] = mean_accuracy_diff_sum[idx]
        # last features
        state[user_id]['lectures_seen'] = last_features.loc[user_id, 'lectures_seen']
        state[user_id]['timestamp'] = last_features.loc[user_id, 'timestamp']
        state[user_id]['last_mean_content_accuracy'] = last_features.loc[user_id, 'mean_content_accuracy']
        state[user_id]['last_part'] = last_features.loc[user_id, 'part']
        state[user_id]['user_part_answered_consecutive'] = last_features.loc[user_id, 'user_part_answered_consecutive']
        state[user_id]['last_correct_timestamp'] = last_correct_features.loc[user_id, True] if (user_id, True) in last_correct_features else np.nan
        state[user_id]['last_incorrect_timestamp'] = last_correct_features.loc[user_id, False] if (user_id, False) in last_correct_features else np.nan
        state[user_id]['last_interaction_elapsed_time_l1'] = last_features.loc[user_id, 'last_interaction_elapsed_time_l1']
        state[user_id]['last_interaction_elapsed_time_l2'] = last_features.loc[user_id, 'last_interaction_elapsed_time_l2']
        state[user_id]['last_interaction_elapsed_time_l3'] = last_features.loc[user_id, 'last_interaction_elapsed_time_l3']
        state[user_id]['mean_user_accuracy_r10'] = deque(last_questions_answered_100[user_id][:10], maxlen=10)
        state[user_id]['mean_user_accuracy_r25'] = deque(last_questions_answered_100[user_id][:25], maxlen=25)
        state[user_id]['mean_user_accuracy_r100'] = deque(last_questions_answered_100[user_id], maxlen=100)
            
    # add user part features
    for idx, (user_id, part, part_count, part_sum) in tqdm(user_part_features.iterrows(), total=user_part_features.index.size):
        state[user_id][f'part_{part}_count'] = part_count
        state[user_id][f'part_{part}_sum'] = part_sum
                
    return state

state = get_state()
gc.collect()

In [None]:
# Example of the state for the famous user 115
display(state[115])

In [None]:
# Gives the state for a new user with all default values
def get_new_user(row):
    return {
        'mean_user_accuracy': 0.680,
        'answered_correctly_user': 0,
        'answered_user': 0,
        'user_content_attempts': dict(),
        'lectures_seen': 0,
        'timestamp': row['timestamp'],
        'last_mean_content_accuracy': 0,
        'last_part': int(row['part']),
        'user_part_answered_consecutive': 0,
        'last_correct_timestamp': np.nan,
        'last_incorrect_timestamp': np.nan,
        'last_interaction_elapsed_time_l1': 0,
        'last_interaction_elapsed_time_l2': 0,
        'last_interaction_elapsed_time_l3': 0,
        'mean_user_accuracy_r10': deque([], maxlen=10),
        'mean_user_accuracy_r25': deque([], maxlen=25),
        'mean_user_accuracy_r100': deque([], maxlen=100),
        'mean_accuracy_diff_sum': 0,
        'mean_content_accuracy_sum': 0,
    }

# returns a dictionary with a list for all user features
def get_user_data_dict():
    return {
        'mean_user_accuracy': [],
        'answered_correctly_user': [],
        'answered_user': [],
        'lectures_seen': [],
        'last_interaction_elapsed_time_l1': [],
        'last_interaction_elapsed_time_l2': [],
        'last_interaction_elapsed_time_l3': [],
        'mean_user_accuracy_r10': [],
        'mean_user_accuracy_r25': [],
        'mean_user_accuracy_r100': [],
        'mean_user_part_accuracy': [],
        'answered_part_user': [],
        'mean_accuracy_diff': [],
        'mean_user_content_accuracy': [],
        'last_question_mean_acc_diff': [],
        'user_part_answered_consecutive': [],
        'mean_user_mean_content_acc_diff': [],
        'last_correct_time_elapsed': [],
        'last_incorrect_time_elapsed': [],
    }

The next function retrieves features for all user from the state and returns a dictionary with all features as shown above.

In [None]:
# dictionary which converts a content_id to a tag
lectures = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/lectures.csv', dtype={ 'tag': np.uint8 })[['lecture_id', 'tag']]
content_id2tag = pd.Series(lectures['tag'].values, index=lectures['lecture_id'].values)

def get_user_data(state, test_df):
    # updated data
    user_data = get_user_data_dict()
    
    # mean first question part accuracies
    part_first_question_mean_accuracy_dict = {1: 0.75, 2: 0.60, 3: 0.49, 4: 0.41, 5: 0.52, 6: 0.51, 7: 0.47}
    cols = ['user_id', 'content_id', 'content_type_id', 'timestamp', 'part', 'mean_content_accuracy']
    
    for idx, (user_id, content_id, is_lecture, timestamp, part, mean_content_accuracy) in test_df[cols].iterrows():
        # LECTURE
        if is_lecture:
            state[user_id]['lectures_seen'] += 1
            
            # fill user data with dummy value
            for key in user_data.keys():
                user_data[key].append(0)
            
        # QUESTION
        else:
            part = int(part)
            
            # update last interaction elapsed time
            state[user_id]['last_interaction_elapsed_time_l3'] = state[user_id]['last_interaction_elapsed_time_l2']
            state[user_id]['last_interaction_elapsed_time_l2'] = state[user_id]['last_interaction_elapsed_time_l1']
            if timestamp != state[user_id]['timestamp']:
                state[user_id]['last_interaction_elapsed_time_l1'] = timestamp - state[user_id]['timestamp']
                state[user_id]['timestamp'] = timestamp
            
            # add various features
            cols = [
                'mean_user_accuracy', 'answered_correctly_user', 'answered_user', 'lectures_seen',
                'last_interaction_elapsed_time_l1', 'last_interaction_elapsed_time_l2', 'last_interaction_elapsed_time_l3',
            ]
            for feature in cols:
                user_data[feature].append(state[user_id][feature])
            
            # branch on first question
            if state[user_id]['answered_user'] > 0:
                user_data['mean_accuracy_diff'].append(state[user_id]['mean_accuracy_diff_sum'] / state[user_id]['answered_user'])
                user_data['mean_user_content_accuracy'].append(state[user_id]['mean_content_accuracy_sum'] / state[user_id]['answered_user'])
                user_data['mean_user_accuracy_r10'].append(np.array(state[user_id]['mean_user_accuracy_r10']).mean())
                user_data['mean_user_accuracy_r25'].append(np.array(state[user_id]['mean_user_accuracy_r25']).mean())
                user_data['mean_user_accuracy_r100'].append(np.array(state[user_id]['mean_user_accuracy_r100']).mean())
            else:
                user_data['mean_accuracy_diff'].append(0)
                user_data['mean_user_content_accuracy'].append(0)
                user_data['mean_user_accuracy_r10'].append(0.68)
                user_data['mean_user_accuracy_r25'].append(0.68)
                user_data['mean_user_accuracy_r100'].append(0.68)
            
            # last question mean acc diff
            if state[user_id]['last_mean_content_accuracy'] != 0:
                user_data['last_question_mean_acc_diff'].append(mean_content_accuracy - state[user_id]['last_mean_content_accuracy'])
            else:
                user_data['last_question_mean_acc_diff'].append(0)
            
            state[user_id]['last_mean_content_accuracy'] = mean_content_accuracy
            
            # user part features
            if f'part_{part}_count' in state[user_id]:
                if state[user_id][f'part_{part}_sum'] == 0:
                    user_data['mean_user_part_accuracy'].append(0)
                else:
                    user_data['mean_user_part_accuracy'].append(state[user_id][f'part_{part}_sum'] / state[user_id][f'part_{part}_count'])
                
                user_data['answered_part_user'].append(state[user_id][f'part_{part}_count'])
            else:
                state[user_id][f'part_{part}_sum'] = 0
                state[user_id][f'part_{part}_count'] = 0
                user_data['mean_user_part_accuracy'].append(part_first_question_mean_accuracy_dict[part])
                user_data['answered_part_user'].append(0)
            
            # user_part_answered_consecutive
            if part == state[user_id]['last_part']:
                state[user_id]['user_part_answered_consecutive'] += 1
            else:
                state[user_id]['last_part'] = part
                state[user_id]['user_part_answered_consecutive'] = 0
                
            user_data['user_part_answered_consecutive'].append(state[user_id]['user_part_answered_consecutive'])
            
            # mean_user_mean_content_acc_diff
            user_data['mean_user_mean_content_acc_diff'].append(mean_content_accuracy - state[user_id]['mean_user_accuracy'])
            
            # last correct/incorrect time elapsed
            user_data['last_correct_time_elapsed'].append(timestamp - state[user_id]['last_correct_timestamp'])
            user_data['last_incorrect_time_elapsed'].append(timestamp - state[user_id]['last_incorrect_timestamp'])
    
    return user_data

In [None]:
# adds all new users to the state with default values
def add_new_users(test_df):
    for idx, row in test_df.iterrows():
        # check if user exists
        if not row['user_id'] in state:
            state[row['user_id']] = get_new_user(row)

In [None]:
# adds the attempt and retry feature to the test_df
def add_attempt_retry(test_df):
    attempt = []
    for user_id, content_id in test_df[['user_id', 'content_id']].itertuples(name=None, index=False):
        if content_id in state[user_id]['user_content_attempts']:
            state[user_id]['user_content_attempts'][content_id] += 1
        else:
            state[user_id]['user_content_attempts'][content_id] = 0

        attempt.append(state[user_id]['user_content_attempts'][content_id])
    
    test_df['attempt'] = attempt
    test_df['retry'] = test_df['attempt'] > 0

This next function adds the mean_content_accuracy taking the prior_question_had_explanation and retry feature into account. an example of the effect of prior_question_had_explanation and retry are given below.

In [None]:
with open('/kaggle/input/riiid-answer-correctness-prediction-features/mean_content_accuracy_cases_dict.pickle', 'rb') as f:
    mean_content_accuracy_cases_dict = pickle.load(f)

def add_mean_content_accuracy(test_df):
    mean_content_accuracy = []
    for key in test_df[['content_id', 'prior_question_had_explanation', 'retry']].itertuples(name=None, index=False):
        # get mean content accuracy
        if key in mean_content_accuracy_cases_dict:
            mean_content_accuracy.append(mean_content_accuracy_cases_dict[key])
        else:
            mean_content_accuracy.append(0)
        
    test_df['mean_content_accuracy'] = mean_content_accuracy

In [None]:
# Example of the mean_content_accuracy for quesiton 6116, the most answered question
for (content_id, prior_question_had_explanation, retry), mean_content_accuracy in mean_content_accuracy_cases_dict.items():
    if content_id == 6116:
        print(f'content_id {content_id}, prior_question_had_explanation: {prior_question_had_explanation}, retry: {retry}, mean_content_accuracy: {mean_content_accuracy}')

After each prediction iteration the user features are updated

In [None]:
def update_user_data(state, prev_test_df):
    for idx, row in prev_test_df.iterrows():
        if not row['content_type_id']:
            answered_correctly = row['answered_correctly']
            user_id = row['user_id']
            part = int(row['part'])
            # update user features
            state[user_id]['answered_correctly_user'] += answered_correctly
            state[user_id]['answered_user'] += 1
            state[user_id]['mean_user_accuracy'] = state[user_id]['answered_correctly_user'] / state[user_id]['answered_user']
            state[user_id]['mean_user_accuracy_r10'].appendleft(answered_correctly)
            state[user_id]['mean_user_accuracy_r25'].appendleft(answered_correctly)
            state[user_id]['mean_user_accuracy_r100'].appendleft(answered_correctly)
            # add user part features, initialize if no part answered yet
            state[user_id][f'part_{part}_sum'] += answered_correctly
            state[user_id][f'part_{part}_count'] += 1
            # update other user features
            state[user_id]['mean_accuracy_diff_sum'] += row['mean_user_accuracy'] - row['mean_content_accuracy']
            state[user_id]['mean_content_accuracy_sum'] += row['mean_content_accuracy']
            # last correct/incorrect time elapsed
            if answered_correctly:
                state[user_id]['last_correct_timestamp'] = row['timestamp']
            else:
                state[user_id]['last_incorrect_timestamp'] = row['timestamp']

# Make actual prediction

In [None]:
import riiideducation

env = riiideducation.make_env()
iter_test = env.iter_test()

In [None]:
pd.options.display.max_columns = 99
prev_test_df = None

for idx, (test_df, _) in tqdm(enumerate(iter_test)):
    # from 2nd iteration, update user data
    if prev_test_df is not None:
        prev_test_df['answered_correctly'] = eval(test_df['prior_group_answers_correct'].iloc[0])
        update_user_data(state, prev_test_df)
        if idx < 4:
            display(test_df)
            display(prev_test_df)
            
    # fill prior question had explenation
    test_df['prior_question_had_explanation'] = test_df['prior_question_had_explanation'].fillna(value = False).astype(bool)
    test_df['prior_question_elapsed_time'].fillna(23916, inplace=True)
            
    # merge with all features
    test_df = features_questions_df.merge(test_df, how='right', on='content_id')
    
    # add new users to state
    add_new_users(test_df)
    
    # add attempt, retry and mean_content_accuracy
    add_attempt_retry(test_df)
    add_mean_content_accuracy(test_df)
    
    # get user data from state and update attempt
    user_data = get_user_data(state, test_df)
    for feature, values in user_data.items():
        test_df[feature] = values
    
    # add harmonic mean
    test_df['hmean_user_content_accuracy'] = 2 * (
        (test_df['mean_user_accuracy'] * test_df['mean_content_accuracy']) /
        (test_df['mean_user_accuracy'] + test_df['mean_content_accuracy'])
    )

    # add content_d position difference
    test_df['content_id_pos_diff'] = (test_df['answered_user'] - test_df['mean_content_id_pos']).values.astype(np.int16)

    test_df['answered_correctly'] = model.predict(test_df[features])

    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])

    # set previour test_df
    prev_test_df = test_df.copy()

In [None]:
submission = pd.read_csv('./submission.csv')

In [None]:
submission.info()

In [None]:
# show the first 5 predictions
submission.head()

That was all of it, hope you enjoyed this notebook and thanks for taking a look at it!

As I am quite new to the machine learning community any tips and tricks are welcome!