**Welcome!** Here is a baseline model for the Riiid challenge explained:

In [None]:
import riiideducation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score
env = riiideducation.make_env()

The dataset for training exceeds the RAM, if you do not use Google Cloud Storage. The dataset for testing, on the other hand, cannot be accessed directly, but the organisers of this competition provide a module for handling the data in batches. It's explained in this [Notebook](https://www.kaggle.com/sohier/competition-api-detailed-introduction). However, there are also more efficient ways to download and store the training data than csv to pandas(See this [Notebook](https://www.kaggle.com/rohanrao/riiid-with-blazing-fast-rid)). Still, we simply resort to using csv to pandas: We load the dataset that contains statistics on one specific answer given by a user to a question. Unfortunately, there are users in the test set for which we do not have data in this dataset:

In [None]:
train = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/train.csv',
                   usecols=[1, 2, 3, 4, 5, 7, 8, 9],
                   dtype={'timestamp': 'int64',
                          'user_id': 'int32',
                          'content_id': 'int16',
                          'content_type_id': 'int8',
                          'task_container_id': 'int16',
                          'answered_correctly':'int8',
                          'prior_question_elapsed_time': 'float32',
                          'prior_question_had_explanation': 'boolean'}
                   )
train.info()

Let's take a look at the dataset:

In [None]:
train

Just as in the test dataset, each row in the training set corresponds to a user's answer to a question. We see that there is information on how often a question is answered correctly in general. Thus, we load the file that contains statistics on each question, as we hope to gain valuable information from it: This is the complete list of questions that appear in the datasets:

In [None]:
#reading in question df
questions_df = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/questions.csv',                         
                            usecols=[0, 3],
                            dtype={'question_id': 'int16',
                              'part': 'int8'}
                          )
questions_df

We remove data on lectures. They only represent about 2 percent of this training dataset and they will not be present in the test dataset:

In [None]:
train = train[train.content_type_id == False].sort_values('timestamp').reset_index(drop = True)

# Features
We compute the mean time that elapsed while the user answered the previous question:

In [None]:
elapsed_mean = train.prior_question_elapsed_time.mean()
elapsed_mean

We compute the average amount of questions seen by a user:

In [None]:
group1 = train.loc[(train.content_type_id == False), ['task_container_id', 'user_id']].groupby(['task_container_id']).agg(['count'])
group1.columns = ['avg_questions']
group2 = train.loc[(train.content_type_id == False), ['task_container_id', 'user_id']].groupby(['task_container_id']).agg(['nunique'])
group2.columns = ['avg_questions']
group3 = group1 / group2
group3['avg_questions_seen'] = group3.avg_questions.cumsum()
print('The amount of questions seen by the average user:')
group3.iloc[0].avg_questions_seen

We compute the mean accuracy for each user:

In [None]:
results_u_final = train.loc[train.content_type_id == False, ['user_id','answered_correctly']].groupby(['user_id']).agg(['mean', 'count'])
results_u_final.columns = ['answered_correctly_user','answered_user']
results_u_final.answered_correctly_user.describe()


We compute the fraction of prior questions that had an explanation for each user:

In [None]:
results_u2_final = train.loc[train.content_type_id == False, ['user_id','prior_question_had_explanation']].groupby(['user_id']).agg(['mean'])
results_u2_final.columns = ['explanation_mean_user']
results_u2_final.explanation_mean_user.describe()

We merge the training and question datasets:

In [None]:
train = pd.merge(train, questions_df, left_on = 'content_id', right_on = 'question_id', how = 'left')

We compute the fraction of correct answers for each question:

In [None]:
results_q_final = train.loc[train.content_type_id == False, ['question_id','answered_correctly']].groupby(['question_id']).agg(['mean'])
results_q_final.columns = ['quest_pct']
results_q_final.quest_pct.describe()

We compute how often each question was asked:

In [None]:
results_q2_final = train.loc[train.content_type_id == False, ['question_id','part']].groupby(['question_id']).agg(['count'])
results_q2_final.columns = ['count']

We merge the data from the questions.csv and the new question features:

In [None]:
question2 = pd.merge(questions_df, results_q_final, left_on = 'question_id', right_on = 'question_id', how = 'left')
question2 = pd.merge(question2, results_q2_final, left_on = 'question_id', right_on = 'question_id', how = 'left')
question2.quest_pct = round(question2.quest_pct,5)
question2

# EDA

We plot the fraction of answers of a user that are correct over the number of question that the respective user answered:

In [None]:
figure=plt.subplots(figsize=(20,20))
plt.scatter(x = results_u_final.answered_user, y=results_u_final.answered_correctly_user)
plt.axhline(train['answered_correctly'].mean(), color='k', linestyle='dashed', linewidth=3)

plt.title("Fraction of the user's answers that are correct vs. Number of questions answered by the user", weight='bold')
plt.text(15000, 0.64, 'Fraction of answers that are correct: {:.2f}'.format(train['answered_correctly'].mean()))
plt.show()

The fraction of first answers that were correct:

In [None]:
train.loc[(train.timestamp == 0)].answered_correctly.mean()

The fraction of subsequent answers that were correct:

In [None]:
train.loc[(train.timestamp != 0)].answered_correctly.mean()

The likelihood that the average user had an explanation provided with the previous question:

In [None]:
prior_mean_user = results_u2_final.explanation_mean_user.mean()
prior_mean_user

We drop the timestamp and the IDs for content question and part from the dataset:

In [None]:
train.drop(['timestamp', 'content_type_id', 'question_id', 'part'], axis=1, inplace=True)

# Realistic validation 
We use the most recent five answers of each user as validation set. After all, that's what we would want to predict, if we stopped data collection a bit earlier.

In [None]:
print('The old length of the training set:')
print(len(train))
validation = train.groupby('user_id').tail(10)
train = train[~train.index.isin(validation.index)]
print('The length of the training set plus the length of the validation set:')
print(len(train) + len(validation))

We again compute the mean accuracy and the fraction of prior questions that had an explanation for each user, but this time without the validation set:

In [None]:
results_u_val = train[['user_id','answered_correctly']].groupby(['user_id']).agg(['mean'])
results_u_val.columns = ['answered_correctly_user']

results_u2_val = train[['user_id','prior_question_had_explanation']].groupby(['user_id']).agg(['mean'])
results_u2_val.columns = ['explanation_mean_user']

We reduce the size of the training set by removing the older answers:

In [None]:
X = train.groupby('user_id').tail(30)
train = train[~train.index.isin(X.index)]
print('The length of the training set plus the length of the validation set plus the length of the set to be discarded:')
print(len(X) + len(validation)+ len(train))

We again compute the mean accuracy and the fraction of prior questions that had an explanation for each user, this time for the smaller training set:

In [None]:
results_u_X = train[['user_id','answered_correctly']].groupby(['user_id']).agg(['mean'])
results_u_X.columns = ['answered_correctly_user']

results_u2_X = train[['user_id','prior_question_had_explanation']].groupby(['user_id']).agg(['mean'])
results_u2_X.columns = ['explanation_mean_user']

# Cleaning
We remove the oldest part:

In [None]:
del(train)

We merge the training set with the features that we computed:

In [None]:
X = pd.merge(X, group3, left_on=['task_container_id'], right_index= True, how="left")
X = pd.merge(X, results_u_X, on=['user_id'], how="left")
X = pd.merge(X, results_u2_X, on=['user_id'], how="left")

We merge the validation set in the same way:

In [None]:
validation = pd.merge(validation, group3, left_on=['task_container_id'], right_index= True, how="left")
validation = pd.merge(validation, results_u_val, on=['user_id'], how="left")
validation = pd.merge(validation, results_u2_val, on=['user_id'], how="left")

We replace missing booleans by False. Then, we use an encoder to replace the boolean variables:

In [None]:
from sklearn.preprocessing import LabelEncoder

lb_make = LabelEncoder()

X.prior_question_had_explanation.fillna(False, inplace = True)
validation.prior_question_had_explanation.fillna(False, inplace = True)

validation["prior_question_had_explanation_enc"] = lb_make.fit_transform(validation["prior_question_had_explanation"])
X["prior_question_had_explanation_enc"] = lb_make.fit_transform(X["prior_question_had_explanation"])

The mean of the list of fractions of correct answers for questions:

In [None]:
content_mean = question2.quest_pct.mean()

question2.quest_pct.mean()

Many questions seem to have been asked few times and answered with an accuracy above average! Let's try to correct the accuracies for questions that have been asked very few times:

In [None]:
#filling questions with no info with a new value
question2.quest_pct = question2.quest_pct.mask((question2['count'] < 3), .65)


#filling very hard new questions with a more reasonable value
question2.quest_pct = question2.quest_pct.mask((question2.quest_pct < .2) & (question2['count'] < 21), .2)

#filling very easy new questions with a more reasonable value
question2.quest_pct = question2.quest_pct.mask((question2.quest_pct > .95) & (question2['count'] < 21), .95)

Let's merge these new features with the training and validation datasets.

In [None]:
X = pd.merge(X, question2, left_on = 'content_id', right_on = 'question_id', how = 'left')
validation = pd.merge(validation, question2, left_on = 'content_id', right_on = 'question_id', how = 'left')

We define the target and the features for the training and validation:

In [None]:
y = X['answered_correctly']
X = X.drop(['answered_correctly'], axis=1)
X.head()

y_val = validation['answered_correctly']
X_val = validation.drop(['answered_correctly'], axis=1)

We reduce the number of features that we use:

In [None]:
X = X[['answered_correctly_user', 'explanation_mean_user', 'quest_pct', 'avg_questions_seen',
       'prior_question_elapsed_time','prior_question_had_explanation_enc', 'part']]
X_val = X_val[['answered_correctly_user', 'explanation_mean_user', 'quest_pct', 'avg_questions_seen',
       'prior_question_elapsed_time','prior_question_had_explanation_enc', 'part']]

We replace missing data in the average accuracy of individual users by the rounded average of the overall accuracy. We replace the missing data on the availability of an explanation for the prior question by the the overall liklihood that such an explanation was provided. We replace the missing mean accuracy of questions by the mean of the respective list. We replace the missing part numbers by the middle part. We replace the missing amounts of questions that a user has seen by the average amount of questions that a user has seen. We replace the missing elapsed time data by the mean elapsed time for previous questions. We replace missing information on whether an explanation was provided for the previous question by No.

In [None]:
X['answered_correctly_user'].fillna(0.65,  inplace=True)
X['explanation_mean_user'].fillna(prior_mean_user,  inplace=True)
X['quest_pct'].fillna(content_mean, inplace=True)

X['part'].fillna(4, inplace = True)
X['avg_questions_seen'].fillna(1, inplace = True)
X['prior_question_elapsed_time'].fillna(elapsed_mean, inplace = True)
X['prior_question_had_explanation_enc'].fillna(0, inplace = True)

We do the same for the validation dataset:

In [None]:
X_val['answered_correctly_user'].fillna(0.65,  inplace=True)
X_val['explanation_mean_user'].fillna(prior_mean_user,  inplace=True)
X_val['quest_pct'].fillna(content_mean,  inplace=True)

X_val['part'].fillna(4, inplace = True)
X['avg_questions_seen'].fillna(1, inplace = True)
X_val['prior_question_elapsed_time'].fillna(elapsed_mean, inplace = True)
X_val['prior_question_had_explanation_enc'].fillna(0, inplace = True)

# Baseline
We import the model and define the datasets:

In [None]:
import lightgbm as lgb

lgb_train = lgb.Dataset(X, y, categorical_feature = ['part', 'prior_question_had_explanation_enc'],free_raw_data=False)
lgb_eval = lgb.Dataset(X_val, y_val, categorical_feature = ['part', 'prior_question_had_explanation_enc'], reference=lgb_train, free_raw_data=False)

We define the objective function and the constraints. Then, we train the model:

In [None]:
### import lightgbm as lgb
params = {
        'num_leaves': 161,
        'boosting_type': 'gbdt',
        'max_bin': 890,
        'objective': 'binary',
        'metric': 'auc',
        'max_depth': 12,
        'min_child_weight': 11,
        'feature_fraction': 0.6903098140467137,
        'bagging_fraction': 0.9267405716419829,
        'bagging_freq': 7,
        'min_child_samples': 77,
        'lambda_l1': 0.02267578846472961,
        'lambda_l2': 9.722845458292198e-08,
        'early_stopping_rounds': 10
        }
lgb_train = lgb.Dataset(X, y, categorical_feature = ['part', 'prior_question_had_explanation_enc'])
lgb_eval = lgb.Dataset(X_val, y_val, categorical_feature = ['part', 'prior_question_had_explanation_enc'], reference=lgb_train)
model = lgb.train(
    params, lgb_train,
    valid_sets=[lgb_train, lgb_eval],
    verbose_eval=1000,
    num_boost_round=2000
)

# Importance
We check how relevant the features are in the model:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
lgb.plot_importance(model)
plt.show()

# Predict

We create an iterator of the test set using the function provided by the compition organiser. For each element in this iterator, we do the following: 1 We add the features that we computed, 2 We replace missing data in the same way that we did it in the training set, 3 We predict the target, and 4 We submit the predicitions with the function that is provided by the compition organisers:

In [None]:
iter_test = env.iter_test()
for (test_df, sample_prediction_df) in iter_test:
    test_df['task_container_id'] = test_df.task_container_id.mask(test_df.task_container_id > 9999, 9999)
    test_df = pd.merge(test_df, group3, left_on=['task_container_id'], right_index= True, how="left")
    test_df = pd.merge(test_df, question2, left_on = 'content_id', right_on = 'question_id', how = 'left')
    test_df = pd.merge(test_df, results_u_final, on=['user_id'],  how="left")
    test_df = pd.merge(test_df, results_u2_final, on=['user_id'],  how="left")
    test_df['answered_correctly_user'].fillna(0.65,  inplace=True)
    test_df['explanation_mean_user'].fillna(prior_mean_user,  inplace=True)
    test_df['quest_pct'].fillna(content_mean,  inplace=True)

    test_df['part'].fillna(4, inplace = True)
    test_df['avg_questions_seen'].fillna(1, inplace = True)
    test_df['prior_question_elapsed_time'].fillna(elapsed_mean, inplace = True)
    test_df['prior_question_had_explanation'].fillna(False, inplace=True)
    test_df["prior_question_had_explanation_enc"] = lb_make.fit_transform(test_df["prior_question_had_explanation"])
    
    test_df['answered_correctly'] =  model.predict(test_df[['answered_correctly_user', 'explanation_mean_user', 'quest_pct', 'avg_questions_seen',
                                                            'prior_question_elapsed_time','prior_question_had_explanation_enc', 'part']])
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])

# Acknowledgement
I am grateful to Takamotoki and Mohammed Abdullah Al Mamun for inspiring me with these notebooks: 
https://www.kaggle.com/takamotoki/lgbm-iii-part2
https://www.kaggle.com/mamun18/riiid-lgbm-lii-hyperparameter-tuning-optuna