> <h1>Riiid AIEd Challenge 2020</h1>

First contact with competition and <code>riiideducation</code> package. Just have a look at the files and the test prediction iteration method to submit a dummy prediction (all predictions 0.5).

In [None]:
import os
import riiideducation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from sklearn.model_selection import train_test_split
import lightgbm as lgb
import gc

In [None]:
DATA_DIR = '/kaggle/input/riiid-test-answer-prediction'
TRAIN_PICKLE = '/kaggle/input/riiid-train/train.pkl.gzip'

The train data is huge (over 101 million rows). Trying to load it into memory with a plain <code>pd.read_csv</code> leads to kernel crashing. To avoid this, we'll customize the data types used for each of the columns and read the data in chunks (thanks to Sirish for this <a href='https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/188908'>hint</a>). Also, as it takes more than 9 minutes to load, after reading the train set the first time, I save it as a pickle object, much quicker to load in the future (just a few seconds), and convert the following cell to markdown. After that, I've created a (<a href='https://www.kaggle.com/jcesquiveld/riiid-train'>dataset</a> with the pickle file and added to the data for this notebook.

In [None]:
%%time

types = {
    'row_id': 'int64',
    'timestamp': 'int64',
    'user_id': 'int32',
    'content_id': 'int16',
    'content_type_id': 'boolean',
    'task_container_id': 'int16',
    'user_ans**wer': 'int8',
    'answered_correctly': 'int8',
    'prior_question_elapsed_time': 'float32',
    'prior_question_had_explanation': 'boolean'
}

# Load train dataset by chunks
train = pd.DataFrame()
for chunk in pd.read_csv(os.path.join(DATA_DIR, 'train.csv'), chunksize=1000000, low_memory=False, dtype=types):
    train = pd.concat([train, chunk], ignore_index=True)


In [None]:
train.head()

In [None]:
WORKING_DIR="/kaggle/working"
train.to_pickle(os.path.join(WORKING_DIR, 'train.pkl.gzip'))


In [None]:
%%time
WORKING_DIR="/kaggle/working"
TRAIN_PICKLE=os.path.join(WORKING_DIR, 'train.pkl.gzip')
# Load the train data set
train_all = pd.read_pickle(TRAIN_PICKLE)
train_all.head()

<h2>Data preparation and feature engineering</h2>

In [None]:
# Keep only useful columns for this version

TARGET = 'answered_correctly'
columns = ['user_id', 'content_id', 'prior_question_elapsed_time', 'prior_question_had_explanation']
train = train_all.loc[train_all.content_type_id == False, columns + [TARGET]]
del train_all
gc.collect()

In [None]:
train.info()

In [None]:
%%time

# Calculate user_performance features
user_performance = train.groupby('user_id')['answered_correctly'].agg(['sum', 'count'])
user_performance['user_percent_correct'] = user_performance['sum'] / user_performance['count']
user_performance.drop(columns=['sum'], inplace=True)
user_performance.head()

In [None]:
%%time

# Calculate question_performance features
question_performance = train.groupby('content_id')['answered_correctly'].agg(['sum', 'count'])
question_performance['question_percent_correct'] = question_performance['sum'] / question_performance['count']
question_performance.drop(columns=['sum', 'count'], inplace=True)
question_performance.head()

In [None]:
%%time

prior_question_elapsed_time_mean = train.prior_question_elapsed_time.mean()

In [None]:
# We keep only 10% of data for training
data = train.sample(frac=0.1)
data.reset_index(drop=True, inplace=True)

del train
_ = gc.collect()

data.head()

In [None]:
# Add features user features and question features

data = data.join(user_performance, on='user_id')
data = data.join(question_performance, on='content_id')
data.reset_index(drop=True, inplace=True)
data.prior_question_had_explanation = data.prior_question_had_explanation.fillna(False).astype(np.int8)
data.head()


In [None]:
# Split into training and validation sets

features = ['user_percent_correct', 'count', 'question_percent_correct','prior_question_elapsed_time', 
            'prior_question_had_explanation']
data_train, data_val = train_test_split(data, test_size=0.20)

_ = gc.collect()

<h2>Training</h2>

In [None]:
params = {
    'objective': 'binary',
    'seed': 42,
    'metric': 'auc',
    'learning_rate': 0.05,
    'max_bin': 800,
    'num_leaves': 80
}

lgb_train = lgb.Dataset(data_train[features], data_train['answered_correctly'])
lgb_val = lgb.Dataset(data_val[features], data_val['answered_correctly'])

_ = gc.collect()

In [None]:
# Train classifier

model = lgb.train(
    params,
    lgb_train,
    valid_sets=[lgb_train, lgb_val],
    verbose_eval=10,
    num_boost_round=10,
    early_stopping_rounds=1
)

In [None]:
# Let's plot feature importance

lgb.plot_importance(model)

In [None]:
columns = ['user_id', 'content_id', 'prior_question_elapsed_time', 'prior_question_had_explanation']

def prepare_test(test):
    df = test[columns]
    df = df.join(user_performance, on='user_id')
    df = df.join(question_performance, on='content_id')
    df.prior_question_had_explanation = df.prior_question_had_explanation.fillna(False).astype(np.int8)
    df.prior_question_elapsed_time = df.prior_question_elapsed_time.fillna(prior_question_elapsed_time_mean)
    df.fillna(0.5, inplace=True)
    return df[features]

<h2>Prediction phase</h2>

Once we have trained our model(s), we're ready to make predictions. For this, we have to use the <code>riiieducation</code> API.

In [None]:
# This has to be called once and only once in a notebook. If called twice by mistake, restart session. 
env = riiideducation.make_env()

# This is the prediction workflow

iter_test = env.iter_test()
for (test_df, prediction_df) in iter_test:
    test_df = test_df.loc[test_df.content_type_id == 0].reset_index(drop=True)
    test = prepare_test(test_df)
    test_df['answered_correctly'] = model.predict(test)   
    env.predict(test_df[['row_id', 'answered_correctly']])

In [None]:
env.predictions[0].to_csv("/kaggle/working/submission.csv",index=False)
env.predictions[0].to_csv("submission.csv",index=False)

That's all folks