> <h1>Riiid AIEd Challenge 2020</h1>

First contact with competition and <code>riiideducation</code> package. Just have a look at the files and the test prediction iteration method to submit a dummy prediction (all predictions 0.5).

In [None]:
import os
import riiideducation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import gc

In [None]:
DATA_DIR = '/kaggle/input/riiid-test-answer-prediction'
TRAIN_PICKLE = '/kaggle/input/riiid-train/train.pkl.gzip'
WORKING_DIR = '/kaggle/working'

The train data is huge (over 101 million rows). Trying to load it into memory with a plain <code>pd.read_csv</code> leads to kernel crashing. To avoid this, we'll customize the data types used for each of the columns and read the data in chunks (thanks to Sirish for this <a href='https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/188908'>hint</a>). Also, as it takes more than 9 minutes to load, after reading the train set the first time, I save it as a pickle object, much quicker to load in the future (just a few seconds), and convert the following cell to markdown. After that, I've created a (<a href='https://www.kaggle.com/jcesquiveld/riiid-train'>dataset</a> with the pickle file and added to the data for this notebook.

```python
%%time

types = {
    'row_id': 'int64',
    'timestamp': 'int64',
    'user_id': 'int32',
    'content_id': 'int16',
    'content_type_id': 'boolean',
    'task_container_id': 'int16',
    'user_ans**wer': 'int8',
    'answered_correctly': 'int8',
    'prior_question_elapsed_time': 'float32',
    'prior_question_had_explanation': 'boolean'
}

# Load train dataset by chunks
train = pd.DataFrame()
for chunk in pd.read_csv(os.path.join(DATA_DIR, 'train.csv'), chunksize=1000000, low_memory=False, dtype=types):
    train = pd.concat([train, chunk], ignore_index=True)
    
# Save train dataset as pickle object, much quicker to load
train.to_pickle(os.path.join(WORKING_DIR, 'train.pkl.gzip'))
```

In [None]:
%%time

# Load the train data set
train_all = pd.read_pickle(TRAIN_PICKLE)
train_all.head()

In [None]:
train_all.to_pickle(os.path.join(WORKING_DIR, 'train.pkl.gzip'))

<h2>Data preparation and feature engineering</h2>

In [None]:
# Keep only useful columns for this version

TARGET = 'answered_correctly'
columns = ['user_id', 'content_id', 'prior_question_elapsed_time', 'prior_question_had_explanation']
train = train_all.loc[train_all.content_type_id == False, columns + [TARGET]]
del train_all
gc.collect()

In [None]:
train.info()

In [None]:
%%time

# Calculate user_performance features
user_performance = train.groupby('user_id')['answered_correctly'].agg(['sum', 'count'])
user_performance['user_percent_correct'] = user_performance['sum'] / user_performance['count']
user_performance.drop(columns=['sum'], inplace=True)
user_performance.head()

In [None]:
%%time

# Calculate question_performance features
question_performance = train.groupby('content_id')['answered_correctly'].agg(['sum', 'count'])
question_performance['question_percent_correct'] = question_performance['sum'] / question_performance['count']
question_performance.drop(columns=['sum', 'count'], inplace=True)
question_performance.head()

In [None]:
%%time

prior_question_elapsed_time_mean = train.prior_question_elapsed_time.mean()

In [None]:
# Use only 1/20 of users for training/validation

np.random.seed(45)
users_ids = train.user_id.unique()
data_users_ids = np.random.choice(users_ids, users_ids.shape[0] // 20, replace=False)

data = train.loc[train.user_id.isin(data_users_ids)]

del train

_ = gc.collect()

data.shape

In [None]:
# Expand data with performance features

data = data.join(user_performance, on='user_id', how='left')
data = data.join(question_performance, on='content_id', how='left')
data.reset_index(drop=True, inplace=True)
data.prior_question_had_explanation = data.prior_question_had_explanation.fillna(False).astype(np.int8)
data.head()

<h2>Train/val split</h2>

In [None]:
%%time

# For validation we use the tail with a given threshold of half of those users
# The threshold is chosen so that aproximately we have a train/val proportion of 80/20
# This way, there remain users with less than threshold interactions in the train set

half_data_users_ids = np.random.choice(data_users_ids, data_users_ids.shape[0] // 2, replace=False)
data_val = data.loc[data.user_id.isin(half_data_users_ids)].groupby('user_id').tail(370)
data_train = data.drop(data_val.index)
print('validation set proportion', data_val.shape[0] / (data_train.shape[0] + data_val.shape[0]))

del data

_ = gc.collect()

In [None]:
data_train.columns.values

<h2>Training</h2>

In [None]:
features = [
    'prior_question_elapsed_time',
    'prior_question_had_explanation',
    'count',
    'user_percent_correct',
    'question_percent_correct'
]

In [None]:
params = {
    'objective': 'binary',
    'seed': 42,
    'metric': 'auc',
    'learning_rate': 0.05,
    'max_bin': 800,
    'num_leaves': 75
}

lgb_train = lgb.Dataset(data_train[features], data_train['answered_correctly'])
lgb_val = lgb.Dataset(data_val[features], data_val['answered_correctly'])

_ = gc.collect()

In [None]:
# Train classifier

model = lgb.train(
    params,
    lgb_train,
    valid_sets=[lgb_train, lgb_val],
    verbose_eval=100,
    num_boost_round=5000,
    early_stopping_rounds=10
)

In [None]:
# Let's plot feature importance

lgb.plot_importance(model)

In [None]:
features

In [None]:
columns = ['user_id', 'content_id', 'prior_question_elapsed_time', 'prior_question_had_explanation']

# Create features for user performance as a dict
user_performance_dict = {}
for key in user_performance.index.values:
    user_performance_dict[key] = user_performance.loc[key].to_numpy()
    
# Create features for question performance as a dict
question_performance_dict = {}
for key in question_performance.index.values:
    question_performance_dict[key] = question_performance.loc[key].to_numpy()

In [None]:
# Prepares batch for prediction using numpy arrays and python dictionaris (no merge)
def prepare_test(test_df):
    
    test_np = test_df[columns].to_numpy()
    x_test = np.zeros((len(test_np), len(features)))
    for i in range(len(test_np)):
        x_test[i,0:2] = test_np[i,2:]
        x_test[i,2:4] = user_performance_dict.get(test_np[i][0], [0,0])
        x_test[i,4:] = question_performance_dict.get(test_np[i][1])
        
    
    return x_test

In [None]:
%%time

# Sanity check - To check we get the same result preprocessing with the prepare_test method

y_val_pred = model.predict(prepare_test(data_val))
y_val = data_val['answered_correctly']
roc_auc_score(y_val, y_val_pred)

In [None]:
test_df = pd.read_csv(os.path.join(DATA_DIR, 'example_test.csv'))
test_df.head()

In [None]:
%%timeit
x_test = prepare_test(test_df)

<h2>Prediction phase</h2>

Once we have trained our model(s), we're ready to make predictions. For this, we have to use the <code>riiieducation</code> API.

In [None]:
# To avoid running before submitting
pd.DataFrame().to_csv('submission.csv')

In [None]:
# This has to be called once and only once in a notebook. If called twice by mistake, restart session. 
env = riiideducation.make_env()

# This is the prediction workflow

iter_test = env.iter_test()
for (test_df, prediction_df) in iter_test:
    test_df = test_df.loc[test_df.content_type_id == 0].reset_index(drop=True)
    x_test = prepare_test(test_df)
    test_df['answered_correctly'] = model.predict(x_test)   
    env.predict(test_df[['row_id', 'answered_correctly']])

That's all folks