# Introduction

Riiid Labs provides innovative educational solutions. They have provided AI tutor based on deep-learning algorithms. This is my submission of the tracing knowledge in the _Riiid AIEd Challenge 2020_. The purpose of this notebook is to present: 
* A thorough exploratory data analysis of the student-question interaction dataset. 
* Predict how well a student answers a question. 

This submission is written in Python. The first thing is to import all the necessary libraries and data into here. 

In [None]:
# Import libraries

import riiideducation

import numpy as np 
import pandas as pd 

from pandas_profiling import ProfileReport

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import label_binarize

import lightgbm as lgb
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb
from sklearn.ensemble import StackingClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform

from sklearn.metrics import mean_absolute_error

from matplotlib import pyplot as plt 
%matplotlib inline

In [None]:
# Import data

# Come from https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/188908
train_df = pd.DataFrame()
counter = 1
for chunk in pd.read_csv('../input/riiid-test-answer-prediction/train.csv', chunksize=1000000, low_memory=False):
    print('Reading chunck {}'.format(counter))
    # Sample the size as it is too big
    chunk = chunk.sample(frac=0.1, random_state=1)
    train_df = pd.concat([train_df, chunk], ignore_index=True)
    counter += 1

In [None]:
# Import other data sets, which are small enough

test_df = pd.read_csv('../input/riiid-test-answer-prediction/example_test.csv')
# questions_df = pd.read_csv('../input/riiid-test-answer-prediction/questions.csv')
# lectures_df = pd.read_csv('../input/riiid-test-answer-prediction/lectures.csv')

# EDA

Let's look at the features in each data frames and see what they mean. 

Here is the data dictionary of `train.csv`, copied from the introduction. 

* `row_id`: (int64) ID code for the row.
* `timestamp`: (int64) the time between this user interaction and the first event completion from that user.
* `user_id`: (int32) ID code for the user.
* `content_id`: (int16) ID code for the user interaction
* `content_type_id`: (int8) 0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture.
* `task_container_id`: (int16) Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id.
* `user_answer`: (int8) the user's answer to the question, if any. Read -1 as null, for lectures.
* `answered_correctly`: (int8) if the user responded correctly. Read -1 as null, for lectures.
* `prior_question_elapsed_time`: (float32) The average time it took a user to answer each question in the previous question bundle, ignoring any lectures in between. Is null for a user's first question bundle or lecture. Note that the time is the average time a user took to solve each question in the previous bundle.
* `prior_question_had_explanation`: (bool) Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback.

Let's look at the first few data and see how does the data frame looks like. 

In [None]:
train_df.head(5)

The data frame is too large to print out its profiling, and we sought to use the traditional method (to inspect the data frame). The following is from the `.info()` method. It shows the dtype and also how many missing data. 

In [None]:
train_df.info()

In [None]:
f'There are {train_df.shape[0]} rows with {train_df.shape[1]} records in train_df. '

In [None]:
print(train_df['answered_correctly'].isna().sum())
print(train_df.shape[0])
print('There are {:0.2f}% of missing data in answered_correctly. '.format(train_df['answered_correctly'].isna().sum()/train_df.shape[0]))

In [None]:
print(train_df['prior_question_elapsed_time'].isna().sum())
print(train_df.shape[0])
print('There are {:0.2f}% of missing data in prior_question_elapsed_time. '.format(train_df['prior_question_elapsed_time'].isna().sum()/train_df.shape[0]))

So how many missing records in `'prior_question_had_explanation'`? 

In [None]:
print(train_df['prior_question_had_explanation'].isna().sum())
print(train_df.shape[0])
print('There are {:0.2f}% of missing data in prior_question_had_explanation. '.format(train_df['prior_question_had_explanation'].isna().sum()/train_df.shape[0]))

It might okay to remove the missing values. 

`test_df` is the validation set and we shall see there are extra features, namely the target variables in the data frame. They are 
* `prior_group_responses` (string) provides all of the `user_answer` entries for previous group in a string representation of a list in the first row of the group. All other rows in each group are null. If you are using Python, you will likely want to call `eval` on the non-null rows. Some rows may be null, or empty lists.

* `prior_group_answers_correct` (string) provides all the `answered_correctly` field for previous group, with the same format and caveats as `prior_group_responses`. Some rows may be null, or empty lists.

In [None]:
# ProfileReport(test_df, title="`test_df` Profiling Report", progress_bar=False)

In [None]:
# ProfileReport(questions_df, title="`questions_df` Profiling Report", progress_bar=False)

Here's the data dictionary of `questions_df`: 
* `question_id`: foreign key for the train/test content_id column, when the content type is question (0).
* `bundle_id`: code for which questions are served together.
* `correct_answer`: the answer to the question. Can be compared with the train user_answer column to check if the user was right.
* `part`: the relevant section of the TOEIC test.
* `tags`: one or more detailed tag codes for the question. The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together.

In [None]:
# ProfileReport(lectures_df, title="`lectures_df` Profiling Report", progress_bar=False)

Participants may watch the lecture and answer the questions, which is listed in `lectures_df` and the details are in below: 
* `lecture_id`: foreign key for the train/test content_id column, when the content type is lecture (1).
* `part`: top level category code for the lecture.
* `tag`: one tag codes for the lecture. The meaning of the tags will not be provided, but these codes are sufficient for clustering the lectures together.
* `type_of`: brief description of the core purpose of the lecture

## Missing Data

We now treat the missing data of `train_df` and `test_df` here. From the EDA, we can see that there are 2.3% and 0.3% of missing values in the last two columns of `train_df`. 

In [None]:
# Code from https://www.kaggle.com/dmikar/baseline-for-riiid-lightgbm
mean_prior = train_df['prior_question_elapsed_time'].astype("float64").mean()
print(f'{mean_prior} is filled for the missing data in prior_question_elapsed_time. ')

train_df['prior_question_elapsed_time'].fillna(mean_prior, inplace = True)
train_df['prior_question_had_explanation'].fillna(False, inplace = True)

In [None]:
mean_prior = test_df['prior_question_elapsed_time'].astype("float64").mean()
print(f'{mean_prior} is filled for the missing data in prior_question_elapsed_time. ')

test_df['prior_question_elapsed_time'].fillna(mean_prior, inplace = True)
test_df['prior_question_had_explanation'].fillna(False, inplace = True)

## Correct the data types

The LGBM model does not like any dtypes other than `int`, `float` or `bool`. While `prior_question_had_explanation` in `test_df` has a custom data type `boolean`, we will need to change them. 

In [None]:
test_df['prior_question_had_explanation'].dtype

In [None]:
test_df['prior_question_had_explanation'] = test_df['prior_question_had_explanation'].astype('bool')

In [None]:
test_df['prior_question_had_explanation'].dtype

# Model

In [None]:
y = train_df['answered_correctly'].to_numpy()
X = train_df[['user_id', 'content_id', 'content_type_id', 'task_container_id', 'prior_question_elapsed_time', 'prior_question_had_explanation']].to_numpy()

X_train_df, X_val_df, y_train_df, y_val_df = train_test_split(X, y, test_size =0.3, shuffle=False)

del train_df

## LightGBM Model

In [None]:
lgb_train = lgb.Dataset(X_train_df, y_train_df, categorical_feature = ['prior_question_had_explanation'], free_raw_data=False)
lgb_eval = lgb.Dataset(X_val_df, y_val_df, categorical_feature = ['prior_question_had_explanation'], free_raw_data=False)

We now train the model. 

In [None]:
# # param values c.f. https://www.kaggle.com/zephyrwang666/riiid-lgbm-bagging2
# param = {'num_leaves': sp_randint(150, 400), 'max_bin':sp_randint(300, 800), 'min_child_weight': [1e-5, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4], 
#          'feature_fraction': sp_uniform(0, 1), 'bagging_fraction': sp_uniform(0, 1), 
#          'objective': ['binary'], 'max_depth': [-1], 
#          'learning_rate': [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09], "boosting_type": ["gbdt"], "bagging_seed": [47], 
#          'eval_metric': ['logloss'], "verbosity": [-1], 
#          'reg_alpha': [0, 1e-1, 1, 2, 5, 7, 10, 50, 100], 'reg_lambda': [0, 1e-1, 1, 2, 5, 7, 10, 50, 100], 
#          'random_state': [47]}

# m1 = lgb.LGBMClassifier(valid_sets = [lgb_train, lgb_eval], verbose_eval = 30, num_boost_round = 10000, early_stopping_rounds = 10, n_jobs=4, n_estimators=3000)

# '''
# Hyperparameter optimisation
# '''
# # Code from https://www.kaggle.com/rtatman/lightgbm-hyperparameter-optimisation-lb-0-761#Model-fitting-with-HyperParameter-optimisation
# #This parameter defines the number of hyperparameter points to be tested
# n_HP_points_to_test = 5

# gsLGBM = RandomizedSearchCV(
#     estimator=m1, param_distributions=param, 
#     n_iter=n_HP_points_to_test,
#     cv=3,
#     refit=True,
#     random_state=47,
#     verbose=True)

In [None]:
# gsLGBM.fit(X_train_df, y_train_df, eval_set = (X_val_df, y_val_df))
# print('Best score reached: {} with params: {} '.format(gsLGBM.best_score_, gsLGBM.best_params_))

In [None]:
# Just in case, the parameters should be printed in here. 
# Score: 0.6788
opt_parameters_LGBM = {'bagging_fraction': 0.11348847189364952, 'bagging_seed': 47, 'boosting_type': 'gbdt', 
 'eval_metric': 'logloss', 'feature_fraction': 0.9744830944364566, 'learning_rate': 0.09, 
 'max_bin': 479, 'max_depth': -1, 'min_child_weight': 1e-05, 'num_leaves': 173, 
 'objective': 'binary', 'random_state': 47, 'reg_alpha': 0, 'reg_lambda': 50, 'verbosity': -1}

In [None]:
m1 = lgb.LGBMClassifier(valid_sets = [lgb_train, lgb_eval], verbose_eval = 30, num_boost_round = 10000, early_stopping_rounds = 10, n_jobs=4, n_estimators=3000, **opt_parameters_LGBM)
m1.fit(X_train_df, y_train_df, eval_set = (X_val_df, y_val_df))

In [None]:
# print(f'The mean absolute error of the model is {mean_absolute_error(y_val_df, gsLGBM.predict(X_val_df))}. ')

## ADABoost

In [None]:
# m2 = AdaBoostClassifier(DecisionTreeClassifier(max_depth=3, min_samples_leaf=1, min_impurity_decrease=10, random_state=47), random_state=47)

In [None]:
# param = {'learning_rate': [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8], 'n_estimators': sp_randint(5, 50)}

# '''
# Hyperparameter optimisation
# '''
# # Code from https://www.kaggle.com/rtatman/lightgbm-hyperparameter-optimisation-lb-0-761#Model-fitting-with-HyperParameter-optimisation
# #This parameter defines the number of HP points to be tested
# n_HP_points_to_test = 3

# gsADA = RandomizedSearchCV(
#     estimator=m2, param_distributions=param, 
#     n_iter=n_HP_points_to_test,
#     cv=3,
#     refit=True,
#     random_state=47,
#     verbose=True)

In [None]:
# gsADA.fit(X_train_df, y_train_df)
# print('Best score reached: {} with params: {} '.format(gsADA.best_score_, gsADA.best_params_))

In [None]:
# Just in case, the parameters should be printed in here. 
# Score: 0.64453
# opt_parameters_ADA = {'learning_rate': 0.08, 'n_estimators': 11}

## Ensembling the Models

In [None]:
# Final models from LGBM and ADABoost
# estimators = [
#     ('lgbm', lgb.LGBMClassifier(verbose_eval = 30, num_boost_round = 10000, early_stopping_rounds = 10, valid = [lgb_train, lgb_eval],
#                                 n_jobs=4, n_estimators=3000, metric='multi_logloss', **gsLGBM.best_params_)),
#     ('ab', AdaBoostClassifier(DecisionTreeClassifier(max_depth=3, min_samples_leaf=1, min_impurity_decrease=10, random_state=47), **gsADA.best_params_))
# ]

# If anything wrong, uncomment the following: 
# estimators = [
#     ('lgbm', lgb.LGBMClassifier(verbose_eval = 30, num_boost_round = 10000, 
#                                 n_jobs=4, n_estimators=3000, **opt_parameters_LGBM)),
#     ('ab', AdaBoostClassifier(DecisionTreeClassifier(max_depth=3, min_samples_leaf=1, min_impurity_decrease=10, random_state=47), **opt_parameters_ADA))
# ]

# del gsLGBM
# del gsADA

In [None]:
# Code from https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python#Second-Level-Predictions-from-the-First-level-Output
# gbm = xgb.XGBClassifier(
#  learning_rate = 0.02,
#  n_estimators= 5,
#  max_depth= 4,
#  min_child_weight= 2,
#  gamma=0.9,                        
#  subsample=0.8,
#  colsample_bytree=0.8,
#  objective= 'binary',
#  nthread= -1,
#  verbosity=2,
#  scale_pos_weight=1)

# clf = StackingClassifier(
#     estimators=estimators, final_estimator=gbm
# )

In [None]:
# clf.fit(X_train_df, y_train_df)

# Submission

In [None]:
# Environment for the comptetition. 

env = riiideducation.make_env()
iter_test = env.iter_test()

In [None]:
for (test_df, sample_prediction_df) in iter_test:
    # Repeat of what's written, don't know why the iterator here does not recognise what has been done before. 
    x_columns = ['user_id', 'content_id', 'content_type_id', 'task_container_id', 'prior_question_elapsed_time', 'prior_question_had_explanation']
    X_train_df = X_train_df
    test_df = test_df
    test_df['prior_question_elapsed_time'].fillna(mean_prior, inplace = True)
    test_df['prior_question_had_explanation'].fillna(False, inplace = True)
    test_df['prior_question_had_explanation'] = test_df['prior_question_had_explanation'].astype('bool')
    test_df['answered_correctly'] = m1.predict(test_df[x_columns])
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])