This notebook augments the Riiid input dataset with new columns: `this_question_had_explanation` and `this_question_elapsed_time`. These are essentially shifted and ffilled versions of `prior_question_had_explanation` and `prior_question_elapsed_time` from the orinal dataset.
They tracks whether **this** question provided feedback to the user and how long did **this** question take, rather than the previous one.

I'm using alternative input/output formats based on info in [this notebook](https://www.kaggle.com/rohanrao/tutorial-on-reading-large-datasets/data)

# Now less wrong!

I've switched from using Pandas to datatable, because Pandas can't do what needs to be done without running out of memory.

This enabled me to fix two bugs:

- Group by user before shifting prev_* to make this_*
- Actually join this_* information properly so that it applies to each row, not just the first row per (user_id, task_container_id) combination.

## Read in training data from .jay file to datatable



In [None]:
!pip install datatable==0.11.0 > /dev/null


In [None]:
import pandas as pd
import numpy as np
import dask.dataframe as dd
import gc
import pyarrow.parquet as pq
import pyarrow
import datatable as dt
from datatable import f


In [None]:
%%time

dt_data = dt.fread("../input/riiid-train-data-multiple-formats/riiid_train.jay")
dt_data.shape

## Compute question_elapsed_time and question_had_explanation for *this* question 


This value is the same within each bundle of questions(set of questions asked of a given user with a given `task_container_id`). So, the whole bundle either got explanations or didn't; and the elapsed time is averaged over the bundle. To get explanation/elapsed time for *this* bundle we get the applicable bundles (those that represent groups of questions, not lectures) in the right order, then shift `prev_*` backwards. 

Then, for `had_explanation`, ffill the NaNs. (the only time that NaNs need to be filled is for the last question, beause there was no `prev_*` to get the data from; ffill makes sense in this context for had_explanation, since the only time users generally *dont't* get an explanation is at the beginning, when they are being asked diagnostic questions.)


Judging by user 115 (for whom the order of `timestamp` and `task_container_id` do not match), it is `timestamp` which determines what the "previous" task container was: the first task container should have `prior_question_had_explanation == None`, and this is true for `timestamp` 0, not `task_container_id` 0.

In [None]:
# make a single column which contains a unique id for each [user_id, task_container_id] combination
dt_data['tc_id'] = dt_data[:,dt.str32(f['user_id'])+'_'+f['task_container_id']]

In [None]:
%%time

# make a separate frame with just questions, and one row per tc_id
questions = dt_data[f["content_type_id"]==0,:]
q_task_containers = questions[
    (f['tc_id']!=dt.shift(f['tc_id'])) , :]
q_task_containers.shape  # expected shape: (76483597, 11)

In [None]:
# this_* is prior_* shifted once by tc_id, within user
q_task_containers['this_question_elapsed_time'] = q_task_containers[:,dt.shift(f['prior_question_elapsed_time'], -1),dt.by(f['user_id'])]['prior_question_elapsed_time']
q_task_containers['this_question_had_explanation'] = q_task_containers[:,dt.shift(f['prior_question_had_explanation'], -1),dt.by(f['user_id'])]['prior_question_had_explanation']

In [None]:
# sanity check - this_* values are null in the last row for a user
q_task_containers[f['user_id']==115, :]

## Put computed columns back into training data

In [None]:
# drop all columns except the newest ones
q_task_containers = q_task_containers[:,'tc_id':]

In [None]:
%%time

# key by tc_id for joining
q_task_containers.key='tc_id'

In [None]:
%%time

# left outer join into original data table
dt_data = dt_data[:,:,dt.join(q_task_containers)]

In [None]:
# Sanity check: last 3 task_container_id bundles for a particular user
dt_data[(f['user_id']==2147012157) & (f['task_container_id'] > 4675),:]

## Write back out 

In [None]:
# Make room in RAM
del q_task_containers
gc.collect()


In [None]:
%%time

dt_data.to_csv('riiid_train_with_qdata.csv')

In [None]:
%%time

dt_data.to_jay('riiid_train_with_qdata.jay')