# About this notebook

ASSUME when we submit a notebook for this competition, it runs on the whole private test dataset (and 20% of them are used to calculate the public LB score).

In this notebook, we try to answer the following questions:

When we commit vs. when we submit a notebook

* if the `questions.csv` contains the same `question_id`.
* if the `lectures.csv` contains the same `question_id`.
* if all the question ids in the private test dataset are seen in `train.csv` and `questions.csv`.
* if all the lecture ids in the private test dataset are seen in `train.csv` and `lectures.csv`.
* if a batch from the private test dataset has timestamps larger (or at least equal) than the timestamps of the corresponding users in `train.csv`. 
* if the batches from the private test datasets have monotonically increasing (actually, non-decreasing) timestamps.
* if there is only one (if any) question bundle for a sinlge user in a single test batch (this is true as mentioned in the competition Data page).
* if each question bundle is in a consecutive block in a test batch (despite the row ids may jumps).
* if the question bundle for a single user in a test batch will always be at the end of this user's sequence.

The code could be used to build a full user history (training time + the previous batches in test time) for prediction. However, it is not optimized, since the current version is only for verifying some assumptions.

In [None]:
# installation with internet
# !pip install datatable==0.11.0

# installation without internet
!pip install ../input/python-datatable/datatable-0.11.0-cp37-cp37m-manylinux2010_x86_64.whl

In [None]:
import pandas as pd
import datatable as dt
import json
from collections import defaultdict

## Reading data in jay format
The dataset can be saved in binary format and read back using datatable in less than a second!

See original notebook [RIIID with blazing fast RID](https://www.kaggle.com/rohanrao/riiid-with-blazing-fast-rid)

In [None]:
dt.fread("/kaggle/input/riiid-test-answer-prediction/train.csv").to_jay("train.jay")
train_dt = dt.fread('train.jay')

# If `train.jay`, we don't need the above conversion - saving time / memory
# train_dt = dt.fread('/kaggle/input/r3id-trainjay/train.jay')

In [None]:
train_dt

## Get unique questions and lectrues from training data

In [None]:
unique_question_id_train = dt.unique(train_dt[dt.f.content_type_id == 0, 'content_id']).to_list()[0]
with open('unique_question_id_train.json', 'w', encoding='UTF-8') as fp:
    json.dump(unique_question_id_train, fp, ensure_ascii=False)
unique_question_id_train = set(unique_question_id_train)

unique_lecture_id_train = dt.unique(train_dt[dt.f.content_type_id == 1, 'content_id']).to_list()[0]
with open('unique_lecture_id_train.json', 'w', encoding='UTF-8') as fp:
    json.dump(unique_lecture_id_train, fp, ensure_ascii=False)
unique_lecture_id_train = set(unique_lecture_id_train)

## Collect questions / lectures from csv files

In [None]:
question_df_from_public = pd.read_csv('/kaggle/input/r3id-info/questions.csv')
lecture_df_from_public = pd.read_csv('/kaggle/input/r3id-info/lectures.csv')

question_df_from_private = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/questions.csv')
lecture_df_from_private = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/lectures.csv')

questions_from_public = set(question_df_from_public['question_id'].values.tolist())
lectures_from_public = set(lecture_df_from_public['lecture_id'].values.tolist())

questions_from_private = set(question_df_from_private['question_id'].values.tolist())
lectures_from_private = set(lecture_df_from_private['lecture_id'].values.tolist())

In [None]:
print(len(unique_question_id_train))
print(len(questions_from_public))
print(len(questions_from_private))

In [None]:
print(len(unique_lecture_id_train))
print(len(lectures_from_public))
print(len(lectures_from_private))

In [None]:
assert_ok = True

try:
    
    assert questions_from_private == questions_from_public
    assert unique_question_id_train == questions_from_public
    
    assert lectures_from_private == lectures_from_public
    assert unique_lecture_id_train.issubset(lectures_from_public)   
            
except AssertionError:
        
    assert_ok = False

## Read precomputed user_id <-> indices map

This is a precomputed dictionary to record the starting and ending indices in `train.csv` for each user in it. The format is like


    {
        "115": [
            0,
            45
        ],
        "124": [
            46,
            75
        ],
        "2746": [
            76,
            95
        ],
        ...
    }


In [None]:
with open('/kaggle/input/r3id-info/user_id_to_row_id_train.json', 'r', encoding='UTF-8') as fp:
    _user_id_to_row_id_train = json.load(fp)
    
# Don't forget to change the key to type int!
user_id_to_row_id_train = {int(k): v for k, v in _user_id_to_row_id_train.items()}

In [None]:
def get_user_dt(user_id, _dt, user_id_to_row_id):
    """Get the partial `datatable.Frame` in `_dt` containing only `user_id`.
    
    Args:
        user_id: `int`. It must in `user_id_to_row_id`, which should be precomputed from `_dt`.
        _dt: `datatable.Frame`.
        user_id_to_row_id: `dict`. See the above markdown cell for the format. 
    """

    assert user_id in user_id_to_row_id
    
    (start, end) = user_id_to_row_id[user_id]
    user_dt = _dt[start:end + 1, :]

    return user_dt

In [None]:
attrs = [
    'user_id',
    'row_id',
    'timestamp',
    'content_id',
    'content_type_id',
    'task_container_id',
    'user_answer',
    'answered_correctly',
    'prior_question_elapsed_time',
    'prior_question_had_explanation'
]

class User_Record:
    """
    Sequences (per attribute) of records for a single user.
    """

    def __init__(self, user_dt=None, user_id=None):
        """Exactly one argument should be `None`.
        
        Args:
            user_dt: A `datatable.Frame` object for a single user.
            user_id: int.
        """

        assert (user_dt is not None and user_id is None) or (user_id is not None and user_dt is None)
        
        user_ids = user_dt[:, 'user_id'].to_list()[0] if user_dt is not None else []
        
        # single value - Each record is for a single user.
        if user_dt is not None:
            assert len(set(user_ids)) == 1
        
        self.user_id = user_ids[0] if user_dt is not None else user_id
    
        self.row_id = user_dt[:, 'row_id'].to_list()[0] if user_dt is not None else []
        self.timestamp = user_dt[:, 'timestamp'].to_list()[0] if user_dt is not None else []
        self.content_id = user_dt[:, 'content_id'].to_list()[0] if user_dt is not None else []
        self.content_type_id = user_dt[:, 'content_type_id'].to_list()[0] if user_dt is not None else []
        self.task_container_id = user_dt[:, 'task_container_id'].to_list()[0] if user_dt is not None else []
        self.user_answer = user_dt[:, 'user_answer'].to_list()[0] if user_dt is not None else []
        self.answered_correctly = user_dt[:, 'answered_correctly'].to_list()[0] if user_dt is not None else []
        self.prior_question_elapsed_time = user_dt[:, 'prior_question_elapsed_time'].to_list()[0] if user_dt is not None else []
        self.prior_question_had_explanation = user_dt[:, 'prior_question_had_explanation'].to_list()[0] if user_dt is not None else []

        # make sure the timestamp is always in order.
        assert self.timestamp == sorted(self.timestamp)
            
    def extend(self, other):
        """
        Add the content of another recocrd `other` to the record `self`.
        """        
        
        assert (self.user_id == other.user_id)
        
        # The `timestamp` should be in order while adding new entries to existing record.
        if len(self.timestamp) > 0:
            assert self.timestamp[-1] <= other.timestamp[0]
                
        for k in attrs:
            if k != 'user_id':
                getattr(self, k).extend(getattr(other, k))

    def update_answer_results(self, prior_correctnesses, prior_answers):
        """
        Update the answers and their correctnesses in a record which was previously unknown in the last test batch.
        """
        
        # sanity check
        assert len(prior_correctnesses) == len(prior_answers)
        
        assert len(self.answered_correctly) >= len(prior_correctnesses)
        assert len(self.user_answer) >= len(prior_answers)
        
        # the places to be updated should contain only -1 (i.e. unknown results)
        assert set(self.answered_correctly[-len(prior_correctnesses):]) == {-1}
        assert set(self.user_answer[-len(prior_answers):]) == {-1}
                
        self.answered_correctly = self.answered_correctly[:-len(prior_correctnesses)] + prior_correctnesses
        self.user_answer = self.user_answer[:-len(prior_answers)] + prior_answers

    def toJSON(self):
        
        return json.dumps(self, default=lambda o: o.__dict__, sort_keys=False, indent=4)
    
    def __str__(self):
        
        return self.toJSON()   

class Record_Buffer:
    """
    A dictionary like buffer to store and manage (i.e. updating) records.
    """
    
    def __init__(self, record_dict=None):
        """
        `record_dict`: A `dict` mapping user ids (`str`) to their records (`User_Record`).
        """
        
        if record_dict is None:
            self.buffer = {}
        else:
            self.buffer = record_dict
           
    def __contains__(self, x):
        
        return x in self.buffer
        
    def __getitem__(self, x):
        
        if x not in self.buffer:
            raise KeyError(str(x))
        
        return self.buffer[x]
    
    def __len__(self):
        
        return len(self.buffer)
    
    def __del__(self):
        
        del self.buffer
                
    def update(self, record):
        """
        Add a record to the buffer. If its user_id already exists, find and update the existing record.
        """

        if record.user_id not in self.buffer:
            self.buffer[record.user_id] = User_Record(user_id=record.user_id)
            
        self.buffer[record.user_id].extend(record)

    def update_answer_results(self, user_id, prior_correctnesses, prior_answers):
        """
        Update the answers and their correctnesses for a single user which was previously unknown in the last test batch.
        """

        assert user_id in self.buffer        
        record = self.buffer[user_id]

        record.update_answer_results(prior_correctnesses, prior_answers)
                
    def toJSON(self):
        
        return json.dumps(self, default=lambda o: o.__dict__, sort_keys=False, indent=4)
    
    def __str__(self):
        
        return self.toJSON()
        
        
def convert_dt(_dt, test=False):
    """
    Change column type, deal with NaN value. If it is a `datatable.Frame` from the test dataset,
    change it to a format suitable for prediction.
    
    Args:
        _dt: `datatable.Frame`, representing a block of the training dataset or a test batch given by `env.iter_test`.
    """

    _dt[dt.f.prior_question_elapsed_time] = dt.float32
    _dt[dt.f.prior_question_elapsed_time == None, 'prior_question_elapsed_time'] = -1.0
    _dt[dt.f.prior_question_had_explanation] = dt.int8
    _dt[dt.f.prior_question_had_explanation == None, 'prior_question_had_explanation'] = -1
    _dt[dt.f.content_type_id] = dt.int8

    if test:
        
        _dt['answered_correctly'] = -1
        _dt['user_answer'] = -1

        del _dt['prior_group_answers_correct']
        del _dt['prior_group_responses']
        # del _dt['group_num']
        
        user_ids = _dt['user_id'].to_list()[0]
                
        # All test questions must have been seen in training time.
        assert set(_dt[dt.f.content_type_id == 0, 'content_id'].to_list()[0]).issubset(unique_question_id_train)

        # All test lectures must have been seen in training time.
        assert set(_dt[dt.f.content_type_id == 1, 'content_id'].to_list()[0]).issubset(unique_lecture_id_train)
        
def convert_df_to_dt(df, test=False):
    """Convert a `pandas.DataFrame` to `datatable.Frame` with some extra processing.
    
    Args:
        df: `pandas.DataFrame`, representing a block of the training dataset or a test batch given by `env.iter_test`.    
    """
    
    _dt = dt.Frame(df.astype({"prior_question_had_explanation": float}))

    if test:
                
        prior_group_answers_correct = eval(_dt[0, 'prior_group_answers_correct'])
        prior_group_responses = eval(_dt[0, 'prior_group_responses'])
        
        if prior_group_answers_correct is None:
            prior_group_answers_correct = [-1] * _dt.nrows
        if prior_group_responses is None:
            prior_group_responses = [-1] * _dt.nrows

        assert type(prior_group_answers_correct) == list
        assert type(prior_group_responses) == list

    convert_dt(_dt, test=test)
    
    if test:
        return _dt, prior_group_answers_correct, prior_group_responses
    else:
        return _dt

    
def combine_user_record(record_1, record_2):
    """
    Returns:
        A new record that combines the two `User_Record` objects `record_1` and  `record_2`.
        The arguments are not modified.
    """

    assert record_1.user_id == record_2.user_id

    record = User_Record(user_id=record_1.user_id)

    record.extend(record_1)
    record.extend(record_2)

    return record    
    

class Prediction_Manager:
    """
    See `update()` for the description.
    """
    
    def __init__(self, train_dt, user_id_to_row_id_train, max_train_buffer_size=30000):
        
        self.train_dt = train_dt
        self.user_id_to_row_id_train = user_id_to_row_id_train
        
        self.train_record_buffer = Record_Buffer()
        self.test_record_buffer = Record_Buffer()
        
        self.current_batch_row_ids = None        
        self.current_batch_users = None
        
        # Used to avoid memory error - not sure if it is necessary.
        self.max_train_buffer_size = max_train_buffer_size
        
    def reset_train_record_buffer(self):
        """
        """
        
        del self.train_record_buffer
        self.train_record_buffer = Record_Buffer()
        
    def update_batch_users(self, user_ids):
        """
        Store the user ids in the current test batch.
        """
        
        self.current_batch_users = user_ids

    def update_answer_results(self, prev_batch_users, prior_group_answers_correct, prior_group_responses):
        """
        When we get a new test batch, we also get the `prior_group_answers_correct` and `prior_group_responses`.
        We use these information to update the `answered_correctly` and `user_anser` fields of the records of
        the users in the previous test batch.
        """
        
        # sanity check: try to make sure the answer results are for `prev_batch_users` by verifying their lengths.
        assert len(prev_batch_users) == len(prior_group_answers_correct)
        assert len(prior_group_answers_correct) == len(prior_group_responses)
        
        d1 = defaultdict(list)
        d2 = defaultdict(list)
        for user_id, prior_ans_correct, prior_ans in zip(prev_batch_users, prior_group_answers_correct, prior_group_responses):
            d1[user_id].append(prior_ans_correct)
            d2[user_id].append(prior_ans)
            
        for user_id in d1:
            self.test_record_buffer.update_answer_results(user_id, d1[user_id], d2[user_id])

    def update(self, test_df):
        """
        For a test batch `test_df` (`pandas.DataFrame`) given by `env.iter_test`, this method performs:
            1. update the `answered_correctly` and `user_anser` information in the previous test batch
            2. update the user records (only in the test time) by appending the information in the current batch
            3. get the user records in the training time
            4. combine the user records in the training time and test time - so we have a full history and the current batch to predict
        """
        
        # conversion
        test_dt, prior_group_answers_correct, prior_group_responses = convert_df_to_dt(test_df, test=True)
        
        user_ids = test_dt['user_id'].to_list()[0]
        prev_batch_users = self.current_batch_users
        if prev_batch_users is not None:
            self.update_answer_results(prev_batch_users, prior_group_answers_correct, prior_group_responses)
        self.update_batch_users(user_ids)

        row_ids = test_dt['row_id'].to_list()[0]
        prev_batch_row_ids = self.current_batch_row_ids
        
        debug_test_batch_row_ids(row_ids, prev_batch_row_ids)
        
        self.current_batch_row_ids = row_ids
        
        # To `User_Record`. Here, each record contains exactly one user interaction.
        # But the same user might appear several times in the list.
        record_batch = [User_Record(user_dt=test_dt[idx, :]) for idx in range(test_dt.nrows)]
        debug_record_batch(record_batch)
                        
        # update test buffer - add info about the new test batch
        for record in record_batch:
            self.test_record_buffer.update(record)

        # get the updated record from test buffer for each user in `user_ids`.
        # The same user might appear several times in the list, however, they get the same user history (test time) sequence.
        test_record_batch = [self.test_record_buffer[x] for x in user_ids]
        
        # get the record from train_dt or train record buffer for each user in `user_ids`
        training_record_batch = [self.get_training_record(x) for x in user_ids]
        
        # obtain the full history (in training time + the previous batches in test time) and the current batch to predict
        predict_record_batch = [combine_user_record(x, y) for x, y in zip(training_record_batch, test_record_batch)]
    
        if len(self.train_record_buffer) >= self.max_train_buffer_size:
            self.reset_train_record_buffer()

    def get_training_record(self, user_id):
        """
        Get the training time history of the user with `user_id`.
        """
                
        if user_id in self.train_record_buffer:
            return self.train_record_buffer[user_id]

        if user_id in self.user_id_to_row_id_train:
            
            user_dt = get_user_dt(user_id, self.train_dt, self.user_id_to_row_id_train)
                    
            assert len(user_dt) > 0
            
            convert_dt(user_dt, test=False)
            
            user_record = User_Record(user_dt=user_dt)
            self.train_record_buffer.update(user_record)
            
            return self.train_record_buffer[user_id]

        else:

            user_record = User_Record(user_id=user_id)
            self.train_record_buffer.update(user_record)
            
            return self.train_record_buffer[user_id]

In [None]:
pm = Prediction_Manager(train_dt=train_dt, user_id_to_row_id_train=user_id_to_row_id_train)
pm

In [None]:
def debug_test_batch_row_ids(row_ids, prev_batch_row_ids):
    """Verify the properties of row ids in a single and across test batch(es).
    
    Args:
        row_ids: The row ids in a batch during the test time given by `env.iter_test()`
        prev_batch_row_ids: The row ids in the batch just before the batch of `row_ids` during the test time given by `env.iter_test()`
    """

    # sanity check
    # row ids must be distinct
    assert len(set(row_ids)) == len(row_ids)
    
    # row ids must be in sorted order in a batch
    assert row_ids == sorted(row_ids)
    
    # row ids must be in sorted order across all batch during the test time.
    if prev_batch_row_ids is not None:
        assert row_ids[0] > prev_batch_row_ids[-1]


def debug_record_batch(record_batch):
    """Verify the properties of the question bundle for a single user in a test batch.
    
    Args:
        record_batch: A list. Each element (`User_Record`) should contain only one record in a single timestamp.
    """

    _tmp = defaultdict(list)
    for record in record_batch:
        # each record here contains only one entry.
        assert len(record.row_id) == 1
        _tmp[record.user_id].append(record)

    for user_id, records in _tmp.items():

        task_container_ids_for_questions = [x.task_container_id[0] for x in records if x.content_type_id[0] == 0]

        # If there is any question for a user
        if len(task_container_ids_for_questions) > 0:

            # There must be exactly one question bundle.
            # This is `True`.
            assert len(set(task_container_ids_for_questions)) == 1

            row_ids_for_questions = [x.row_id[0] for x in records if x.content_type_id[0] == 0]
            
            # This is `False`: the question bundle must be in a consecutive block with continuous row ids.
            # assert row_ids_for_questions == list(range(row_ids_for_questions[0], row_ids_for_questions[-1] + 1))

            records_in_between = [x for x in records if row_ids_for_questions[0] <= x.row_id[0] <= row_ids_for_questions[-1]]
            row_ids_in_between = [x.row_id[0] for x in records_in_between]
            
            # This is `True`: the question bundle must be in a consecutive block (but the row ids may jump).
            assert row_ids_for_questions == row_ids_in_between

            # The question bundle must be at the end of the sequence (for a single user) in a test batch.
            assert row_ids_for_questions == [x.row_id[0] for x in records][-len(row_ids_for_questions):]

In [None]:
assert_ok

In [None]:
# if some assertion fails, no submission.csv is generated, and we get submission scoring error.
if assert_ok:

    import riiideducation
    env = riiideducation.make_env()
    iter_test = env.iter_test()

In [None]:
test_history = []

n_test_batch = 0
for test_df, _ in iter_test:
    
    n_test_batch += 1
        
    try:
        
        # If some assertion fails, no submission.csv is generated, and we get submission scoring error.
        # assert 1 == 0
        
        _question_ids = set(test_df[test_df['content_type_id'] == 0]['content_id'].values.tolist())
        _lecture_ids = set(test_df[test_df['content_type_id'] == 1]['content_id'].values.tolist())        
        
        # All test questions must have been seen in training time.
        assert _question_ids.issubset(unique_question_id_train)
        
        # All test lectures must have been seen in training time.
        assert _lecture_ids.issubset(unique_lecture_id_train)
        
        pm.update(test_df)
        
        # Save a few buffer status to check things are expected.
        if n_test_batch <= 4:
            test_history.append(str(pm.test_record_buffer))
        
    except AssertionError as e:
        print('some assertions are wrong, breaking the loop')
        break
    
    test_df['answered_correctly'] = 0.5
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])

## Save some status to outputs - so you can verify

In [None]:
for idx in range(4):
    with open(f'test_batch_{idx}.json', 'w', encoding='UTF-8') as fp:
        fp.write(test_history[idx])

In [None]:
if n_test_batch <= 4:
    with open('train_record_buffer.json', 'w', encoding='UTF-8') as fp:
        fp.write(str(pm.train_record_buffer))

## Conclusion

* It seems that the answers to the questions mentioned in the top cell of this notebook are all `Yes`.
* I hope there is no missing part or logical error in this notebook.
* I encourage you to verify by yourself.
* Any feedback is appreciated.
* I don't take any responsibility for any error in this notebook.