## Investigation into some potential properties of test set.
Inspired by [this](https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/188899) discussion.
* are there any new questions in test (NO)
* are there any new users in test (YES - new users with timestamp 0)
* timeframe of test set? (FOLLOWING TRAIN for any given user)

In [None]:
import riiideducation
import pandas as pd

env = riiideducation.make_env()

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Training data is in the competition dataset as usual
It's larger than will fit in memory with default settings, so we'll specify more efficient datatypes and only load a subset of the data for now.

In [None]:
train_df = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/train.csv', low_memory=False, nrows=10**5, 
                       dtype={'row_id': 'int64', 'timestamp': 'int64', 'user_id': 'int32', 'content_id': 'int16', 'content_type_id': 'int8',
                              'task_container_id': 'int16', 'user_answer': 'int8', 'answered_correctly': 'int8', 'prior_question_elapsed_time': 'float32', 
                             'prior_question_had_explanation': 'boolean',
                             }
                      )
train_df

In [None]:
users = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/train.csv', sep=',', usecols=['user_id', 'timestamp'], squeeze=True)
#this takes few minutes - reading in the entire set of users

In [None]:
users_with_latest_ts = users.groupby('user_id')['timestamp'].max()
#get the latest timestamp for all train users

In [None]:
#create set for comparision to the test set
user_set = set(users.user_id.unique())
len(user_set)

There are 393656 unique users in train set. We will later compare if test API returns any new users, not already present in train.

In [None]:
questions_df = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/questions.csv')

In [None]:
questions_df.question_id.max() + 1 == questions_df.shape[0] 
questions_df.shape[0]

There are 13523 unique questions in questions.cvs. We will later check if test API returns any new questions. 

## Iterate through example test set. 

Following example notebook, getting the example test set. 

In [None]:
iter_test = env.iter_test()

Let's get the data for the first test batch and check it out.

In [None]:
(test_df, sample_prediction_df) = next(iter_test)
test_df

In [None]:
#get users and timestamps
test_users_and_ts = test_df[['user_id','timestamp']]
test_users_and_ts.shape

In [None]:
#work with sets to create a set of unique users and questions returned by test API
question_ids = set(test_df.content_id.unique())
new_ids = set(test_df.content_id.unique())
question_ids = question_ids.union(new_ids)

user_ids = set(test_df.user_id.unique())
new_users = set(test_df.user_id.unique())
user_ids = user_ids.union(new_users)

In [None]:
env.predict(sample_prediction_df)

## Main Loop
Let's loop through all the remaining batches in the test set generator and make the default prediction for each. 

Let's store all users, timestamps, content_id (questions) and check them for novelty.

In [None]:
for (test_df, sample_prediction_df) in iter_test:
    new_ids = set(test_df.content_id.unique())
    question_ids = question_ids.union(new_ids)
    
    new_users = set(test_df.user_id.unique())
    user_ids = user_ids.union(new_users)
    
    print("Length of test set {}, unique users {}".format(len(test_df), len(new_ids)))
    
    test_users_and_ts_i = test_df[['user_id','timestamp']]
    test_users_and_ts = pd.concat([test_users_and_ts,test_users_and_ts_i])
    #print(test_users_and_ts.shape)
    
    test_df['answered_correctly'] = 0.5
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])

In [None]:
test_set_min_ts = test_users_and_ts.groupby('user_id')['timestamp'].min().reset_index()

In [None]:
df = pd.merge(users_with_latest_ts,test_set_min_ts, on = 'user_id')

In [None]:
if any(df['timestamp_y'] < df['timestamp_x']): 
    print("USER INTERACTION IN TEST SET HAS HAPPENED _BEFORE_ THE LATEST INTERACTION IN TRAIN SET. TIME MIXUP DETECTED")
else:
    print("ALL CLEAR, TEST SET ACTIONS FOLLOWED TRAIN SET ACTIONS FOR ANY GIVEN USER WHO WAS PRESENT IN BOTH")

In [None]:
print(user_ids - user_set, "these users are new")

In [None]:
print(question_ids - set(questions_df.question_id), "these questions are new")

In [None]:
new_users_are_really_new = test_users_and_ts[test_users_and_ts.user_id.isin(user_ids - user_set)].groupby('user_id')['timestamp'].min().reset_index()
if new_users_are_really_new.timestamp.max() > 0:
    print("new user detected in test who is not really new! (timestamp is not 0)")
    print(new_users_are_really_new[new_users_are_really_new.timestamp>0])
else:
    print("ALL CLEAR. NEW USERS IN TEST ARE INDEED NEW - test contains their first interaction and possibly more")

In [None]:
new_users_are_really_new