This notebook is an EDA of the data from the [Riiid! Answer Correctness Prediction](https://www.kaggle.com/c/riiid-test-answer-prediction) 

> **Credit:** This notebook is forked and edited from [this kernel](https://www.kaggle.com/erikbruin/riiid-comprehensive-eda-baseline) by Erik Bruin.

In [None]:
# import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# import matplotlib.style as style
# style.use('fivethirtyeight')
# import seaborn as sns
# import os
# from matplotlib.ticker import FuncFormatter
# import gc  # garbage collection

# Get the data

## train

The data for this competition is relatively large, and it takes a lot of time to upload it. In [this kernel](https://www.kaggle.com/rohanrao/tutorial-on-reading-large-datasets/) by Rohan Rao the writer explores various file formats to efficiently store and access the data. They were uploaded to this notebook (all available [here](https://www.kaggle.com/rohanrao/riiid-train-data-multiple-formats)), and one of them, the gzipped pickle, is used.

In [None]:
%%time
train = pd.read_pickle("../input/riiid-train-data-multiple-formats/riiid_train.pkl.gzip")
print("Train size:", train.shape)

In [None]:
train.head()

Directly from the data description in the competition:
* `row_id`: (int64) ID code for the row.
* `timestamp`: (int64) the time in milliseconds between this user interaction and the first event completion from that user.
* `user_id`: (int32) ID code for the user.
* `content_id`: (int16) ID code for the user interaction
* `content_type_id`: (int8) 0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture.
* `task_container_id`: (int16) Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id.
* `user_answer`: (int8) the user's answer to the question, if any. Read -1 as null, for lectures.
* `answered_correctly`: (int8) if the user responded correctly. Read -1 as null, for lectures.
* `prior_question_elapsed_time`: (float32) The average time in milliseconds it took a user to answer each question in the previous question bundle, ignoring any lectures in between. Is null for a user's first question bundle or lecture. Note that the time is the average time a user took to solve each question in the previous bundle.
* `prior_question_had_explanation`: (bool) Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback

The train dataset is ordered by ascending user_id and ascending timestamp.

Memory analysis of the data (using `memory_usage(deep=True)`) reveals that `prior_question_had_explanation` is an object, and we cast it to Boolean.

In [None]:
train['prior_question_had_explanation'] = train['prior_question_had_explanation'].astype('boolean')

In [None]:
train.info()

In [None]:
train.head()

### Sampling the data

In order for the sampled data to represent the original data we must ensure that for each user all the corresponding transactions are taken.

In [None]:
user_interactions = train.user_id.value_counts()
SAMPLE_SIZE = 100000
sampled_users, n_sampled = [], 0
while n_sampled < SAMPLE_SIZE:
    user = user_interactions.sample(1)
    user_id = user.index.values[0]
    n_interactions = user.values[0]
    sampled_users.append(user_id)
    n_sampled += n_interactions
#     print(user_id, n_interactions)


In [None]:
print(sampled_users, n_sampled)

In [None]:
train = train.loc[train.user_id.isin(sampled_users)]
train.shape

## General statistics

How many users fo we have?

In [None]:
train.user_id.nunique()

What is the distruibution of the number of interactions?

> Who are the users with thousands of interactions?

In [None]:
train.user_id.value_counts().plot.hist(bins=1000, xlim=[0, 1000])

What is the balance between questions and lectures?

In [None]:
train.groupby('user_id')['content_type_id'].value_counts().unstack().fillna(0).median()

It is a big question whether the lectures have any influence.

Etc...

# User exploration

This step is **IMPORTANT** and should be repeated with many users!

In [None]:
my_user = 2136150087

> List of representative users:
> * 1822813285 - 195+3 interactions, intensive 14 days + short session after 40 days, repeated 7 questions
> * 453360579 - 438 + 4 interactions, 5 days in a row, repeated 91 questions twice and 17 questions three times. Closer look shows that this user answered questions 3363 & 3365 3 times on 3 different containers, and twice he repeated the same wrong answer.
> * 1835864303 - 32 interactions, one of the containers had 12 interactions (usually 1-3), interactions span over a year.

In [None]:
user_df = train.loc[train.user_id==my_user]
user_df.shape

In [None]:
user_df = user_df.sort_values(by='timestamp')
user_df.head()

## Timeline

Converting `timestamp` to days:

In [None]:
ms_per_day = 24 * 60 * 60 * 1000
print(ms_per_day)

In [None]:
user_df.timestamp = user_df.timestamp / ms_per_day
user_df.head()

We can see the bunches of activities.

In [None]:
user_df.timestamp.plot.hist(bins=20, xlabel='Days', density=True);

> **Note:** Some users have durations of more than a year...

How are the activities divided between questions and lectures?

In [None]:
fig = plt.figure(figsize=(10, 4))
ax = fig.gca()
ax.plot(user_df.timestamp, 
        user_df.content_type_id, 
        '.--', ms=15, lw=1)
ax.set_xlabel('Days');

In general there are very few lectures.

### Questions vs. lectures

How important is it to see the relevant lectures (based on tags)?
* Lectures are relatively sparse, so it is very common to answer a question without watching any relevant lecture.
* We define `ratio` to be the ratio of tags that have been watched out of the question's tags. It is not clear that `ratio` has any value. This may indicate the seriousness of a student. Perhaps the existence of lectures is more important than the actual tags.

In [None]:
answered_content = train.groupby(['content_id', 'answered_correctly']).size().unstack()
answered_questions = answered_content.loc[answered_content.loc[:, -1].isnull(), [0, 1]].fillna(0)
correct_ratio = answered_questions.iloc[:, 1].divide(answered_questions.sum(axis=1))
correct_ratio.plot.hist(bins=100)

> **Note:** Problematic questions
> * questions that no one answers correctly
> * interactions which took too less/much time to complete relative to other interactions with this question bundles

> **Note:** With this statistics it makes sense to assign most questions an automatic Correct/Incorrect prediction based on the question history. Such questions (with ratio of 0 or 1) should be considered as noise...

In [None]:
user_df.content_type_id.value_counts()

In [None]:
user_df.content_id.value_counts().value_counts()

In [None]:
user_df.answered_correctly.value_counts()

> **Note:** Something strange in the repetition of questions. user_id 453360579 had 91 questions answered twice and 17 questions with 3 trials. This indicate 91 + 17\*2 = 125 mistakes. However this user has 144 incorrect answers...

> **Note:** Are there "confusing" questions, which are more often than others successfully solved in the second trial?

> **Note:** What about repeating questions? After some trials, the user will probably hit the right answer...

## Questions and lectures

In [None]:
questions = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/questions.csv', index_col='question_id')
lectures = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/lectures.csv', index_col='lecture_id')

In [None]:
questions.head()

What are the so-called "Parts"? When following the link provided in the data description we find out that this relates to a test.

> The TOEIC L&R uses an optically-scanned answer sheet. There are 200 questions to answer in two hours in Listening (approximately 45 minutes, 100 questions) and Reading (75 minutes, 100 questions). 

The listening section consists of Part 1-4 (Listening Section (approx. 45 minutes, 100 questions)).

The reading section consists of Part 5-7 (Reading Section (75 minutes, 100 questions)).

> **Note:** One of the questions has no related tag so we remove it for now. 

In [None]:
questions = questions.loc[questions.tags.notnull()]

In [None]:
lectures.head()

Does watching a specific lecture helps in answering the related questions (based on tags)?

For each question we evaluate the ratio of tags that were represented in the lectures history of the user.

Metadata for the lectures watched by users as they progress in their education.
* `lecture_id`: foreign key for the train/test content_id column, when the content type is lecture (1).
* `part`: top level category code for the lecture.
* `tag`: one tag codes for the lecture. The meaning of the tags will not be provided, but these codes are sufficient for clustering the lectures together.
* `type_of`: brief description of the core purpose of the lecture


In [None]:
user_questions = user_df.loc[~user_df.content_type_id]

In [None]:
def get_prev_lectures_tags(df, current_timestamp):
    prev_df = df.loc[df.timestamp < current_timestamp]
    lectures_ids = prev_df.loc[prev_df.content_type_id, 'content_id'].values
    lectures_tags = lectures.loc[lectures_ids, 'tag'].values
    return lectures_tags

In [None]:
get_prev_lectures_tags(user_df, 0.01)

In [None]:
def did_watch_tag_lecture(df):
    did_he = []
    for i in range(len(df)):
        current_timestamp = df['timestamp'].iloc[i]
        if df['content_type_id'].iloc[i]:  # lecture
            did_he.append(None)
        else:  # question
            lectures_tags = set(get_prev_lectures_tags(df, current_timestamp))
            question_tags = set(map(int, questions.loc[df['content_id'].iloc[i]].tags.split()))
#             print(i, lectures_tags, question_tags)
            did_he_ratio = len(lectures_tags & question_tags) / len(question_tags)
            did_he.append(did_he_ratio)
    return did_he

> **Note:** the history of successful answers can be a good predictive for solving questions even if relevant tagged lectures are not present

In [None]:
ratios = did_watch_tag_lecture(user_df)

In [None]:
user_df['watch_ratio'] = ratios
user_df.head()

In [None]:
user_df.groupby(['watch_ratio', 'answered_correctly']).size()

## Test data

What is `env.itertest`?

# Example test

In [None]:
example_test = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/example_test.csv')

In [None]:
example_test.head()

# Thoughts about feature engineering


What we would like to know for the prediction?
* How are you in this part?
* How is it going for you?
* Do you try a question many times?
* Does the user repeat questions? 

In [None]:
cols = ['user_id', 'answered_correctly', 'prior_question_had_explanation']
train = train.loc[:, cols]
train.head()

In [None]:
train = train[train.answered_correctly != -1]
train.shape

# 2. Baseline model

In [None]:
#this clears everything loaded in RAM, including the libraries
%reset -f

In [None]:
import numpy as np
import pandas as pd
import riiideducation
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.style as style
style.use('fivethirtyeight')
import seaborn as sns
import os
import lightgbm as lgb
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
import gc
import sys
pd.set_option('display.max_rows', None)

In [None]:
%%time
cols_to_load = ['row_id', 'user_id', 'answered_correctly', 'content_id']
train = pd.read_pickle("../input/riiid-train-data-multiple-formats/riiid_train.pkl.gzip")[cols_to_load]
# train['user_id'] = train['user_id'].astype('int64')
# train['prior_question_had_explanation'] = train['prior_question_had_explanation'].astype('boolean')

print("Train size:", train.shape)

In [None]:
train.head()

In [None]:
# %%time

# questions = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/questions.csv')
# lectures = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/lectures.csv')
# example_test = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/example_test.csv')
# example_sample_submission = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/example_sample_submission.csv')

In [None]:
train.shape

Dropping lectures rows

In [None]:
train = train[train.answered_correctly != -1]
train.shape

Dropping questions with extreme correct ratio.

In [None]:
answered_questions = train.groupby(['content_id', 'answered_correctly']).size().unstack()
correct_ratio = answered_questions.iloc[:, 1].divide(answered_questions.sum(axis=1))
correct_ratio.plot.hist(bins=100)

In [None]:
easy_question_th = 0.95
normal_questions = correct_ratio.loc[correct_ratio < easy_question_th].index
train = train.loc[train.content_id.isin(normal_questions)]
train.shape

In [None]:
train_train = train.iloc[:1000000]
train_test = train.iloc[1000000:1200000]

In [None]:
train_train.head()

In [None]:
total_q = train_train.groupby('user_id').size()
n_correct = train_train.groupby('user_id')['answered_correctly'].sum()
ratio_q = n_correct.divide(total_q)
current_user_data = pd.DataFrame({'n_questions': total_q, 'ratio_q': ratio_q})
current_user_data.head()

In [None]:
train_test.loc['answered_correctly'] = 0.5
for idx, row in train_test.iterrows():
    # TBD: update current_user_data, inc. n_questions & ratio_q
    if row.user_id in current_user_data.index:
        pred = current_user_data.loc[row.user_id, 'ratio_q']
        print(pred)
    else:
        pred = 0.5

    train_test.loc[idx, 'answered_correctly'] = pred

In [None]:
train_test

# Submission

In [None]:
env = riiideducation.make_env()

In [None]:
iter_test = env.iter_test()

In [None]:
for i, (test_df, sample_prediction_df) in enumerate(iter_test):
    # Create target (all-0.5-)column
    test_df['answered_correctly'] = 0.5
    
    # Making predictions
    for idx, row in test_df.iterrows():
        if row.user_id in current_user_data.index:
            pred = current_user_data.loc[row.user_id, 'ratio_q']
        else:
            pred = 0.5
        test_df.loc[idx, 'answered_correctly'] = pred    
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])

    # Updating knowledge based on latest batch
    if i > 0:
        prev_group_answers_correct = test_df.prior_group_answers_correct.iloc[0]
        if isinstance(prev_group_answers_correct, str):
            answers = map(int, prev_group_answers_correct.split())
            answers = pd.DataFrame(answers, index=prev_test_df.index)
    prev_test_df = test_df.copy()