<h1><center>Riiid! Answer Correctness Prediction. Data Analysis and visualization.</center></h1>

<center><img src="https://www.riiid.co/assets/main_hero_aied.png"></center>

**Reading this notebook, you will find,**

1. Incomplete rows (with null values) which might be removed or complemented
2. Strange task_container_id jumps which might be revised
3. Analysis of features (answer correctness by features, etc)

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:blue; border:0' role="tab" aria-controls="home"><center>Quick navigation</center></h3>

* [1. train.csv](#1)
    * [1.1 row_id, timestamp, user_id, user_answer](#11)
    * [1.2 content_type_id, questions / lectures](#12)
    * [1.3 prior_question_had_explanation](#13)
    * [1.4 the incomplete 392,506 rows](#14)
    * [1.5 prior_question_elapsed_time](#15)
    * [1.6 task_container_id](#16)
    * [1.7 user_id](#17)
    * [1.8 content_id](#18)
    
* [2. question.csv](#2)
* [3. work in progress](#3)

Using .jay format, we can read the full train.csv very fast. You can see the details in [Vopani's fantastic notebook](https://www.kaggle.com/rohanrao/riiid-with-blazing-fast-rid)

In [None]:
# datatable library installation
!pip install ../input/python-datatable/datatable-0.11.0-cp37-cp37m-manylinux2010_x86_64.whl

In [None]:
import time
start_time = time.time()

import numpy as np
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

import sys
import gc
import datatable as dt

<a id="1"></a>
# <center>1. train.csv<center>

In [None]:
# saving the dataset in .jay (binary format)
dt.fread("../input/riiid-test-answer-prediction/train.csv").to_jay("train.jay")

In [None]:
# reading the dataset from .jay format
import datatable as dt
train = dt.fread("train.jay")

In [None]:
# transforming to pandas format
train_df = dt.fread("train.jay").to_pandas()
print(train.shape)

# <center>1.1 row_id, timestamp</center>

<a id="11"></a>
* Format  
<font color="Magenta">**Feature column name**:</font> (Data format) *Description by the host*  
**My findings**
  
<font color="Magenta">**row_id**:</font> (int64) *ID code for the row.*  
**This database contains 101,230,332 rows (users' interactions).**  

[Data Description](https://www.kaggle.com/c/riiid-test-answer-prediction/data) says, "<font color="Red">*Expect to see roughly 2.5 million questions in the hidden test set.*</font>" **<- I think *'the 2.5 million questions'* means interactions (rows).**

<font color="Magenta">**timestamp**:</font> (int64) *the time in milliseconds between this user interaction and the first event completion from that user.*  
**Max value is 87,425,770,000. 1 year is 365 \* 24 \* 60 \* 60 \* 1000 = 31,536,000,000. The longest span of interactions is 2.77 years.**

In [None]:
# transform 'content_type_id' from bool to int
train_df['content_type_id'] = train_df['content_type_id'] * 1

In [None]:
pd.options.display.float_format = '{:.11g}'.format
train_df.describe()

# <center>1.2 content_type_id</center>
<a id="12"></a>
<font color="Magenta">**content_type_id**:</font> (int8) *0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture.*  
  
**The average user answers 51.7 questions (1 / 0.01935) per lecture.  
All lecture rows have <font color="Red">"-1"</font> in <font color="Magenta">'answered_correctly'</font> and <font color="Red">null</font> in <font color="Magenta">'prior_question_elapsed_time'</font>. But despite the description, they have <font color="Red">'False'</font> value instead of 'null' in <font color="Magenta">'prior_question_had_explanation'</font>.**

[Data Description](https://www.kaggle.com/c/riiid-test-answer-prediction/data) says,
"<font color="Red">*The lecture rows in test_df should not be submitted.*</font>"

In [None]:
train_df_lecture = train_df[train_df['content_type_id'] == 1]
train_df_lecture.shape

In [None]:
train_df_lecture['answered_correctly'].value_counts()

In [None]:
train_df[train_df['answered_correctly'] == -1]['content_type_id'].value_counts()

In [None]:
train_df_lecture['prior_question_elapsed_time'].isnull().count()

In [None]:
train_df_lecture['prior_question_had_explanation'].value_counts()

In [None]:
del train_df_lecture
gc.collect()

In [None]:
print('Number of missing values for every column')
print(train_df.isnull().sum())

<a id="13"></a>
# <center>1.3 prior_question_had_explanation</center>
  
<font color="Magenta">**prior_question_had_explanation**:</font> (bool) *Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of <font color="Chocolate">an onboarding diagnostic test</font> where they did not get any feedback.*  
  
**90.7% of the questions were answered after reading the explanation, and 67.3% of them were correct.  
On the other hand, 9.3% of the questions were answered without explanation, and only 50% of them were correct.**

In [None]:
train_df_questions_only = train_df[train_df['content_type_id'] == 0]
train_df_questions_only['prior_question_had_explanation'].value_counts() / (len(train_df_questions_only) - 392506)

In [None]:
train_df_questions_only.groupby('prior_question_had_explanation')['answered_correctly'].mean()

<a id="14"></a>
# <center>1.4 the incomplete 392,506 rows</center>
  
**I found 392,506 null values in the <font color="Magenta">'prior_question_had_explanation'</font> column. 
Of those <font color="Red">'incomplete 392,506 rows'</font>, 392,441 rows' 'timestamps' are zeros, meaning user's first question bundle.  The remaining 65 rows also have lower task_container_ids, inferring <font color="Chocolate">'the onboarding diagnostic test'</font>. Answer correctness is not so bad (68% and 65%, respectably).**  
  
**I may remove these rows when modeling or fill the <font color="Magenta">'prior_question_had_explanation'</font> columns with 'False' because users were novices and in <font color="Chocolate">'the onboarding diagnostic test'</font>.**

In [None]:
train_df_explanation_null = train_df[train_df['prior_question_had_explanation'].isnull()]
train_df_explanation_null.shape

In [None]:
train_df_explanation_null[train_df_explanation_null['timestamp'] == 0].describe()

In [None]:
train_df_explanation_null[train_df_explanation_null['timestamp'] != 0].describe()

In [None]:
del train_df_explanation_null
gc.collect()

<a id="15"></a>
# <center>1.5 prior_question_elapsed_time</center>
  
<font color="Magenta">**prior_question_elapsed_time**:</font> *(float32) The average time in milliseconds it took a user to answer each question in the previous question bundle, ignoring any lectures in between. Is null for a user's first question bundle or lecture. Note that the time is the average time a user took to solve each question in the previous bundle. * 
  
**This column contains 2,351,538 null values. 1,959,032 of them are lectures. The remaining rows are** <font color="Red">'the incomplete 392,506 rows'</font>
  
**The last two digits of this column seems not to be significant.  Max value is 300,000, I guess the time limit for one bundle is 5 minutes.**

In [None]:
train_df_elapsed_null = train_df[train_df['prior_question_elapsed_time'].isnull()]
train_df_elapsed_null = train_df_elapsed_null[train_df_elapsed_null['content_type_id'] == 0]
train_df_elapsed_null.shape

**Most of the 'timestamp == 0' columns have null values in <font color="Magenta">'prior_question_had_explanation'</font>.  
But <font color="Red">the remaining 3,976 rows</font> have 'False' or 'True' value.  I may remove these rows, too.**

In [None]:
train_df_timestamp0 = train_df[train_df['timestamp'] == 0]
train_df_timestamp0.shape

In [None]:
train_df_timestamp0['prior_question_had_explanation'].value_counts()

In [None]:
train_df_elapsed_check = train_df[train_df['prior_question_elapsed_time'] < 10000]
train_df_elapsed_check = train_df_elapsed_check.sort_values('prior_question_elapsed_time')
train_df_elapsed_check['prior_question_elapsed_time'].unique()

In [None]:
del train_df_elapsed_null, train_df_timestamp0, train_df_elapsed_check
gc.collect()

<a id="16"></a>
# <center>1.6 task_container_id</center>
  
<font color="Magenta">**task_container_id**:</font> (int16) *Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id.*  

**Max value of 9999 seems artificial and I suspect that the system can only count up to 9999 for task_container_id. I checked the behavior after 9999 and found strange counting.**  
(The first case, id returned from 9999 to 1396 and stopped. The second, it stopped just at 9999. The third, returned to 9506. The fourth, returned to 9989. The fifth, returned to 9953 and then 9122, 9841, 3296, and stopped. It seems a bug to me.)

In [None]:
train_df_questions_only.reset_index(inplace=True, drop=True)
train_df_questions_only

In [None]:
train_df_doubtful_task_id = train_df_questions_only[(train_df_questions_only['task_container_id'].diff() < -1000) & (train_df_questions_only['task_container_id'] != 0)]
train_df_doubtful_task_id

In [None]:
train_df_doubtful_container_id_bins = pd.cut(train_df_doubtful_task_id['task_container_id'], [0, 1, 9, 99, 999, 9999, 20000]).value_counts().reset_index()
train_df_doubtful_container_id_bins.columns = ['bins', 'count']
train_df_doubtful_container_id_bins = train_df_doubtful_container_id_bins.sort_values(by=['bins'])
train_df_doubtful_container_id_bins

In [None]:
del train_df_doubtful_task_id, train_df_doubtful_container_id_bins
gc.collect()

In [None]:
train_df['task_container_id'].value_counts()

In [None]:
train_df[train_df['task_container_id'] == 9999]

In [None]:
train_df[1896755:1896762]

In [None]:
train_df[1929138:1929144]

In [None]:
train_df[3101565:3101572]

In [None]:
train_df[4301504:4301511]

In [None]:
train_df[5295816:5295826]

**I also found other types of <font color="Magenta">'task_container_ids'</font> jumps. Many (but not all) lecture rows show these jumps. And I have no idea why id went from 4556 to 2786 in row_id 21234106.**

In [None]:
train_df[21234101:21234110]

**The graph below seems like a learning curve to me. Although <font color="Magenta">'task_container_id'</font> contains some noise, I may use it as an indicator of <font color="Red">users' experience</font>.**

In [None]:
task_accuracy = train_df.groupby('task_container_id')['answered_correctly'].mean()
task_accuracy = task_accuracy.rolling(50).mean()

fig = px.line(
    task_accuracy, 
    title='Answer correctness by task_container_id', 
    height=600, 
    width=800
)

fig.show()

In [None]:
del task_accuracy
gc.collect()

<a id="17"></a>
# <center>1.7 user_id</center>

<font color="Magenta">**user_id**:</font> (int32) *ID code for the user.*  
  
**The database seems to be sorted by this column. When I split the data into four, all user_ids in the first set are smaller than any of those in the latter sets, and so forth.**
  
**393,656 users are in the train.csv.  
68.2% of users had interacted less than 100 times. 87 of them had only interacted once.  
346 users had interacted more than or equal to 10,000 times. The most active user (user_id == '801103753') had interacted 17,917 times.**

**In [this Discussion](https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/191106),** Kaggle staff said,
"<font color="Red">*the hidden test set contains new users but not new questions.*</font>"

In [None]:
train_df_user_activeness = train_df['user_id'].value_counts()
train_df_user_activeness

In [None]:
train_df_user_activeness_bins = pd.cut(train_df_user_activeness, [0, 1, 9, 99, 999, 9999, 20000]).value_counts().reset_index()
train_df_user_activeness_bins.columns = ['bins', 'count']
train_df_user_activeness_bins = train_df_user_activeness_bins.sort_values(by=['bins'])
train_df_user_activeness_bins

In [None]:
user_accuracy = train_df_questions_only.groupby('user_id')['answered_correctly'].mean()

fig = px.histogram(
    user_accuracy, 
    x="answered_correctly",
    nbins=100,
    width=700,
    height=500,
    title='Answer correctness by user'
)

fig.show()

In [None]:
del train_df_user_activeness, train_df_user_activeness_bins, user_accuracy
gc.collect()

<a id="18"></a>
# <center>1.8 content_id</center>

<font color="Magenta">**content_id**:</font> (int16) *ID code for the user interaction*  
**The content_id corresponds to 13523 question_ids (0 to 13522) and 415 lecture_ids (89 to 32736), mixedly.  
The most popular 36 questions were answered more than 100,000 times for each. On the other hand, the least popular 243 questions were answered less than 100 times. 9 of them were answered only once. (dummy questions?)**


  


<font color="Magenta">**user_answer**:</font> (int8) *the user's answer to the question, if any. Read -1 as null, for lectures.*  
**One correct answer among 4 options(0-3). *question.csv* shows that the correct answer rate of option 2 is lower than the other three.  Users answered option 2 less, properly.**

In [None]:
train_df['content_id'].value_counts()

In [None]:
train_df_questions_id_check = train_df_questions_only['content_id'].value_counts()
train_df_questions_id_check

In [None]:
train_df_lectures_only = train_df[train_df['content_type_id'] == 1]
train_df_lectures_id_check = train_df_lectures_only['content_id'].value_counts()
train_df_lectures_id_check

In [None]:
train_df_popularity_of_questions = pd.cut(train_df_questions_id_check, [0, 1, 10, 100, 1000, 10000, 100000, 210000]).value_counts().reset_index()
train_df_popularity_of_questions.columns = ['bins', 'count']
train_df_popularity_of_questions = train_df_popularity_of_questions.sort_values(by=['bins'])
train_df_popularity_of_questions

In [None]:
del train_df_popularity_of_questions, train_df_lectures_only, train_df_lectures_id_check
gc.collect()

In [None]:
content_accuracy = train_df_questions_only.groupby('content_id')['answered_correctly'].mean()

fig = px.histogram(
    content_accuracy, 
    x="answered_correctly",
    nbins=100,
    width=700,
    height=500,
    title='Answer correctness by content'
)

fig.show()

In [None]:
del train_df_questions_only, 
gc.collect()

<a id="2"></a>
# <center>2. questions.csv<center>

**questions.csv**: metadata for the questions posed to users.

* question_id: foreign key for the train/test content_id column, when the content type is question (0).

* bundle_id: code for which questions are served together.

* correct_answer: the answer to the question. Can be compared with the train user_answer column to check if the user was right.

* part: the relevant section of the TOEIC test.

* tags: one or more detailed tag codes for the question. The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together.

In [None]:
questions_df = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/questions.csv')
questions_df.rename(columns={'question_id': 'content_id'}, inplace='True')
questions_df

# <center>2.1 part</center>

<font color="Magenta">**part**:</font> top level category code for the question.  
  
**In the first place, Riiid is for TOEIC . You can see what is [TOEIC here](https://www.iibc-global.org/english.html). 'part' means [Question types](https://www.iibc-global.org/english/toeic/test/lr/about/format.html)**

* Listening Section  
Part1:  Photographs - 6 questions  
Part2:  Question-Response - 25 questions  
Part3:  Conversations - 39 questions  
Part4:  Talks - 30 questions  

* Reading Section  
Part5:  Incomplete Sentences - 30 questions  
Part6:  Text Completion - 16 questions  
Part7:  Single Passages - 29 questions / Multiple Passages - 25 questions

**Part1 is the easiest, 81.5% of the questions were answered correctly. Part5 is the hardest, their answer correctness was 66.6%.**

In [None]:
questions_df = pd.merge(content_accuracy, questions_df, on='content_id')
questions_df

In [None]:
part_accuracy = questions_df.groupby('part')['answered_correctly'].mean()
part_accuracy.columns = ['part', 'answer_correctness']
part_accuracy

In [None]:
fig = px.bar(
    part_accuracy, 
    title='accuracy by part'
)

fig.show()

In [None]:
run_time = time.time() - start_time
print(run_time)

**Thank you for reading my notebook. This notebook is an additional analysis of [Isaienkov's great notebook](https://www.kaggle.com/isaienkov/riiid-answer-correctness-prediction-eda-modeling) from a different angle.  Read his if you haven't. I also reuse some of his code. Thanks, [Isaienkov](https://www.kaggle.com/isaienkov).  
This is my first open public book. I appreciate if you**  
# comment and/or upvote.