![Riiid! App](https://venturebeat.com/wp-content/uploads/2020/07/1-1-e1595515659939.jpeg?w=1200&strip=all)


The goal of this notebook is to make an explanatory, exploratory and storytelling style walkthrough this data, I'll try to highlight all the relevant parts and update regularly to add important information. Reading this notebook from top to bottom should give you the big picture perspective on this data.

### Overview of the data:
This data is simply records of users' interactions with an educational app, each user has his/her own unique id, these interactions are watching lectures or answering quesstions. You can think of the users as the data generating distribution since the users are given questions as input and expect answers as output. Or you can think of the app as the data generating distribution, by giving users and historic information about them as input you will expect their answers' correctness as output.

### Goal of the modeling:
Our goal in this competition is to make a model that will predict the users' answers correctness given the question and historic data about the users.

# 1. Importing Data:

The "train.csv" dataframe has +100 million records, which makes it impossible to load it fully on a kaggle notebook with 16gb of ram (the size of the file is about 5gb on disk but pandas requires around x5 of ram). There are many approaches to solve this issue, the one I have chosen is @rohanrao's using a package called datatable (check out his notebook [here](https://www.kaggle.com/rohanrao/riiid-with-blazing-fast-rid)).

In [None]:
!pip install ../input/python-datatable/datatable-0.11.0-cp37-cp37m-manylinux2010_x86_64.whl > /dev/null

In [None]:
import gc

import numpy as np
import pandas as pd
import datatable as dt


import matplotlib
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

from sklearn.metrics import roc_auc_score

plt.style.use('ggplot')

In [None]:
train = dt.fread('../input/riiid-test-answer-prediction/train.csv').to_pandas()
questions = pd.read_csv('../input/riiid-test-answer-prediction/questions.csv')
lectures = pd.read_csv('../input/riiid-test-answer-prediction/lectures.csv')

# 2. Train.csv:

We start by describing and analyzing the train.csv columns one after the other:

In [None]:
train.head()

## 2.1 timestamp:

This column tell us the duration between the user's first interaction and the current interaction in milliseconds.
This variable will give us an idea how active users were through time, let's see how this variable is distributed.

In [None]:
timedelta = pd.to_timedelta(train['timestamp'], unit='ms')
print(f"\
Averge: {timedelta.mean().floor('s')}\n\
Median: {timedelta.median().floor('s')}\n\
Min:    {timedelta.min().floor('s')}\n\
Max:    {timedelta.max().floor('s')}")
del timedelta
_ = gc.collect()

fig = plt.figure(figsize=(15, 4))
ax = fig.add_subplot(111)

ax.hist((train['timestamp'] / (1000 * 60 * 60 * 24)), bins=100, rwidth=0.8)

format_yticks = matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ','))

ax.set_xticks(range(0, int(train['timestamp'].max() / (1000 * 60 * 60 * 24)), 20))
plt.xticks(rotation=70)
ax.set_xlabel('Days')
ax.yaxis.set_major_formatter(format_yticks)

I ploted the variable in days, it would better to look more closely in a more meaningfull timespan...

In [None]:
hours = range(0, 1000 * 60 * 60 * 24, 1000 * 60 * 60 * 12)
days = range(1000 * 60 * 60 * 24, 1000 * 60 * 60 * 24 * 7, 1000 * 60 * 60 * 24)
weeks = range(1000 * 60 * 60 * 24 * 7, 1000 * 60 * 60 * 24 * 7 * 4, 1000 * 60 * 60 * 24 * 7)
months = range(1000 * 60 * 60 * 24 * 30, 1000 * 60 * 60 * 24 * 30 * 12, 1000 * 60 * 60 * 24 * 30 * 2)
rest = [1000 * 60 * 60 * 24 * 30 * 12, train['timestamp'].max()]

bins = list(hours) + list(days) + list(weeks) + list(months) + rest

h,e = np.histogram(train['timestamp'], bins=bins)

plt.figure(figsize=(15, 4))

plt.xticks(np.arange(-0.5, len(h) + 0.5),
           ["0", "12 hours", "1 day", "2 days", "3 days", "4 days", "5 days", "6 days", "1 week", "2 weeks", "3 weeks", "1 month", " 3 months", "5 months", "7 months", "9 months", "11 monts", "1 year", f"max"],
           fontsize=14, rotation=60)

format_yticks = FuncFormatter(lambda x, p: format(int(x), ','))

plt.bar(range(len(bins)-1), h, width=0.96,
        color = ["tab:blue"] * len(hours) + ["tab:orange"] * len(days) + ["tab:green"] * len(weeks) + ["tab:red"] * len(months) + ["tab:purple"] * len(rest))
plt.ylabel('Number of Interactions')
plt.gcf().axes[0].yaxis.set_major_formatter(format_yticks)

We can see that many interactions happen in the first 12 hours, and also during the first and second month.

## 2.2 user_id:

Each user has his own id, also in the hidden test set we will have more new user ids. Let's see how many users we have in this dataset:

In [None]:
print(f"Total number of users: {train['user_id'].nunique():,}\n\
Number of interations / Number of users: {len(train['user_id']) / train['user_id'].nunique():.2f}")

We plot the number of interactions per user variable, just to check if we have clusters of very active user and others with less activity:

In [None]:
user_counts = train['user_id'].value_counts()

print(f"\
Averge: {user_counts.mean():.02f}\n\
Median: {user_counts.median():>6}\n\
Min:    {user_counts.min():>6}\n\
Max:    {user_counts.max():,}")

fig = plt.figure(figsize=(15,4))
ax = fig.add_subplot(111)

ax.hist(user_counts, bins=100, rwidth=0.8)

ax.set_xlabel('Number of Interactions')
ax.set_ylabel('Users counts')
ax.yaxis.set_major_formatter(format_yticks)

This distribution looks very skewed to the left, let re-plot this with a log scale:

In [None]:
bins = np.logspace(0, 8, base=6, endpoint=user_counts.max(), num=50, dtype=np.int)

h,e = np.histogram(user_counts, bins=bins)

fig = plt.figure(figsize=(15, 4))
ax = fig.add_subplot(111)

plt.sca(ax)
plt.xticks(np.arange(-0.5, len(h) + 0.5), bins)

ax.bar(range(len(bins)-1), h, width=0.9)

plt.xticks(rotation=70)
ax.set_xlabel('Number of Interactions')
ax.set_ylabel('Users counts')
ax.yaxis.set_major_formatter(format_yticks)

It seems that we have a mixture of two distributions, the left one (the dominant one) are users with less interaction and the right one are users orders of magnitude more interactions.

## 2.3 content_id:

This column represent the id of a questions or a lecture (the same id is also present in quesntions.csv and lecture.csv datasets). We can verify that this column is just ids from the other dataframes by using the code bellow:

In [None]:
content_id_in_dfs = np.isin(train['content_id'].unique(), np.concatenate([questions.question_id, lectures.lecture_id]))
print(f"Are all content ids in question and lecture dataframes?: {np.all(content_id_in_dfs)}")

But does these ids represent ids uniquely for each question or lecture?...

In [None]:
ids_intersect = np.intersect1d(questions.question_id, lectures.lecture_id)
print(f"Questions ids and lectures ids not intersecting?: {ids_intersect.size == 0}")

This is important, because to distinguish between questions and lectures we need to use content_type_id column and not only rely on content_id.

We need to answer another question now, are all lectures and questions present is our training dataset?

In [None]:
questions_not_in_train = np.isin(questions.question_id, train['content_id'][train['content_type_id'] == 0].unique(), invert=True)
lectures_not_in_train = np.isin(lectures.lecture_id, train['content_id'][train['content_type_id'] == 1].unique(), invert=True)
print(f"\
Questions not in train.csv: {np.count_nonzero(questions_not_in_train)}\n\
Lectures not in train.csv:  {np.count_nonzero(lectures_not_in_train)}")

Only 3 lectures are not present is the training dataset.

We will analyze more the content_id in a later sections along the corresponding dataframes.

## 2.4 content_type_id:

This is to indicate if the content is a lecture or a question, (0: question, 1: lecture).

In [None]:
print(f"\
Unique values:            {train['content_type_id'].nunique()}\n\
Questions:       {np.count_nonzero(train['content_type_id'] == 0):,}\n\
Lectures:         {np.count_nonzero(train['content_type_id'] == 1):,}")

## 2.5 task_container_id:

The questions or lectures are bundeled together and given to users as batches, this column tell us how different contents are grouped.

Let's get and idea on how they are contained...

In [None]:
is_same_type_container = train['content_type_id'].groupby(train['task_container_id']).apply(lambda x: np.all(x == x.iloc[0]))
print(f"\
Total number of containers:                            {train['task_container_id'].nunique():,}\n\
Any lectures and questions in the same container?:     {np.all(is_same_type_container) == False}\n\
How many containers with mixed content type:           {np.count_nonzero(is_same_type_container == False)}\n\
How many containers with one content type:             {np.count_nonzero(is_same_type_container)}")

## 2.6 user_answer:

The user's answer to the question, if any. -1 as null, for lectures.

In [None]:
print(f"Unique values: {train['user_answer'].nunique()}")

In [None]:
fig = plt.figure(figsize=(5, 5))
ax = fig.add_subplot(111)

ax.hist(train['user_answer'])

ax.set_xticks(train['user_answer'].unique())

ax.set_xlabel('Answer Number')
ax.set_ylabel('Counts')

ax.yaxis.set_major_formatter(format_yticks)

## 2.7 answered_correctly:

If the user answered correctly, -1 for lectures. This is the only feature that needs to predicted.

In [None]:
fig = plt.figure(figsize=(5, 5))
ax = fig.add_subplot(111)

ax.hist(train['answered_correctly'])

ax.set_xticks(train['answered_correctly'].unique())

ax.set_xlabel('Answer Correctness')
ax.set_ylabel('Counts')
ax.yaxis.set_major_formatter(format_yticks)

## 2.8 prior_question_elapsed_time:

The average time it took a user to answer each question in the previous question bundle, ignoring any lectures in between. Is null for a user's first question bundle or lecture.

In [None]:
total_na = train['prior_question_elapsed_time'].isna().sum()
timedelta = pd.to_timedelta(train['prior_question_elapsed_time'], unit='ms')

print(f"\
Total null values:      {total_na:,}\n\
Total non-null values:  {train.shape[0] - total_na:,}\n\
Averge:                 {timedelta.mean().total_seconds():>12} s\n\
Median:                 {timedelta.median().total_seconds():>12} s\n\
Min:                    {timedelta.min().total_seconds():>12} s\n\
Max:                    {timedelta.max().total_seconds():>12} s")

fig = plt.figure(figsize=(15, 4))
ax = fig.add_subplot(111)

ax.hist(train['prior_question_elapsed_time'][train['prior_question_elapsed_time'].isna() == False] / 1000, bins = 50, rwidth=0.9)

ax.set_xlabel('Seconds')
ax.set_ylabel('User Count')
ax.yaxis.set_major_formatter(format_yticks)

## 2.9 prior_question_had_explanation:

Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle.

In [None]:
counts = train['prior_question_had_explanation'].value_counts()
print(f"\
True:  {counts[0]:,}\n\
False: {counts[1]:,}")

# 3. questions.csv:

This dataframe gives us infromation about the questions. These questions will be the same for the hidden test set.

In [None]:
questions.head()

In [None]:
tags = questions.tags.apply(lambda x: [int(t) if t.isdigit() else -1 for t in str(x).split(' ')])
tags = np.concatenate(tags.tolist())

In [None]:
print(f"\
Total questions: {questions.question_id.nunique()}")

# 4. lectures.csv:

In [None]:
lectures.head()

In [None]:
print(f'Total number of lectures: {lectures.lecture_id.nunique()}')

# 5. Looking for patterns:
Our goal in this competition as mentioned before is to predict the user's answer correctness, in this section we're going to try to find any patterns that describe our target using different predictors available in the dataset and statistics.

## 5.1 Question difficulty:
First let's see  if are difference in difficulty of anwering each question, we're gonna do that by calculating the ratio of correct answers for each question and plot it:

In [None]:
questions_ratios = train.query('content_type_id == 0').groupby('content_id')['answered_correctly'].mean()
plt.figure(figsize=(15, 4))
questions_ratios.hist(bins=100, rwidth=0.8)
plt.xlabel('Correctness Ratio')
_ = plt.ylabel('Number of Questions')

Here we have a normal distribution skewed to the right, but why we have this guaussian distribution rather than a uniform distribution? it could be that our questions follow a bernouli distribtion with different parameter each, that means each questions has it own probabily of being answered correctly.

Let's try to use answer correctness mean as a prediction, and calculate the score:

In [None]:
preds = questions_ratios[train.query('content_type_id == 0').content_id].values
score = roc_auc_score(train.query('content_type_id == 0').answered_correctly.values, preds)
print(f'Score: {score}')

That's a good score given that we used only a single feature, this show how import this feature is.

## 5.2 User's Performance:
It could be that users differ on how well they respond to questions, we will do the same as the previous section, group by user_id and aggregate the mean over the answer correctness column:

In [None]:
users_ratios = train.query('content_type_id == 0').groupby('user_id')['answered_correctly'].mean()
plt.figure(figsize=(15, 4))
users_ratios.hist(bins=100, rwidth=0.8)
plt.xlabel('Correctness Ratio')
_ = plt.ylabel('Number of Users')

Let's try this as a prediction...

In [None]:
preds = users_ratios[train.query('content_type_id == 0').user_id].values
score = roc_auc_score(train.query('content_type_id == 0').answered_correctly.values, preds)
print(f'Score: {score}')

## 5.3 Container difficulty:

In [None]:
containers_ratios = train.query('content_type_id == 0').groupby('task_container_id')['answered_correctly'].mean()
plt.figure(figsize=(15, 4))
containers_ratios.hist(bins=100, rwidth=0.8)
plt.xlabel('Correctness Ratio')
_ = plt.ylabel('Number of Containers')

This distribution has less variance than the other distributions, this suggests that containers don't differ much on difficulity. Container id won't be a good predictor for answer correctness.

In [None]:
preds = containers_ratios[train.query('content_type_id == 0').task_container_id].values
score = roc_auc_score(train.query('content_type_id == 0').answered_correctly.values, preds)
print(f'Score: {score}')

As excpected it gave us a lower score than the other predictors..

# 6. Submission process:
Submissions should be done through notebooks, using the time series API.

The submission process is an iterative process, your model will get data as batches. For each batch your model have to send predictions of the current to the API before receiving the next batch.

Along each batch you receive from the API, labels of the preceding are reveiled, so it would be useful for your model to learn and predict in the same time. 

[API Demonstration](https://www.kaggle.com/sohier/competition-api-detailed-introduction).

# 7. Private Dataset and Proportions:
Lastly, one important aspect in every competition, is how different the private is from the public test.

One thing worth highlighting is that the private dataset will have new users and the will have about 2.5 million questions.