# Riiid! Answer Correctness Prediction
## Track knowledge states of 1M+ students in the wild


![](https://www.riiid.co/assets/opengraph.png)

In this competition, the challenge is to create algorithms for *Knowledge Tracing* the modeling of student knowledge over time. 

The goal is to accurately predict how students will perform on future interactions;  We will predict whether students are able to answer their next questions correctly.


## Presentation of the solution:
This is a simplistic solution that you can begin with, that gave good roc_auc results.

There are two assumptions for this heuristic model:

1. Questions differ in difficulty. There are some questions easier than others.
2. Checking the correctness of one's answer after responding to the question influences the performance on the next question.

## Modeling of the solution:

We measure question difficulty by calculating the percentage of students who answered it correctly

We will do a **target-based Encoding** on the question difficulty and whether the student looked at the prior quesion explanation combined.





# **Load Data** (whole Dataset)

I am only going to use content_id and prior_question_had_explanation as features for this heuristic model.

I am loading the whole dataset using datatable. ( Many thanks to @Vopani, check his notebook [here](https://www.kaggle.com/rohanrao/riiid-with-blazing-fast-rid)


**{**  The only issue with this, is that I would rather read only three columns from the beggining. Instead of all dumping the whole table and then extracting the three columns. I'm not sure datatable in Python allows this. Can anyone help me with this? (Pandas allows this with the argument usecols in readcsv method)    **}**

In [None]:
!pip install ../input/python-datatable/datatable-0.11.0-cp37-cp37m-manylinux2010_x86_64.whl

In [None]:
import riiideducation
import pandas as pd
import numpy as np
import datatable as dt

In [None]:
# saving the dataset in .jay (binary format)
dt.fread("../input/riiid-test-answer-prediction/train.csv").to_jay("train.jay")

In [None]:
%%time

# reading the dataset from .jay format
import datatable as dt

train = dt.fread("train.jay")

print(train.shape)

In [None]:
%%time

train = train.to_pandas()[['content_id','prior_question_had_explanation','answered_correctly']]

In [None]:
train.head()

In [None]:
train.shape

We don't need lectures data, we only need questions.

In [None]:
## Answer df is the train data with only info about answers i.e containing no lectures
answer_df = train.query('answered_correctly != -1')
del train

In [None]:
answer_df.describe()

On average, 65% of questions are answered correctly.


# Verification of the assumptions.

In [None]:
answer_df[['content_id','answered_correctly']].groupby(['content_id']).agg('mean').plot()

The percentage of students answering correctly changes from a question to another.

We even note that that are two questions that have really low scores with percentages of students answering correctly lower than  0.1!

TO DO:

   Look into those questions.

In [None]:
answer_df[['prior_question_had_explanation','answered_correctly']].groupby(['prior_question_had_explanation']).agg('mean').plot(kind='bar')

The bar plot above confirms that, on average, students perform better when they have checked the explanation of their previous answer.

# Heuristic model

In [None]:
average_student_performance = answer_df.describe()['answered_correctly'][1]
average_student_performance

On average 65,72% of students perform well on questions.

We will use this statistic to fill in the missing question data on the dataset.

In [None]:
%%time
## Calculate accuracy by content and explanation
c_exp_acc = answer_df[['content_id','prior_question_had_explanation','answered_correctly']].groupby(['content_id','prior_question_had_explanation']).agg('mean').reset_index()
c_exp_acc.columns = ['content_id','prior_question_had_explanation', 'content_explanation_acc']
c_exp_acc.head()
del answer_df

In [None]:
# You can only call make_env() once, so don't lose it!
env = riiideducation.make_env()

In [None]:
iter_test = env.iter_test()

In [None]:
for (test_df, sample_prediction_df) in iter_test:

    ## Add the calculated heuristics by content and question explanation
    ## Then fill the missing values by the average question answer.
    test_df = test_df.merge(c_exp_acc, how = 'left', on = ['content_id','prior_question_had_explanation'])
    test_df['answered_correctly'] = test_df['content_explanation_acc'].fillna(average_student_performance)
    

    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])