# Riiid! Answer Correctness Prediction
## Introduction
In this competition you will predict which questions each student is able to answer correctly. You will loop through a series of batches of questions. Once you make that prediction, you can move on to the next batch.

This competition is different from most Kaggle Competitions in that:
* You can only submit from Kaggle Notebooks
* You must use our custom **`riiideducation`** Python module.  The purpose of this module is to control the flow of information to ensure that you are not using future data to make predictions.  If you do not use this module properly, your code may fail.

## In this Starter Notebook, we'll show how to use the **`riiideducation`** module to get the test features and make predictions.
## TL;DR: End-to-End Usage Example
```
import riiideducation
env = riiideducation.make_env()

# Training data is in the competition dataset as usual
train_df = pd.read_csv('/kaggle/input/riiideducation/train.csv', low_memory=False)
train_my_model(train_df)

for (test_df, sample_prediction_df) in iter_test:
    test_df['answered_correctly'] = 0.5
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])```
Note that `train_my_model` is a function you need to write for the above example to work.

## In-depth Introduction
First let's import the module and create an environment.

In [1]:
# import riiideducation
import pandas as pd

# You can only call make_env() once, so don't lose it!
# env = riiideducation.make_env()

In [2]:
folder = '/kaggle/input'

In [3]:
import os
for dirname, _, filenames in os.walk(folder):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/media/mourao/BACKUP/kaggle.com/c/riiid-test-answer-prediction/data/train.csv
/media/mourao/BACKUP/kaggle.com/c/riiid-test-answer-prediction/data/lectures.csv
/media/mourao/BACKUP/kaggle.com/c/riiid-test-answer-prediction/data/example_test.csv
/media/mourao/BACKUP/kaggle.com/c/riiid-test-answer-prediction/data/example_sample_submission.csv
/media/mourao/BACKUP/kaggle.com/c/riiid-test-answer-prediction/data/questions.csv


### Training data is in the competition dataset as usual
It's larger than will fit in memory with default settings, so we'll specify more efficient datatypes and only load a subset of the data for now.

In [6]:
train_df = pd.read_csv(os.path.join(folder, 'train.csv'), low_memory=False, nrows=10**5, 
                       dtype={'row_id': 'int64', 'timestamp': 'int64', 'user_id': 'int32', 'content_id': 'int16', 'content_type_id': 'int8',
                              'task_container_id': 'int16', 'user_answer': 'int8', 'answered_correctly': 'int8', 'prior_question_elapsed_time': 'float32', 
                             'prior_question_had_explanation': 'boolean',
                             }
                      )
train_df

Unnamed: 0,row_id,timestamp,user_id,content_id,content_type_id,task_container_id,user_answer,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation
0,0,0,115,5692,0,1,3,1,,
1,1,56943,115,5716,0,2,2,1,37000.0,False
2,2,118363,115,128,0,0,0,1,55000.0,False
3,3,131167,115,7860,0,3,0,1,19000.0,False
4,4,137965,115,7922,0,4,1,1,11000.0,False
...,...,...,...,...,...,...,...,...,...,...
99995,99995,153647401,2078569,4334,0,275,3,0,6000.0,True
99996,99996,153692472,2078569,6436,0,276,3,0,9000.0,True
99997,99997,153722998,2078569,6446,0,277,2,1,21000.0,True
99998,99998,153759775,2078569,3715,0,278,3,0,12000.0,True


## `iter_test` function

Generator which loops through each batch of questions in the test set. You have direct access to the example test rows for your convenience, but your code will only be able to get rows from the real test set via the API. Once you call **`predict`** you can continue on to the next batch.

Yields:
* While there are more batch(es) and `predict` was called successfully since the last yield, yields a tuple of:
    * `test_df`: DataFrame with the test features for the next batch, and user responses for the previous batch.
    * `sample_prediction_df`: DataFrame with an example prediction.  Intended to be filled in and passed back to the `predict` function.
* If `predict` has not been called successfully since the last yield, prints an error and yields `None`.

In [4]:
# You can only iterate through a result from `env.iter_test()` once
# so be careful not to lose it once you start iterating.
iter_test = env.iter_test()

Let's get the data for the first test batch and check it out.

In [5]:
(test_df, sample_prediction_df) = next(iter_test)
test_df

Unnamed: 0_level_0,row_id,timestamp,user_id,content_id,content_type_id,task_container_id,prior_question_elapsed_time,prior_question_had_explanation,prior_group_answers_correct,prior_group_responses
group_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,0,0,275030867,5729,0,0,,,[],[]
0,1,13309898705,554169193,12010,0,4427,19000.0,True,,
0,2,4213672059,1720860329,457,0,240,17000.0,True,,
0,3,62798072960,288641214,13262,0,266,23000.0,True,,
0,4,10585422061,1728340777,6119,0,162,72400.0,True,,
0,5,18020362258,1364159702,12023,0,4424,18000.0,True,,
0,6,2325432079,1521618396,574,0,1367,18000.0,True,,
0,7,39456940781,1317245193,12043,0,5314,17000.0,True,,
0,8,3460555189,1700555100,7910,0,532,21000.0,True,,
0,9,2214770464,998511398,7908,0,393,21000.0,True,,


In [6]:
sample_prediction_df

Unnamed: 0_level_0,row_id,answered_correctly
group_num,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0,0.5
0,1,0.5
0,2,0.5
0,3,0.5
0,4,0.5
0,5,0.5
0,6,0.5
0,7,0.5
0,8,0.5
0,9,0.5


Note that we'll get an error if we try to continue on to the next batch without making our predictions for the current batch.

In [7]:
next(iter_test)

You must call `predict()` successfully before you can continue with `iter_test()`


### **`predict`** function
Stores your predictions for the current batch.  Expects the same format as you saw in `sample_prediction_df` returned from the `iter_test` generator.

Args:
* `predictions_df`: DataFrame which must have the same format as `sample_prediction_df`.

This function will raise an Exception if not called after a successful iteration of the `iter_test` generator.

Let's make a dummy prediction using the sample provided by `iter_test`.

In [8]:
env.predict(sample_prediction_df)

## Main Loop
Let's loop through all the remaining batches in the test set generator and make the default prediction for each.  The `iter_test` generator will simply stop returning values once you've reached the end.

When writing your own Notebooks, be sure to write robust code that makes as few assumptions about the `iter_test`/`predict` loop as possible.  For example, the test set contains question IDs that have not been previously observed in train.

You may assume that the structure of `sample_prediction_df` will not change in this competition.

**The lecture rows in `test_df` should not be submitted.**

In [9]:
for (test_df, sample_prediction_df) in iter_test:
    test_df['answered_correctly'] = 0.5
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])

## Restart the Notebook to run your code again
In order to combat cheating, you are only allowed to call `make_env` or iterate through `iter_test` once per Notebook run.  However, while you're iterating on your model it's reasonable to try something out, change the model a bit, and try it again.  Unfortunately, if you try to simply re-run the code, or even refresh the browser page, you'll still be running on the same Notebook execution session you had been running before, and the `riideducation` module will still throw errors.  To get around this, you need to explicitly restart your Notebook execution session, which you can do by **clicking "Run"->"Restart Session"** in the Notebook Editor's menu bar at the top.