In [None]:
path = '../input/riiid-test-answer-prediction/'
import pandas as pd
question = pd.read_csv(path + 'questions.csv')

Since the organizer of this competition has announced the source of the dataset (EdNET), I'm going to discuss about some of the domain specific characteristics of the data. 

Based on their [EdNet paper](https://arxiv.org/pdf/1912.03072.pdf), EdNet is comprised of data collected by Santa, which is an AI tutoring service that is supposed to help students prepare for their TOEIC Listening and Reading test. TOEIC is a well-known test in South Korea for assessing students' English proficiency. 

TOEIC is comprised of 7 parts, which is why there are 7 unique values in `part` column of questions.csv file. 

In [None]:
question['part'].unique()

On TOEIC, Each part can be explained as follows.

**Listening** - test takers have to listen to spoken English and answer questions

- Part 1: Photographs - select a spoken sentence that best describes the given picture
- Part 2: Question-Response - select the appropriate response to the spoken question
- Part 3: Conversations - select the appropriate answers to questions related to a spoken conversation between 2~3 people
- Part 4: Talks - select the appropriate answers to questions related to a spoken short talk

**Reading**  - test takers have to read a sentence or a passage to answer questions

- Part 5: Incomplete Sentences - select the correct word that fills the blank in a sentence so it becomes complete in syntax-wise or semantic-wise
- Part 6: Text Completion - select the correct answer that fills the blank in a passage so it becomes complete in syntax-wise and semantic-wise
- Part 7: Reading Comprehension - select the correct answers to questions related to a given passage

I'm also assuming the numbers on `part` column correspond exactly one-on-one to the part numbers on TOEIC because they show same characteristics. 

Reason 1: On TOEIC, Part 2 has only three available options. Part 2 on questions.csv also has only three distinct values on `correct_answer` column, namely 0, 1, and 3. 

In [None]:
question.groupby('part')['correct_answer'].unique()

Reason 2: On TOEIC, each questions comes seperately in Part 1, 2, and 5. However, the questions in Part 3, 4, 6, and 7 come in bundle. On the `question.csv`, the part numbers have the exact same characteristics. 

In [None]:
# 1 question for each bundle
question.query('part == 1').groupby('bundle_id').count().head()

In [None]:
# 1 question for each bundle
question.query('part == 2').groupby('bundle_id').count().head()

In [None]:
# many questions for each bundle
question.query('part == 3').groupby('bundle_id').count().head()

In [None]:
# many questions for each bundle
question.query('part == 4').groupby('bundle_id').count().head()

In [None]:
# 1 question for each bundle
question.query('part == 5').groupby('bundle_id').count().head()

In [None]:
# many questions for each bundle
question.query('part == 6').groupby('bundle_id').count().head()

In [None]:
# many questions for each bundle
question.query('part == 7').groupby('bundle_id').count().head()

Some of my thoughts listed below
- Not sure if it is possible to reflect these kinds of TOEIC specific characteristics to make a domain specific model. 
- Not sure whether it will have a good performance.
- Not sure whether the organizers and sponsors will like it because they are probably expecting for a domain-agnostic model. 

You can find more information on the TOEIC website

- [TOEIC website](https://www.ets.org/toeic)
- [TOEIC sample test by ETS](https://www.ets.org/s/toeic/pdf/toeic-listening-reading-sample-test-updated.pdf)
