# Riiid! Answer Correctness Prediction

![Riid](https://www.riiid.co/assets/opengraph.png)

Discussion with a good intro to the competition - https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/189409  
And the questions are from these 7 parts of the TOEIC test - Test of English for International Communication

### 7 Parts of a TOEIC test
https://www.ets.org/toeic/organizations/listening-reading/about/content-format/

### Listening

- Part 1: Photographs
- Part 2: Question-Response
- Part 3: Conversations
- Part 4: Short Talks

### Reading

- Part 5: Incomplete Sentences
- Part 6: Error Recognition or Text Completion
- Part 7: Reading Comprehension


# Data description

Modified with changes from the competition hosts.

### **train.csv**

- `row_id`: (int64) ID code for the row.
- `timestamp`: (int64) the time between this user interaction and the first event completion from that user. This is the timestamp of when the user started anwsering the question
- `user_id`: (int32) ID code for the user.
- `content_id`: (int16) ID code for the user interaction
- `content_type_id`: (int8) 0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture.
- `task_container_id`: (int16) Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id.
- `user_answer`: (int8) the user's answer to the question, if any. Read -1 as null, for lectures.
- `answered_correctly`: (int8) if the user responded correctly. Read -1 as null, for lectures.
- `prior_question_elapsed_time`: (float32) How long it took a user to answer their all of the questions in the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture.
   CORRECT: Note that the time is the average time a user took to solve each question in the previous bundle
   
> **Suppose a user spent 60 seconds answering their first question at 10:00. The timestamp for that row would read 10:00 (we'll skip the normalization for simplicity), and the prior_question_elapsed_time would be null since it's the first question. If they took 30 seconds to answer their next question at 11:00 the second row's timestamp would be 11:00 and the prior_question_elapsed_time would be 60 seconds.**
   
- `prior_question_had_explanation`: (bool) Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback.

### **questions.csv: metadata for the questions posed to users.**

- `question_id`: foreign key for the train/test content_id column, when the content type is question (0).
- `bundle_id`: code for which questions are served together.
- `correct_answer`: the answer to the question. Can be compared with the train user_answer column to check if the user was right.
- `part`: the relevant section of the TOEIC test.
     - **What is TOEIC test?** The Test of English for International Communication is an international standardized test of English language proficiency for non-native speakers. parts mentioned above.
- `tags`: one or more detailed tag codes for the question. The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together.

### **lectures.csv: metadata for the lectures watched by users as they progress in their education.**

- `lecture_id`: foreign key for the train/test content_id column, when the content type is lecture (1).
- `part`: top level category code for the lecture.
- `tag`: one tag codes for the lecture. The meaning of the tags will not be provided, but these codes are sufficient for clustering the lectures together.
- `type_of`: brief description of the core purpose of the lecture

### **example_test_rows.csv** 

Three sample groups of the test set data as it will be delivered by the time-series API. The format is largely the same as train.csv. There are two different columns that mirror what information the AI tutor actually has available at any given time, but with the user interactions grouped together for the sake of API performance rather than strictly showing information for a single user at a time. Some questions will appear in the hidden test set that have NOT been presented in the train set, emulating the challenge of quickly adapting to modeling newly introduced questions. Their metadata is still in question.csv as usual.

`prior_group_responses` (string) provides all of the user_answer entries for previous group in a string representation of a list in the first row of the group. All other rows in each group are null. If you are using Python, you will likely want to call eval on the non-null rows. Some rows may be null, or empty lists.

`prior_group_answers_correct` (string) provides all the answered_correctly field for previous group, with the same format and caveats as prior_group_responses. Some rows may be null, or empty lists.

# EDA - Q/A index

I find it easier to do EDA when I break it down into specific questions.  
The answers to few specific questions will help in discovering features

## **train.csv**

- [Basic EDA](#train_eda)

- [How many users' data are we given?](#q1)
-  [How do users interact with different content-types?](#qd)
- [How many questions did each user answer?](#q2)
- [How many lectures did each student see?](#q3)
- [Whats the correlation between Questions count and Number of lectures?](#q4)
- [How long do users use the App?](#qtt)

## **Questions.csv**

- [How many questions are there in train data, and in questions meta-data?](#q5)
- [How are questions split into each part of the TOEIC test?](#q6)
- [How are questions split into bundles?](#q7)
8. [What is the probability of getting a question right?](#q8)
9. [What is the probability of a user getting any question right](#q9)
10. [Should we predict answers for the same students in train? Will new students be added in test data?](#q10)
11. [Do students reattempt questions?](#q11)
12. [How are questions split into tags?](#q12)

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import plotly_express as px
import matplotlib.pyplot as plt
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [None]:
# from https://www.kaggle.com/rohanrao/riiid-with-blazing-fast-rid
!pip install ../input/python-datatable/datatable-0.11.0-cp37-cp37m-manylinux2010_x86_64.whl

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                c_prec = df[col].apply(lambda x: np.finfo(x).precision).max()
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max and c_prec == np.finfo(np.float16).precision:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max and c_prec == np.finfo(np.float32).precision:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
# plotly helpers

# plots a histogram and a box plot of a column in dataframe
def distribution_plot(df, column, min_quantile=0, max_quantile=1):
    display(pd.DataFrame(df[column].describe(percentiles=np.arange(.1, 1, .1))).T)
    min_value = df[column].quantile(min_quantile)
    max_value = df[column].quantile(max_quantile)
    df = df[(df[column] >= min_value) & (df[column] <= max_value)]
    fig = make_subplots(rows=1, cols=2)
    fig.add_trace(go.Histogram(x=df[column], nbinsx=100), row=1, col=1)
    fig.add_trace(go.Box(y=df[column], orientation='v', name=column), row=1, col=2)
    fig.update_layout(title_text=f'{column} Distribution | min quantile {min_quantile} | max quantile {max_quantile}', showlegend=False)
    fig.show()
    
def p_line(y, x=None, title=None):
    if x is None:
        if hasattr(y, 'index'):
            x = y.index
        else:
            x = list(range(0, len(y)))
    fig = go.Figure(data=go.Scatter(x=x, y=y))
    if title is not None:
        fig.update_layout(title_text=title)
    fig.show()

In [None]:
data_dir = Path('../input/riiid-test-answer-prediction')

In [None]:
# import data files
import datatable as dt

train = dt.fread("../input/riiid-test-answer-prediction/train.csv").to_pandas()
print(train.shape)
train = reduce_mem_usage(train)
questions = pd.read_csv(data_dir/'questions.csv')
lectures = pd.read_csv(data_dir/'lectures.csv')

In [None]:
example_test = pd.read_csv(data_dir/'example_test.csv')

In [None]:
set(example_test['content_id']) - set(questions['question_id'])


<a id='train_eda'/>



# Train.csv
## Basic EDA

In [None]:
train.head()

In [None]:
pd.DataFrame(train.isnull().sum(), columns=['null_count'])

<a id='q1'/>

## How many users' data are we given?


In [None]:
print(f"Number of users in train - {train['user_id'].nunique()}")

<a id='qd'/>

## How do users interact with different content-types?

0 - Questions
1 - Lectures

In [None]:
content_count = pd.DataFrame(train['content_type_id'].value_counts()).reset_index()
content_count.columns = ['type', 'count']
content_count['type'] = content_count.type.replace({0: 'questions', 1: 'lectures'})
px.pie(content_count, values='count', names='type')

<a id='q2'/>

## How many questions did each user answer?

In [None]:
user_interactions_count = train[['row_id','user_id', 'content_type_id']].groupby(['user_id', 'content_type_id'], as_index=False).count()
user_interactions_count = user_interactions_count.rename(columns={'row_id': 'count'})
user_interactions_count = user_interactions_count.pivot(index='user_id', columns='content_type_id', values=['count'])
user_interactions_count = user_interactions_count.fillna(0)
user_interactions_count.columns = ['questions_count', 'lectures_count']

In [None]:
distribution_plot(user_interactions_count,'questions_count', min_quantile=0, max_quantile=.9)

#### What are top 10 counts of number of answered questions

In [None]:
qc_count = user_interactions_count['questions_count'].value_counts(normalize=True)
qc_count.index = pd.Series(qc_count.index).apply(lambda x: f'count_{x}')
p_line(qc_count.head(10), title='percentage of users vs questions count')

- 15% of the users have 30 questions
- There are spikes in the number of questions. This is worth exploring more. There might specific usage patterns in the app

<a id='q3'/>

## How many lectures did each student see?

In [None]:
distribution_plot(user_interactions_count,'lectures_count', min_quantile=0, max_quantile=.95)

#### Lectures count cumulative sum

In [None]:
p_line(user_interactions_count['lectures_count'].value_counts(normalize=True).cumsum(), title='User percentage cumsum vs number of lectures')

- This is unexpected, 62% of the users saw no lectures. So large number of users are here for the questions.

<a id='q4'/>

## Whats the correlation between number of questions each user has answered and the number of lectures he has seen?

In [None]:
user_interactions_count.corr()

In [None]:
user_interactions_count['questions_lectures'] = user_interactions_count['questions_count'].astype(str) + '_' + user_interactions_count['lectures_count'].astype(str)

#### Top 20 combinations of questions count and lectures count

In [None]:
p_line(user_interactions_count['questions_lectures'].value_counts(normalize=True).head(20), title='Percentage of users vs questions_lectures combo count')

In [None]:
user_interactions_count = user_interactions_count.sort_values('questions_count')
user_interactions_count['user_index'] = range(0, len(user_interactions_count))

fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(
    go.Scatter(x=user_interactions_count['user_index'], y=user_interactions_count['questions_count'], name="questions_count"),
    secondary_y=False,
)
fig.add_trace(
    go.Scatter(x=user_interactions_count['user_index'], y=user_interactions_count['lectures_count'], name="lectures_count", opacity=.5),
    secondary_y=True,
)
fig.update_xaxes(title_text="user index")
fig.update_yaxes(title_text="number of Questions", secondary_y=False)
fig.update_yaxes(title_text="number of Lectures", secondary_y=True)
fig.show()

In [None]:
user_interactions_count = user_interactions_count.sort_values('lectures_count')
user_interactions_count['user_index'] = range(0, len(user_interactions_count))

fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(
    go.Scatter(x=user_interactions_count['user_index'], y=user_interactions_count['questions_count'], name="questions_count", opacity=.5),
    secondary_y=False,
)
fig.add_trace(
    go.Scatter(x=user_interactions_count['user_index'], y=user_interactions_count['lectures_count'], name="lectures_count"),
    secondary_y=True,
)
fig.update_xaxes(title_text="User index")
fig.update_yaxes(title_text="number of Questions", secondary_y=False)
fig.update_yaxes(title_text="number of Lectures", secondary_y=True)
fig.show()

- There is a 80% correlation between number of questions, number of lectures seen be a user
- There is also a huge variability, there are users with 14K questions but no lectures, there are users with 150 lectures but only 1000 questions  
  This can be taken as the variability in user behaviour in the app, some want the app for questions, some want it for the lectures.
  But in general a person whose watched more lectures, should have attemted more questions

<a id='qtt'/>

## How long do users use the App?

#### Checking if each users' timestamp are monotonously increasing

In [None]:
pd.Series.is_monotonic_increasing

In [None]:
is_increasing = train[['timestamp', 'user_id']].groupby('user_id').agg(lambda x: x.is_monotonic_increasing)

<a id='q5'/>

# Questions.csv

## Basic EDA

In [None]:
questions.head()

In [None]:
pd.DataFrame(questions.isnull().sum(), columns=['null_count'])

#### One question without any tag :P

<a id='q5' />

## How many questions are there in train data, and in questions meta-data?

In [None]:
print(f'Number of unique questions in train data - {train.loc[train["content_type_id"] == 0, "content_id"].nunique()}')
print(f'Number of unique questions in questions metadata - {questions["question_id"].nunique()}')

In [None]:
q_not_in_train_data = set(questions['question_id']) - set(train.loc[train['content_type_id'] == 0, 'content_id'])
print(f'Questions in Metadata, but not in train data - {q_not_in_train_data}')

q_not_in_metadata = set(train.loc[train['content_type_id'] == 0, 'content_id']) - set(questions['question_id'])
print(f'Questions in train, but not in metadata - {q_not_in_metadata}')

 - That's weird, from the description I got that the test set might have a few questions that are in questions meta-data but not in train data.
 - But all the questions in meta-data are in train data
 - TODO - this will impact feature engineering, get this clarified

## 6. How are questions split into each part of the TOEIC test?
<a id='q6'/>

In [None]:
questions.head()

In [None]:
questions['part'].unique()

In [None]:
parts = questions[['question_id', 'part']].groupby('part', as_index=False).count().rename(columns={'question_id': 'number_of_questions'})

In [None]:
px.bar(parts, x='part', y='number_of_questions')

- Part 5: Incomplete Sentences has a large number of unique questions

## 7. How are questions split into bundles?
<a id='q7'/>

In [None]:
questions['bundle_id'].nunique()

In [None]:
bundle_count = questions[['question_id', 'bundle_id']].groupby('bundle_id', as_index=False).count().rename(columns={'question_id': 'number_of_questions'})

In [None]:
distribution_plot(bundle_count, 'number_of_questions')

- Most questions are just given alone

## 8. What is the probability of getting a question right?
<a id='q8'/>

In [None]:
question_answers = train.loc[train['content_type_id'] == 0, ['content_id', 'answered_correctly']].groupby('content_id', as_index=False).mean()
distribution_plot(question_answers, 'answered_correctly')

- Normal distribution with Right skew, smart students, or easy questions

## 9. What is the probability of a user getting any question right?
<a id='q9'/>

In [None]:
user_answers = train.loc[train['content_type_id'] == 0, ['user_id', 'answered_correctly']].groupby('user_id', as_index=False).mean()
distribution_plot(user_answers, 'answered_correctly')

In [None]:
px.histogram(user_answers, 'answered_correctly', nbins=100)

- Normal distribution with right skew
- Does the peak have any meaning? don't think so

## 10. Should we predict answers for the same students in train? Will new students be added in test data?
<a id='q10'/>

In [None]:
test = pd.read_csv(data_dir/'example_test.csv')
test.shape

In [None]:
set(test['user_id']) - set(train['user_id'])

- There is one extra student, I'm assuming there will be new users.
- TODO - check if this is clearly stated somewhere - Done, there will be new students

## 11. Do students reattempt questions? 
<a id='q11'/>

In [None]:
user_question_count = train.loc[train['content_type_id'] == 0, ['row_id', 'user_id', 'content_id']].groupby(['user_id', 'content_id']).count()

In [None]:
p_line(user_question_count['row_id'].value_counts(normalize=True))

- 89% of the questions are attempted only once
- There are good number of questions that are reattempted upto 4 times

## 12. How are questions split into tags?
<a id='q12'/>

In [None]:
questions.isnull().sum()

In [None]:
print(f'{questions["tags"].nunique()} unique combinations')
questions['tags'].value_counts()

In [None]:
questions['n_tags'] = questions['tags'].fillna("").apply(lambda x: len(x.split()))

In [None]:
all_tags = pd.Series(np.concatenate(questions['tags'].fillna("").apply(lambda x: x.split()).values)).apply(lambda x: f'tag_{x}')

In [None]:
all_tags.nunique()

In [None]:
# this is number of question in questions meta-data,not in train data
p_line(all_tags.value_counts(), all_tags.value_counts().index, title='Unique question tag vs question count')

- 40% questions have a single tag associated with them
- can have a maximum of 6 tags
- There are 188 unique tags, arranged into 1519 combinations

## 13. How much time does the user take to answer a question?

## 14. How much time does a user spend on the App?

# EDA - To Be Continued.....

# Baseline submission

In [None]:
question_answers = question_answers.rename(columns={'answered_correctly': 'question_score'})
user_answers = user_answers.rename(columns={'answered_correctly': 'user_score'})

In [None]:
question_answers['content_type_id'] = 0
user_answers['content_type_id'] = 0

In [None]:
import riiideducation
env = riiideducation.make_env()
iter_test = env.iter_test()

In [None]:
def harmonic_mean(a, b):
    return (2 * a * b) / (a + b + 1e-6)

In [None]:
for (test_df, sample_prediction_df) in iter_test:
    test_df = test_df.merge(question_answers, on=['content_id', 'content_type_id'], how='left')
    test_df = test_df.merge(user_answers, on=['user_id', 'content_type_id'], how='left')
    test_df['question_score'].fillna(.5, inplace=True)
    test_df['user_score'].fillna(.5, inplace=True)
    test_df['prediction'] = test_df.apply(lambda row: harmonic_mean(row['question_score'], row['user_score']), axis=1)
    test_df['answered_correctly'] = test_df['prediction']
    # display(test_df)
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])