# Riiid Challenge

# Introduction

![](https://media-exp1.licdn.com/dms/image/C511BAQFdNFjlEdfIVw/company-background_10000/0?e=2159024400&v=beta&t=1nlr0BJH9o8ihnnW7a5Gee0v3IM08hgUrLPNoUp0Ko8)

### About Riiid:
Riiid is global leading AI Tutor solution provider delivering creative disruption to the education market through its cutting-edge AI technology.

### Problem:

In 2018, 260 million children weren't attending school. At the same time, more than half of these young students didn't meet minimum reading and math standards. Education was already in a tough place when COVID-19 forced most countries to temporarily close schools. This further delayed learning opportunities and intellectual development. The equity gaps in every country could grow wider. We need to re-think the current education system in terms of attendance, engagement, and individualized attention.

### Solution:

Riiid Labs, an AI solutions provider delivering creative disruption to the education market, empowers global education players to rethink traditional ways of learning leveraging AI. With a strong belief in equal opportunity in education, Riiid launched an AI tutor based on deep-learning algorithms in 2017 that attracted more than one million South Korean students.

### Translating the Problem Into Machine Learning

### Goals:

- In this competition, your challenge is to create algorithms for "Knowledge Tracing," the modeling of student knowledge over time. The goal is to accurately predict how students will perform on future interactions.

#### *Therefore this is a classification problem.*

### Limitations:

- We've given pretty big database and here on Kaggle we have some computational limits like memory.
- Loading and working on this data takes quite long time for bigger batches. So we should work in given run-time limits.
- Traditional validation methods might be inefficient by the form of data given.

### Performance Metric:

- Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

## Getting Things Ready

Here we install and load our libraries and set some default settings for future use.

In [None]:
!pip install ../input/python-datatable/datatable-0.11.0-cp37-cp37m-manylinux2010_x86_64.whl > /dev/null

In [None]:
# importing basic libraries for eda

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#

import datatable as dt

#

import gc
import warnings
warnings.filterwarnings('ignore')

In [None]:
# styling settings

plt.rcParams['figure.figsize'] = [18,10]
plt.style.use('ggplot')

# Loading the Data

The data we given is huge, it's not easy to load them all into our RAM without getting out of memory. We need to set dtypes for each column to decrease memory usage. By default they are 32/64 for numeric columns, but can choose dtypes manually based on their max amount in the column. These dtype selections based on:

>int8 / uint8 : consumes 1 byte of memory, range between -128/127 or 0/255,

>bool : consumes 1 byte, true or false,

>float16 / int16 / uint16: consumes 2 bytes of memory, range between -32768 and 32767 or 0/65535,

>float32 / int32 / uint32 : consumes 4 bytes of memory, range between -2147483648 and 2147483647,

>float64 / int64 / uint64: consumes 8 bytes of memory.

[Source](https://medium.com/@vincentteyssier/optimizing-the-size-of-a-pandas-dataframe-for-low-memory-environment-5f07db3d72e)

We're also going to use **datatable** for faster loading. You can find deeper explanation [here](https://www.kaggle.com/rohanrao/tutorial-on-reading-large-datasets).



In [None]:
# dict for dtypes

data_types = {
    'row_id': 'int32',
    'timestamp': 'int64',
    'user_id': 'int64',
    'content_id': 'int16',
    'content_type_id': 'int8',
    'task_container_id': 'int16',
    'user_answer': 'int8',
    'answered_correctly': 'int8',
    'prior_question_elapsed_time': 'float32',
    'prior_question_had_explanation': 'boolean'
}

In [None]:
# loading data with datatable and converting it to pandas df.

train_df = dt.fread('../input/riiid-test-answer-prediction/train.csv').to_pandas()

# randomly selecting portion of the data for faster processing.

train_df = train_df.sample(len(train_df)//5,random_state=42)


# setting dtypes for each column

for column, d_type in data_types.items():
    train_df[column] = train_df[column].astype(d_type) 

In [None]:
# loading other data files

questions_df = pd.read_csv('../input/riiid-test-answer-prediction/questions.csv')
lectures_df = pd.read_csv('../input/riiid-test-answer-prediction/lectures.csv')
test_df = pd.read_csv('../input/riiid-test-answer-prediction/example_test.csv')

#### Using handpicked dtypes for each column decreased their memory usage, which is a good sign...

In [None]:
# dtypes and memory usage

train_df.info()

In [None]:
train_df.

# Timestamp 

#### "timestamp": The time in milliseconds between this user interaction and the first event completion from that user.

#### Here we can see earlier parts of the timeline is more active than longer sessions, which is expected. Also we can notice that there are some users spend quite a long time around!

In [None]:
# plotting timestamp related graphs

fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(32,14))

sns.distplot(train_df.timestamp, kde=False,hist_kws={
                 'rwidth': 0.85,
                 'edgecolor': 'black',
                 'alpha': 0.8}, bins=100, ax=ax[0])

ax[0].set_xlabel('Time in Miliseconds')
ax[0].set_ylabel('Count')
ax[0].set_title('Timestamp Distribution', weight='bold')


sns.distplot(train_df.groupby('user_id').agg({'timestamp': 'mean'}), kde=False, hist_kws={
                 'rwidth': 0.85,
                 'edgecolor': 'black',
                 'alpha': 0.8}, bins=50,ax=ax[1])

ax[1].set_xlabel('Time in Miliseconds*')
ax[1].set_ylabel('Count')
ax[1].set_title('Mean Timestamp Per User Distribution', weight='bold')
plt.show()

In [None]:
train_df.sort_values("timestamp").head()


In [None]:
train_df.head()

In [None]:
train_df_serious_students.head()

In [None]:
train_df["lecture"] = (train_df["answered_correctly"]==-1)

In [None]:
train_df[train_df.lecture]

In [None]:
# plotting the performance of students based on counts of questions and lectures that they have seen
min_lectures = [i*5 for i in range(11)]
fig, ax = plt.subplots(ncols=1, nrows=10, figsize=(32,140))
train_df['counts']=1
train_df["lecture"] = (train_df["answered_correctly"]==-1)
train_df['lecture'] = train_df['lecture'].astype(np.int8)
train_lectures = train_df.groupby('user_id').agg({'lecture' : 'sum'})
train_df_serious_students = train_df[train_df.answered_correctly >= 0].groupby('user_id').agg({'counts': 'sum','answered_correctly': 'mean'})
train_df_serious_students = train_df_serious_students.join(train_lectures, on = 'user_id', how ='left')

for i in range(10) : 
    plot_df = train_df_serious_students[(train_df_serious_students.counts >= 10) & (train_df_serious_students.lecture >=min_lectures[i])  & (train_df_serious_students.lecture <min_lectures[i+1])]
    sns.distplot(plot_df['answered_correctly'], kde=True, hist_kws={
                     'rwidth': 0.85,
                     'edgecolor': 'black',
                     'alpha': 0.8}, bins=50,ax=ax[i])
    ax[i].set_xlabel('Average accuracy')
    ax[i].set_ylabel('Counts')
    ax[i].set_title('Average accuracy for students that have watched {} lectures \n and answered at least 10 questions'.format(min_lectures[i]), weight='bold')
plt.show()


In [None]:
47774654/(1000*3600)

# Users and Contents

##### Here we have unique ID for each user , if we count all the interections made by user we can see there are some pretty active users, out of ~300k unique users we see top 25 of them almost made more than 3k interactions.

#### Content is pretty similar with user ID's. Here we can see most popular contents, seems  content #6116 is really favourite one  followed by #6173 and #4120 around 400k interactions.


In [None]:
# plotting user and content related graphs

fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(32,14))

sns.countplot(y='user_id', data=train_df, order=train_df.user_id.value_counts().index[:25], palette='autumn',ax = ax[0])
ax[0].set_title('Top 25 Active Users', weight='bold')


sns.countplot(y='content_id', data=train_df, order=train_df.content_id.value_counts().index[:25], palette='autumn',ax = ax[1])
ax[1].set_title('Top 25 Content', weight='bold')


plt.show()

# Content Types

#### "content_type_id": 0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture.

#### Here we can see that 98% of our samples are questions and ~2% are lectures.

In [None]:
# plotting content types

g=sns.countplot(train_df.content_type_id, palette='autumn')

# adding percentages

total = float(len(train_df['content_type_id']))

for p in g.patches:
    height = p.get_height()
    g.text(p.get_x() + p.get_width() / 2.,
            height + 2,
            '{:1.2f}%'.format((height / total) * 100),
            ha='center')

plt.ylabel('Count*10^7')    
plt.title('Content Types - 0: Question, 1: Lecture', weight='bold')
plt.show()

# Task Containers

#### "task_container_id": Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id.

#### We can observe that tasks with smaller ID's are much more common than bigger numbers, meanwhile most popular one is #14

In [None]:
# plotting task containers

fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(32,14))

sns.distplot(train_df.task_container_id, kde=False,hist_kws={
                 'rwidth': 0.85,
                 'edgecolor': 'black',
                 'alpha': 0.8}, ax=ax[0])


ax[0].set_ylabel('Frequency')
ax[0].set_title('Task Container ID Distribution', weight='bold')


sns.countplot(y='task_container_id', data=train_df, order=train_df.task_container_id.value_counts().index[:25], palette='autumn', ax=ax[1])
ax[1].set_title('Top 25 Tasks', weight='bold')


plt.show()

# User Answers

#### "user_answer": The user's answer to the question, if any. Read -1 as null, for lectures.

#### Here we can see answer option #2 is less common than rest of the three answers. Seems like users/instructors doesn't like option #2 that much :)

In [None]:
# plotting user answers

g=sns.countplot(train_df.user_answer, hue=train_df.answered_correctly, palette='autumn', order=train_df.user_answer.value_counts().index)

# adding percentages

total = float(len(train_df['user_answer']))
for p in g.patches:
    height = p.get_height()
    g.text(p.get_x() + p.get_width() / 2.,
            height + 2,
            '{:1.2f}%'.format((height / total) * 100),
            ha='center')

plt.title('False/Correct per User Answer  (-1 for Lectures)', weight='bold')

plt.show()

# Prior Questions


#### prior_question_elapsed_time:The average time in milliseconds it took a user to answer each question in the previous question bundle, ignoring any lectures in between. Is null for a user's first question bundle or lecture. Note that the time is the average time a user took to solve each question in the previous bundle.

#### prior_question_had_explanation: Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback.

In [None]:
# plotting prior questions related stuff

fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(32,14))


sns.distplot(train_df.prior_question_elapsed_time.dropna(), kde=False, hist_kws={
                 'rwidth': 0.85,
                 'edgecolor': 'black',
                 'alpha': 0.8}, ax=ax[0])

ax[0].set_ylabel('Count')
ax[0].set_title('Prior Question Elapsed Time Distribution', weight='bold')


g=sns.countplot(train_df.prior_question_had_explanation.dropna(), palette='autumn', ax=ax[1])

# adding percentages to plot

total = float(len(train_df.prior_question_had_explanation.dropna()))
for p in g.patches:
    height = p.get_height()
    g.text(p.get_x() + p.get_width() / 2.,
            height + 2,
            '{:1.2f}%'.format((height / total) * 100),
            ha='center')

ax[1].set_title('Prior Question Elapsed had Explanation?', weight='bold')
    
plt.show()

In [None]:
# Dropping lectures from the dataframe

train_df = train_df.loc[train_df['answered_correctly'] != -1].reset_index(drop=True)

# Questions

#### Here we add extra data we have given. So we might find some insightful hints...

In [None]:
# merging question data with train data

train_df = pd.merge(train_df,questions_df[['question_id','part']], how='left', left_on='content_id', right_on='question_id').sort_values('row_id')
train_df['part'] = train_df['part'].astype('int8')


# Question Parts

#### part: The relevant section of the TOEIC test.

#### We've merged question parts based on their specific question ID's. These giving us relevant section of the TOEIC test explained here:

![](https://i.imgur.com/2wqNAJ1.png)
![](https://i.imgur.com/4B3AQyL.png)

### Here we observe that Part 5 questions are the most popular one, where you have to complete the sentences by given four options.

In [None]:
g=sns.countplot(train_df.part, hue=train_df.answered_correctly, palette='autumn')

# adding percentages to plot

total = float(len(train_df.part))
for p in g.patches:
    height = p.get_height()
    g.text(p.get_x() + p.get_width() / 2.,
            height + 2,
            '{:1.2f}%'.format((height / total) * 100),
            ha='center')

plt.title('False/Correct per Question Part', weight='bold')
plt.show()

In [None]:
# grouping by user id and getting mean, sum, counts

usr_ans = train_df.groupby('user_id').agg({ 'answered_correctly': ['mean','sum', 'count']})
usr_ans.columns = ['avg_correct_answer','num_of_correct', 'total_answers']

# changing dtype for reducing memory (default = 64)

usr_ans['num_of_correct'] = usr_ans['num_of_correct'].astype('int16')
usr_ans['total_answers'] = usr_ans['total_answers'].astype('int16')


train_df = pd.merge(train_df, usr_ans, how='left', on = 'user_id')

# Correct Answer Accuracy

#### We've filtered users who have answered more than 100 questions and sorted them based on their correct answer ratio, we got some verry accurate users!

In [None]:
#  plotting top 25 Accurate Users

sns.barplot(x='avg_correct_answer',y='user_id', orient='h', data=usr_ans[usr_ans['total_answers']>100].sort_values('avg_correct_answer', ascending=False).reset_index().iloc[:25],
            palette='autumn', order=usr_ans[usr_ans['total_answers']>100].sort_values('avg_correct_answer', ascending=False).reset_index().user_id.iloc[:25])
plt.title('Top 25 Accurate Users', weight='bold')
plt.show()

# Answer Accuracy vs. Total Questions Answered

#### Here we can observe increasing correct answer ratio by total number of questions answered by the user. Practice makes perfect!

In [None]:
# plotting total answers vs avg accuracy

sns.regplot(data=usr_ans[usr_ans['total_answers']> 100], y='avg_correct_answer', x='total_answers', ci=False, scatter_kws={'alpha':0.5}, line_kws={"color": "orange"})
plt.axhline(train_df.avg_correct_answer.mean(), color='k', linestyle='dashed', linewidth=3)
plt.axvline(train_df.total_answers.mean(), color='k', linestyle='dashed', linewidth=3)

min_ylim, max_ylim = plt.ylim()
plt.text(train_df.total_answers.mean()+25, max_ylim*0.20, 'Average Questions Solved {:.2f}'.format(train_df.total_answers.mean()))
plt.text(train_df.total_answers.mean()+2400, max_ylim*0.6, 'Average Correct Answer: {:.2f}'.format(train_df.avg_correct_answer.mean()))

plt.title('Average Correct Answer Ratio vs. Total Questions Answered per User', weight='bold')
plt.show()

## Answer Accuracy - Time Relations

#### Here we took maximum time spent by the user and filtered ones out if they have less than one hour. We got slight increase on user accuracy by increasing time spent but it seems insignificant.

In [None]:
# plotting answer accuracy vs time

total_time = train_df.groupby('user_id')["timestamp"].max()
total_time = pd.merge(total_time.reset_index(), usr_ans.reset_index(), how='left', on = 'user_id')

sns.regplot(data=total_time[total_time['timestamp']> 3.6e+6], y='avg_correct_answer', x='timestamp', ci=False, scatter_kws={'alpha':0.5}, line_kws={"color": "orange"})

plt.axvline(total_time.timestamp.mean(), color='k', linestyle='dashed', linewidth=3)
plt.text(total_time.timestamp.mean()+total_time.timestamp.mean()*0.1, max_ylim*0.03, 'Average Time {:.2f}'.format(total_time.timestamp.mean()))

plt.title('Average Correct Answer Ratio vs. Time Spent', weight='bold')
plt.show()

# deleting some variables to save memory:

del total_time

gc.collect()

# Answer Accuracy - Content Relations

#### When we take the answer accuracy by the content we can see that more popular/generic contents have lower correct answer ratio. Meanwhile less popular/specific questions have higher correct ratio.

In [None]:
# creating new feature based on content id

cnt_ans = train_df.groupby('content_id').agg({ 'answered_correctly': ['mean','sum', 'count']})
cnt_ans.columns = ['avg_correct_answer_c','num_of_correct_c', 'total_answers_c']

# changing dtype for reducing memory (default = 64)

cnt_ans['num_of_correct_c'] = cnt_ans['num_of_correct_c'].astype('int32')
cnt_ans['total_answers_c'] = cnt_ans['total_answers_c'].astype('int32')

train_df = pd.merge(train_df, cnt_ans, how='left', on = 'content_id')

In [None]:
# plotting contents vs. answer accuracies

sns.regplot(data=cnt_ans[cnt_ans['total_answers_c']> 100], y='avg_correct_answer_c', x='total_answers_c', ci=False, scatter_kws={'alpha':0.5}, line_kws={"color": "orange"})

# plotting mean lines

plt.axhline(train_df.avg_correct_answer_c.mean(), color='k', linestyle='dashed', linewidth=3)
plt.axvline(train_df.total_answers_c.mean(), color='k', linestyle='dashed', linewidth=3)


min_ylim, max_ylim = plt.ylim()
plt.text(35000, max_ylim*0.65, 'Average Correct Answer: {:.2f}'.format(train_df.avg_correct_answer_c.mean()))
plt.text(5500, max_ylim*0.10, 'Average Questions Solved per Content: {:.2f}'.format(train_df.total_answers_c.mean()))

plt.title('Average Correct Answer Ratio vs. Total Questions Answered per Content', weight='bold')
plt.show()

In [None]:
train_df.head()

# Baseline Model

#### Here's the part we gonna do some simple baseline model for benchmarking our future models. We start by filling some missing values in our data and then split it as X and y for modelling. We need to predict if the given user answered specific question correct or failed it.

In [None]:
# creating x variable for training

X = train_df.copy()

In [None]:
# filling na values

X['prior_question_elapsed_time'].fillna(0,  inplace=True)
X['prior_question_had_explanation'] = X['prior_question_had_explanation'].fillna(value = False).astype(bool)

In [None]:
del train_df

gc.collect()

In [None]:
# setting x and y for training

X=X.sort_values(['user_id'])
y = X[["answered_correctly"]]
X = X.drop(["answered_correctly"], axis=1)

In [None]:
from sklearn.preprocessing import LabelEncoder

lb_make = LabelEncoder()
X["prior_question_had_explanation_enc"] = lb_make.fit_transform(X["prior_question_had_explanation"])
X['prior_question_had_explanation_enc'] = X['prior_question_had_explanation_enc'].astype('int8')
X.head()




In [None]:
# selecting feautres for training

X = X[['avg_correct_answer','num_of_correct','total_answers', 'avg_correct_answer_c', 'prior_question_elapsed_time','prior_question_had_explanation_enc','part']] 

In [None]:
X.shape

### Here we have basic, default parameter Lightgbm classifier for training.

In [None]:
# loading model(s) for testing

from sklearn.model_selection import StratifiedKFold, cross_validate
import lightgbm as lgb

light = lgb.LGBMClassifier(
)



# Validation

#### We will stratify and shuffle our target (which I'm not sure since there is some sort of timeline but we'll do this way for the baseline) and validate it using 3 folds.

In [None]:
# setting stratified kfold for validation

kf = StratifiedKFold(3, shuffle=True, random_state=42)
classifiers = [light]

In [None]:
def model_check(X, y, classifiers, cv):
    
    ''' A function for testing multiple classifiers and return several metrics. '''
    
    model_table = pd.DataFrame()

    row_index = 0
    for cls in classifiers:

        MLA_name = cls.__class__.__name__
        model_table.loc[row_index, 'Model Name'] = MLA_name
        
        cv_results = cross_validate(
            cls,
            X,
            y,
            cv=cv,
            scoring=('accuracy','f1','roc_auc'),
            return_train_score=True,
            n_jobs=-1
        )
        model_table.loc[row_index, 'Train Roc/AUC Mean'] = cv_results[
            'train_roc_auc'].mean()
        model_table.loc[row_index, 'Test Roc/AUC Mean'] = cv_results[
            'test_roc_auc'].mean()
        model_table.loc[row_index, 'Test Roc/AUC Std'] = cv_results['test_roc_auc'].std()

        model_table.loc[row_index, 'Time'] = cv_results['fit_time'].mean()

        row_index += 1        

    model_table.sort_values(by=['Test Roc/AUC Mean'],
                            ascending=False,
                            inplace=True)

    return model_table

# Results

#### Alright. Looks like our baseline did OK! 

In [None]:
# displaying default model results

raw_models = model_check(X, y, classifiers, kf)
display(raw_models)

# Prediction

#### Here we use competition specific prediction environment, we'll predict the test samples and submit them by using 'riiideducation' package.

In [None]:
# importing riid package

import riiideducation

env = riiideducation.make_env()

iter_test = env.iter_test()

In [None]:
# fitting the model

light.fit(X, y)

# resetting indexes for pred merging

cnt_ans=cnt_ans.reset_index()
usr_ans=usr_ans.reset_index()

In [None]:
for (test_df, sample_prediction_df) in iter_test:
    test_df = test_df.merge(usr_ans, how = 'left', on = 'user_id')
    test_df = test_df.merge(cnt_ans, how = 'left', on = 'content_id')
    test_df = pd.merge_ordered(test_df,questions_df[['question_id','part']], how='left', left_on='content_id', right_on='question_id', fill_method='ffill')
    test_df['prior_question_had_explanation'] = test_df['prior_question_had_explanation'].fillna(value = False).astype(bool)
    test_df['prior_question_elapsed_time'].fillna(0,  inplace=True)
    test_df['avg_correct_answer'].fillna(0.5, inplace=True)
    test_df['avg_correct_answer_c'].fillna(0.5, inplace=True)
    test_df.fillna(value = -1, inplace = True)
    test_df["prior_question_had_explanation_enc"] = lb_make.fit_transform(test_df["prior_question_had_explanation"])
    
    
    y_pred = light.predict_proba(test_df[['avg_correct_answer','num_of_correct','total_answers', 'avg_correct_answer_c', 'prior_question_elapsed_time','prior_question_had_explanation_enc','part']])[:,1]
    test_df['answered_correctly'] = y_pred
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])


# Last Words

### Well, that's it then... I hope you enjoyed while reading this notebook and find it useful for you! Please don't forget to comment/upvote if you liked it!

### This is early version of the notebook and I'll try to update it when I have time. Thanks for reading again, happy coding!

## Work in Progress