# RIIID : Answer Correctness Prediction

![](https://res-2.cloudinary.com/crunchbase-production/image/upload/c_lpad,f_auto,q_auto:eco/zcyhpowwzhmv9zvynidc)


I came across this interesting competition on Kaggle about a month ago, and its been a great learning experience for me so far as I had to go through lot of challenges with the size of the data and full test dataset not being available to us. In the process of working through this competition, I got a chance to learn about GPUs and data processing and modelling libraries which use GPUs to process data faster. 

Through some of the great notebooks written so far, I came across RAPIDS framework created by NVIDIA which allows for GPU based acceleration for analytics workflows. It also allowed me to use some of my weekly quota of GPUs for the first time on Kaggle. Please read through this kernel and provide your feedback in the comments. 



# Table of Contents - 
* [Overview of RAPIDS](#rapids)
* [Importing raw datasets](#import)
* [Exploratory Analysis](#eda)
* [Feature Engineering ](#feature)
* [Model Development & Feature Importance](#model)
* [Predictions](#preds)



### Questions answered in the EDA 
* How does the average correct answer rate vary across questions in the training dataset?
* How many questions does an average user answer over his learning journey? What is the total span of data in number of days for an average user?
* What is the distribution of number of appearances of a question on training dataset?
* Does a student's likelihood to answer a question correctly improves over time?
* What is the average size of a question bundle?
* Does watching lectures improves the chances of users answering a question correctly?
* What is the average time spent by users in reading the explanation for a prior question bundle?
* How does the correct answer rate vary across different tags? Does the count of total tags in a question have an impact on the correct answer rate for a question?


# Overview of RAPIDS <a name="rapids"></a>

[RAPIDS](https://rapids.ai/) is a suite of open-source software libraries and APIs for executing data science pipelines entirely on GPUs. Some of the advantages are - 
* **Faster Execution Time** - RAPIDS leverages NVIDIA CUDAÂ® under the hood to accelerate your workflows by running the entire data science training pipeline on GPUs. This reduces training time and the frequency of model deployment from days to minutes.
* **Use the Same Tools** - By hiding the complexities of working with the GPU and even the behind-the-scenes communication protocols within the data center architecture, RAPIDS creates a simple way to get data science done.

I initially started working on this problem in Pandas, but my code was quite slow and I wasn't able to handle the datasets properly. Therefore I switched to RAPIDS and since then the overall runtime of my notebook has come down to ~2 mins(excluding load time for cudf packages)


To use RAPIDS in our notebook, we need to add RAPIDS package files to our notebook and then load the package. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px
from plotly.subplots import make_subplots
from matplotlib import pyplot
import plotly.graph_objects as go
import gc

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory




In [None]:
%%time

import sys
!cp ../input/rapids/rapids.0.15.0 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/null
sys.path = ["/opt/conda/envs/rapids/lib/python3.7/site-packages"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib/python3.7"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib"] + sys.path 
!cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/



In [None]:
import cudf #Cudf is a library for processing dataframes
import cupy # CuPy is an open-source array library accelerated with NVIDIA CUDA.


# Importing raw datasets <a name="import"></a>


In this competition we've been given access to student data log from a learning app. Student activity includes watching lectures on different topics and then answering questions. Each question & lecture has some metadata associated to it, providing more details around the learning journey of a student. 

There are three primary datasets that are available in this competition. Below is the description provided to us - 

**train.csv**

* ```row_id```: (int64) ID code for the row.

* ```timestamp```: (int64) the time in milliseconds between this user interaction and the first event completion from that user.

* ```user_id```: (int32) ID code for the user.

* ```content_id```: (int16) ID code for the user interaction

* ```content_type_id```: (int8) 0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture.

* ```task_container_id```: (int16) Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id.

* ```user_answer```: (int8) the user's answer to the question, if any. Read -1 as null, for lectures.

* ```answered_correctly```: (int8) if the user responded correctly. Read -1 as null, for lectures.

* ```prior_question_elapsed_time```: (float32) The average time in milliseconds it took a user to answer each question in the previous question bundle, ignoring any lectures in between. Is null for a user's first question bundle or lecture. Note that the time is the average time a user took to solve each question in the previous bundle.

* ```prior_question_had_explanation```: (bool) Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback.

**questions.csv**: metadata for the questions posed to users.

* ```question_id```: foreign key for the train/test content_id column, when the content type is question (0).

* ```bundle_id```: code for which questions are served together.

* ```correct_answer```: the answer to the question. Can be compared with the train user_answer column to check if the user was right.

* ```part```: the relevant section of the TOEIC test.

* ```tags```: one or more detailed tag codes for the question. The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together.

**lectures.csv**: metadata for the lectures watched by users as they progress in their education.

* ```lecture_id```: foreign key for the train/test content_id column, when the content type is lecture (1).

* ```part```: top level category code for the lecture.

* ```tag```: one tag codes for the lecture. The meaning of the tags will not be provided, but these codes are sufficient for clustering the lectures together.

* ```type_of```: brief description of the core purpose of the lecture

### Reading data using cudf

Similar to pandas, we use the ```read_csv``` function to in cudf package to read the input data provided to us. 

This is the description for cudf on RAPIDS website - Built based on the Apache Arrow columnar memory format, cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF provides a pandas-like API, so users can use it to easily accelerate their workflows without going into the details of CUDA programming.

In [None]:
#Importing raw datasets. We will only be importing 20M rows in training data. 
train=cudf.read_csv('/kaggle/input/riiid-test-answer-prediction/train.csv',low_memory=False,nrows=2*(10**7), dtype={'row_id': 'int64', 'timestamp': 'int64', 'user_id': 'int32', 'content_id': 'int16', 'content_type_id': 'int8',
                              'task_container_id': 'int16', 'user_answer': 'int8', 'answered_correctly': 'int8', 'prior_question_elapsed_time': 'float32', 
                             'prior_question_had_explanation': 'boolean',
                             })        


lectures=cudf.read_csv('/kaggle/input/riiid-test-answer-prediction/lectures.csv')
questions=cudf.read_csv('/kaggle/input/riiid-test-answer-prediction/questions.csv')


In [None]:
train

In [None]:
lectures

In [None]:
questions

Some observations from having a peek at the data above - 
* timestamp values are represented in millisecond units and start at zero for every user
* As expected, prior_question_had_explanation and prior_question_elapsed time values are zero for first question answered by a user
* **content_id is the foreign key to be used to join train datasets with questions and lectures datasets**. Content id can represent a lecture or question
* Lectures can be used for concepts or solving questions. Every lecture has a tag associated to it. 
* Single question can have multiple tags associated to it. Tags will help us understand topic associated with a lecture. However we don't have a mapping file to give us tag_id to tag_name mapping

Also taking a look at the number of records in each of the datasets here - 
* **The train dataset has more than 100 million records**, we've only imported around 20 million here
* **There are about 418 different kinds of lectures**
* **There are total of 13523 questions in our data**

Size of the train data is much higher than our memory capacity in Kaggle notebook, therefore you will see me walking a tightrope with managing RAM resources here and frequently deleting datasets after their usage to free up memory. 

 # Exploratory Analysis <a name="eda"></a>

Having taken a quick glance at the data, let's dive deep into datasets provided to us and try and gain a deeper understanding of the data. This will help us understand distribution of individual variables as well as relationships between independent and dependent variable. It will also allow us to create meaningful features which can then help us in creating a good model. We will be looking to answer the questions listed out at start of the notebook.

In [None]:
#Count of records by distinct value of answered_correctly column
train.groupby('answered_correctly').agg({'row_id': ['count']}).reset_index()

In [None]:
#Count of records by content type
train.groupby('content_type_id').agg({'row_id': ['count']}).reset_index()

* ```answered_correctly``` has three distinct values. Value -1 represents records where user didn't answer a question and watched a lecture instead
* In our sample of training data imported, majority of data is for questions answered. Less than 2% of training data is for lectures watched by students. 

## Question and User Level summaries

In [None]:
#Merge the train and questions dataset
train_questions=train.merge(questions,left_on='content_id',right_on='question_id',how='inner')
train_questions_gp=train_questions.loc[train_questions['content_type_id']==0].groupby('question_id').agg({'answered_correctly':'sum','row_id':'count'}).reset_index()
train_questions_gp['percent_correct']=train_questions_gp['answered_correctly']/train_questions_gp['row_id']
train_questions_lim=train_questions_gp.loc[train_questions_gp['row_id']>=10]
train_questions_lim_pandas=train_questions_lim.to_pandas()

fig=px.histogram(train_questions_lim_pandas,x='percent_correct',title='Distribution of percentage correct answers across questions',template="simple_white")
fig.update_traces(marker=dict(color='cadetblue'))
fig.show()

question_cnt=train_questions.groupby('content_id').agg({'row_id':'count'}).reset_index()
question_cnt_pd=cudf.DataFrame.to_pandas(question_cnt)
fig1=px.histogram(question_cnt_pd,x='row_id',title='Count of Question Occurences',template="simple_white")
fig1.update_traces(marker=dict(color='cadetblue'))
fig1.show()

del question_cnt_pd
#del question_cnt


In [None]:
#Group by training data on user_ids
train_user_gp=train_questions.groupby('user_id').agg({'answered_correctly':'mean','row_id':'count'}).reset_index()
train_user_gp.columns=['user_id','user_answer_mean','user_answer_count']
train_user_gp_pandas=train_user_gp.to_pandas()
train_user_gp_pandas=train_user_gp_pandas.loc[train_user_gp_pandas['user_answer_count']<1000]
fig=px.histogram(train_user_gp_pandas,x='user_answer_count',title='Distribution of total Questions Answered by users',template="simple_white")
fig.update_traces(marker=dict(color='cadetblue'))
fig.show()


train_user_gp=train_questions.groupby('user_id').agg({'answered_correctly':'mean','row_id':'count'}).reset_index()
train_user_gp_pandas=train_user_gp.to_pandas()
fig1=px.violin(train_user_gp_pandas,y='answered_correctly',box=True,title='Distribution of Percentage of Answered Correctly across users',template="simple_white")
fig1.update_traces(marker=dict(color='cadetblue'))
fig1.show()

fig2=px.scatter(train_user_gp_pandas,x='row_id',y='answered_correctly')
fig2.update_layout(title='Distribution of total Questions Answered by user',xaxis_title='Total Questions Answered',yaxis_title='Percentage Answered Correctly',template="simple_white")
fig2.update_traces(marker=dict(color='cadetblue'))
fig2.show()
del train_user_gp_pandas

Few things are visible from the charts above - 
* The mean of correct answer rate across questions is quite high, **the distribution of percentage correct across questions has a peak between 0.7-0.8**
* Majority of the questions in the dataset appear less than 5k times, however there are small number of questions which have very high unique occurences, sometimes higher than 10k occurences
* When we look at distribution of percentage correct answers across users, we see that it has a skew towards lower values with median at 0.56. Q1 is at 0.42 and Q3 is at 0.66
* **The total user scores are lower for users in initial stages, as the number of questions answered increases the average scores increase and show lower variance and vary between 0.6 - 0.8**
* After 4k questions, there is no increase in average user scores
* There is no linear relationship between total questions answered and average user scores.


### Lectures

In [None]:
#Identifying users who have attended a lecture. As users who have attended a lecture have answered_correctly as -1 we can identify such users by looking for min value in answered_correctly as -1
train_user_gp_new=train_questions.groupby('user_id').agg({'answered_correctly':'sum','row_id':'count'}).reset_index()

train_user_gp_lec=train_questions.groupby('user_id').agg({'answered_correctly':'min'}).reset_index()
lec_users=train_user_gp_lec.loc[train_user_gp_lec['answered_correctly']==-1]
lec_users['lec_flag']=lec_users['answered_correctly']*-1
del lec_users['answered_correctly']
#lec_users

train_user_gp_lec=train_user_gp_new.merge(lec_users,left_on='user_id',right_on='user_id',how='left')
train_user_gp_lec['lec_flag'].fillna(0,inplace=True)
train_user_gp_lec_gp=train_user_gp_lec.groupby('lec_flag').agg({'answered_correctly':'sum','row_id':'mean','row_id':'sum'}).reset_index()
train_user_gp_lec_gp['percent_correct']=train_user_gp_lec_gp['answered_correctly']/train_user_gp_lec_gp['row_id']
train_user_gp_lec_gp_pd=train_user_gp_lec_gp.to_pandas()

fig=px.bar(train_user_gp_lec_gp_pd,x='lec_flag',y='percent_correct',title='Student Performance variation with Lecture Viewing',template='simple_white')
fig.update_xaxes(type='category')
fig.update_traces(marker=dict(color='cadetblue'))

fig.show()



#Creating Flags for number of lectures & whether a student has watched a lecture or not
lec_watchers=train_questions.loc[train_questions['answered_correctly']==-1]

train_user_gp_lec=lec_watchers.groupby('user_id').agg({'answered_correctly':'sum'}).reset_index()
train_user_gp_lec
train_user_gp_lec['num_lec']=train_user_gp_lec['answered_correctly']*-1
train_user_gp_lec['lec_flag']=1
train_user_gp_lec
del train_user_gp_lec['answered_correctly']

del train_user_gp_lec_gp_pd

In [None]:
train_lec=train.merge(train_user_gp_lec,left_on='user_id',right_on='user_id',how='left')
train_lec['num_lec'].fillna(0,inplace=True)
train_lec['lec_flag'].fillna(0,inplace=True)

train_lec_gp=train_lec.groupby('num_lec').agg({'answered_correctly':'mean'}).reset_index()
del train_lec

train_lec_gp_pandas=train_lec_gp.to_pandas()
train_lec_gp_pandas.answered_correctly.rolling(5).mean()
fig=px.line(train_lec_gp_pandas,x='num_lec',y='answered_correctly',title='Variation in Correct Answer rate with number of lectures watched',template='simple_white')

fig.update_traces(marker=dict(color='cadetblue'))
fig.show()
del train_lec_gp_pandas

As we can see above students who have seen lectures generally do better than students who haven't seen any lectures. When we dive deeper into the number of lectures vs correct answer rate we see that for number of lectures higher than 100 the correct answer rate is much higher than 70%. However such instances could be outliers and could be skewed by a few users due to low data volume. 

## Relationship between Task Containers, Bundle & Part and dependent variable

In [None]:
#Count the number of questions per task container
task_container=train.groupby('task_container_id').agg({'row_id': ['count']}).reset_index()
task_container.columns=['task_container_id','count']
task_container_pandas=task_container.to_pandas()

fig = px.histogram(task_container_pandas, x="count",nbins=50,labels={'x':'task_container_id', 'y':'count'},template='simple_white')
fig.update_layout(title='Distribution of size of Task Containers',yaxis_title='Count of Task Containers')
fig.update_traces(marker=dict(color='cadetblue'))
fig.show()
del task_container_pandas

In [None]:
# train_sample=train.sample(frac=0.25)
# train_prior=train_sample.loc[~train_sample['prior_question_elapsed_time'].isna(),:]
# #train_prior

# train_prior_pandas=train_prior.to_pandas()
# px.violin(train_prior_pandas,y='prior_question_elapsed_time',x='prior_question_had_explanation',box=True,title='Variation of Prior Question Elapsed time with Prior Question Explanation')


In [None]:
# # Deleting data frames to clear up some space
# del train_prior
# del train_prior_pandas
# del train_sample

In [None]:
train_bundle_gp=train_questions.loc[train_questions['content_type_id']==0].groupby('bundle_id').agg({'question_id':'count'}).reset_index()
train_bundle_gp_lim=train_bundle_gp.loc[train_bundle_gp['question_id']<=5000]

train_bundle_gp_lim=train_bundle_gp_lim.to_pandas()

fig=px.histogram(train_bundle_gp_lim,x='question_id',title='Distribution of question counts across bundle_ids',template='simple_white')
fig.update_traces(marker=dict(color='cadetblue'))
fig.show()
del train_bundle_gp_lim

In [None]:
train_bundle_ans_gp=train_questions.loc[train_questions['content_type_id']==0].groupby('bundle_id').agg({'answered_correctly':'mean'}).reset_index()

train_bundle_ans_gp_pandas=train_bundle_ans_gp.to_pandas()
fig=px.histogram(train_bundle_ans_gp_pandas,x='answered_correctly',title='Distribution of Correct Answers by Question Bundle',template='simple_white')
fig.update_traces(marker=dict(color='cadetblue'))
fig.show()
del train_bundle_ans_gp_pandas


In [None]:
#Create flags for bundles based on percentage answered correct for each bundle in the training dataset
#This will be used as a feature in the dataset
train_bundle_ans_gp['bundle_flag']=0

def bundle_flag_fun(x):
    if (x>=0.75) :
        return 4
    if (x>=0.5) :
        return 3
    elif (x>=0.25) :
        return 2
    else :
        return 1
    
train_bundle_ans_gp['bundle_flag']=train_bundle_ans_gp["answered_correctly"].applymap(bundle_flag_fun)

del train_bundle_ans_gp['answered_correctly']

In [None]:
train_part_gp=train_questions.groupby('part').agg({'answered_correctly':'mean','row_id':'count'}).reset_index()
train_part_lim=train_part_gp.loc[train_part_gp['row_id']>=10]

train_part_lim_pandas=train_part_lim.to_pandas()

fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Bar(x=train_part_lim_pandas['part'], y=train_part_lim_pandas['answered_correctly'], name="Percentage Answered Correctly",marker=dict(color='cadetblue')),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=train_part_lim_pandas['part'], y=train_part_lim_pandas['row_id'], name="Count of Questions",marker=dict(color='grey')),
    secondary_y=True,
)


# Add figure title
fig.update_layout(
    title_text="Question Part - Percentage Answered Correctly & Count of Questions"
)
#fig.update_traces(marker=dict(color='cadetblue'))

# Set x-axis title
#fig.update_xaxes(title_text="Number of Tags in a Question")
fig.show()
del train_part_gp['row_id']
train_part_gp.columns=['part','part_percent_correct']
del train_part_lim_pandas




Next we look at three attributes associated to a question - Task Container, Bundle_id and part - 
* **Majority of task containers have a very small size**
* Distribution of total question count across bundles has an uneven distribution with multiple peaks, biggest peaks are seen in the bins of 0-99,200-299 and 900-999
* Distribution of correct answer across bundles looks very similar to distribution of correct answers across individual questions
* We see that in our training dataset, **40% of questions belong to Part 5**. **Part 7 is the least common of all. The correct_answer rate across all parts ranges from 0.6-0.7**

## Prior Question Elapsed Time and Prior Question had Explanation

In [None]:
train_priorquestions_gp=train_questions.groupby('prior_question_had_explanation').agg({'answered_correctly':'mean','row_id':'count'}).reset_index()
train_priorquestions_gp_pandas=train_priorquestions_gp.to_pandas()


fig = make_subplots(rows=1, cols=2)

# Add traces
fig.add_trace(
    go.Bar(x=train_priorquestions_gp_pandas['prior_question_had_explanation'], y=train_priorquestions_gp_pandas['answered_correctly'], name="Percentage Answered Correctly",marker=dict(color='LightSkyBlue')),
    row=1, col=1
)

fig.add_trace(
    go.Bar(x=train_priorquestions_gp_pandas['prior_question_had_explanation'], y=train_priorquestions_gp_pandas['row_id'], name="Total Question Count",marker=dict(color='lightseagreen')),
    row=1, col=2
)
fig.update_layout(title='Question Count & Percentage Correct Answers by Prior Question Had Explanation')
fig.update_layout(template='simple_white')

fig.show()

del train_priorquestions_gp_pandas

In [None]:

train_priorquestions_gp=train_questions.groupby('prior_question_had_explanation').agg({'prior_question_elapsed_time':'mean'}).reset_index()
train_priorquestions_gp_pandas=train_priorquestions_gp.to_pandas()

fig1=px.bar(train_priorquestions_gp_pandas,x='prior_question_had_explanation',y='prior_question_elapsed_time',title='Mean of prior_question_elapsed_time by prior_question_had_explanation',template='simple_white')
fig1.update_traces(marker=dict(color='cadetblue'))
fig1.show()
del train_priorquestions_gp_pandas

In the data description above, ```prior_question_had_explanation``` is defined as - Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between.

We notice that whenever users saw explanation to prior question bundle, their correct answer rate improved substantially. Also for more than 90% of the question bundles, users referred to the explanation after answering the questions.

Also the ```prior_question_elapsed_time``` i.e. time spent by users on question bundle does not have any impact on likelihood of users reading explanation of bundle after answering it.

## Question Tags


In [None]:
# train_questions_nolec=train_questions_nolec.reset_index(drop=True)
k=train_questions['tags']
train_questions['tag_count']=k.str.count(' ')+1
train_questions['tag1']= train_questions['tags'].str.split(' ')[0]

train_tag_gp=train_questions.groupby('tag_count').agg({'row_id':'count','answered_correctly':'mean'}).reset_index()
del k

#train_tag_gp


In [None]:
# Create figure with secondary y-axis
train_tag_gp_pandas=train_tag_gp.to_pandas()
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(x=train_tag_gp_pandas['tag_count'], y=train_tag_gp_pandas['answered_correctly'], name="Percentage Answered Correctly",marker=dict(color='LightSkyBlue')),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=train_tag_gp_pandas['tag_count'], y=train_tag_gp_pandas['row_id'], name="Count of Answers",marker=dict(color='grey')),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="Distribution of Correct Answers & Total Questions with number of Tags"
)

# Set x-axis title
fig.update_xaxes(title_text="Number of Tags in a Question")

# Set y-axes titles
fig.update_yaxes(title_text="Percentage Questions Answered Correctly", secondary_y=False)
fig.update_yaxes(title_text="Total Questions Answered", secondary_y=True)
fig.show()
del train_tag_gp_pandas

In [None]:
tag_gp=train_questions.loc[train_questions['content_type_id']==0].groupby('tag1').agg({'answered_correctly':'mean','row_id':'count'}).reset_index()
tag_gp_pandas=tag_gp.to_pandas()

fig = px.scatter(tag_gp_pandas, x="tag1", y="answered_correctly",
    size="row_id",
                 hover_name="tag1", size_max=60)
fig.update_layout(title='Correct Answer Rate & Question counts across values of Tag1',template='simple_white')
fig.update_traces(marker=dict(color='cadetblue'))
fig.show()
del tag_gp_pandas

I also took a look at different tag values associated to different questions. In the above two visualizations -
1. I extracted the count of tags associated to a question and then plotted correct answer rate next to it to understand if more number of tags have an impact on successful answer rate on a question
2. We also extracted first value of a tag across each question and compared the total count of questions along with correct answer rate across each of the tags

Here are some findings - 
* A very high percentage of questions only had a single tag associated to them
* **Multi tag questions have a higher correct answer rate than single tag questions**
* For each of the count of tags, number of questions associated with a tag count seems to be inversely proportional to correct answer rate for a tag count
* **Most tags have a correct answer rate in the range of 0.4-0.8


As students when we practice over our learning material we tend to get better and better with time. We would check for the same in our dataset, we would look at whether a user's correct answer rate improve with two factors - **total number of questions answered and total timestamp**

### User Activity

In [None]:
#Create flags for different tag values
tag_gp['tag_flag']=0

def tag_flag_fun(x):
    if (x>=0.75) :
        return 4
    if (x>=0.5) :
        return 3
    elif (x>=0.25) :
        return 2
    else :
        return 1
    
tag_gp['tag_flag']=tag_gp["answered_correctly"].applymap(tag_flag_fun)
del tag_gp['answered_correctly']
del tag_gp['row_id']


In [None]:
train_user_gp_sum=train_questions.groupby('user_id').agg({'answered_correctly':'sum','row_id':'count'}).reset_index()
train_user_gp_sum.columns=['user_id','user_answer_total','user_answer_count']

train_user_gp_sum['user_answer_count_flag']=0

def user_answer_count_flag_fun(x):
    if (x<=100) :
        return 0
    if (x<=500) :
        return 1
    elif (x<=1000) :
        return 2
    elif (x<=2000) :
        return 3
    elif (x<=3000) :
        return 4
    else :
        return 5
    
  
train_user_gp_sum['user_answer_count_flag']=train_user_gp_sum["user_answer_count"].applymap(user_answer_count_flag_fun)
gp=train_user_gp_sum.groupby('user_answer_count_flag').agg({'user_answer_total':'sum','user_answer_count':'sum'}).reset_index()
gp['percent_answer_correct']=gp['user_answer_total']/gp['user_answer_count']
gp=gp.to_pandas()

fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Bar(x=gp['user_answer_count_flag'], y=gp['user_answer_count'], name="Count of Questions",marker=dict(color='LightSkyBlue')),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=gp['user_answer_count_flag'], y=gp['percent_answer_correct'], name="Percentage Answered Correctly",marker=dict(color='grey')),
    secondary_y=True,
)


# Add figure title
fig.update_layout(
    title_text="Improvement in Correct Answer Rate with total questions answered"
)

# Set x-axis title
fig.update_xaxes(title_text="Total Question Count Flag")
fig.show()

del gp

In [None]:
#Converting timestamps to hours and days for readability

train['timestamp_hours']=train['timestamp']/(3600*1000)
train['timestamp_days']=train['timestamp_hours']/(24)

timestamp_series=cupy.asnumpy(train['timestamp_days'])

timestamp_series=np.random.choice(timestamp_series,size=10000)
timestamp_series
fig = go.Figure(data=[go.Histogram(x=timestamp_series)])
fig.update_layout(title='Total Activity across number of days',xaxis_title='number of days',yaxis_title='total questions answered')
fig.update_traces(marker=dict(color='cadetblue'))
fig.show()
del timestamp_series

In [None]:
train_sample=train.sample(n=10000, replace=True, random_state=1)
train_sample['timestamp_flag']=0
#train_sample=cudf.DataFrame.to_pandas(train_sample)
train_sample=train_sample.reset_index(drop=True)

#train_sample.loc[i,'timestamp_days']
def timestamp_flag(x):
    if (x<=50) :
        return 0
    elif (x<=100) :
        return 1
    elif (x<=365) :
        return 2
    else :
        return 3
    
    
train_sample['timestamp_flag']=train_sample["timestamp_days"].applymap(timestamp_flag)

train_sp_gp=train_sample.groupby('timestamp_flag').agg({'answered_correctly':'mean'}).reset_index()
train_sp_gp_pandas=train_sp_gp.to_pandas()

fig=px.bar(train_sp_gp_pandas,x='timestamp_flag',y='answered_correctly',title='Variation in Percentage of Correct Answer Rate with time')
fig.update_xaxes(type='category')
fig.update_traces(marker=dict(color='cadetblue'))
fig.show()
del train_sample
del train_sp_gp_pandas


In the cells above we created flag variables for total timestamp at which questions are answered as well as the total number of questions answered by users. We then plotted user's activity across these flag variables. 

We see also see that majority of users have a learning journey of less than 100 days, a smaller portion of users are active after that period on the app. 

We can also see that a **user's correct answer rate improves massively as they tend to answer more questions, however this is not reflected to a great extent in user's correct answer rate with total time**. As we can see that there is a slight improvement in correct answer rate with time but its not a substantial difference. This maybe due to the fact that users learning journey is not often proportional to time, there may be long gaps between periods when users are active. Therefore simple increase in timestamp does not justify increase in learning rate for users. 


## EDA Summary - Answers to Questions
We've been able to answer some of the questions above through the EDA analysis performed by us. Here are some of the key findings from the analysis -
* There are more than 100M records available in the training dataset. 
* The mean of correct answer rate across questions is quite high, the distribution of percentage correct across questions has a peak between 0.7-0.8. The distribution for correct answer rate is quite spread out - indicating a big difference in correct answer rate across questions
* Instances where students read the explanation for previous question bundle improves the correct answer rate of current bundle. 
* High percentage of questions belong to part 5
* Students get better at answering questions through more and more practice. 
* High share of students have learning journey of less than 100 days
* Students who watch lectures are more likely to answer questions correctly through their learning journey
* Questions with higher number of tags have a higher likelihood of being answered correctly

We will now proceed to feature engineering and model development where we will be using this information obtained above.


In [None]:
#Removing outliers for model development
exclusion_questions=question_cnt.loc[question_cnt['row_id']<20,'content_id']
exclusion_user=train_user_gp_sum.loc[train_user_gp_sum['user_answer_count']<20,'user_id']
exclusion_user=exclusion_user.unique()

def outlier_analysis(df):
    df=df.loc[~df.user_id.isin(exclusion_user)]
    df=df.loc[~df.content_id.isin(exclusion_questions)]
    return df

# Feature Engineering<a name="feature"></a>

After getting a good understanding of the data, our next step is to generate some features on the training data and create a model using them to generate predictions. Lot of our flag variables created above can directly be translated into features. In additon to that, here are some of the features i've come up with - 

**User Based Features** - 
* Percentage Answered Correctly
* Total Questions Answered
* Std Deviation Answered Correctly
* Variance Answered Correctly
* User min value answered_correctly
* User max value answered_correctly

**Question Based Features** - 
* Percentage Correct
* Total count
* Question answered_correctly Variance
* Question answered_correctly Standard Deviation

**Other Features** - 
* Bundle Flag
* Tag Flag 
* Part Flag
* Tag Count
* Prior Question Had Explanation
* Prior Question Elapsed Time
* Lecture Flag (Based on whether a student has seen a lecture)
* Number of lectures seen by a student

The feature generation process is generally iterative, we go with intuition and come up with an initial set of features and then remove the ones which don't improve model performance and then come up with newer features that might improve prediction accuracy. 

In [None]:
del train
del train_questions
#del exclusion_questions
gc.collect()

In [None]:
import sys
def sizeof_fmt(num, suffix='B'):
    ''' by Fred Cirera,  https://stackoverflow.com/a/1094933/1870254, modified'''
    for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
        if abs(num) < 1024.0:
            return "%3.1f %s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f %s%s" % (num, 'Yi', suffix)

for name, size in sorted(((name, sys.getsizeof(value)) for name, value in locals().items()),
                         key= lambda x: -x[1])[:20]:
    print("{:>30}: {:>8}".format(name, sizeof_fmt(size)))

In [None]:
del _5

In [None]:
#Function for Feature Engineering
def feature_engg(df):

    #Timestamp
   # df['timestamp'].fillna(0,inplace=True)
   # df['timestamp_hours']=df['timestamp']/(3600*1000)
   # df['timestamp_days']=df['timestamp_hours']/(24)

    #MultiTag
    questions=cudf.read_csv('/kaggle/input/riiid-test-answer-prediction/questions.csv')
    questions=questions[['question_id','part','tags','bundle_id']]
    
    df = df.merge(questions, how = 'left', left_on = 'content_id',right_on = 'question_id')
    del df['question_id']

    df['tags'].fillna('',inplace=True)
    df['tag_count']=df['tags'].str.count(' ')+1
    df['tag1']=df["tags"].str.split(" ", n = 1, expand = True)[[0]]
    df['tag_count'].fillna(1,inplace=True)
    del df['tags']

    df=df.merge(tag_gp,on='tag1',how='left')
    df['tag_flag'].fillna(2,inplace=True)

    #NA Values of Prior Question had Explanation with False
    df['prior_question_had_explanation'].fillna(False, inplace=True)
    
    #Prior Question Elapsed Time MVT
    df['prior_question_elapsed_time'].fillna(25302, inplace=True)
        
    df=df.merge(train_questions_gp,left_on='content_id',right_on='question_id',how='left')
    df['q_correct'].fillna(0.69,inplace=True)
    df['q_count'].fillna(74,inplace=True)
    df['q_var'].fillna(0.18,inplace=True)
    df['q_std'].fillna(0.41,inplace=True)
    
    #Part correct answer rate
    df['part'].fillna(5,inplace=True)
    df=df.merge(train_part_gp,on=['part'],how='left')
    df['part_percent_correct'].fillna(0.69,inplace=True)
    
    #Bundle Flagging
    df=df.merge(train_bundle_ans_gp,on=['bundle_id'],how='left')
    df['bundle_flag'].fillna(2,inplace=True)
    
    df=df.merge(user_flags,on=['user_id'],how='left')
    df['user_answer_mean'].fillna(0.53,inplace=True)
    df['user_answer_count'].fillna(286,inplace=True)
    df['user_min'].fillna(0,inplace=True)
    df['user_max'].fillna(1,inplace=True)
    df['user_std'].fillna(0.47,inplace=True)
    df['user_var'].fillna(0.22,inplace=True)
    
    #Lec Flag
    df=df.merge(train_user_gp_lec,left_on='user_id',right_on='user_id',how='left')
    df['num_lec'].fillna(0,inplace=True)
    df['lec_flag'].fillna(0,inplace=True)
    

    return df

In [None]:
train_user_gp_lec=train_user_gp_lec.astype('int32')

In [None]:
train=cudf.read_csv('/kaggle/input/riiid-test-answer-prediction/train.csv',low_memory=False,nrows=4*(10**7),dtype={'row_id': 'int64', 'timestamp': 'int64', 'user_id': 'int32', 'content_id': 'int16', 'content_type_id': 'int8',
                              'task_container_id': 'int16', 'user_answer': 'int8', 'answered_correctly': 'int8', 'prior_question_elapsed_time': 'float32', 
                             'prior_question_had_explanation': 'boolean',
                             })                                                                                                               
del train['timestamp']  
del train['row_id']
del train['task_container_id']

train = train[train.content_type_id == False]

del train['content_type_id']
gc.collect()

#train.drop(['timestamp','content_type_id'], axis=1,   inplace=True)

#Creating User level flags -
user_flags=train.groupby('user_id').agg({'answered_correctly':['mean','count','min','max','std','var']}).reset_index()
user_flags.columns=['user_id','user_answer_mean','user_answer_count','user_min','user_max','user_std','user_var']
user_flags

#Creating content level flags
train_questions=train.merge(questions,left_on='content_id',right_on='question_id',how='left')

train_questions_gp=train_questions.groupby('question_id').agg({'answered_correctly':['mean','count','var','std']}).reset_index()
train_questions_gp.columns=['question_id','q_correct','q_count','q_var','q_std']
del train_questions
gc.collect()

print(train.shape[0])
train=outlier_analysis(train)
train=feature_engg(train)
print(train.shape[0])

#del train
gc.collect()

train=train[['answered_correctly',
       'prior_question_elapsed_time', 'prior_question_had_explanation',
       'tag_count','tag_flag',
       'q_correct', 'q_count', 'q_var', 'q_std', 'part_percent_correct',
       'bundle_flag', 'user_answer_mean', 'user_answer_count', 'user_min',
       'user_max', 'user_std', 'user_var','num_lec','lec_flag']]
train['prior_question_had_explanation'] =train['prior_question_had_explanation'].astype(int)
train=train.astype('float32')

In [None]:
train.isnull().any()

# Random Forest & XGBoost Models<a name="model"></a>

This competition has been unique for me because we don't have visibility into the entire test dataset at once. We have to call an API and fetch batches of test data, iterate over test data in loops and make predictions for each of the batches. Because we don't have visibility into test data any errors in the code are visible only once we make a submission, which sometimes takes hours to run. Therefore running multiple iterations and experimenting has been a challenge in this competition.

### Evaluation Metric : ROC
Found this great [link](https://towardsdatascience.com/understanding-the-roc-and-auc-curves-a05b68550b69) here to read about the evaluation metric for this competition - ROC

#### What is ROC - AUC Curve?
AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. The ROC curve is plotted with True Positive Rate(TPR) against the (False Positive Rate)FPR where TPR is on y-axis and FPR is on the x-axis.


![](https://miro.medium.com/max/542/1*pk05QGzoWhCgRiiFbz-oKQ.png)

In [None]:
  ######cuML######                
  #Random Forest
from sklearn.model_selection import train_test_split #Splitting data for model training
from sklearn.metrics import roc_auc_score
import cupy as cp

#from sklearn.ensemble import RandomForestClassifier

#train_input=train
#del model_data_input['answered_correctly']

#X=model_data.loc[:, model_data.columns != 'answered_correctly']#Selecting feature variables
X=train.loc[:,[ 'prior_question_elapsed_time',
       'prior_question_had_explanation',
        'tag_count','tag_flag', 'q_correct', 'q_count', 'q_var',
       'q_std', 'part_percent_correct', 'bundle_flag', 'user_answer_mean',
       'user_answer_count', 'user_min', 'user_max', 'user_std', 'user_var','num_lec','lec_flag']]#Selecting feature variables

Y=train['answered_correctly'] #Selecting the output columns
feature_list=X.columns

X_train,X_test,Y_train,Y_test=train_test_split(X, Y,test_size=0.3,random_state=1)
import cuml
from cuml import RandomForestClassifier as cuRF                
  # cuml Random Forest params     
cu_rf_params = {'n_estimators': 100,          
     'max_depth': 8,             
     'max_features':7,
        'min_rows_per_node':5,
     'rows_sample':0.7
               }            
                                  
cu_rf = cuRF(**cu_rf_params)     
cu_rf.fit(X_train, Y_train)      
pred=cu_rf.predict(X_test)
#acc_score = cu_rf.score(pred, Y_test)
cu_score = cuml.metrics.accuracy_score( Y_test, pred )
#sk_score = accuracy_score( asnumpy( Y_test ), asnumpy( pred ) )

#print( " cuml accuracy: ", cu_score )

print('ROC Score for Random Forest')
Y_test=cp.asnumpy(Y_test)
pred=cp.asnumpy(pred)
roc=roc_auc_score(Y_test, pred)
print(" - ROC: {:.5}".format(roc))
#print()


In [None]:
import sys
def sizeof_fmt(num, suffix='B'):
    ''' by Fred Cirera,  https://stackoverflow.com/a/1094933/1870254, modified'''
    for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
        if abs(num) < 1024.0:
            return "%3.1f %s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f %s%s" % (num, 'Yi', suffix)

for name, size in sorted(((name, sys.getsizeof(value)) for name, value in locals().items()),
                         key= lambda x: -x[1])[:10]:
    print("{:>30}: {:>8}".format(name, sizeof_fmt(size)))

In [None]:
del pred
del train
del Y
del lec_watchers
del X
gc.collect()

In [None]:
import xgboost
from xgboost import XGBClassifier
from xgboost import plot_importance

#Parameters for XGBoost 
params1 = {
    'max_depth' : 7,
   # 'max_leaves' : 2**4,
    'alpha':0.1, 
   # 'lambda' : 0.2,
    'min_child_weight ':2,
    'subsample':0.7,
    'tree_method' : 'gpu_hist',
    'learning_rate': 0.1, #default = 0.3,
    'colsample_bytree':0.7,
    'eval_metric':'auc', 
    'objective' : 'binary:logistic',
    'grow_policy' : 'lossguide',
    'n_estimators':200
}

train_matrix = xgboost.DMatrix(data = X_train, label = Y_train)
test_matrix=xgboost.DMatrix(data = X_test)
xgb = xgboost.train(params1, dtrain = train_matrix)

predicts = xgb.predict(test_matrix)
roc = roc_auc_score(Y_test.astype('int32'), predicts)
print('ROC for XGBoost model')
print(roc)

In [None]:
xgb.get_score(importance_type='gain')
plot_importance(xgb)
pyplot.show()

Features related to mean & deviance of responses from users and question are most important for the model. Some of the flag features around tags, bundles and lectures don't do so well. Its likely that information from these features has already been captured in the features above. 

We see that our XGBoost model performs much better than Random Forest on validation dataset. Therefore we would be using the XGBoost model for final predictions. 

# Final Predictions<a name="preds"></a>

In [None]:
model=xgb

In [None]:
#Importing the competition package
import riiideducation
env = riiideducation.make_env()
iter_test = env.iter_test()

In [None]:
iter_test = env.iter_test()
#Read the test data in batches
for test_df, sample_prediction_df in iter_test:
    test_df = cudf.from_pandas(test_df)
    test_df=test_df.loc[test_df['content_type_id'] == 0]
    test_df = feature_engg(test_df)

    row_ids=test_df['row_id']
    
    test_df=test_df[['prior_question_elapsed_time',
       'prior_question_had_explanation',
       'tag_count','tag_flag', 'q_correct', 'q_count', 'q_var',
       'q_std', 'part_percent_correct', 'bundle_flag', 'user_answer_mean',
       'user_answer_count', 'user_min', 'user_max', 'user_std', 'user_var','num_lec','lec_flag']]
    
    test_df['prior_question_had_explanation'] =test_df['prior_question_had_explanation'].astype('float32')
    test_df=test_df.astype('float32')
    
    test_matrix=xgboost.DMatrix(data = test_df)
#    pred=model.predict_proba(test_df)[:,1]
    pred=model.predict(test_matrix)
#     rf_pred = model.predict_proba(test_df)[:,1]
#     lg_pred = lgb_model.predict(test_df)
   # row_ids=test_df['row_id']
    test_df['row_id']=row_ids
    test_df['answered_correctly'] =pred
    test_df = test_df.to_pandas()
    #print(test_df)
    env.predict(test_df[['row_id', 'answered_correctly']])


In [None]:
test_df

### References 
I got to learn about Rapids from this great notebook . Please take a look if you want to learn more about Rapids. - https://www.kaggle.com/andradaolteanu/answer-correctness-rapids-crazy-fast

### Next Steps - 
* Using higher volume of training data
* Continue working on Feature Engineering
* Experiment with model ensembling and Hyperparameter Optimization for XGBoost
* Create a running lookup table of user performance as we keep processing batches of user data on test dataset and use it to make predictions