# Note:
**i noticed there is lag in loading the entire note book. Please scroll down to the very bottom and scroll up. The enitre notebook with figures will load after that**
<br>
**Please let me know if there are any issues in comment. It will help me fixing the issue**

# INTRODUCTION
This notebook is for people who doesn't have a fancy workstation or a compute budget in cloud. In this notebook i am demonstrating how to do the data preprocessing, aggregation, merging etc. on the **complete data** without ever running into memory issues or having to wait long time for results. 
I am using gpu accelerated **cudf** package by Rapids which has a very similar API to pandas(almost mirrored) for data preprocessing and **Plotly** for visualization. 


<br>Content:
1. [About Rapids](#1)
    1. [Installation on Kaggle](#2)
1. [Load Data](#3)
1. [Explore Individual Features](#4)
    1. [train.csv](#5)
    2. [questions.csv](#6)
    3. [lectures.csv](#7)
1. [Explore Feature Groups](#8)
    * In Progress..
1. Reference
    1. Ednet Paper: https://arxiv.org/pdf/1912.03072.pdf
    1. Rapids ai: https://rapids.ai/about.html
    1. Plotly Express: https://plotly.com/python/plotly-express/

## About Rapids
The RAPIDS suite of open source software libraries and APIs gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs.RAPIDS utilizes NVIDIA CUDAÂ® primitives for low-level compute optimization, and exposes GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.<br><br>
RAPIDS also focuses on common data preparation tasks for analytics and data science. This includes a familiar dataframe API that integrates with a variety of machine learning algorithms for end-to-end pipeline accelerations without paying typical serialization costs. RAPIDS also includes support for multi-node, multi-GPU deployments, enabling vastly accelerated processing and training on much larger dataset sizes.<br><br>
Some RAPIDS projects include cuDF, a pandas-like dataframe manipulation library; cuML, a collection of machine learning libraries that will provide GPU versions of algorithms available in scikit-learn; cuGraph, a NetworkX-like accelerated graph analytics library.


<a id="1"></a> <br>
### Rapids Installation on Kaggle
* Add the [rapids dataset](https://www.kaggle.com/cdeotte/rapids) to you notebook
* Turn on the GPU
* Run the following code snippet which will unzip the data and add the neccessary packages to the sys path

In [None]:
import gc
gc.collect()

In [None]:
import sys
!cp ../input/rapids/rapids.0.15.0 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/null
sys.path = ["/opt/conda/envs/rapids/lib/python3.7/site-packages"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib/python3.7"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib"] + sys.path 
!cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/

### Load Data

In [None]:
import numpy as np
import dask.dataframe as dd
import pandas as pd
from time import time
from contextlib import contextmanager

# rapids libsJJ
import cuml
import cudf
import cupy

# ploting
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.express as px

In [None]:
try:
    del train
    del question
    del lectures
except:
    pass

In [None]:
@contextmanager
def timer(name):
    t0 = time()
    yield
    print(f'[{name}] done in {time() - t0:.2f} s')



with timer("Data Loading Time"):
    train = cudf.read_csv("/kaggle/input/riiid-test-answer-prediction/train.csv",)
    question = cudf.read_csv("/kaggle/input/riiid-test-answer-prediction/questions.csv",)
    lectures = cudf.read_csv("/kaggle/input/riiid-test-answer-prediction/lectures.csv",)
    

<a id="4"></a>
## Explore Individual Features

<a id="5"></a>
## train.csv

* ```row_id```: (int64) ID code for the row.

* ```timestamp```: (int64) the time in milliseconds between this user interaction and the first event completion from that user.

* ```user_id```: (int32) ID code for the user.

* ```content_id```: (int16) ID code for the user interaction

* ```content_type_id```: (int8) 0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture.

* ```task_container_id```: (int16) Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id.

* ```user_answer```: (int8) the user's answer to the question, if any. Read -1 as null, for lectures.

* ```answered_correctly```: (int8) if the user responded correctly. Read -1 as null, for lectures.

* ```prior_question_elapsed_time```: (float32) the average time a user took to solve each question in the previous bundle. [refer](https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/189768)

* ```prior_question_had_explanation```: (bool) Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback.

### timestamp

In [None]:
print("summary statistics of the feature")
train.timestamp.describe().to_frame()

The timestamp values are in milliseconds and is hard to intrepret.So converting it into days and looking at the statistics will be more intutive

In [None]:
to_days_factor = 24*60*60*1000.
(train.timestamp/to_days_factor).describe().round(3).to_frame().rename({'timestamp': 'timestamp_days'}, axis=1)

On an Average a user spend 89.16 days or nearly three months for the TOEIC exam preperation. The value looks bit too high!

In [None]:
df = train.sample(frac=0.01).to_pandas()

import plotly.express as px
fig = px.histogram(df, x="timestamp", title="Histogram of timestamp")
fig.show()

df['timestamp_in_days'] = np.log10(((df['timestamp']+1)/to_days_factor))

fig = px.histogram(df,x='timestamp_in_days', range_x=[-5, 3], 
                  title="Histogram of timestamp in days (x axis log scale)")
fig.update_layout(
    xaxis = dict(
        tickmode = 'array',
        tickvals = [i for i in range(-5,4)],
        ticktext = [str(10**i) + " days" for i in range(-5,4)]
    )
)

fig.show()

But the above timestamp feature is recorded whenever there is a user interaction. So in the following analysis i will find the maximum timestamp at user level and compute the summary statistics and histogram again

In [None]:
df = train.groupby(["user_id"]).timestamp.max().to_frame().to_pandas()
df['user_max_timestamp_in_days'] = (df['timestamp']+1)/(to_days_factor)
df['user_max_timestamp_in_days_log'] = np.log10(df['user_max_timestamp_in_days'])

df['user_max_timestamp_in_days'].describe().astype(float).round(3).to_frame()

The avereage duration a an arbitary user uses the app is 60 days or nearly two months. Which is lesser than the mean value of the timestamp feature.

In [None]:
fig = px.histogram(df, x='user_max_timestamp_in_days_log', range_x=[-5, 3], 
                  title="Histogram of maximum user timestamp in days (x axis log scale)",)
fig.update_layout(
    xaxis = dict(
        tickmode = 'array',
        tickvals = [i for i in range(-5,4)],
        ticktext = [str(10**i) + " days" for i in range(-5,4)]
    )
)

fig.update_layout(
    showlegend=False,
    annotations=[
        dict(
            x=-2,
            y=8000,
            xref="x",
            yref="y",
            text="Could be Dropouts",
            showarrow=True,
            arrowhead=7,
            ax=-200,
            ay=-50
        ),
        dict(
            x=1.6,
            y=4000,
            xref="x",
            yref="y",
            text="Serious Students",
            showarrow=True,
            arrowhead=7,
            ax=-200,
            ay=-100
        ),
    ]
)


fig.show()

The bimodality of the above histogram is interesting.This figure says that there are broadly two kinds of users in Santa app. The first category is just come and checkout the app churn away. I call them dropouts. Their distribution peaks in the left. The other category of students who spends significant time in the App. They could be Serious student. Their distribution peaks in the right side of the histogram.<br><br>
But draw back of the above categorization are the following
* Users who just joined the app when this dataset was sampled who are very serious about cracking TOEIC could be miinterpreted as dropouts because we dont have the details about their future interactions in the training dataset.
* train data may be  sampled in such a way that some users end up getting very few interactions to artificially indroduce some imbalance in order for the model to be robust enough to predict the behavious of the newly joined users. In that case of also my hypothesis will categorize them as dropouts. which is actually not true

### user_id

In [None]:
df = train.groupby("user_id")['row_id'].count(). \
to_frame().sort_values("row_id", ascending=False).rename({'row_id': 'number_of_user_interactions'}, axis=1).reset_index().to_pandas()
df['user_id'] =  df.user_id.astype(str) + "-" 


fig = px.bar(df.head(50), 
             y='user_id', 
             x='number_of_user_interactions', 
             title="Bar Plot of top 50 users",
            orientation='h',
            )
fig.show()

In [None]:
fig = px.bar(df.tail(50), 
             y='user_id', 
             x='number_of_user_interactions', 
             title="Bar Plot of bottom 50 users",
            orientation='h')
fig.show()

### content_id and content_type_id

In [None]:
df = train.groupby(["content_id", 'content_type_id'])['row_id'].count(). \
to_frame().sort_values("row_id", ascending=False).rename({'row_id': 'number_of_interactions_for_content_id'}, axis=1).reset_index().to_pandas()


df['content_type_id'] = df['content_type_id'].map({0:'questions(0)', 1: 'lectures(1)'})

fig = px.scatter(df, 
             x='content_id', 
             y='number_of_interactions_for_content_id', 
             title="content ids vs number of interactions in log scale",
             log_x = True,
             log_y=True,
                color='content_type_id',
            )
fig.show()

In [None]:
fig = px.box(df, x="content_type_id", y="number_of_interactions_for_content_id",
             log_y=True,
            labels={
                     "number_of_interactions_for_content_id": "number_of_interactions_per_content_id"
                 },
                title="Box Plot of number of interaction per content id",
            )
fig.show()

### task_container_id and container_type_id

In [None]:
df = train.groupby(["task_container_id", "content_type_id"]).agg({"row_id": "count", "content_id": 'nunique'})
df.shape

train.groupby(['content_type_id']).content_id.nunique().to_frame().rename({"content_id": "number_of_unique_content_ids"}, axis=1)

### task_container_id

In [None]:
df = train.groupby("task_container_id")['row_id'].count(). \
to_frame().sort_values("row_id", ascending=False).rename({"row_id": "interactions_per_task_container_id"}, axis=1).reset_index().to_pandas()


fig = px.bar(df.head(100), x="task_container_id", y="interactions_per_task_container_id",
             log_y=True,
                title="Bar Plot of number of interaction per task container id:Top 100",
            )
fig.show()

In [None]:
fig = px.scatter(df, x="task_container_id", y="interactions_per_task_container_id",
             log_y=True,
                title="Scatter Plot of number of interaction per task container id",
            )
fig.show()

### user_answer

In [None]:
df = train.groupby("user_answer")['row_id'].count(). \
to_frame().sort_values("row_id", ascending=False).rename({"row_id": "count"}, axis=1).reset_index().to_pandas()


df['user_answer'] = df['user_answer'].map({-1: "lectures(-1)", 0: "choice-0", 1: "choice-1", 2: "choice-2", 3: "choice-3", })
fig = px.bar(df, x="user_answer", y="count",
             log_y=True,
                title="Bar Plot of user answer categoty wise count",
             color = "user_answer",
            )
fig.update_xaxes(categoryorder='array', categoryarray= ["choice-0","choice-1","choice-2","choice-3", "lectures(-1)"])

fig.show()

In [None]:
df['percentage'] = ((df['count'])/df['count'].sum()).round(4)*100

fig = px.pie(df, values='percentage', names='user_answer', title='Pie chart of User answer')
fig.update_layout(legend_title_text='user_answer')
fig.show()

### prior_question_elapsed_time

In [None]:
train.prior_question_elapsed_time.describe().to_frame().astype(int)

train.prior_question_elapsed_time.describe().to_frame().astype(int)/1000.

print("percentage of missing values=",((train.prior_question_elapsed_time.isna().sum()*100)/train.shape[0]).round(2))

In [None]:
df = train.sample(frac=0.02).to_pandas()
df['prior_question_elapsed_time(seconds)'] = df['prior_question_elapsed_time']/1000.
df = df.dropna()

fig = px.histogram(df, x='prior_question_elapsed_time(seconds)', 
                  title="Histogram of prior_question_elapsed_time in seconds",
                   log_y=False, histnorm='probability',
                   color='prior_question_had_explanation',
                   range_x = [0, 75]
                  )


fig.show()

In [None]:
fig = px.box(df, x="prior_question_had_explanation", y="prior_question_elapsed_time(seconds)",
             log_y=True,
            
                title="Box Plot of number of prior_question_had_explanation",
            )
fig.show()

### prior_question_had_explanation

In [None]:
df = train.groupby(['prior_question_had_explanation', 'content_type_id']).agg({"row_id": "count"}).reset_index().rename({"row_id": "count"}, axis=1).to_pandas()
df['content_type_id'] = df['content_type_id'].map({0: "question", 1: "lecture"})

fig = px.histogram(df, x="prior_question_had_explanation", y="count",
             log_y=False,
                title="Bar Plot of user answer categoty wise count",
             color = "content_type_id",
                   # histnorm="percent",
            )

fig.show()

In [None]:
del df

<a id="6"></a>
## question.csv

metadata for the questions posed to users.

* ```question_id```: foreign key for the train/test content_id column, when the content type is question (0).

* ```bundle_id```: code for which questions are served together.

* ```correct_answer```: the answer to the question. Can be compared with the train user_answer column to check if the user was right.

* ```part```: the relevant section of the TOEIC test.

* ```tags```: one or more detailed tag codes for the question. The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together.

In [None]:
question.describe()

In [None]:
question.head()

### question_id

In [None]:
fig = px.scatter(question.to_pandas(), y = "question_id",
                title="Scatter Plot of question_id vs index",
            )
fig.show()

### bundle_id

In [None]:
df = question.groupby(['bundle_id'], as_index=False).agg({'question_id': 'nunique'}). \
sort_values('question_id', ascending=False).rename({'question_id': "questions_per_bundle"}, axis=1).to_pandas()

fig = px.scatter(df, x = "bundle_id", y = "questions_per_bundle",
                title="Scatter Plot of number of questions per bundle vs bundle_id",
            )
fig.show()

### correct_answer

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

temp = question.to_pandas().replace({"correct_answer":{0:"choice-0", 1:"choice-1",2:"choice-2",3:"choice-3"}},)

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Histogram(x= temp.correct_answer, name="count"),
    secondary_y=False,
)

fig.add_trace(
    go.Histogram(x=temp.correct_answer, name="percentage", histnorm='percent'),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="correct answer histogram"
)

# Set x-axis title
fig.update_xaxes(title_text="correct answer")

# Set y-axes titles
fig.update_yaxes(title_text="count", secondary_y=False)
fig.update_yaxes(title_text="percentage", secondary_y=True)
fig.update_layout(showlegend=False)
fig.update_traces(marker_color='green',)

fig.show()

### part

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

temp = question.to_pandas()
temp = temp.replace({"part": {i: "part_" +str(i) for i in temp.part.unique()}},)

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Histogram(x= temp.part, name="count"),
    secondary_y=False,
)

fig.add_trace(
    go.Histogram(x=temp.part, name="percentage", histnorm='percent'),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="histogram of question part"
)

# Set x-axis title
fig.update_xaxes(title_text="part")

# Set y-axes titles
fig.update_yaxes(title_text="count", secondary_y=False)
fig.update_yaxes(title_text="percentage", secondary_y=True)
fig.update_layout(showlegend=False)
fig.update_traces(marker_color='green',)

fig.show()

In [None]:
question.groupby("bundle_id").question_id.count().to_frame().sort_values("question_id", ascending=False).head()

In [None]:
question[question.bundle_id==6940]

### tags

In [None]:
def encode_tags(x):
    try:
        tags = x['tags'].split(" ")
        return {i:1 for i in tags}
    except:
        return {}

question_pd = question.to_pandas()
question_pd['tag_encoded_dict'] = question_pd.apply(lambda x: encode_tags(x), axis=1)

tmp = pd.DataFrame(question_pd.tag_encoded_dict.tolist()).fillna(0)

tmp = tmp[tmp.columns.astype(int).sort_values().astype(str)]
from sklearn.preprocessing import StandardScaler

std = StandardScaler()
tmp = std.fit_transform(tmp)

#tags_encoded_df = cudf.DataFrame.from_pandas(tmp)


tsne = cuml.TSNE(n_components=2, perplexity=30, random_state=0, learning_rate=400)
ret = pd.DataFrame(tsne.fit_transform(tmp), index=question_pd.index, columns=["component_1", "component_2"])

ret["part"] = question_pd["part"]

ret['part'] = "part_" + ret["part"].astype(str)

fig = px.scatter(ret, x="component_1", y="component_2",
             log_y=False, log_x=False,
                title="TSNE Plot of tags clustering",
                 color="part",
            )
fig.show()

<a id="7"></a>
## lectures.csv

metadata for the lectures watched by users as they progress in their education.

* ```lecture_id```: foreign key for the train/test content_id column, when the content type is lecture (1).

* ```part```: top level category code for the lecture.

* ```tag```: one tag codes for the lecture. The meaning of the tags will not be provided, but these codes are sufficient for clustering the lectures together.

* ```type_of```: brief description of the core purpose of the lecture

In [None]:
lectures.head()

In [None]:
lectures.info()

### lecture_id

In [None]:
fig = px.scatter(lectures.to_pandas(), y = "lecture_id",
                title="Scatter Plot of lecture_id vs index", log_y=True,
            )
fig.show()

### part

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

temp = lectures.to_pandas()
temp = temp.replace({"part": {i: "part_" +str(i) for i in temp.part.unique()}},)

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Histogram(x= temp.part, name="count"),
    secondary_y=False,
)

fig.add_trace(
    go.Histogram(x=temp.part, name="percentage", histnorm='percent'),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="lectures part answer histogram"
)

# Set x-axis title
fig.update_xaxes(title_text="part")

fig.update_xaxes(categoryorder='array', categoryarray= ["part_"+str(i) for i in range(1,7)])

# Set y-axes titles
fig.update_yaxes(title_text="count", secondary_y=False)
fig.update_yaxes(title_text="percentage", secondary_y=True)
fig.update_layout(showlegend=False)
fig.update_traces(marker_color='green',)

fig.show()

### type_of

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

temp = lectures.to_pandas()
# temp = temp.replace({"part": {i: "part_" +str(i) for i in temp.part.unique()}},)

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Histogram(x= temp.type_of, name="count"),
    secondary_y=False,
)

fig.add_trace(
    go.Histogram(x=temp.type_of, name="percentage", histnorm='percent'),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="lectures type_of answer histogram"
)

# Set x-axis title
fig.update_xaxes(title_text="part")

fig.update_xaxes(categoryorder='array', categoryarray= ["part_"+str(i) for i in range(1,7)])

# Set y-axes titles
fig.update_yaxes(title_text="count", secondary_y=False)
fig.update_yaxes(title_text="percentage", secondary_y=True)
fig.update_layout(showlegend=False)
fig.update_traces(marker_color='green',)

fig.show()

### tags

In [None]:
df = lectures.to_pandas().groupby("tag").lecture_id.count(). \
to_frame().sort_values("lecture_id").rename({"lecture_id":"count"}, axis=1).reset_index()

fig = px.bar(df, x = "tag",y="count",
                title="Bar Plot of tag vs count", log_y=False,
            )
fig.show()