# Contents


1. Adding data set
2. About train data 
    1. Data cleaning
    2. Data exploring
        1. Basic data exploration
        2. Advance data exploration

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.style as style
style.use('fivethirtyeight')
import seaborn as sns
import datetime

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 1. Adding data set
* riiid-test-answer-prediction
* riiid-parquet-files

>Since the train dataset is huge(5G), i added the riiid_parquet_files data [here](https://www.kaggle.com/ryati131457/riiid-parquet-files) in this kernel, and intent to use the train.parquet file to load the train dataset, and it deducted the loading time from 10 minutes to about 10 second. Thanks Ryati!

In [None]:
%%time
train = pd.read_parquet('../input/riiid-parquet-files/train.parquet')

print('Trian size:', train.shape)

In [None]:
# import datatable as dt

# # reading data from csv using datatable and converting to pandas
# train_data = dt.fread("../input/riiid-test-answer-prediction/train.csv").to_pandas()

# # writing dataset as pickle
# train_data.to_pickle("riiid_train.pkl.gzip")

# # load pickled train data
# train_data = pd.read_pickle("../input/riiid_train.pkl.gzip")

# print("Train size:", data.shape)

In [None]:
# you can only call meake_env() once, so don't lose it!
# how to check the env?
# env = riiideducation.make_env()

# 2. About train data

* row_id: (int64) ID code for the row.

* timestamp: (int64) the time in milliseconds between this user interaction and the first event completion from that user.

* user_id: (int32) ID code for the user.

* content_id: (int16) ID code for the user interaction

* content_type_id: (int8) 0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture.

* task_container_id: (int16) Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id.

* user_answer: (int8) the user's answer to the question, if any. Read -1 as null, for lectures.

* answered_correctly: (int8) if the user responded correctly. Read -1 as null, for lectures.

* prior_question_elapsed_time: (float32) The average time in milliseconds it took a user to answer each question in the previous question bundle, ignoring any lectures in between. Is null for a user's first question bundle or lecture. Note that the time is the average time a user took to solve each question in the previous bundle.

* prior_question_had_explanation: (bool) Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback.

**Problems from exploring the train data file:**

* data types are not approporate.
* having missing values.

## 2.1 Data cleaning

### 2.1.1 Convert data types 
Accroding to the description of the train data from the riiid education, we need to convert the dtypes to the corresponding data types. But first, let's see all the datypes in train. 

In [None]:
train.info()

Seems the dataypes are converted in the train data with parquet version,let's leave it here and move to next step.

In [None]:
# seems most dataypes are converted in the parquet file, we onely need to
#  convert prior_question_had_explanation to boolean
# train['prior_question_had_explanation'] = train.prior_question_had_explanation.astype('int')
# train.info()

### 2.1.2 Deal with missing data
Only below two columns have missing values
* NaN values in prior_question_elapsed_time: means the fill nan values with 0 representing the starting time.
* NAN values in prior_question_had_explanation: means lectures in between and should be ignoried. fill nan values with -1 for indecating lecutures.

In [None]:
# check all nan values
train.isna().sum()

null for a user's first question bundle or lecture.

In [None]:
# fill nan value with False in prior_question_had_explanation
# train.prior_question_had_explanation.fillna(0, inplace=True)

In [None]:
train.head()

## 2.2 Data exploring

### 2.2.1 Basic data exploration
* timestamp: the time(day) between this user interaction and the first event completion from that user
* user_id: find the number of unique users.
* content_id: find the total number of different content ids and the percentage of questions in the total content ids.
* content_type_id: find number of question and lecture in the train data respectively. (0 represents a question, 1 represents a 1ecture)
* task_container_id: find the number of unique id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id.
* user_answer: find user's answer to the question, -1 for lectures.
* answered_correctly: find the correct responses, -1 for lectures.

timestamp: ms to minute

In [None]:

# 1 day = 86400000 ms
time_spent_dis = train.timestamp.apply(lambda x: x/86400000)
fig = plt.figure(figsize=(6,4))
time_spent_dis.plot.hist(bins=50)
plt.axvline(time_spent_dis.median(), color = 'r',linestyle = '--', linewidth=1)
plt.title('Histgram of Timestamp')
plt.xlabel('Days between this user interaction and the first event completion from that user')
plt.show()

In [None]:
print('From the histgram above, we can see that most user interactions were not active very long on the APP, and the median interactive time is about {} days.'.format(
round(time_spent_dis.median())))

user_id:

In [None]:
# user_id
print('Number of unique users in train dataset: {}'.format(train.user_id.nunique()))


content_id:

In [None]:
# content_id
top_10_content_list = list(train.content_id.value_counts().sort_values(ascending=False)[:10])
print('There are {} unique content in the train set. The toal 10 most frequent used content ids are {}.'.format(
train.content_id.nunique(), top_10_content_list))

content_type_id

In [None]:
# content_type_id
question, lecture = train.content_type_id.value_counts()
print('There are {} questions and {} lectures in the trian dataset, and questions account for {}% of the total content.'.format(
question, lecture, round(question/(question + lecture)*100,1)))

task_container_id: unique batches num

In [None]:
# task_container_id: unique batches num
print('the number of unique batches for questions or lectures: {}'.format(
train.task_container_id.nunique()))


user_answer:

In [None]:
# user_answer
print('0-3 are the answers to questions, -1 is no-answer (lecture).')
train.user_answer.value_counts()

answered_correctly: the correct response to questions

In [None]:
# answered_correctly: the correct response to questions
correct_question, notcorrect_question, noanswer_lecture = train.answered_correctly.value_counts()
print('There are total {} answered questions in the train data. \
 {} questions were answered correctly and {} were not answered correctly.\
 The correct answered rate is about {}%.'.format(
    correct_question + notcorrect_question, correct_question,
    notcorrect_question, round(correct_question/(correct_question + notcorrect_question)*100,1)))

### 2.2.2 Advanced data exploration

**Exploring possible features**
1. total time(ms) spent on the APP
2. average time(ms) spent on a question
3. the average rate of questions answered correctly for each user
4. the average rate of whether a user saw an explanation after the prior question

get all users' question data, and explore 4 possible features:

In [None]:
# get all users' question data(False means question,True for lecture)
users = train[train['content_type_id'] == False].groupby('user_id')

In [None]:
# get #, % of correct questions
user_answers = users['answered_correctly'].agg(correct_mean = 'mean', 
                correct_count = 'sum', answers_count = 'count')
user_answers['correct_count'] = user_answers.correct_count.astype('int64')

# get the total spent time(ms)
user_time = users['timestamp'].agg(time_total = 'max')

# concate two dataframes
user_correct = pd.concat([user_answers,user_time], axis = 1)
# add mean spent time column 
user_correct['time_mean'] = user_correct.time_total / user_correct.answers_count

# get the average of prior_question_had_explanation
user_had_explanation = users['prior_question_had_explanation'].agg(had_explanation_mean = 'mean') 

# concatenate user_had_explanation with  user_correct
# here can make multiple vis (might need normalization)
user_correct_info = pd.concat([user_correct, user_had_explanation], axis = 1)

# sort values fisrt by the # of correct_count then by the mean correct rate
# and see the top10 results
user_correct_info.sort_values(by=["answers_count", 'correct_mean'], ascending=False)[:10]
