Riiid Labs, an AI solutions provider delivering creative disruption to the education market, empowers global education players to rethink traditional ways of learning leveraging AI. With a strong belief in equal opportunity in education, Riiid launched an AI tutor based on deep-learning algorithms in 2017 that attracted more than one million South Korean students. This year, the company released EdNet, the world’s largest open database for AI education containing more than 100 million student interactions.

In this competition, your challenge is to create algorithms for "Knowledge Tracing," the modeling of student knowledge over time. The goal is to accurately predict how students will perform on future interactions. You will pair your machine learning skills using Riiid’s EdNet data. 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import os

This part of the program is based on the notebook by Rohan Rao: https://www.kaggle.com/rohanrao/tutorial-on-reading-large-datasets

- The Riiid! Answer Correctness Prediction dataset has over 100 million rows and 10 columns.The usual  pd.read_csv will result in an out-of-memory error. 

- We also convert the dataset into another format which uses lesser disk space, is smaller in size and/or can be read faster for subsequent reads. 

Using Datable to read large databases. 
Documentation: https://datatable.readthedocs.io/en/latest/index.html

The training dataset has the following features:


    - row_id: (int64) ID code for the row.
    
    - timestamp: (int64) the time in milliseconds between this user interaction and the first event completion from that user.
    
    - user_id: (int32) ID code for the user.
    
    - content_id: (int16) ID code for the user interaction
    
    - content_type_id: (bool) 0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture.
    
    - task_container_id: (int16) ID code for the batch of questions or lectures. (eg. a user might see three questions in a row before seeing the explanations for any of them - those three would all share a task_container_id)
    
    - user_answer: (int8) the user's answer to the question, if any. Read -1 as null, for lectures.
    
    - answered_correctly: (int8) if the user responded correctly. Read -1 as null, for lectures.
    
    - prior_question_elapsed_time: (float32) The average time in milliseconds it took a user to answer each question in the previous question bundle, ignoring any lectures in between (is null for a user's first question bundle or lecture)
    
    - prior_question_had_explanation: (bool) Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback.




In [None]:
%%time
# Load the train data set
train = pd.read_pickle('../input/riiid-train-data-multiple-formats/riiid_train.pkl.gzip')
train.head()

In [None]:
### checking memory usage
train.memory_usage(deep = True)

In [None]:
train.info()

In [None]:
train['prior_question_had_explanation'] = train['prior_question_had_explanation'].astype('bool')

train.memory_usage(deep=True)

In [None]:
## what is the number of unique users in the dataset
print(f'We have {train.user_id.nunique()} unique user ids in the dataset')

In [None]:
print(train['content_type_id'].value_counts())
print('\nIn percentages\n',train['content_type_id'].value_counts(normalize = True))

In [None]:
### what is the number of interactions that are lectures in the dataset
temp = len(train.loc[train['content_type_id'] == True])
print(f' the number of interactions that are lectures are {temp} and the rest {(len(train) - temp)} are questions')
print(f' the proportion of interactions that are lectures are {lectures/len(train): 0.2f} and the proportion of questions are {(len(train) - temp)/len(train):0.2f}')

In [None]:
## how many different types of contents are there in these interactions
unique_contents = train['content_id'].nunique()
print(f'There are {unique_contents} unique contents in the entire dataset')

In [None]:
#within the dataset of only questions (leaving out lectures) how many contents are there
unique_questions = train[train['content_type_id'] == False].content_id.nunique()
print(f'There are {unique_questions} unique contents in the dataset of just questions')

In [None]:
### showing the top most used content_ids
common_ids = train['content_id'].value_counts()[0:50]

fig = plt.figure(figsize=(12,6))
ax = common_ids.plot.bar()
plt.title("Fifty most used content id's")
plt.show()

task_container_id: (int16) Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id.

In [None]:
## how many different types of task containers are there in these interactions
unique_tasks = train['task_container_id'].nunique()
print(f'There are {unique_tasks} unique task containers in the entire dataset (this includes lectures and questions)')

In [None]:
### showing the top most used task containers
common_tasks = train['task_container_id'].value_counts()[0:30]

fig = plt.figure(figsize=(12,6))
ax = common_tasks.plot.bar()
plt.title("Thirty most used task containers")
plt.show()

User answer. questions are multiple choice (answers 0-3). As mentioned in the data description, -1 is actually no-answer (as the interaction was a lecture instead of a question). Remember that we already found that 0.019352 was lectures, so that informaiton should match up



In [None]:
df = pd.concat([train.user_answer.value_counts(), 
                train.user_answer.value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))

print(df)

In [None]:
### showing the top most used task containers
user_answers = train['user_answer'].value_counts()

fig = plt.figure(figsize=(12,6))
ax = user_answers.plot.bar()
plt.title("User Answers")
plt.show()

Examining the column 'answered correctly' '-1' stands for lectures so we can also find out what percentage of the answers are lectures

In [None]:
df = pd.concat([train['answered_correctly'].value_counts(), 
                train['answered_correctly'].value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))

print(df)

In [None]:
### for now deleting all the lecture interactions
'''train = train[train['content_type_id'] == False]
train['answered_correctly'].value_counts()'''

In [None]:
## how many users are there after deleting lectures
print(f' There are {train.user_id.nunique()} users after deleting the lectures')

In [None]:
### About 1/3 answered correctly

In [None]:
### showing correct answers
correct_answers = train['answered_correctly'].value_counts()

fig = plt.figure(figsize=(12,6))
ax = correct_answers.plot.bar()
plt.title("Correct Answers")
plt.show()

In [None]:
### showing correct answers horizontally
correct_answers = train['answered_correctly'].value_counts()

fig = plt.figure(figsize=(12,6))
ax = correct_answers.plot.barh()
plt.title("Answered Correctly")
plt.show()

In [None]:
### crosstab between answered correctly and user answer
pd.crosstab(train.user_answer, train.answered_correctly)

In [None]:
### crosstab between answered correctly and content type
pd.crosstab(train.content_type_id, train.answered_correctly)

Not sure what the connection is between answered correctly and user answer

timestamp: (int64) the time in milliseconds between this user interaction and the first event completion from that user. As you can see, most interactions are from users that were not active very long on the platform yet.

In [None]:
## so all 393656 users should have a timestamp of zero
zero_timestamps = train[train.timestamp == 0].user_id.nunique()
print(f' of all the {train.user_id.nunique()} users, The number of users with timestamp equal zero is {zero_timestamps}')

Something is wrong here, there are some user_id with no entry for timestamp = 0

In [None]:
#1 year = 31536000000 ms
ts = train['timestamp']/(31536000000/12)
fig = plt.figure(figsize=(12,6))
ts.plot.hist(bins=100)
plt.title("Histogram of timestamp")
#plt.xticks(rotation=0)
plt.xlabel("Months between this user interaction and the first event completion from that user")
plt.show()

In [None]:
#1 year = 31536000000 ms
ts = train['timestamp']/(31536000000/365)
fig = plt.figure(figsize=(12,6))
ts.plot.hist(bins=100)
plt.title("Histogram of timestamp")
#plt.xticks(rotation=0)
plt.xlabel("Days between this user interaction and the first event completion from that user")
plt.show()

In [None]:
%who
import gc
gc.collect()

Is there a relationship between time stamp and answered correctly

In [None]:
#bin_labels_5 = ['Bin_1', 'Bin_2', 'Bin_3', 'Bin_4', 'Bin_5']
train['ts_bin'] = pd.qcut(train['timestamp'], q=20, labels = np.arange(20))

In [None]:
fig = plt.figure(figsize=(12,6))
#ax = df.plot.bar()
ax = train.groupby('ts_bin')['answered_correctly'].mean().plot.bar()
plt.title("Answered Correctly")
plt.show()

Initial interactions got a lower percentage of the answers correct, after that interactions much further out, they seem to have the same propotion correct

Is there a relationship between number of questions answered and the percentage of correct

In [None]:
user_percent = train[train.answered_correctly != -1].groupby('user_id')['answered_correctly'].agg(Mean = 'mean', Nquestions = 'count')

In [None]:
print(f'The highest number of quesitons answered by a user was {user_percent.Nquestions.max()}')

In [None]:
from scipy.stats import pearsonr
corr, _ = pearsonr(user_percent.Mean, user_percent.Nquestions)
print('Pearsons correlation: %.3f' % corr)

In [None]:
sample = user_percent[user_percent.Nquestions < 5000].sample(n=2000, random_state = 1)
fig = plt.figure(figsize=(12,6))
x = sample.Nquestions
y = sample.Mean
plt.scatter(x, y, marker='o')
plt.title("Percent answered correctly versus number of questions answered")
plt.xticks(rotation=0)
plt.xlabel("Number of questions answered")
plt.ylabel("Percent answered correctly")
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")
plt.show()

Does it help if the 'prior_question_had_explanation'? 

In [None]:
df = pd.concat([train['prior_question_had_explanation'].value_counts(), 
                train['prior_question_had_explanation'].value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))

print(df)

In [None]:
### showing prior question had explanation
correct_answers = train['prior_question_had_explanation'].value_counts()

fig = plt.figure(figsize=(12,6))
ax = correct_answers.plot.bar()
plt.title("Prior Question had Explanation")
plt.show()

In [None]:
prior_answers = train[train.answered_correctly != -1].groupby('prior_question_had_explanation')['answered_correctly'].agg(Mean = 'mean')

In [None]:

fig = plt.figure(figsize=(12,6))
ax = prior_answers.plot.bar()
plt.title("Prior Question had Explanation")
plt.show()

In [None]:
## is there a relationshp between prior_question_elapsed_time and answered correctly

In [None]:
prior_time = train[train.answered_correctly != -1].groupby('answered_correctly')['prior_question_elapsed_time'].agg(Mean = 'mean')

In [None]:
fig = plt.figure(figsize=(12,6))
ax = prior_time.plot.bar()
plt.title("Prior Question Time Taken")
plt.show()

Questions

Metadata for the questions posed to users.

    - question_id: foreign key for the train/test content_id column, when the content type is question (0).
    - bundle_id: code for which questions are served together.
    - correct_answer: the answer to the question. Can be compared with the train user_answer column to check if the user was right.
    - part: the relevant section of the TOEIC test.
    - tags: one or more detailed tag codes for the question. The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together.


In [None]:
questions = pd.read_csv('../input/riiid-test-answer-prediction/questions.csv')
questions.head()

In [None]:
questions.shape

Analysis with Tags

In [None]:
questions['tags'] = questions['tags'].astype(str)

In [None]:
questions.head()

In [None]:
tags = [x.split() for x in questions[questions.tags != "nan"].tags.values]

In [None]:
tag_list = []
for i in tags:
    for j in i:
        if j not in(tag_list):
            tag_list.append(j)
        
print(len(tag_list))


In [None]:
print(f'There are {len(tag_list)} unique tags in the dataset')

In [None]:
tags_list = [x.split() for x in questions.tags.values]
questions['tags_list'] = tags_list
questions.head()

Find out which are the hardest tags and easiest tags

In [None]:
## Also add information on lectures, there is a relationship between every watched a lecture and percent correct