<a href="https://colab.research.google.com/github/theresaskruzna/riiid_knowledge_tracing/blob/main/01_Data_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import Libraries

In [26]:
# Riiid Knowledge Tracing Project
# Initial setup and data exploration

import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import tensorflow as tf
from tensorflow.keras import layers, models

print("TensorFlow version:", tf.__version__)

TensorFlow version: 2.18.0


# Data pre-processing

## Load original data

In [2]:
# This installs the Kaggle API package. The -q flag makes the installation "quiet" (less verbose output).
# This only needs to be run once per Colab session
!pip install -q kaggle

# This imports the files module from Google Colab and opens a file upload dialog where you select your Kaggle API credentials file (kaggle.json) from your local computer.
# Upload your kaggle.json file (from your Kaggle account settings)
from google.colab import files
files.upload()  # Upload your kaggle.json here

!mkdir -p ~/.kaggle # Creates a .kaggle directory in your home folder (~) if it doesn't already exist. The -p flag ensures no error is thrown if the directory already exists.
!cp kaggle.json ~/.kaggle/ # Copies the uploaded kaggle.json file to the .kaggle directory where the Kaggle API expects to find it.
!chmod 600 ~/.kaggle/kaggle.json # Sets permissions on the kaggle.json file to "read/write for owner only" (600). This is required by the Kaggle API for security reasons.

# Download the dataset
!kaggle competitions download -c riiid-test-answer-prediction # Downloads all files for the Riiid Test Answer Prediction competition. The -c flag specifies which competition.
!unzip -q riiid-test-answer-prediction.zip -d riiid_data # Extracts the downloaded zip file into a directory called "riiid_data". The -q flag makes the unzip operation quiet (less verbose).

Saving kaggle.json to kaggle.json
Downloading riiid-test-answer-prediction.zip to /content
 99% 1.27G/1.29G [00:09<00:00, 166MB/s]
100% 1.29G/1.29G [00:09<00:00, 149MB/s]






In [3]:
# This command will list all files and directories within the "riiid_data" directory in a detailed format.
!ls -la riiid_data/

total 5710060
drwxr-xr-x 3 root root       4096 Mar 10 12:25 .
drwxr-xr-x 1 root root       4096 Mar 10 12:25 ..
-rw-r--r-- 1 root root        971 Oct 14  2020 example_sample_submission.csv
-rw-r--r-- 1 root root       6137 Oct 14  2020 example_test.csv
-rw-r--r-- 1 root root       9703 Oct 14  2020 lectures.csv
-rw-r--r-- 1 root root     296161 Oct 14  2020 questions.csv
drwxr-xr-x 2 root root       4096 Mar 10 12:25 riiideducation
-rw-r--r-- 1 root root 5846760913 Oct 14  2020 train.csv


In [4]:
# This command lists detailed information about the specific file kaggle.json in the .kaggle directory in your home folder.
!ls -la ~/.kaggle/kaggle.json

-rw------- 1 root root 70 Mar 10 12:25 /root/.kaggle/kaggle.json


Loading questions.csv and lectures.csv into the notebook.

In [16]:
%%time

questions_df = pd.read_csv('riiid_data/questions.csv')

CPU times: user 8.59 ms, sys: 1.2 ms, total: 9.79 ms
Wall time: 16.7 ms


In [15]:
%%time

lectures_df = pd.read_csv('riiid_data/lectures.csv')

CPU times: user 4.33 ms, sys: 108 µs, total: 4.44 ms
Wall time: 11.5 ms


The original train.csv data set is too large (5.85GB) for processing in collab, thus, a more suitable data set that has been compressed and converted from .csv to .pkl.gzip will be used for the time series forecasting purposes.
This approach downloads the whole dataset but I will look into ways to change it to dowloading just the single file I need.

In [9]:
import kagglehub
import os

# Download latest version
path = kagglehub.dataset_download("rohanrao/riiid-train-data-multiple-formats")

print("Path to dataset files:", path)

os.listdir(path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/rohanrao/riiid-train-data-multiple-formats?dataset_version_number=3...


100%|██████████| 4.17G/4.17G [00:54<00:00, 81.4MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/rohanrao/riiid-train-data-multiple-formats/versions/3


['riiid_train.h5',
 'riiid_train.feather',
 'riiid_train.jay',
 'riiid_train.parquet',
 'riiid_train.pkl.gzip']

In [12]:
%%time

train_df = pd.read_pickle(f"{path}/riiid_train.pkl.gzip")

CPU times: user 2.88 s, sys: 8.79 s, total: 11.7 s
Wall time: 26.1 s


## Dataset description

Tailoring education to a student's ability level is one of the many valuable things an AI tutor can do. Your challenge in this competition is a version of that overall task; you will predict whether students are able to answer their next questions correctly. You'll be provided with the same sorts of information a complete education app would have: that student's historic performance, the performance of other students on the same question, metadata about the question itself, and more.

This is a time-series code competition, you will receive test set data and make predictions with Kaggle's time-series API. Please be sure to review the Time-series API Details section closely.

## Feature description

**Files**


*train.csv*
*   row_id: (int64) ID code for the row.
*   timestamp: (int64) the time in milliseconds between this user interaction and the first event completion from that user.
*   user_id: (int32) ID code for the user.
*   content_id: (int16) ID code for the user interaction
*   content_type_id: (int8) 0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture.
*   task_container_id: (int16) Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id.
*   user_answer: (int8) the user's answer to the question, if any. Read -1 as null, for lectures.
*   answered_correctly: (int8) if the user responded correctly. Read -1 as null, for lectures.
*   prior_question_elapsed_time: (float32) The average time in milliseconds it took a user to answer each question in the previous question bundle, ignoring any lectures in between. Is null for a user's first question bundle or lecture. Note that the time is the average time a user took to solve each question in the previous bundle.
*   prior_question_had_explanation: (bool) Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback.

*questions.csv: metadata for the questions posed to users.*
*   question_id: foreign key for the train/test content_id column, when the content type is question (0).
*   bundle_id: code for which questions are served together.
*   correct_answer: the answer to the question. Can be compared with the train user_answer column to check if the user was right.
*   part: the relevant section of the TOEIC test.
*   tags: one or more detailed tag codes for the question. The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together.

*lectures.csv: metadata for the lectures watched by users as they progress in their education.*


*   lecture_id: foreign key for the train/test content_id column, when the content type is lecture (1).
*   part: top level category code for the lecture.
*   tag: one tag codes for the lecture. The meaning of the tags will not be provided, but these codes are sufficient for clustering the lectures together.
*   type_of: brief description of the core purpose of the lecture

*example_test_rows.csv*

Three sample groups of the test set data as it will be delivered by the time-series API. The format is largely the same as train.csv. There are two different columns that mirror what information the AI tutor actually has available at any given time, but with the user interactions grouped together for the sake of API performance rather than strictly showing information for a single user at a time. Some users will appear in the hidden test set that have NOT been presented in the train set, emulating the challenge of quickly adapting to modeling new arrivals to a website.
*   prior_group_responses (string) provides all of the user_answer entries for previous group in a string representation of a list in the first row of the group. All other rows in each group are null. If you are using Python, you will likely want to call eval on the non-null rows. Some rows may be null, or empty lists.
*   prior_group_answers_correct (string) provides all the answered_correctly field for previous group, with the same format and caveats as prior_group_responses. Some rows may be null, or empty lists.

**Time-series API Details**

Refer to the starter notebook for an example of how to complete a submission. The time-series API has changed somewhat from previous competitions!

You should not try to submit anything for the rows that contain lectures.

The API provides user interactions groups in the order in which they occurred. Each group will contain interactions from many different users, but no more than one task_container_id of questions from any single user. Each group has between 1 and 1000 users.

Expect to see roughly 2.5 million questions in the hidden test set.

The API will load up to 1 GB of the test set data in memory after initialization. The initialization step (env.iter_test()) will require meaningfully more memory than that; we recommend you do not load your model until after making that call. The API will also consume roughly 15 minutes of runtime for loading and serving the data, but will also obfuscate the true runtime for all submissions.

The API loads the data using the types specified above (int32 for user_id, int8 for content_type_id, etc).

# Data exploration

## train.pkl.gzip

In [19]:
train_df.head()

Unnamed: 0,row_id,timestamp,user_id,content_id,content_type_id,task_container_id,user_answer,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation
0,0,0,115,5692,False,1,3,1,,
1,1,56943,115,5716,False,2,2,1,37000.0,False
2,2,118363,115,128,False,0,0,1,55000.0,False
3,3,131167,115,7860,False,3,0,1,19000.0,False
4,4,137965,115,7922,False,4,1,1,11000.0,False


In [25]:
print(f"Training dataset size: {train_df.shape}\n")

train_df.info()

Training dataset size: (101230332, 10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101230332 entries, 0 to 101230331
Data columns (total 10 columns):
 #   Column                          Dtype  
---  ------                          -----  
 0   row_id                          int64  
 1   timestamp                       int64  
 2   user_id                         int32  
 3   content_id                      int16  
 4   content_type_id                 bool   
 5   task_container_id               int16  
 6   user_answer                     int8   
 7   answered_correctly              int8   
 8   prior_question_elapsed_time     float32
 9   prior_question_had_explanation  object 
dtypes: bool(1), float32(1), int16(2), int32(1), int64(2), int8(2), object(1)
memory usage: 3.7+ GB


The data set has 101+ million rows of data spread across 10 feature columns of various data types.

Checking against the feature list in the section above if all the features in the new dataset are in the correct format.

In [14]:
train_df.memory_usage(deep=True)

Unnamed: 0,0
Index,132
row_id,809842656
timestamp,809842656
user_id,404921328
content_id,202460664
content_type_id,101230332
task_container_id,202460664
user_answer,101230332
answered_correctly,101230332
prior_question_elapsed_time,404921328


The feature column 'prior_question_had_explanation' is taking an unreasonably large chunk of memory when loading the dataset compared to the other variables. This is because instead of a boolean data typed (as it states in the list of features), this feature is classed as object data type.

In [30]:
train_df['prior_question_had_explanation'] = train_df['prior_question_had_explanation'].astype('boolean')

# Verify the change in memory usage (optional)
train_df.memory_usage(deep=True)

Unnamed: 0,0
Index,132
row_id,809842656
timestamp,809842656
user_id,404921328
content_id,202460664
content_type_id,101230332
task_container_id,202460664
user_answer,101230332
answered_correctly,101230332
prior_question_elapsed_time,404921328


Now the memory usage of that variable is down to a relatable amount to the other variables.

In [None]:
unique_values = train_df.nunique()
unique_values

Unnamed: 0,0
row_id,101230332
timestamp,72821015
user_id,393656
content_id,13782
content_type_id,2
task_container_id,10000
user_answer,5
answered_correctly,3
prior_question_elapsed_time,3258
prior_question_had_explanation,2


**!!(consult tutor)!!**

Some of the other variables could potentially be downsized too, for instance, 'answered_correctly' has only 3 unique numerical values but in the feature list it states that one of the values '-1' should be same as the value '0' so can this also be converted to boolean?

Next, the 'prior_question_elapsed_time' is classed as float32 in both the dataset and the feature list, potentially as it has decimals, however, for the purpose of this project, wouldnt it be better to convert it to an int32?

The feature 'content_type_id' was already converted from int8 in the og dataset to a boolean as it only has values 0 and 1.

In [33]:
max_values = train_df[['row_id', 'timestamp', 'user_id', 'prior_question_elapsed_time']].max()
max_values

Unnamed: 0,0
row_id,101230300.0
timestamp,87425770000.0
user_id,2147483000.0
prior_question_elapsed_time,300000.0


From the table above we can see that the feature 'row_id' would well fit into int32.

'prior_question_elapsed_time' has decimals and can be downgraded to float32. Are the decimals important if the whole numbers are representing miliseconds?

In [34]:
train_df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
row_id,101230332.0,,,,50615165.5,29222679.859512,0.0,25307582.75,50615165.5,75922748.25,101230331.0
timestamp,101230332.0,,,,7703643654.326523,11592655308.506727,0.0,524343559.0,2674233764.5,9924550562.75,87425772049.0
user_id,101230332.0,,,,1076732424.202999,619716349.50211,115.0,540811564.0,1071781070.0,1615741676.0,2147482888.0
content_id,101230332.0,,,,5219.604757,3866.35894,0.0,2063.0,5026.0,7425.0,32736.0
content_type_id,101230332.0,2.0,False,99271300.0,,,,,,,
task_container_id,101230332.0,,,,904.062366,1358.302181,0.0,104.0,382.0,1094.0,9999.0
user_answer,101230332.0,,,,1.376123,1.192896,-1.0,0.0,1.0,3.0,3.0
answered_correctly,101230332.0,,,,0.625164,0.522531,-1.0,0.0,1.0,1.0,1.0
prior_question_elapsed_time,98878794.0,,,,25423.84375,21228.390625,0.0,16000.0,21000.0,29666.0,300000.0
prior_question_had_explanation,100837826.0,2.0,True,89685560.0,,,,,,,


In [36]:
missing_values = train_df.isna().sum().to_frame('missing_values')
missing_values

Unnamed: 0,missing_values
row_id,0
timestamp,0
user_id,0
content_id,0
content_type_id,0
task_container_id,0
user_answer,0
answered_correctly,0
prior_question_elapsed_time,2351538
prior_question_had_explanation,392506


## questions.csv

In [57]:
questions_df.head()

Unnamed: 0,question_id,bundle_id,correct_answer,part,tags
0,0,0,0,1,51 131 162 38
1,1,1,1,1,131 36 81
2,2,2,0,1,131 101 162 92
3,3,3,0,1,131 149 162 29
4,4,4,3,1,131 5 162 38


In [55]:
print(f"Questions dataset size: {questions_df.shape}\n")

questions_df.info()

Questions dataset size: (13523, 5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13523 entries, 0 to 13522
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   question_id     13523 non-null  int64 
 1   bundle_id       13523 non-null  int64 
 2   correct_answer  13523 non-null  int64 
 3   part            13523 non-null  int64 
 4   tags            13522 non-null  object
dtypes: int64(4), object(1)
memory usage: 528.4+ KB


The int64 data types in this dataset seem unnecessary so will most likely convert to more fitting one.

In [42]:
questions_df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
question_id,13523.0,,,,6761.0,3903.89818,0.0,3380.5,6761.0,10141.5,13522.0
bundle_id,13523.0,,,,6760.510907,3903.857783,0.0,3379.5,6761.0,10140.0,13522.0
correct_answer,13523.0,,,,1.455298,1.149707,0.0,0.0,1.0,3.0,3.0
part,13523.0,,,,4.264956,1.652553,1.0,3.0,5.0,5.0,7.0
tags,13522.0,1519.0,8.0,738.0,,,,,,,


In [43]:
unique_values = questions_df.nunique()
unique_values

Unnamed: 0,0
question_id,13523
bundle_id,9765
correct_answer,4
part,7
tags,1519


For features 'question_id' and 'bundle_id' int16 is suffiecient.

For 'correct_answer' and 'part' int8.

'tags' is categorical so might keep that as object.

In [61]:
missing_values = questions_df.isnull().sum().to_frame('missing_values')
missing_values

Unnamed: 0,missing_values
question_id,0
bundle_id,0
correct_answer,0
part,0
tags,1


In [62]:
missing_tag_row = questions_df[questions_df['tags'].isnull()]
missing_tag_row

Unnamed: 0,question_id,bundle_id,correct_answer,part,tags
10033,10033,10033,2,6,


In [45]:
duplicates = questions_df.duplicated().sum()
print(f'Duplicated values: {duplicates}')

Duplicated values: 0


Only 1 missing value for the 'tags' variable - at this size of a dataset, the entire row can probably be deleted, however, as the question_id is related to the same variable in the main dataset, it's important to check what is the reprasentation of this particular question in that dataset.

 No duplicate values.



In [46]:
print(questions_df.part.value_counts())

part
5    5511
2    1647
3    1562
4    1439
6    1212
7    1160
1     992
Name: count, dtype: int64


## lectures.csv

In [63]:
lectures_df.head()

Unnamed: 0,lecture_id,tag,part,type_of
0,89,159,5,concept
1,100,70,1,concept
2,185,45,6,concept
3,192,79,5,solving question
4,317,156,5,solving question


In [56]:
print(f"Lectures dataset size: {lectures_df.shape}\n")

lectures_df.info()

Lectures dataset size: (418, 4)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   lecture_id  418 non-null    int64 
 1   tag         418 non-null    int64 
 2   part        418 non-null    int64 
 3   type_of     418 non-null    object
dtypes: int64(3), object(1)
memory usage: 13.2+ KB


Also the int64 data types in this dataset seem unnecessary so will most likely convert to more fitting one.

In [50]:
lectures_df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
lecture_id,418.0,,,,16983.401914,9426.16466,89.0,9026.25,17161.5,24906.25,32736.0
tag,418.0,,,,94.480861,53.586487,0.0,50.25,94.5,140.0,187.0
part,418.0,,,,4.267943,1.872424,1.0,2.0,5.0,6.0,7.0
type_of,418.0,4.0,concept,222.0,,,,,,,


In [51]:
unique_values = lectures_df.nunique()
unique_values

Unnamed: 0,0
lecture_id,418
tag,151
part,7
type_of,4


For features 'lecture_id' and 'tag' int16 is suffiecient.

For 'part' and 'part' int8.

'type_of' is categorical so might keep that as object.

In [52]:
missing_values = lectures_df.isna().sum().to_frame('missing_values')
missing_values

Unnamed: 0,missing_values
lecture_id,0
tag,0
part,0
type_of,0


In [53]:
duplicates = lectures_df.duplicated().sum()
print(f'Duplicated values: {duplicates}')

Duplicated values: 0


No missing or duplicate values in this dataset.

In [54]:
print(lectures_df.type_of.value_counts())

type_of
concept             222
solving question    186
intention             7
starter               3
Name: count, dtype: int64


# Feature engineering

## Missing values

### train.pkl.gzip

In [38]:
missing_values = train_df.isna().sum().to_frame('missing_values')
missing_values

Unnamed: 0,missing_values
row_id,0
timestamp,0
user_id,0
content_id,0
content_type_id,0
task_container_id,0
user_answer,0
answered_correctly,0
prior_question_elapsed_time,2351538
prior_question_had_explanation,392506


### questions.csv

In [65]:
missing_values = questions_df.isna().sum().to_frame('missing_values')
missing_values

Unnamed: 0,missing_values
question_id,0
bundle_id,0
correct_answer,0
part,0
tags,1


In [64]:
missing_tag_row = questions_df[questions_df['tags'].isnull()]
missing_tag_row

Unnamed: 0,question_id,bundle_id,correct_answer,part,tags
10033,10033,10033,2,6,


Only 1 missing value for the 'tags' variable - at this size of a dataset, the entire row can probably be deleted, however, as the question_id is related to the same variable in the main dataset, it's important to check what is the reprasentation of this particular question in that dataset.


In [66]:
missing_question_id = missing_tag_row['question_id'].iloc[0]  # Extract the question_id
question_id_counts = train_df[train_df['content_id'] == missing_question_id].shape[0]

print(f"Occurrences of question_id {missing_question_id}: {question_id_counts}")

Occurrences of question_id 10033: 1


The question_id 10033 is only found once across the whole train.pkl.gzip dataset with 101+ million rows so it's safe to say we can delete that row from question.csv.

In [None]:
questions_df = questions_df.dropna(subset=['tags'])  # Drop rows with missing 'tags'

## Outliers

# Extra code

The code below crashed my runtime session when used with train.pkl.gzip dataset.

In [None]:
train_df[['row_id',
    'timestamp',
    'user_id',
    'content_id',
    'content_type_id',
    'task_container_id',
    'user_answer',
    'answered_correctly',
    'prior_question_elapsed_time',
    'prior_question_had_explanation']].describe().T

In [None]:
duplicates = train_df.duplicated().sum()
print(f'Duplicated values: {duplicates}')

Code to be added/tested

In [None]:
import os
import subprocess

# Define the dataset and file you want
dataset = "rohanrao/riiid-train-data-multiple-formats"
filename = "riiid_train.pkl.gzip"

# Create the directory to store the file if it doesn't exist
os.makedirs("kaggle_data", exist_ok=True)

# Use the kaggle CLI to download just the specific file
command = f"kaggle datasets download {dataset} -f {filename} -p kaggle_data --unzip"
result = subprocess.run(command, shell=True, capture_output=True, text=True)

if result.returncode == 0:
    filepath = os.path.join("kaggle_data", filename)
    print(f"Successfully downloaded: {filepath}")
    print(f"File size: {os.path.getsize(filepath) / (1024 * 1024):.2f} MB")
else:
    print("Error downloading the file:")
    print(result.stderr)

In [None]:
train['prior_question_had_explanation'] = train['prior_question_had_explanation'].astype('boolean')

train.memory_usage(deep=True)

In [None]:
train_df.columns

In [None]:
print(train_df.content_type_id.value_counts())