<h1>Riid AIEd Challenge 2020 - Part I</h1>

Due to memory/time restrictions in this competition, work is divided into several parts (kernels):
<ul>
    <li>Part I - Memory optimization</li>
    <li>Part II - Splitting data</li>
    <li>Part III - Feature engineering</li>
    <li>Part IV - Training and validation</li>
    <li>Part V - Prediction and submission</li>
</ul>

This is Part I. In this part I'll 
<ul>
    <li>Read the competition data with the <code>datatable</code> package, perform some optimizations in data types to reduce memory footprint</li>
    <li>Divide the competition data into two parts. The first part, which I'll call <code>past_data</code> will be used to create features for training. Also a small part of it will be used as training and validation dataset. The second part will be the test dataset, and will be designed to be similar to the competition test set.</li>
    <li>Save data in pickle format to be used by the next phase.</li>
</ul>

In [None]:
!python3.7 -m pip install --upgrade pip
!pip install /kaggle/input/python-datatable/datatable-0.11.0-cp37-cp37m-manylinux2010_x86_64.whl

In [None]:
# Imports

import os
import datatable as dt
import pandas as pd
import numpy as np
import pickle
import gc

In [None]:
# Define directories used

DATA_DIR = '/kaggle/input/riiid-test-answer-prediction'
WORKING_DIR = '/kaggle/working'

The train data is huge (over 101 million rows). Trying to load it into memory with a plain <code>pd.read_csv</code> leads to kernel crashing. For that reason, I use the <code>datatable</code> package, which is more efficient for loading CSV files.

In [None]:
%%time

# Load the train data set 
competition_data = dt.fread(os.path.join(DATA_DIR, 'train.csv' )).to_pandas()
competition_data.head()

In [None]:
# Let's check what we have

competition_data.info()

In [None]:
# Let's see if we have nulls

competition_data.isnull().sum()

In [None]:
# Let's see statistics about our data. We're especially insterested in value ranges for every column

competition_data.max()

Let's see the maximum value representable by each of the common numeric types 

In [None]:
types = pd.Series(
    data = [np.iinfo(np.int8).max, np.iinfo(np.int16).max, np.iinfo(np.int32).max, np.iinfo(np.int64).max, 
            np.finfo(np.float16).max, np.finfo(np.float32).max, np.finfo(np.float64).max],
    index = ['np.int8', 'np.int16', 'np.int32', 'np.int64', 'np.float16', 'np.float32', 'np.float64'],
    name = 'max value'
)
types

With <code>prior_question_elapsed_time</code>, though its values would fit nicely in a <code>np.float32</code> type, calculating the mean (adding all values) requires a bigger type, so we set appart this mean before changing the type.

In [None]:
print('mean calculated with original type (np.float64): ', competition_data.prior_question_elapsed_time.mean())
print('mean calculated with np.float32 type: ', competition_data.prior_question_elapsed_time.astype(np.float32).mean())

In [None]:
# Let's free a bit of memory by fitting some columns into smaller types

competition_data['content_id'] = competition_data['content_id'].astype(np.int16)
competition_data['content_type_id'] = competition_data['content_type_id'].astype(np.int8)    # code as in example_test
competition_data['task_container_id'] = competition_data['task_container_id'].astype(np.int16)
competition_data['answered_correctly'] = competition_data['answered_correctly'].astype(np.int8)
competition_data['user_answer'] = competition_data['user_answer'].astype(np.int8)
competition_data['prior_question_elapsed_time'] = competition_data['prior_question_elapsed_time'].astype(np.float32)
competition_data['prior_question_had_explanation'] = competition_data['prior_question_had_explanation'].astype('bool')

# We don't need row_id
competition_data = competition_data.drop(columns='row_id')

_ = gc.collect()

competition_data.info()

Wow, 2.6 GB instead of 4.6 GB.

<h2>Save data</h2>

In [None]:
# Let's save data into pickle format for next kernels

competition_data.to_pickle(os.path.join(WORKING_DIR, 'competition_data.pkl'))

That's all folks