# Efficient dataframe loading with Datatable

First of all, lets start by quoting what [Sohier Dane](https://www.kaggle.com/sohier) said about the training dataset in the competition's starter [notebook](https://www.kaggle.com/sohier/competition-api-detailed-introduction).

> It's larger than will fit in memory with default settings, so we'll specify more efficient datatypes and only load a subset of the data for now.

After that, an instruction is given on how to efficiently load the dataset using specific data types. However, if you try to load the entire dataset that way using pandas, your RAM memmory limit will be likely reached.

Inspired by [Vopani](https://www.kaggle.com/rohanrao)'s excelent [notebook](https://www.kaggle.com/rohanrao/riiid-with-blazing-fast-rid), we'll see how to load heavy .csv data using the [**Python datatable**](https://datatable.readthedocs.io/en/latest/index.html) package.

In [None]:
# Install the datatable package.
!pip install datatable

In [None]:
import pandas as pd
import datatable as dt
import gc
import numpy as np

## Loading with datatable

Loading .csv data and converting it to a Pandas Dataframe with Datatable is straightforward.

In [None]:
%%time
train_df = dt.fread("/kaggle/input/riiid-test-answer-prediction/train.csv").to_pandas()

We can see thath the entire dataset was loaded in hoghly 44 seconds, which is nice given the dataset size.

Now, let's check the dataframe's information.

In [None]:
train_df.info()

As we can see, **datatable** has automatically infered some columns types, in contrast with the rather conservative Pandas's data loading.

The entire dataset fits nicely in 4.6 GB without effort. But, as stated in the starter notebook, the data types can be further tweaked in order to improve memmory consumption.

In [None]:
dtype={
    'row_id': np.int64, 'timestamp': np.int64, 'user_id': np.int32, 'content_id': np.int16, 'content_type_id': np.int8,
    'task_container_id': np.int16, 'user_answer': np.int8, 'answered_correctly': np.int8, 'prior_question_elapsed_time': np.float32, 
    'prior_question_had_explanation': np.bool,
}

In [None]:
for col in dtype.keys():
    train_df[col] = train_df[col].astype(dtype[col])
train_df.info()

Now, we've gained approximately 1.6 GB of extra RAM to be used in something really useful.

## Saving in binary format

As an additional step, well check the benefits of saving the processed data into a binary format.

Datatable uses .jay format, which makes reading our dataset a breeze.

In [None]:
dt.Frame(train_df).to_jay("train_df.jay")

In [None]:
del train_df

In [None]:
gc.collect()

Now, we'll load our data back.

In [None]:
%%time
train_df = dt.fread("train_df.jay").to_pandas()

Cool! The entire dataset was loaded in amzing 4.84 seconds (~ 4 times faster). As a bonus, our data types were preserved.

In [None]:
train_df.info()

That's all folks!

This is the first of a series of short notebook that I'm planning to make. The goal is to build the critical phases of an end-to-end project, step by step. So, stay tuned for the other kernels.

Have you found something useful? Please, give a an upvote!