In this competition, many participants face difficulties in handling large size `train.csv`. We have to use Kaggle Notebook env and its RAM is 16GB.

In this notebook, I introduce `pandas.DataFrame.memory_usage()` and how to filter columns when loading.

In [None]:
import pandas as pd

import riiideducation

In [None]:
# You can only call make_env() once, so don't lose it!
env = riiideducation.make_env()

In [None]:
train_df = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/train.csv',
                       low_memory=False,
                       nrows=10**6,
                       dtype={'row_id': 'int64',
                              'timestamp': 'int64',
                              'user_id': 'int32',
                              'content_id': 'int16',
                              'content_type_id': 'int8',
                              'task_container_id': 'int16',
                              'user_answer': 'int8',
                              'answered_correctly': 'int8',
                              'prior_question_elapsed_time': 'float32',
                              'prior_question_had_explanation': 'boolean',
                             }
                      )

At the end of the log, we can see that memory usage is 31.5MB.

>memory usage: 31.5 MB

In [None]:
train_df.info()

We can see the results for each column by `momory_usage()`. `row_id` and `timestamp` use much more memory than others.

In [None]:
train_df.memory_usage()

In [None]:
train_df.memory_usage().plot.barh()

If you only use specific columns like notebook I published, you can filter columns by `usecols`.

https://www.kaggle.com/sishihara/riiid-answered-correctly-benchmark

In [None]:
train_df = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/train.csv',
                       low_memory=False,
                       nrows=10**6,
                       usecols=['content_id', 'answered_correctly'],
                       dtype={'content_id': 'int16', 'answered_correctly': 'int8'}
                      )

In [None]:
train_df.info()

The momory usage was reduced to 2.9M. This means that you can increase the size of `nrows`.

And another options is use proper type in `dtype` which we can see from the results of `memory_usage()`.