## Motivation

Suppose we have (user_id × content_id)-wise features saved as csv files:

In [None]:
%%time
import cudf

types = {
        'row_id': 'int64',
        'timestamp': 'int64',
        'user_id': 'int32',
        'content_id': 'int16',
        'content_type_id': 'int8',
        'task_container_id': 'int16',
        'user_answer': 'int8',
        'answered_correctly': 'int8',
        'prior_question_elapsed_time': 'float32',
        'prior_question_had_explanation': 'int8'
}
datapath = '/kaggle/input/riiid-test-answer-prediction/train.csv'

train_X = cudf.read_csv(datapath, dtype=types)
train_X = train_X[train_X['content_type_id'] == 0]
feat_float = train_X.groupby(['user_id', 'content_id'])['answered_correctly'].mean().astype('float32')
feat_int = train_X.groupby(['user_id', 'content_id'])['answered_correctly'].count().astype('int16')
# del train_X

print('number of keys : ', len(feat_int))
feat_float.reset_index().to_csv('user_content_wise_float.csv', index=False)
feat_int.reset_index().to_csv('user_content_wise_int.csv', index=False)

In [None]:
train_X.user_id.nunique(), train_X.content_id.nunique()

In [None]:
!pip install memory_profiler
%load_ext memory_profiler

In [None]:
def all_dict():
    num_user_id = len(feat_float.index.get_level_values(level=0).unique())
    
    print('num_user_id :', num_user_id)

    feat_dict = dict()
    for cnt, (user_id, data) in enumerate(feat_float.groupby(level=0)):
        df = data.to_frame().to_pandas()
        feat_dict[user_id] = df.reset_index().drop('user_id',axis=1).set_index('content_id').to_dict()['answered_correctly']
        del df
    return feat_dict

%memit feat_dict = all_dict()
del feat_dict

We have a huge dict with the number of keys 86867031, and the estimated RAM usage with Python dict (64-bit float value) is 8588MiB.

It is hard to put this data on RAM as a usual Python dict, but we can borrow the power of parallel-hashmap implemented in C++ through pybind11 to put it on RAM.

In [None]:
del feat_float, feat_int

path_to_parallelmap_folder = "/kaggle/input/parallel-hashmap"
path_to_cppfile = "/kaggle/input/pybind11demo/mydicts.cpp"
module_name = "my_module"

## Compiling and importing

We first need to compile the cpp file with the following command, and no, I don't fully understand what this long command does neither. See [pybind11 tutorial](https://pybind11.readthedocs.io/en/stable/basics.html) for details.

In [None]:
!c++ -O3 -Wall -shared -std=c++11 -fPIC `python3 -m pybind11 --includes` $path_to_cppfile -I$path_to_parallelmap_folder -o $module_name`python3-config --extension-suffix`

Now we can import the classes defined in the cpp file (``mydicts.cpp``).

In [None]:
!ls

In [None]:
from my_module import my_dict_int, my_dict_float

## How to use

To instantiate the hash map, we pass the datapath of csv to the constuctor. Then, on C++ side, the csv file is parsed, and the hash map is constructed.

In ``mydicts.cpp``, I defined two operations:

* ``setval(user_id, content_id, value)`` of my_dict_int/float ⇔ ``update({(user_id, content_id): value})`` of Python 3 dict
* ``getval(user_id, content_id)`` of my_dict_int/float ⇔ ``setdefault((user_id, content_id), 0/0.0)`` of Python 3 dict

These operations just reflect my taste; you can modify the cpp file (mydicts.cpp) and use any API you prefer.

Also, you can use any data format (not only csv) if you can parse it on C++ side. You need to modify the constructor in ``mydicts.cpp`` of each class accordingly.

## memory/time efficiency

[The parallel hashmap](https://github.com/greg7mdp/parallel-hashmap) is really RAM-friendly: 

* ~700MiB for each 16-bit integer (user_id × content_id)-wise feature
* ~1300MiB for each 32-bit float (user_id × content_id)-wise feature

In [None]:
%%time
%memit user_content_feat_float = my_dict_float("/kaggle/working/user_content_wise_float.csv")

In [None]:
%%time
%memit user_content_feat_int = my_dict_int("/kaggle/working/user_content_wise_int.csv")

Accessing and modifying the value is very fast:

In [None]:
%timeit user_content_feat_int.setval(user_id=5, content_id=16, value=123)
%timeit user_content_feat_int.getval(user_id=5, content_id=16)

print(user_content_feat_int.getval(user_id=5, content_id=16))

%timeit user_content_feat_float.setval(user_id=5, content_id=16, value=123.193)
%timeit user_content_feat_float.getval(user_id=5, content_id=16)

print(user_content_feat_float.getval(user_id=5, content_id=16))

## positional/keyword arguments

We can use both positional and keyword arguments, thanks to pybind11:

In [None]:
print(user_content_feat_int.getval(user_id=2147482888, content_id=9788))

assert user_content_feat_int.getval(user_id=2147482888, content_id=9788) == user_content_feat_int.getval(2147482888, 9788)