# Memory trick to load a lot of data

If you ever faced to memory problems on this challenge or another, this notebook should help you to have a better understanding on how to deal with Big Data.

Every dataset you load is stored into memory, which is limited.
When using pandas, every float (decimal numbers) is casted as float64.

Float64 uses 64 bits (8 bytes) to store a single value and range between -1.79e+308 and 1.74e+308.
But on the Jane Street dataset, never use such huge numbers!

The float16 consumes 16 bits (2 bytes) of memory and range between -32768 and 32767.
If we use them, you can save 75% of the memory.

Moreover, the arithmetical operations on the float16 will be really cheaper than the float64 operations.
Then, your training and predicting will be better!

In [None]:
import pandas as pd
import numpy as np
import time

Now, here is the code I use to load the dataset:

In [None]:
class CFG:
    TRAIN_SIZE = 0.2
    MIN_WEIGHT = 0
    NROWS = None
    FILL_NA = -999

In [None]:
# -75% of time space took by train.csv so we can train faster models
# and avoid RAM errors

features_columns = ["feature_%d" % i for i in range(130)]
columns_dtypes = {}
for column in features_columns:
    columns_dtypes[column] = "float16"
columns_dtypes["resp_1"] = "float16"
columns_dtypes["resp_2"] = "float16"
columns_dtypes["resp_3"] = "float16"
columns_dtypes["resp_4"] = "float16"
columns_dtypes["resp"] = "float16"

print("Loading dataset...")
dataset = pd.read_csv("/kaggle/input/jane-street-market-prediction/train.csv", delimiter=",", nrows=CFG.NROWS, dtype=columns_dtypes)
dataset = dataset[dataset.weight > CFG.MIN_WEIGHT]
print("Done!")

print("Splitting train/test dataset...")
train_number_items = int(dataset.shape[0] * CFG.TRAIN_SIZE)
train = dataset[:train_number_items]
test = dataset[train_number_items + 1:]
print("Done!")

print("Filling NaN values...")
train = train.fillna(CFG.FILL_NA)
test = test.fillna(CFG.FILL_NA)
print("Done.")

print("Preparing X and y...")
X_train = train[features_columns].to_numpy()
X_test = test[features_columns].to_numpy()
y_train = np.where(train["resp"] > 0, 1, 0)
y_test = np.where(test["resp"] > 0, 1, 0)
d_train = train["date"].to_numpy()
d_test = test["date"].to_numpy()
w_train = train["weight"].to_numpy()
w_test = test["weight"].to_numpy()
r_train = train["resp"].to_numpy()
r_test = test["resp"].to_numpy()
resp_train = train[["resp_1", "resp_2", "resp_3", "resp_4"]].to_numpy()
resp_test = test[["resp_1", "resp_2", "resp_3", "resp_4"]].to_numpy()

resp_label_train = np.where(resp_train > 0, 1, 0)
resp_label_test = np.where(resp_test > 0, 1, 0)

print("Done!")

print("Deleting unused variables...")
del dataset
del train
del test
print("Done!")

print("Train/test sizes: %d/%d" % (len(X_train), len(X_test)))