Poor performance of OneHotEncoder for category_encoders version >=2.0.0 #362

DSOTM-pf · 2022-07-12T11:54:52Z

Expected Behavior

Similar memory usage for the different category_encoders versions or better performance for higher category_encoders versions.

Actual Behavior

According to the experiment results, when the category_encoders version is higher than 2.0.0, the performance of the model is worse.

Memory(MB)	Version
896	2.3.0
896	2.2.2
896	2.1.0
896	2.0.0
288	1.3.0

Steps to Reproduce the Problem

Step 1: download above dataset
train & test (63MB)
Step 2: install category_encoders

pip install  category_encoders == #version#

Step 3: change category_encoders version and save the memory usage

import numpy as np 
import pandas as pd
import category_encoders as ce
import tracemalloc
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
df_train.drop("id", axis=1, inplace=True) 
df_test.drop("id", axis=1, inplace=True) 
cat_labels = [f"cat{i}" for i in range(10)]

tracemalloc.start()
onehot_encoder = ce.one_hot.OneHotEncoder() 
onehot_encoder.fit(pd.concat([df_train[cat_labels], df_test[cat_labels]], axis=0))
train_ohe = onehot_encoder.transform(df_train[cat_labels]) 
test_ohe = onehot_encoder.transform(df_test[cat_labels]) 

current3, peak3 = tracemalloc.get_traced_memory()
print("Get_dummies memory usage is {",current3 /1024/1024,"}MB; Peak memory was :{",peak3 / 1024/1024,"}MB")

Specifications

Version: 2.3.0, 2.2.2, 2.1.0, 2.0.0, 1.3.0
Platform: ubuntu 16.4
OS : Ubuntu
CPU : Intel(R) Core(TM) i9-9900K CPU
GPU : TITAN V

PaulWestenthanner · 2022-07-14T07:31:14Z

Thanks for that issue report. From the top of my head I'd guess this is because most encoders convert the input to dataframes and create a deep copy of it. Maybe this wasn't the case yet in old versions. I'd need some time to check if this is really the reason. I'm also not sure if the deep copies can safely be removed, probably there was a reason the add them in the first place.
If you want to investigate it feel free, otherwise I'll have a look and keep you posted

DSOTM-pf · 2022-07-19T03:25:22Z

Hi, thanks for your quick reply!
I have observed the same memory usage issue with WOEEncoder (#364 ). About the root cause of the memory increase in these two APIs, I tried to look for it in the code changes in version 2.0.0, but it was not a good choice due to the number of code changes. I will take your suggestion and see if it is due to "deep copy".

PaulWestenthanner · 2022-07-19T08:33:36Z

category_encoders/category_encoders/one_hot.py

Line 340 in 5e9e803

X = X_in.copy(deep=True)

that should be the relevant line

Piecer-plc mentioned this issue Jul 19, 2022

Memory increase of WOEEncoder for newer category_encoders version #364

Open

PaulWestenthanner added good first issue enhancement labels Jan 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance of OneHotEncoder for category_encoders version >=2.0.0 #362

Poor performance of OneHotEncoder for category_encoders version >=2.0.0 #362

DSOTM-pf commented Jul 12, 2022

PaulWestenthanner commented Jul 14, 2022

DSOTM-pf commented Jul 19, 2022 •

edited

Loading

PaulWestenthanner commented Jul 19, 2022

Poor performance of OneHotEncoder for category_encoders version >=2.0.0 #362

Poor performance of OneHotEncoder for category_encoders version >=2.0.0 #362

Comments

DSOTM-pf commented Jul 12, 2022

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

PaulWestenthanner commented Jul 14, 2022

DSOTM-pf commented Jul 19, 2022 • edited Loading

PaulWestenthanner commented Jul 19, 2022

DSOTM-pf commented Jul 19, 2022 •

edited

Loading