Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance of OneHotEncoder for category_encoders version >=2.0.0 #362

Open
DSOTM-pf opened this issue Jul 12, 2022 · 3 comments
Open

Comments

@DSOTM-pf
Copy link

Expected Behavior

Similar memory usage for the different category_encoders versions or better performance for higher category_encoders versions.

Actual Behavior

According to the experiment results, when the category_encoders version is higher than 2.0.0, the performance of the model is worse.

Memory(MB) Version
896 2.3.0
896 2.2.2
896 2.1.0
896 2.0.0
288 1.3.0

Steps to Reproduce the Problem

Step 1: download above dataset
train & test (63MB)
Step 2: install category_encoders

pip install  category_encoders == #version#

Step 3: change category_encoders version and save the memory usage

import numpy as np 
import pandas as pd
import category_encoders as ce
import tracemalloc
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
df_train.drop("id", axis=1, inplace=True) 
df_test.drop("id", axis=1, inplace=True) 
cat_labels = [f"cat{i}" for i in range(10)]

tracemalloc.start()
onehot_encoder = ce.one_hot.OneHotEncoder() 
onehot_encoder.fit(pd.concat([df_train[cat_labels], df_test[cat_labels]], axis=0))
train_ohe = onehot_encoder.transform(df_train[cat_labels]) 
test_ohe = onehot_encoder.transform(df_test[cat_labels]) 

current3, peak3 = tracemalloc.get_traced_memory()
print("Get_dummies memory usage is {",current3 /1024/1024,"}MB; Peak memory was :{",peak3 / 1024/1024,"}MB")

Specifications

  • Version: 2.3.0, 2.2.2, 2.1.0, 2.0.0, 1.3.0
  • Platform: ubuntu 16.4
  • OS : Ubuntu
  • CPU : Intel(R) Core(TM) i9-9900K CPU
  • GPU : TITAN V
@PaulWestenthanner
Copy link
Collaborator

Thanks for that issue report. From the top of my head I'd guess this is because most encoders convert the input to dataframes and create a deep copy of it. Maybe this wasn't the case yet in old versions. I'd need some time to check if this is really the reason. I'm also not sure if the deep copies can safely be removed, probably there was a reason the add them in the first place.
If you want to investigate it feel free, otherwise I'll have a look and keep you posted

@DSOTM-pf
Copy link
Author

DSOTM-pf commented Jul 19, 2022

Hi, thanks for your quick reply!
I have observed the same memory usage issue with WOEEncoder (#364 ). About the root cause of the memory increase in these two APIs, I tried to look for it in the code changes in version 2.0.0, but it was not a good choice due to the number of code changes. I will take your suggestion and see if it is due to "deep copy".

@PaulWestenthanner
Copy link
Collaborator

X = X_in.copy(deep=True)

that should be the relevant line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants