-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gaussian Copula – Memory Issue in Release 0.10.0 #459
Comments
Hi @dyuliu, could you provide the code and/or dataset you used, as well as the result of |
Hi @fealho , The data is A datetime64[ns]
B object
C object
D object
E float64
F int64
G int64
dtype: object The counts of unique values of three categorical variables:
The stats of the three numeric columns E, F, G: I provide the simple version of my code that can replicate the exactly the same memory issue: unique_info1 = UniqueCombinations(
columns=['B', 'C'],
handling_strategy='transform'
)
constraints = [
unique_info1
]
from sdv.tabular import GaussianCopula
syn_gen = GaussianCopula(
constraints=constraints,
categorical_transformer='categorical_fuzzy',
) This code works perfectly fine in 0.9.0 but runs into a memory issue in 0.10. The reason is the use of |
I think the reason is stemming from this function uppers = ndtr((X[:, None] - self._model.dataset) / stdev) X[:, None] is of shape (152711, 1) Therefore, Let's say one float (f8) will take up 64 bits -> 64 / 8 / 1024 / 1024 / 1024 GB |
Thanks for the pointer @dyuliu ! Indeed, the problem seems to come from the model selection inside copulas, and more explicitly on the Gaussian KDE CDF computation, which is not memory efficient. We will open an issue there to work on it. In this case, however, what is being trained is the columns model that will be used only to populate missing columns in case of conditional sampling, so searching for the best distribution is a bit overkill. @fealho already implemented a workaround by basically capping the distribution of the internal columns model to plain Gaussian, which in this case should be more than good enough, so after PR #473 is merged the memory issue should banish. In the meantime, would you be able to install SDV from the corresponding branch and give it a try on your end? |
Sounds good. I will have a test using the source install.
|
Environment Details
Please indicate the following details about the environment in which you found the bug:
Error Description
Steps to reproduce
My training data's shape is
(152711, 7)
.I am using
categorical_transformer='categorical_fuzzy'
I simply fit the Gaussian Copula model and then encounter the error described before.
I downgraded SDV to 0.9.1 and this memory error is gone.
I have a MUST reason to use 0.10.0 because it supports specifying constraints and conditions at the same time.
But this version I encounter the memory error which actually should be fixed in version 0.9.0 (https://github.com/sdv-dev/SDV/releases/tag/v0.9.0).
Can someone help me to investigate why the memory error appears again?
The text was updated successfully, but these errors were encountered: