Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gaussian Copula – Memory Issue in Release 0.10.0 #459

Closed
dyuliu opened this issue Jun 3, 2021 · 5 comments · Fixed by #473
Closed

Gaussian Copula – Memory Issue in Release 0.10.0 #459

dyuliu opened this issue Jun 3, 2021 · 5 comments · Fixed by #473
Labels
bug Something isn't working

Comments

@dyuliu
Copy link
Contributor

dyuliu commented Jun 3, 2021

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 0.10.0
  • Python version: 3.6
  • Operating System: linux

Error Description

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-8-7aa0aed49c92> in <module>
      6 
      7 print(train_df.shape)
----> 8 syn_gen.fit(train_df)
      9 
     10 end = time.time()

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/sdv/tabular/base.py in fit(self, data)
    109                      self._metadata.name, data.shape)
    110         if not self._metadata_fitted:
--> 111             self._metadata.fit(data)
    112 
    113         self._num_rows = len(data)

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/sdv/metadata/table.py in fit(self, data)
    489 
    490         LOGGER.info('Fitting constraints for table %s', self.name)
--> 491         constrained = self._fit_transform_constraints(data)
    492         extra_columns = set(constrained.columns) - set(data.columns)
    493 

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/sdv/metadata/table.py in _fit_transform_constraints(self, data)
    353             self._constraints[idx] = constraint
    354 
--> 355             data = constraint.fit_transform(data)
    356 
    357         return data

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/sdv/constraints/base.py in fit_transform(self, table_data)
    232                 Transformed data.
    233         """
--> 234         self.fit(table_data)
    235         return self.transform(table_data)
    236 

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/sdv/constraints/base.py in fit(self, table_data)
    144             transformed_data = self._hyper_transformer.fit_transform(data_to_model)
    145             self._columns_model = GaussianMultivariate()
--> 146             self._columns_model.fit(transformed_data)
    147 
    148         return self._fit(table_data)

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/copulas/__init__.py in decorated(self, X, *args, **kwargs)
    214             raise ValueError('There are nan values in your data.')
    215 
--> 216         return function(self, X, *args, **kwargs)
    217 
    218     return decorated

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/copulas/multivariate/gaussian.py in fit(self, X)
    106 
    107             univariate = get_instance(distribution)
--> 108             univariate.fit(column)
    109 
    110             columns.append(column_name)

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/copulas/univariate/base.py in fit(self, X)
    223             selection_sample = X
    224 
--> 225         self._instance = select_univariate(selection_sample, self.candidates)
    226         self._instance.fit(X)
    227 

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/copulas/univariate/selection.py in select_univariate(X, candidates)
     24             instance = get_instance(model)
     25             instance.fit(X)
---> 26             ks, _ = kstest(X, instance.cdf)
     27             if ks < best_ks:
     28                 best_ks = ks

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/scipy/stats/stats.py in kstest(rvs, cdf, args, N, alternative, mode)
   6892     xvals, yvals, cdf = _parse_kstest_args(rvs, cdf, args, N)
   6893     if cdf:
-> 6894         return ks_1samp(xvals, cdf, args=args, alternative=alternative, mode=mode)
   6895     return ks_2samp(xvals, yvals, alternative=alternative, mode=mode)
   6896 

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/scipy/stats/stats.py in ks_1samp(x, cdf, args, alternative, mode)
   6264     N = len(x)
   6265     x = np.sort(x)
-> 6266     cdfvals = cdf(x, *args)
   6267 
   6268     if alternative == 'greater':

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/copulas/univariate/base.py in cdf(self, X)
    316                 Cumulative distribution values for points in X.
    317         """
--> 318         return self.cumulative_distribution(X)
    319 
    320     def percent_point(self, U):

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/copulas/univariate/gaussian_kde.py in cumulative_distribution(self, X)
    103         stdev = np.sqrt(self._model.covariance[0, 0])
    104         lower = ndtr((self._get_bounds()[0] - self._model.dataset) / stdev)[0]
--> 105         uppers = ndtr((X[:, None] - self._model.dataset) / stdev)
    106         return (uppers - lower).dot(self._model.weights)
    107 

MemoryError: Unable to allocate 174. GiB for an array with shape (152711, 152711) and data type int64

Steps to reproduce

My training data's shape is (152711, 7).
I am using categorical_transformer='categorical_fuzzy'

I simply fit the Gaussian Copula model and then encounter the error described before.

I downgraded SDV to 0.9.1 and this memory error is gone.

I have a MUST reason to use 0.10.0 because it supports specifying constraints and conditions at the same time.
But this version I encounter the memory error which actually should be fixed in version 0.9.0 (https://github.com/sdv-dev/SDV/releases/tag/v0.9.0).

Can someone help me to investigate why the memory error appears again?

@dyuliu dyuliu added bug Something isn't working pending review labels Jun 3, 2021
@fealho
Copy link
Member

fealho commented Jun 9, 2021

Hi @dyuliu, could you provide the code and/or dataset you used, as well as the result of pip freeze? That would be help us figure out the error faster.

@dyuliu
Copy link
Contributor Author

dyuliu commented Jun 11, 2021

Hi @fealho ,

The data is

A                    datetime64[ns]
B                    object
C                    object
D                    object
E                    float64
F                    int64
G                   int64
dtype: object

The counts of unique values of three categorical variables:

  • B 2
  • C 51
  • D 18

The stats of the three numeric columns E, F, G:
image

I provide the simple version of my code that can replicate the exactly the same memory issue:
Relevant code to train the synthetic model:

unique_info1 = UniqueCombinations(
    columns=['B', 'C'],
    handling_strategy='transform'
)

constraints = [
    unique_info1
]

from sdv.tabular import GaussianCopula

syn_gen = GaussianCopula(
    constraints=constraints,
    categorical_transformer='categorical_fuzzy',
)

This code works perfectly fine in 0.9.0 but runs into a memory issue in 0.10.

The reason is the use of UniqueCombinations. If I remove it, it looks fine but I haven't fully run it because it may take several hours for the whole training process.

@dyuliu
Copy link
Contributor Author

dyuliu commented Jun 24, 2021

I think the reason is stemming from this function cumulative_distribution in gaussian_kde.py from copulas library.

uppers = ndtr((X[:, None] - self._model.dataset) / stdev)

X[:, None] is of shape (152711, 1)
self._model.dataset is of shape (1, 152711)

Therefore,
X[:, None] - self._model.dataset produces a ndarray of shape (152711, 152711)

Let's say one float (f8) will take up 64 bits -> 64 / 8 / 1024 / 1024 / 1024 GB
Then the total memory cost should be 64 / 8 / 1024 / 1024 / 1024 * 152711 * 152711 ~ 173.75 GB -> 174GB reported in the error message.

@csala
Copy link
Contributor

csala commented Jun 25, 2021

Thanks for the pointer @dyuliu !

Indeed, the problem seems to come from the model selection inside copulas, and more explicitly on the Gaussian KDE CDF computation, which is not memory efficient. We will open an issue there to work on it.

In this case, however, what is being trained is the columns model that will be used only to populate missing columns in case of conditional sampling, so searching for the best distribution is a bit overkill. @fealho already implemented a workaround by basically capping the distribution of the internal columns model to plain Gaussian, which in this case should be more than good enough, so after PR #473 is merged the memory issue should banish.

In the meantime, would you be able to install SDV from the corresponding branch and give it a try on your end?

@dyuliu
Copy link
Contributor Author

dyuliu commented Jun 25, 2021

Sounds good. I will have a test using the source install.

In the meantime, would you be able to install SDV from the corresponding branch and give it a try on your end?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants