Gaussian Copula – Memory Issue in Release 0.10.0 #459

dyuliu · 2021-06-03T21:49:55Z

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDV version: 0.10.0
Python version: 3.6
Operating System: linux

Error Description

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-8-7aa0aed49c92> in <module>
      6 
      7 print(train_df.shape)
----> 8 syn_gen.fit(train_df)
      9 
     10 end = time.time()

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/sdv/tabular/base.py in fit(self, data)
    109                      self._metadata.name, data.shape)
    110         if not self._metadata_fitted:
--> 111             self._metadata.fit(data)
    112 
    113         self._num_rows = len(data)

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/sdv/metadata/table.py in fit(self, data)
    489 
    490         LOGGER.info('Fitting constraints for table %s', self.name)
--> 491         constrained = self._fit_transform_constraints(data)
    492         extra_columns = set(constrained.columns) - set(data.columns)
    493 

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/sdv/metadata/table.py in _fit_transform_constraints(self, data)
    353             self._constraints[idx] = constraint
    354 
--> 355             data = constraint.fit_transform(data)
    356 
    357         return data

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/sdv/constraints/base.py in fit_transform(self, table_data)
    232                 Transformed data.
    233         """
--> 234         self.fit(table_data)
    235         return self.transform(table_data)
    236 

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/sdv/constraints/base.py in fit(self, table_data)
    144             transformed_data = self._hyper_transformer.fit_transform(data_to_model)
    145             self._columns_model = GaussianMultivariate()
--> 146             self._columns_model.fit(transformed_data)
    147 
    148         return self._fit(table_data)

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/copulas/__init__.py in decorated(self, X, *args, **kwargs)
    214             raise ValueError('There are nan values in your data.')
    215 
--> 216         return function(self, X, *args, **kwargs)
    217 
    218     return decorated

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/copulas/multivariate/gaussian.py in fit(self, X)
    106 
    107             univariate = get_instance(distribution)
--> 108             univariate.fit(column)
    109 
    110             columns.append(column_name)

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/copulas/univariate/base.py in fit(self, X)
    223             selection_sample = X
    224 
--> 225         self._instance = select_univariate(selection_sample, self.candidates)
    226         self._instance.fit(X)
    227 

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/copulas/univariate/selection.py in select_univariate(X, candidates)
     24             instance = get_instance(model)
     25             instance.fit(X)
---> 26             ks, _ = kstest(X, instance.cdf)
     27             if ks < best_ks:
     28                 best_ks = ks

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/scipy/stats/stats.py in kstest(rvs, cdf, args, N, alternative, mode)
   6892     xvals, yvals, cdf = _parse_kstest_args(rvs, cdf, args, N)
   6893     if cdf:
-> 6894         return ks_1samp(xvals, cdf, args=args, alternative=alternative, mode=mode)
   6895     return ks_2samp(xvals, yvals, alternative=alternative, mode=mode)
   6896 

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/scipy/stats/stats.py in ks_1samp(x, cdf, args, alternative, mode)
   6264     N = len(x)
   6265     x = np.sort(x)
-> 6266     cdfvals = cdf(x, *args)
   6267 
   6268     if alternative == 'greater':

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/copulas/univariate/base.py in cdf(self, X)
    316                 Cumulative distribution values for points in X.
    317         """
--> 318         return self.cumulative_distribution(X)
    319 
    320     def percent_point(self, U):

/opt/cortex-installs/miniconda/envs/sreg/lib/python3.6/site-packages/copulas/univariate/gaussian_kde.py in cumulative_distribution(self, X)
    103         stdev = np.sqrt(self._model.covariance[0, 0])
    104         lower = ndtr((self._get_bounds()[0] - self._model.dataset) / stdev)[0]
--> 105         uppers = ndtr((X[:, None] - self._model.dataset) / stdev)
    106         return (uppers - lower).dot(self._model.weights)
    107 

MemoryError: Unable to allocate 174. GiB for an array with shape (152711, 152711) and data type int64

Steps to reproduce

My training data's shape is (152711, 7).
I am using categorical_transformer='categorical_fuzzy'

I simply fit the Gaussian Copula model and then encounter the error described before.

I downgraded SDV to 0.9.1 and this memory error is gone.

I have a MUST reason to use 0.10.0 because it supports specifying constraints and conditions at the same time.
But this version I encounter the memory error which actually should be fixed in version 0.9.0 (https://github.com/sdv-dev/SDV/releases/tag/v0.9.0).

Can someone help me to investigate why the memory error appears again?

The text was updated successfully, but these errors were encountered:

fealho · 2021-06-09T10:10:58Z

Hi @dyuliu, could you provide the code and/or dataset you used, as well as the result of pip freeze? That would be help us figure out the error faster.

dyuliu · 2021-06-11T13:53:48Z

Hi @fealho ,

The data is

A                    datetime64[ns]
B                    object
C                    object
D                    object
E                    float64
F                    int64
G                   int64
dtype: object

The counts of unique values of three categorical variables:

B 2
C 51
D 18

The stats of the three numeric columns E, F, G:

I provide the simple version of my code that can replicate the exactly the same memory issue:
Relevant code to train the synthetic model:

unique_info1 = UniqueCombinations(
    columns=['B', 'C'],
    handling_strategy='transform'
)

constraints = [
    unique_info1
]

from sdv.tabular import GaussianCopula

syn_gen = GaussianCopula(
    constraints=constraints,
    categorical_transformer='categorical_fuzzy',
)

This code works perfectly fine in 0.9.0 but runs into a memory issue in 0.10.

The reason is the use of UniqueCombinations. If I remove it, it looks fine but I haven't fully run it because it may take several hours for the whole training process.

dyuliu · 2021-06-24T12:47:33Z

I think the reason is stemming from this function cumulative_distribution in gaussian_kde.py from copulas library.

uppers = ndtr((X[:, None] - self._model.dataset) / stdev)

X[:, None] is of shape (152711, 1)
self._model.dataset is of shape (1, 152711)

Therefore,
X[:, None] - self._model.dataset produces a ndarray of shape (152711, 152711)

Let's say one float (f8) will take up 64 bits -> 64 / 8 / 1024 / 1024 / 1024 GB
Then the total memory cost should be 64 / 8 / 1024 / 1024 / 1024 * 152711 * 152711 ~ 173.75 GB -> 174GB reported in the error message.

csala · 2021-06-25T14:54:51Z

Thanks for the pointer @dyuliu !

Indeed, the problem seems to come from the model selection inside copulas, and more explicitly on the Gaussian KDE CDF computation, which is not memory efficient. We will open an issue there to work on it.

In this case, however, what is being trained is the columns model that will be used only to populate missing columns in case of conditional sampling, so searching for the best distribution is a bit overkill. @fealho already implemented a workaround by basically capping the distribution of the internal columns model to plain Gaussian, which in this case should be more than good enough, so after PR #473 is merged the memory issue should banish.

In the meantime, would you be able to install SDV from the corresponding branch and give it a try on your end?

dyuliu · 2021-06-25T23:04:05Z

Sounds good. I will have a test using the source install.

In the meantime, would you be able to install SDV from the corresponding branch and give it a try on your end?

dyuliu added bug Something isn't working pending review labels Jun 3, 2021

csala mentioned this issue Jun 25, 2021

Limit memory consumption of the GaussianMultivariate when using constraints #473

Merged

csala mentioned this issue Jun 25, 2021

Review GaussianKDE memory usage sdv-dev/Copulas#256

Open

fealho closed this as completed in #473 Oct 26, 2021

npatki removed the pending review label Jun 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gaussian Copula – Memory Issue in Release 0.10.0 #459

Gaussian Copula – Memory Issue in Release 0.10.0 #459

dyuliu commented Jun 3, 2021 •

edited

Loading

fealho commented Jun 9, 2021 •

edited

Loading

dyuliu commented Jun 11, 2021 •

edited

Loading

dyuliu commented Jun 24, 2021

csala commented Jun 25, 2021

dyuliu commented Jun 25, 2021 •

edited

Loading

Gaussian Copula – Memory Issue in Release 0.10.0 #459

Gaussian Copula – Memory Issue in Release 0.10.0 #459

Comments

dyuliu commented Jun 3, 2021 • edited Loading

Environment Details

Error Description

Steps to reproduce

fealho commented Jun 9, 2021 • edited Loading

dyuliu commented Jun 11, 2021 • edited Loading

dyuliu commented Jun 24, 2021

csala commented Jun 25, 2021

dyuliu commented Jun 25, 2021 • edited Loading

dyuliu commented Jun 3, 2021 •

edited

Loading

fealho commented Jun 9, 2021 •

edited

Loading

dyuliu commented Jun 11, 2021 •

edited

Loading

dyuliu commented Jun 25, 2021 •

edited

Loading