Tabular: ensure that values fall within range #200

csala · 2020-09-28T13:31:48Z

Following up from an issue open in CTGAN: sdv-dev/CTGAN#24 (comment)

Current Tabular model implementations do not properly identify the range in which values should be generated, oftentimes producing values outside of the desired range. This is especially obvious in situations where a value is expected to be always positive but has an average close to 0.

@Baukebrenninkmeijer explained it very well here:

@oregonpillow This occurs because in the continuous columns, Gaussians are being fitted to the distribution. A column with a lot of zeros will have a Gaussian fit around 0, which will inherently result in negative values.

In my practices, I've been trying to detect the min and max for continuous columns in the metadata extraction (transformer.py) and clipping the resulting values to those. They limit these anomalies a bit, but do remove some of the probabilistic variation that occured with this technique. Previously, a synthetic data point could have a higher capital-gain than anyone in the real data. However, with this limitation that option is gone.

We should find a way to allow the users to indicate that the range in which the values are generated needs to be learned from the training data and then ensure that this value range is respected.

The text was updated successfully, but these errors were encountered:

csala · 2020-09-28T13:41:50Z

To address this, at the moment this is achievable by using a workaround based on CustomConstraint. This can be used in two ways:

Use a rejection-sampling strategy by setting an is_valid function that discards the rows that do not fall within the expected range. This can make the sampling process slower.
Set a transform strategy that applies a transformation that moves the input data to the real range and then reverts it back. The problem with this is that it may be needed to apply a non-linear transformation that affects the correlations in a way that prevents the model from learning them properly.

For a later release we are working on a better method that handles this properly within the modeling process without the need to add external constraints.

csala · 2020-10-13T07:09:45Z

The recent introduction of CopulaGAN solves this problem by transforming each column using its marginal distribution.

Here is an example using CopulaGAN on the Census dataset forcing Gamma as the distribution for the capital-gain and capital-loss columns, which always get negative values when using plain CTGAN on them:

from sdv.demo import load_tabular_demo
from sdv.tabular import CopulaGAN

census = load_tabular_demo('census')

field_distributions = {
    'capital-gain': 'gamma',
    'capital-loss': 'gamma'
}

model = CopulaGAN(field_distributions=field_distributions)
model.fit(census)
model.sample()

tokchinkuan · 2020-11-23T17:42:32Z

The recent introduction of CopulaGAN solves this problem by transforming each column using its marginal distribution.

Here is an example using CopulaGAN on the Census dataset forcing Gamma as the distribution for the capital-gain and capital-loss columns, which always get negative values when using plain CTGAN on them:
from sdv.demo import load_tabular_demo
from sdv.tabular import CopulaGAN

census = load_tabular_demo('census')

field_distributions = {
    'capital-gain': 'gamma',
    'capital-loss': 'gamma'
}

model = CopulaGAN(field_distributions=field_distributions)
model.fit(census)
model.sample()

Hi, is it necessary to specify gamma distribution for the columns where only positive values are allowed? I have trained a CopulaGAN on the UNSW-NB15 dataset without specifying any distribution and found that negative values are still generated in columns where only positive values exist in the real dataset. Thank you.

Baukebrenninkmeijer · 2020-11-24T10:54:34Z

@tokchinkuan Yes, the gamma distribution specifies that no values outside the original range can exist. If you don't specify to use gamma, CopulaGAN will default to using Gaussians, which goes back to the problem outlined at the first message in this issue thread.

csala · 2021-09-09T11:51:02Z

This topic has been completely covered in the latest releases, so it can be closed

Ifrahraoof · 2022-03-22T13:20:43Z

Can you please explain detection metrics in detail. It is very confusing. Thank you !

csala mentioned this issue Sep 28, 2020

Not working with Discrete_columns containing integers sdv-dev/CTGAN#24

Closed

csala mentioned this issue Oct 13, 2020

Unwanted negative numbers generated #138

Closed

csala mentioned this issue Oct 20, 2020

Limit the value of some variables sdv-dev/CTGAN#72

Closed

JulienGervai mentioned this issue Nov 4, 2020

CustomConstraint is not working ? #235

Closed

csala mentioned this issue Dec 10, 2020

how to deal with the output of the G after training sdv-dev/CTGAN#109

Closed

HarisNaveed17 mentioned this issue Mar 23, 2021

About detection metrics and distribution choice #359

Open

pvk-developer mentioned this issue Mar 26, 2021

Is it possible to limit a column to possitive values? #365

Closed

dyuliu mentioned this issue May 6, 2021

Ensure values fall within the specified range #423

Closed

npatki mentioned this issue May 20, 2021

Giving user an ability to write general constraints as functions #411

Closed

csala closed this as completed Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tabular: ensure that values fall within range #200

Tabular: ensure that values fall within range #200

csala commented Sep 28, 2020 •

edited by kveerama

Loading

csala commented Sep 28, 2020 •

edited by kveerama

Loading

csala commented Oct 13, 2020

tokchinkuan commented Nov 23, 2020

Baukebrenninkmeijer commented Nov 24, 2020

csala commented Sep 9, 2021

Ifrahraoof commented Mar 22, 2022

Tabular: ensure that values fall within range #200

Tabular: ensure that values fall within range #200

Comments

csala commented Sep 28, 2020 • edited by kveerama Loading

csala commented Sep 28, 2020 • edited by kveerama Loading

csala commented Oct 13, 2020

tokchinkuan commented Nov 23, 2020

Baukebrenninkmeijer commented Nov 24, 2020

csala commented Sep 9, 2021

Ifrahraoof commented Mar 22, 2022

csala commented Sep 28, 2020 •

edited by kveerama

Loading

csala commented Sep 28, 2020 •

edited by kveerama

Loading