Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabular: ensure that values fall within range #200

Closed
csala opened this issue Sep 28, 2020 · 6 comments
Closed

Tabular: ensure that values fall within range #200

csala opened this issue Sep 28, 2020 · 6 comments

Comments

@csala
Copy link
Contributor

csala commented Sep 28, 2020

Following up from an issue open in CTGAN: sdv-dev/CTGAN#24 (comment)

Current Tabular model implementations do not properly identify the range in which values should be generated, oftentimes producing values outside of the desired range. This is especially obvious in situations where a value is expected to be always positive but has an average close to 0.

@Baukebrenninkmeijer explained it very well here:

@oregonpillow This occurs because in the continuous columns, Gaussians are being fitted to the distribution. A column with a lot of zeros will have a Gaussian fit around 0, which will inherently result in negative values.

In my practices, I've been trying to detect the min and max for continuous columns in the metadata extraction (transformer.py) and clipping the resulting values to those. They limit these anomalies a bit, but do remove some of the probabilistic variation that occured with this technique. Previously, a synthetic data point could have a higher capital-gain than anyone in the real data. However, with this limitation that option is gone.

We should find a way to allow the users to indicate that the range in which the values are generated needs to be learned from the training data and then ensure that this value range is respected.

@csala
Copy link
Contributor Author

csala commented Sep 28, 2020

To address this, at the moment this is achievable by using a workaround based on CustomConstraint. This can be used in two ways:

  1. Use a rejection-sampling strategy by setting an is_valid function that discards the rows that do not fall within the expected range. This can make the sampling process slower.
  2. Set a transform strategy that applies a transformation that moves the input data to the real range and then reverts it back. The problem with this is that it may be needed to apply a non-linear transformation that affects the correlations in a way that prevents the model from learning them properly.

For a later release we are working on a better method that handles this properly within the modeling process without the need to add external constraints.

@csala
Copy link
Contributor Author

csala commented Oct 13, 2020

The recent introduction of CopulaGAN solves this problem by transforming each column using its marginal distribution.

Here is an example using CopulaGAN on the Census dataset forcing Gamma as the distribution for the capital-gain and capital-loss columns, which always get negative values when using plain CTGAN on them:

from sdv.demo import load_tabular_demo
from sdv.tabular import CopulaGAN

census = load_tabular_demo('census')

field_distributions = {
    'capital-gain': 'gamma',
    'capital-loss': 'gamma'
}

model = CopulaGAN(field_distributions=field_distributions)
model.fit(census)
model.sample()

@tokchinkuan
Copy link

The recent introduction of CopulaGAN solves this problem by transforming each column using its marginal distribution.

Here is an example using CopulaGAN on the Census dataset forcing Gamma as the distribution for the capital-gain and capital-loss columns, which always get negative values when using plain CTGAN on them:

from sdv.demo import load_tabular_demo
from sdv.tabular import CopulaGAN

census = load_tabular_demo('census')

field_distributions = {
    'capital-gain': 'gamma',
    'capital-loss': 'gamma'
}

model = CopulaGAN(field_distributions=field_distributions)
model.fit(census)
model.sample()

Hi, is it necessary to specify gamma distribution for the columns where only positive values are allowed? I have trained a CopulaGAN on the UNSW-NB15 dataset without specifying any distribution and found that negative values are still generated in columns where only positive values exist in the real dataset. Thank you.

@Baukebrenninkmeijer
Copy link

@tokchinkuan Yes, the gamma distribution specifies that no values outside the original range can exist. If you don't specify to use gamma, CopulaGAN will default to using Gaussians, which goes back to the problem outlined at the first message in this issue thread.

@csala
Copy link
Contributor Author

csala commented Sep 9, 2021

This topic has been completely covered in the latest releases, so it can be closed

@csala csala closed this as completed Sep 9, 2021
@Ifrahraoof
Copy link

Can you please explain detection metrics in detail. It is very confusing. Thank you !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants