Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditional sampling with negative float values doesn't work #1161

Closed
pimlock opened this issue Jan 6, 2023 · 2 comments · Fixed by #1616
Closed

Conditional sampling with negative float values doesn't work #1161

pimlock opened this issue Jan 6, 2023 · 2 comments · Fixed by #1616
Assignees
Labels
bug Something isn't working feature:sampling Related to generating synthetic data after a model is built
Milestone

Comments

@pimlock
Copy link

pimlock commented Jan 6, 2023

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 0.17.2
  • Python version: Python 3.9.10
  • Operating System: Mac OS 12.6.1

Error Description

When sampling records from the model with negative float as a condition, no records are being returned, as the _filter_conditions method filters them all out.

That's because it compares negative distance value, with an absolute value (which will always be >= 0).

Steps to reproduce

Sample code that triggers this issue:

data = pd.DataFrame({
    'column1': [-float(x) for x in list(range(100))],
    'column2': list(range(100))
})

model = GaussianCopula()
model.fit(data)

sampled = model.sample_conditions([Condition({'column1': -50.0})])

# this assertion fails, as the dataframe is empty
assert len(sampled) == 1
@pimlock pimlock added bug Something isn't working new Automatic label applied to new issues labels Jan 6, 2023
@npatki
Copy link
Contributor

npatki commented Jan 11, 2023

Thanks for filing @pimlock, confirmed that I can replicate this issue. We'll leave this open until the related changes are merged.

@npatki npatki added under discussion Issue is currently being discussed feature:sampling Related to generating synthetic data after a model is built and removed new Automatic label applied to new issues labels Jan 11, 2023
@npatki npatki removed the under discussion Issue is currently being discussed label Jan 23, 2023
@npatki
Copy link
Contributor

npatki commented Jun 2, 2023

Update: I can confirm that this issue continues to persist for SDV 1.0+. This has a new API, so I'm attaching the updated code below for replication.

import pandas as pd

from sdv.sampling import Condition
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer

data = pd.DataFrame({
    'column1': [-float(x) for x in list(range(100))],
    'column2': list(range(100))
})

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data)

synth = GaussianCopulaSynthesizer(metadata)
synth.fit(data)

my_condition = Condition(
    num_rows=10,
    column_values={'column1': -50.0 }
)

synth.sample_from_conditions([my_condition])

This now directly results in an error.

ValueError: Unable to sample any rows for the given conditions. This may be because the provided values are out-of-bounds in the current model. 
Please try again with a different set of values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature:sampling Related to generating synthetic data after a model is built
Projects
None yet
4 participants