Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InvalidDataError: The provided data does not match the metadata (although it matches) #1833

Closed
deltaproximity opened this issue Mar 5, 2024 · 4 comments · Fixed by #1858
Closed
Assignees
Labels
bug Something isn't working feature:constraints Related to inputting rules or business logic resolution:WAI The software is working as intended
Milestone

Comments

@deltaproximity
Copy link

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 1.10.0
  • Python version: 3.10.6
  • Operating System: Windows 10 22H2

Error Description

Error when using constraint 'ScalarRange' on a numeric columns to train a synthesizer (see the attached image):

Error message:
File ~.conda\envs\scrf\lib\site-packages\sdv\single_table\base.py:164, in BaseSynthesizer.validate(self, data)
161 errors += self._validate(data) # Validate rules specific to each synthesizer
163 if errors:
--> 164 raise InvalidDataError(errors)

InvalidDataError: The provided data does not match the metadata:

Data is not valid for the 'ScalarRange' constraint:
col_with_constraint
0 1.000000
1 0.936195
2 0.936195
3 0.936195
4 0.936195
+2656 more

This is how the column "col_with_constraint" looks like:
image
image

Steps to reproduce

<Replace this text with a description of the steps that anyone can follow to reproduce the error. If the error happens only on a specific dataset, please consider attaching some example data to the issue so that others can use it to reproduce the error.>

from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.datasets.local import load_csvs


datasets = load_csvs(
    folder_name = os.path.join('path_to_data_folder_with_csv_file'),
    read_csv_parameters={
        'skipinitialspace': True,
        #'encoding': 'utf_32'
    })

df = datasets['data_file']

constraint1 = {
    'constraint_class': 'ScalarRange',
    #'table_name': 'guests', # for multi table synthesizers
    'constraint_parameters': {
        'column_name': 'col_with_constraint',
        'low_value': 0.7,
        'high_value': 0.9,
        'strict_boundaries': False
    }
}

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.add_constraints(constraints=[
    constraint1
])
synthesizer.fit(df)
@deltaproximity deltaproximity added bug Something isn't working new Automatic label applied to new issues labels Mar 5, 2024
@srinify
Copy link
Contributor

srinify commented Mar 5, 2024

Hi there @deltaproximity it looks like the data you're using for training don't adhere to the constraint you specified (scalar range from 0.7 to 0.9). I see multiple values outside of this range.

Constraints in sdv are used to describe business rules inherent in your real data that you want the trained synthesizer / model to know about. This error is being thrown because sdv detected that the underlying data for training doesn't match the constraint you specified: InvalidDataError: The provided data does not match the metadata: Data is not valid for the 'ScalarRange' constraint:

Do you mind sharing more about your use case here? What's the motivation to define such a constraint that deviates from your original data?

@npatki npatki added under discussion Issue is currently being discussed feature:constraints Related to inputting rules or business logic and removed new Automatic label applied to new issues labels Mar 5, 2024
@deltaproximity
Copy link
Author

Hi @srinify, thanks for your reply. I need the synthesizer to use only the data from the range [0.7, 0.9], because the data I want to sample should have values of this column only in this range. From what I understood from the sdv documentation when using the conditional sampling one can only fix values but cannot specify a range for sampling. Therefore, I wanted to create a synthesizer that learns only from the data in the above specified range.

@srinify
Copy link
Contributor

srinify commented Mar 6, 2024

Thanks for the context. A few things:

As a workaround @deltaproximity what you can do is sample a bunch of rows and filter out the ones outside your range:

# Request more rows than you need. Maybe 1,000 if you need 100 true rows.
synthetic_data = synthesizer.sample(1000)

# Filter out rows
filtered_synthetic_data = synthetic_data[synthetic_data[(synthetic_data[COL_NAME] >= LOW_RANGE) & (synthetic_data[COL_NAME] <= HIGH_RANGE)]

@npatki
Copy link
Contributor

npatki commented Mar 20, 2024

Hi all, I'm closing this issue out as it has been inactive for a few weeks. I believe we now have other issues that are more suited to the root cause of this (see previous comment).

Please feel free to reply if there is anything more to discuss. We can always reopen the issue for more investigation. Thanks.

@npatki npatki closed this as completed Mar 20, 2024
@npatki npatki added resolution:WAI The software is working as intended and removed under discussion Issue is currently being discussed labels Mar 20, 2024
@amontanez24 amontanez24 added this to the 1.12.0 milestone Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature:constraints Related to inputting rules or business logic resolution:WAI The software is working as intended
Projects
None yet
5 participants