InvalidDataError: The provided data does not match the metadata (although it matches) #1833

deltaproximity · 2024-03-05T16:44:11Z

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDV version: 1.10.0
Python version: 3.10.6
Operating System: Windows 10 22H2

Error Description

Error when using constraint 'ScalarRange' on a numeric columns to train a synthesizer (see the attached image):

Error message:
File ~.conda\envs\scrf\lib\site-packages\sdv\single_table\base.py:164, in BaseSynthesizer.validate(self, data)
161 errors += self._validate(data) # Validate rules specific to each synthesizer
163 if errors:
--> 164 raise InvalidDataError(errors)

InvalidDataError: The provided data does not match the metadata:

Data is not valid for the 'ScalarRange' constraint:
col_with_constraint
0 1.000000
1 0.936195
2 0.936195
3 0.936195
4 0.936195
+2656 more

This is how the column "col_with_constraint" looks like:

Steps to reproduce

from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.datasets.local import load_csvs


datasets = load_csvs(
    folder_name = os.path.join('path_to_data_folder_with_csv_file'),
    read_csv_parameters={
        'skipinitialspace': True,
        #'encoding': 'utf_32'
    })

df = datasets['data_file']

constraint1 = {
    'constraint_class': 'ScalarRange',
    #'table_name': 'guests', # for multi table synthesizers
    'constraint_parameters': {
        'column_name': 'col_with_constraint',
        'low_value': 0.7,
        'high_value': 0.9,
        'strict_boundaries': False
    }
}

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.add_constraints(constraints=[
    constraint1
])
synthesizer.fit(df)

The text was updated successfully, but these errors were encountered:

srinify · 2024-03-05T18:02:56Z

Hi there @deltaproximity it looks like the data you're using for training don't adhere to the constraint you specified (scalar range from 0.7 to 0.9). I see multiple values outside of this range.

Constraints in sdv are used to describe business rules inherent in your real data that you want the trained synthesizer / model to know about. This error is being thrown because sdv detected that the underlying data for training doesn't match the constraint you specified: InvalidDataError: The provided data does not match the metadata: Data is not valid for the 'ScalarRange' constraint:

Do you mind sharing more about your use case here? What's the motivation to define such a constraint that deviates from your original data?

deltaproximity · 2024-03-06T08:56:47Z

Hi @srinify, thanks for your reply. I need the synthesizer to use only the data from the range [0.7, 0.9], because the data I want to sample should have values of this column only in this range. From what I understood from the sdv documentation when using the conditional sampling one can only fix values but cannot specify a range for sampling. Therefore, I wanted to create a synthesizer that learns only from the data in the above specified range.

srinify · 2024-03-06T15:38:25Z

Thanks for the context. A few things:

sdv is working as intended and this error is in line with how constraints work. You can read more here about the error.
It looks like InvalidDataError is being returned instead of ConstraintsNotMetError, which is likely a bug! So I'll open a separate bug issue for us to fix that and make the error clearer: When inappropriately applying ScalarRange constraint, InvalidDataError is being returned instead of ConstraintsNotMetError #1842
It would be useful to be able to perform conditional sampling with a specified range (instead of just specific values). So I'll open a feature request issue for that and we'd love it if you could comment with your broader use case in mind: Support for specifying a range during conditional sampling #1843

As a workaround @deltaproximity what you can do is sample a bunch of rows and filter out the ones outside your range:

# Request more rows than you need. Maybe 1,000 if you need 100 true rows.
synthetic_data = synthesizer.sample(1000)

# Filter out rows
filtered_synthetic_data = synthetic_data[synthetic_data[(synthetic_data[COL_NAME] >= LOW_RANGE) & (synthetic_data[COL_NAME] <= HIGH_RANGE)]

npatki · 2024-03-20T15:20:39Z

Hi all, I'm closing this issue out as it has been inactive for a few weeks. I believe we now have other issues that are more suited to the root cause of this (see previous comment).

Please feel free to reply if there is anything more to discuss. We can always reopen the issue for more investigation. Thanks.

deltaproximity added bug Something isn't working new Automatic label applied to new issues labels Mar 5, 2024

npatki added under discussion Issue is currently being discussed feature:constraints Related to inputting rules or business logic and removed new Automatic label applied to new issues labels Mar 5, 2024

This was referenced Mar 6, 2024

When inappropriately applying ScalarRange constraint, InvalidDataError is being returned instead of ConstraintsNotMetError #1842

Closed

Support for specifying a range during conditional sampling #1843

Open

pvk-developer mentioned this issue Mar 20, 2024

InvalidDataError is being returned instead of ConstraintsNotMetError when working with constraints #1858

Merged

npatki closed this as completed Mar 20, 2024

npatki added resolution:WAI The software is working as intended and removed under discussion Issue is currently being discussed labels Mar 20, 2024

amontanez24 added this to the 1.12.0 milestone Apr 11, 2024

amontanez24 assigned pvk-developer Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InvalidDataError: The provided data does not match the metadata (although it matches) #1833

InvalidDataError: The provided data does not match the metadata (although it matches) #1833

deltaproximity commented Mar 5, 2024

srinify commented Mar 5, 2024

deltaproximity commented Mar 6, 2024

srinify commented Mar 6, 2024 •

edited

Loading

npatki commented Mar 20, 2024

InvalidDataError: The provided data does not match the metadata (although it matches) #1833

InvalidDataError: The provided data does not match the metadata (although it matches) #1833

Comments

deltaproximity commented Mar 5, 2024

Environment Details

Error Description

Steps to reproduce

srinify commented Mar 5, 2024

deltaproximity commented Mar 6, 2024

srinify commented Mar 6, 2024 • edited Loading

npatki commented Mar 20, 2024

srinify commented Mar 6, 2024 •

edited

Loading