Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constraints should work with timezone-aware datetime columns #1576

Closed
npatki opened this issue Sep 12, 2023 · 0 comments · Fixed by #1631
Closed

Constraints should work with timezone-aware datetime columns #1576

npatki opened this issue Sep 12, 2023 · 0 comments · Fixed by #1631
Assignees
Labels
feature:constraints Related to inputting rules or business logic feature request Request for a new feature
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Sep 12, 2023

I'm filing this issue as a result of #1570

Problem Description

Currently, I am able to use a variety of constraints with datetime columns: SclarInequality, Inequality, ChainedInequality etc. These constraints should work for datetime columns even when there are missing values.

However, if my datetime columns are timezone aware, then the constraint causes the synthesizer to crash with a cryptic error that seems to have nothing to do with the issue.

InvalidDataError: The provided data does not match the metadata:

ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Expected behavior

I expect that all constraints should be able to work with timezone-aware datetime columns. See the example code below.

from sdv.datasets.demo import download_demo
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer

import pandas as pd

data, metadata = download_demo(
    modality='single_table',
    dataset_name='fake_hotel_guests'
)

# update our datetime columns to be timezone aware
data['checkin_date'] = pd.to_datetime(data['checkin_date']).dt.tz_localize('UTC')
data['checkout_date'] = pd.to_datetime(data['checkout_date']).dt.tz_localize('UTC')

# metadata does not need a datetime_format since the columns are
# already represented as datetimes
metadata = SingleTableMetadata.load_from_dict({'METADATA_SPEC_VERSION': 'SINGLE_TABLE_V1',
 'columns': {'guest_email': {'sdtype': 'email', 'pii': True},
  'has_rewards': {'sdtype': 'boolean'},
  'room_type': {'sdtype': 'categorical'},
  'amenities_fee': {'sdtype': 'numerical', 'computer_representation': 'Float'},
  'checkin_date': {'sdtype': 'datetime' },
  'checkout_date': {'sdtype': 'datetime' },
  'room_rate': {'sdtype': 'numerical', 'computer_representation': 'Float'},
  'billing_address': {'sdtype': 'address', 'pii': True},
  'credit_card_number': {'sdtype': 'credit_card_number', 'pii': True}},
 'primary_key': 'guest_email'})

metadata.validate()
metadata.validate_data(data)

my_constraint = {
    'constraint_class': 'Inequality',
    'constraint_parameters': {
        'low_column_name': 'checkin_date',
        'high_column_name': 'checkout_date',
        'strict_boundaries': True
    }
}

synth = GaussianCopulaSynthesizer(metadata)
synth.add_constraints(constraints=[my_constraint])
synth.fit(data)

Output:

---------------------------------------------------------------------------
InvalidDataError                          Traceback (most recent call last)
<ipython-input-22-01f2dfadf5a6> in <cell line: 14>()
     12 synth = GaussianCopulaSynthesizer(metadata)
     13 synth.add_constraints(constraints=[my_constraint])
---> 14 synth.fit(data)

2 frames
/usr/local/lib/python3.10/dist-packages/sdv/single_table/base.py in fit(self, data)
    375         self._data_processor.reset_sampling()
    376         self._random_state_set = False
--> 377         processed_data = self._preprocess(data)
    378         self.fit_processed_data(processed_data)
    379 

/usr/local/lib/python3.10/dist-packages/sdv/single_table/base.py in _preprocess(self, data)
    319 
    320     def _preprocess(self, data):
--> 321         self.validate(data)
    322         self._data_processor.fit(data)
    323         return self._data_processor.transform(data)

/usr/local/lib/python3.10/dist-packages/sdv/single_table/base.py in validate(self, data)
    145 
    146         if errors:
--> 147             raise InvalidDataError(errors)
    148 
    149     def _validate_transformers(self, column_name_to_transformer):

InvalidDataError: The provided data does not match the metadata:

ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Additional context

If it is not feasible to allow timezone-aware columns through a constraint, then the data validation should fail upfront (even when no constraints are added) as well as giving the user a clearer message as to why.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature:constraints Related to inputting rules or business logic feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants