Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot apply Inequality constraint on demo dataset's datetime columns #1203

Closed
npatki opened this issue Jan 27, 2023 · 2 comments
Closed

Cannot apply Inequality constraint on demo dataset's datetime columns #1203

npatki opened this issue Jan 27, 2023 · 2 comments
Assignees
Labels
bug Something isn't working feature:constraints Related to inputting rules or business logic
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Jan 27, 2023

Environment Details

  • SDV version: 1.0.0 (in progress) + existing versions (0.18.0)
  • Python version: 3.8
  • Operating System: Linux (Colab Notebook)

Error Description

The student_placements_pii demo dataset contains start_date and end_date columns that represent dates. They include some missing values.

If I apply an Inequality constraint to designate that start_date < end_date, then the synthesizer crashes during fit.

Steps to reproduce

from sdv.datasets.demo import download_demo
from sdv.single_table import GaussianCopulaSynthesizer

data, metadata = download_demo(
    modality='single_table',
    dataset_name='student_placements_pii'
)

start_lessthan_end = {
    'constraint_class': 'Inequality',
    'constraint_parameters': {
      'low_column_name': 'start_date',
      'high_column_name': 'end_date'  
    }   
}

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.add_constraints(constraints=[start_lessthan_end])
synthesizer.fit(data)
AggregateConstraintsError: 
ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Notes

Both start_date and end_date columns are being represented as object dtypes in pandas. This should be OK.

I also tried casting them myself to see if this helps:

data['start_date'] = pd.to_datetime(data['start_date'])
data['end_date'] = pd.to_datetime(data['end_date'])

If I do this, then the fit calls works, but there is still an error during sample:

IntCastingNaNError: Error: Sampling terminated. Partial results are stored in a temporary file: .sample.csv.temp. This file will be overridden the next time you sample. Please rename the file if you wish to save these results.
Cannot convert non-finite values (NA or inf) to integer
@npatki
Copy link
Contributor Author

npatki commented Feb 23, 2023

@amontanez24 @pvk-developer I am still having an issue with this feature. While is no longer an issue during fit, the synthesizer still crashes curing sample call.

Should we reopen this issue or should I file a new one for the sample call? Code and stack trace is shown below

Code

from sdv.datasets.demo import download_demo
from sdv.single_table import GaussianCopulaSynthesizer

real_data, metadata = download_demo(
    modality='single_table',
    dataset_name='fake_hotel_guests'
)

synthesizer = GaussianCopulaSynthesizer(metadata)

checkin_lessthan_checkout = {
    'constraint_class': 'Inequality',
    'constraint_parameters': {
        'low_column_name': 'checkin_date',
        'high_column_name': 'checkout_date'
    }
}

synthesizer.add_constraints([
    checkin_lessthan_checkout
])
synthesizer.fit(real_data)
synthetic_data = synthesizer.sample(num_rows=500)

Stack Trace

Sampling rows:   0%|          | 0/500 [00:00<?, ?it/s]
---------------------------------------------------------------------------
IntCastingNaNError                        Traceback (most recent call last)
[/usr/local/lib/python3.8/dist-packages/sdv/single_table/base.py](https://localhost:8080/#) in _sample_with_progress_bar(self, num_rows, max_tries_per_batch, batch_size, output_file_path, show_progress_bar)
    768                 progress_bar.set_description('Sampling rows')
--> 769                 sampled = self._sample_in_batches(
    770                     num_rows=num_rows,

17 frames
[/usr/local/lib/python3.8/dist-packages/sdv/single_table/base.py](https://localhost:8080/#) in _sample_in_batches(self, num_rows, batch_size, max_tries_per_batch, conditions, transformed_conditions, float_rtol, progress_bar, output_file_path)
    699         for step in range(math.ceil(num_rows / batch_size)):
--> 700             sampled_rows = self._sample_batch(
    701                 batch_size=batch_size,

[/usr/local/lib/python3.8/dist-packages/sdv/single_table/base.py](https://localhost:8080/#) in _sample_batch(self, batch_size, max_tries, conditions, transformed_conditions, float_rtol, progress_bar, output_file_path)
    632             prev_num_valid = num_valid
--> 633             sampled, num_valid = self._sample_rows(
    634                 num_rows_to_sample,

[/usr/local/lib/python3.8/dist-packages/sdv/single_table/base.py](https://localhost:8080/#) in _sample_rows(self, num_rows, conditions, transformed_conditions, float_rtol, previous_rows)
    557 
--> 558             sampled = self._data_processor.reverse_transform(sampled)
    559 

[/usr/local/lib/python3.8/dist-packages/sdv/data_processing/data_processor.py](https://localhost:8080/#) in reverse_transform(self, data, reset_keys)
    732         for constraint in reversed(self._constraints_to_reverse):
--> 733             reversed_data = constraint.reverse_transform(reversed_data)
    734 

[/usr/local/lib/python3.8/dist-packages/sdv/constraints/base.py](https://localhost:8080/#) in reverse_transform(self, table_data)
    286         table_data = table_data.copy()
--> 287         return self._reverse_transform(table_data)
    288 

[/usr/local/lib/python3.8/dist-packages/sdv/constraints/tabular.py](https://localhost:8080/#) in _reverse_transform(self, table_data)
    474         if self._is_datetime:
--> 475             diff_column = diff_column.astype('timedelta64[ns]')
    476 

[/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py](https://localhost:8080/#) in astype(self, dtype, copy, errors)
   5814             # else, only a single dtype is given
-> 5815             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   5816             return self._constructor(new_data).__finalize__(self, method="astype")

[/usr/local/lib/python3.8/dist-packages/pandas/core/internals/managers.py](https://localhost:8080/#) in astype(self, dtype, copy, errors)
    417     def astype(self: T, dtype, copy: bool = False, errors: str = "raise") -> T:
--> 418         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    419 

[/usr/local/lib/python3.8/dist-packages/pandas/core/internals/managers.py](https://localhost:8080/#) in apply(self, f, align_keys, ignore_failures, **kwargs)
    326                 else:
--> 327                     applied = getattr(b, f)(**kwargs)
    328             except (TypeError, NotImplementedError):

[/usr/local/lib/python3.8/dist-packages/pandas/core/internals/blocks.py](https://localhost:8080/#) in astype(self, dtype, copy, errors)
    590 
--> 591         new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    592 

[/usr/local/lib/python3.8/dist-packages/pandas/core/dtypes/cast.py](https://localhost:8080/#) in astype_array_safe(values, dtype, copy, errors)
   1308     try:
-> 1309         new_values = astype_array(values, dtype, copy=copy)
   1310     except (ValueError, TypeError):

[/usr/local/lib/python3.8/dist-packages/pandas/core/dtypes/cast.py](https://localhost:8080/#) in astype_array(values, dtype, copy)
   1256     else:
-> 1257         values = astype_nansafe(values, dtype, copy=copy)
   1258 

[/usr/local/lib/python3.8/dist-packages/pandas/core/dtypes/cast.py](https://localhost:8080/#) in astype_nansafe(arr, dtype, copy, skipna)
   1167     elif np.issubdtype(arr.dtype, np.floating) and np.issubdtype(dtype, np.integer):
-> 1168         return astype_float_to_int_nansafe(arr, dtype, copy)
   1169 

[/usr/local/lib/python3.8/dist-packages/pandas/core/dtypes/cast.py](https://localhost:8080/#) in astype_float_to_int_nansafe(values, dtype, copy)
   1212     if not np.isfinite(values).all():
-> 1213         raise IntCastingNaNError(
   1214             "Cannot convert non-finite values (NA or inf) to integer"

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

During handling of the above exception, another exception occurred:

IntCastingNaNError                        Traceback (most recent call last)
[<ipython-input-27-b0a36df8dc5e>](https://localhost:8080/#) in <module>
----> 1 synthetic_data_constrained = synthesizer.sample(500)

[/usr/local/lib/python3.8/dist-packages/sdv/single_table/base.py](https://localhost:8080/#) in sample(self, num_rows, max_tries_per_batch, batch_size, output_file_path)
    806         show_progress_bar = has_constraints or has_batches
    807 
--> 808         return self._sample_with_progress_bar(
    809             num_rows,
    810             max_tries_per_batch,

[/usr/local/lib/python3.8/dist-packages/sdv/single_table/base.py](https://localhost:8080/#) in _sample_with_progress_bar(self, num_rows, max_tries_per_batch, batch_size, output_file_path, show_progress_bar)
    776 
    777         except (Exception, KeyboardInterrupt) as error:
--> 778             handle_sampling_error(output_file_path == TMP_FILE_NAME, output_file_path, error)
    779 
    780         else:

[/usr/local/lib/python3.8/dist-packages/sdv/single_table/utils.py](https://localhost:8080/#) in handle_sampling_error(is_tmp_file, output_file_path, sampling_error)
     80 
     81     if error_msg:
---> 82         raise type(sampling_error)(error_msg + '\n' + str(sampling_error))
     83 
     84     raise sampling_error

IntCastingNaNError: Error: Sampling terminated. Partial results are stored in a temporary file: .sample.csv.temp. This file will be overridden the next time you sample. Please rename the file if you wish to save these results.
Cannot convert non-finite values (NA or inf) to integer

@npatki
Copy link
Contributor Author

npatki commented Mar 2, 2023

Seems like this is working now in the Beta release so I'm closing the issue!

@npatki npatki closed this as completed Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature:constraints Related to inputting rules or business logic
Projects
None yet
Development

No branches or pull requests

3 participants