Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some datetime formats cause InvalidDataError, even if the datetime matches the format #1136

Closed
npatki opened this issue Dec 8, 2022 · 1 comment · Fixed by #1351
Closed
Assignees
Labels
bug Something isn't working
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Dec 8, 2022

Environment Details

  • SDV version: 0.17.1
  • Python version: 3.8
  • Operating System: Linux

Error Description

I have a column with a specific datetime format: '%Y%m%d%H%M%S%f'.
So a value such as 20220902110443000000 should be parsed as Sep 2, 2022, 11:04:43.000000.

Whenever I try to pass in data in this format, I get an OverflowError, even if I specify the datetime format in the metadata.

Steps to reproduce

import pandas as pd
from sdv.tabular import GaussianCopula

# create some fake data with this format
data = pd.DataFrame(data={
    'my_column': ['20220902110443000000', '20220916230356000000', '20220826173917000000'],
})

# write metadata and specify the datetime format
metadata = {
  'fields': {
      'my_column': {
          'type': 'datetime',
          'format': '%Y%m%d%H%M%S%f'
      }
  }
}

# try to run it through the SDV
model = GaussianCopula(table_metadata=metadata)
model.fit(data)

Stack Trace

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()

25 frames
OverflowError: Python int too large to convert to C long

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
TypeError: invalid string coercion to datetime

During handling of the above exception, another exception occurred:

OverflowError                             Traceback (most recent call last)
[/usr/local/lib/python3.8/dist-packages/dateutil/parser/_parser.py](https://localhost:8080/#) in _build_naive(self, res, default)
   1233                 repl['day'] = monthrange(cyear, cmonth)[1]
   1234 
-> 1235         naive = default.replace(**repl)
   1236 
   1237         if res.weekday is not None and not res.day:

OverflowError: Python int too large to convert to C long

Context

What's interesting is that RDT seems to be able to handle this type of data well. So maybe something is misconfigured in the SDV?

from rdt import HyperTransformer
from rdt.transformers.datetime import UnixTimestampEncoder

ht = HyperTransformer()

ht.set_config(config={
    'sdtypes': { 'my_column': 'datetime'},
    'transformers': { 'my_column': UnixTimestampEncoder(datetime_format='%Y%m%d%H%M%S%f') }
})

# this works without crashing!
transformed = ht.fit_transform(data)
reversed = ht.reverse_transform(transformed)
@npatki npatki added the bug Something isn't working label Dec 8, 2022
@npatki
Copy link
Contributor Author

npatki commented Mar 29, 2023

This issue continues to persist in the SDV 1.0:

import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer

data = pd.DataFrame(data={
    'string_column': ['20220902110443000000', '20220916230356000000', '20220826173917000000', '20220826212135000000', '20220929111311000000'],
    'int_column': [20220902110443000000, 20220916230356000000, 20220826173917000000, 20220826212135000000, 20220929111311000000]
})

test_metadata = {
  'columns': {
      'string_column': {
          'sdtype': 'datetime',
          'datetime_format': '%Y%m%d%H%M%S%f'
      },
      'int_column': {
          'sdtype': 'datetime',
          'datetime_format': '%Y%m%d%H%M%S%f'
      }
  }
}

metadata = SingleTableMetadata.load_from_dict(test_metadata)
metadata.validate()
model = GaussianCopulaSynthesizer(metadata)
model.validate(data)

This yields an error:

---------------------------------------------------------------------------
InvalidDataError                          Traceback (most recent call last)
[<ipython-input-7-e8b8ccdf2b53>](https://localhost:8080/#) in <cell line: 4>()
      2 
      3 model = GaussianCopulaSynthesizer(metadata)
----> 4 model.validate(data)
      5 # model.fit(data)

[/usr/local/lib/python3.9/dist-packages/sdv/single_table/base.py](https://localhost:8080/#) in validate(self, data)
    228 
    229         if errors:
--> 230             raise InvalidDataError(errors)
    231 
    232     def _validate_transformers(self, column_name_to_transformer):

InvalidDataError: The provided data does not match the metadata:
Invalid values found for datetime column 'string_column': ['20220826173917000000', '20220826212135000000', '20220902110443000000', '+ 2 more'].

Invalid values found for datetime column 'int_column': [20220826173917000000, 20220826212135000000, 20220902110443000000, '+ 2 more'].

Additional Context

In _validate_column, I don't think we're using the provided datetime_format string to verify the datetime values.

See line.

@npatki npatki changed the title Datetime format is causing OverflowError: Python int too large to convert to C long Some datetime formats cause InvalidDataError, even if the datetime matches the format Mar 29, 2023
@amontanez24 amontanez24 self-assigned this Apr 19, 2023
@amontanez24 amontanez24 added this to the 1.0.1 milestone Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants