Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality report produces ValueError if a datetime column is in a formatted string #303

Closed
npatki opened this issue Feb 8, 2023 · 1 comment · Fixed by #306
Closed

Quality report produces ValueError if a datetime column is in a formatted string #303

npatki opened this issue Feb 8, 2023 · 1 comment · Fixed by #306
Labels
bug Something isn't working feature:reports Related to any of the generated reports
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Feb 8, 2023

Environment Details

  • SDMetrics version: 0.9.0
  • Python version: 3.8
  • Operating System: Linux (Colab Notebook)

Error Description

Starting from SDV 1.0, users will provide a new metadata format. The current version of SDMetrics (0.9.0) already supports this format.

However, I am having an issue when the datetime columns are present in a string format such as: "27 Dec 2020" (format string ="%d %b %Y")

Steps to reproduce

Replicate using SDV 1.0:

from sdv.datasets.demo import download_demo
from sdv.lite import SingleTablePreset
from sdv.evaluation.single_table import evaluate_quality

real_data, metadata = download_demo(
    modality='single_table',
    dataset_name='fake_hotel_guests'
)

synthesizer = SingleTablePreset(metadata, name='FAST_ML')
synthesizer.fit(real_data)
synthetic_data = synthesizer.sample(num_rows=500)

quality_report = evaluate_quality(real_data, synthetic_data, metadata)

Output:

ValueError: Unable to parse string "27 Dec 2020" at position 0

The report should be able to parse this string because I have provided the format in the metadata.

Stack Trace

Creating report: 100%|██████████| 4/4 [00:00<00:00,  6.74it/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "27 Dec 2020"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
6 frames
[<ipython-input-8-8e8df8749faa>](https://localhost:8080/#) in <module>
      1 from sdv.evaluation.single_table import evaluate_quality
      2 
----> 3 quality_report = evaluate_quality(
      4     real_data,
      5     synthetic_data,

[/usr/local/lib/python3.8/dist-packages/sdv/evaluation/single_table.py](https://localhost:8080/#) in evaluate_quality(real_data, synthetic_data, metadata, verbose)
     25     """
     26     quality_report = QualityReport()
---> 27     quality_report.generate(real_data, synthetic_data, metadata.to_dict(), verbose)
     28     return quality_report.get_score()
     29 

[/usr/local/lib/python3.8/dist-packages/sdmetrics/reports/single_table/quality_report.py](https://localhost:8080/#) in generate(self, real_data, synthetic_data, metadata, verbose)
     81         existing_column_pairs.extend(
     82             list(self._metric_results['CorrelationSimilarity'].keys()))
---> 83         additional_results = discretize_and_apply_metric(
     84             real_data, synthetic_data, metadata, ContingencySimilarity, existing_column_pairs)
     85         self._metric_results['ContingencySimilarity'].update(additional_results)

[/usr/local/lib/python3.8/dist-packages/sdmetrics/reports/utils.py](https://localhost:8080/#) in discretize_and_apply_metric(real_data, synthetic_data, metadata, metric, keys_to_skip)
    514     metric_results = {}
    515 
--> 516     binned_real, binned_synthetic, binned_metadata = discretize_table_data(
    517         real_data, synthetic_data, metadata)
    518 

[/usr/local/lib/python3.8/dist-packages/sdmetrics/reports/utils.py](https://localhost:8080/#) in discretize_table_data(real_data, synthetic_data, metadata)
    478                     synthetic_col = pd.to_datetime(synthetic_col, format=field_meta['format'])
    479 
--> 480                 real_col = pd.to_numeric(real_col)
    481                 synthetic_col = pd.to_numeric(synthetic_col)
    482 

[/usr/local/lib/python3.8/dist-packages/pandas/core/tools/numeric.py](https://localhost:8080/#) in to_numeric(arg, errors, downcast)
    181         coerce_numeric = errors not in ("ignore", "raise")
    182         try:
--> 183             values, _ = lib.maybe_convert_numeric(
    184                 values, set(), coerce_numeric=coerce_numeric
    185             )

/usr/local/lib/python3.8/dist-packages/pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "27 Dec 2020" at position 0
@npatki npatki added bug Something isn't working feature:reports Related to any of the generated reports labels Feb 8, 2023
@npatki
Copy link
Contributor Author

npatki commented Feb 9, 2023

The issue is in /reports/utils.py:

If the new format of metadata is provided, then the datetime format is available in a parameter called 'datetime_format'. Currently, we are only checking for the parameter called 'format'.

We'll need something like this after this line:

if real_col.dtype == 'O' and field_meta.get('datetime_format', ''):
    real_col = pd.to_datetime(real_col, format=field_meta['datetime_format'])
    synthetic_col = pd.to_datetime(synthetic_col, format=field_meta['datetime_format'])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature:reports Related to any of the generated reports
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant