Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality Report crashes when numerical column has only NaN values #273

Closed
npatki opened this issue Nov 28, 2022 · 1 comment
Closed

Quality Report crashes when numerical column has only NaN values #273

npatki opened this issue Nov 28, 2022 · 1 comment
Labels
bug Something isn't working feature:reports Related to any of the generated reports resolution:duplicate This issue or pull request already exists

Comments

@npatki
Copy link
Contributor

npatki commented Nov 28, 2022

Environment Details

  • SDMetrics version: 0.8.0
  • Python version: 3.7
  • Operating System: Linux

Error Description

A numerical column in the real data may contain missing values. Sometimes, the synthetic data may only produce these missing values and fail to create any numerical values. In such cases, the software crashes when I try to produce a quality report.

Expected Behavior: Certain metrics may not be computable if there are only NaN values. But instead of crashing the report, the error should be noted in the detailed breakdowns, and the report should still produce a score while ignoring the values (along with details, visualizations, etc.)

Steps to reproduce

import pandas as pd
from sdmetrics.reports.single_table import QualityReport

real_data = pd.DataFrame(data={
    'col1': [1, 2, 1, 3, 4],
    'col2': [2, 4, 1, 7, 1]
})

# the 'col2' only contain NaN values
synthetic_data = pd.DataFrame(data={
    'col1': [1, 3, 2, 2, 1],
    'col2': [np.nan]*5
})

metadata = {
    'fields': {
        'col1': { 'type': 'numerical', 'subtype': 'integer' },
        'col2': { 'type': 'numerical', 'subtype': 'integer' }
    }
}

report = QualityReport()
report.generate(real_data, synthetic_data, metadata)

Output

Creating report:  50%|█████     | 2/4 [00:00<00:00, 106.06it/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-55-a6be2bd142ce> in <module>
     17 
     18 report = QualityReport()
---> 19 report.generate(real_data, synthetic_data, metadata)

3 frames
/usr/local/lib/python3.7/dist-packages/sdmetrics/reports/single_table/quality_report.py in generate(self, real_data, synthetic_data, metadata)
     71             try:
     72                 self._metric_results[metric.__name__] = metric.compute_breakdown(
---> 73                     real_data, synthetic_data, metadata)
     74             except IncomputableMetricError:
     75                 # Metric is not compatible with this dataset.

/usr/local/lib/python3.7/dist-packages/sdmetrics/single_table/multi_column_pairs.py in compute_breakdown(cls, real_data, synthetic_data, metadata, **kwargs)
    128             synthetic = synthetic_data[list(sorted_columns)]
    129             breakdown[sorted_columns] = cls.column_pairs_metric.compute_breakdown(
--> 130                 real, synthetic, **kwargs)
    131 
    132         return breakdown

/usr/local/lib/python3.7/dist-packages/sdmetrics/column_pairs/statistical/correlation_similarity.py in compute_breakdown(cls, real_data, synthetic_data, coefficient)
     83 
     84         correlation_real, _ = correlation_fn(real_data[column1], real_data[column2])
---> 85         correlation_synthetic, _ = correlation_fn(synthetic_data[column1], synthetic_data[column2])
     86 
     87         if np.isnan(correlation_real) or np.isnan(correlation_synthetic):

/usr/local/lib/python3.7/dist-packages/scipy/stats/stats.py in pearsonr(x, y)
   4014 
   4015     if n < 2:
-> 4016         raise ValueError('x and y must have length at least 2.')
   4017 
   4018     x = np.asarray(x)

ValueError: x and y must have length at least 2.

Note: It is OK that the correlation metric is crashing (correlation is undefined if there are no values). But the report should not crash.

@npatki npatki added bug Something isn't working feature:reports Related to any of the generated reports labels Nov 28, 2022
@npatki npatki added this to the 0.8.1 milestone Nov 30, 2022
@npatki npatki removed this from the 0.8.1 milestone Dec 8, 2022
@npatki
Copy link
Contributor Author

npatki commented Jul 25, 2023

This is a duplicate of #351, so I'll close it off in favor of the new issue.

@npatki npatki closed this as completed Jul 25, 2023
@npatki npatki added the resolution:duplicate This issue or pull request already exists label Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature:reports Related to any of the generated reports resolution:duplicate This issue or pull request already exists
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant