Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ReferentialIntegrity NaN handling #494

Closed
frances-h opened this issue Nov 2, 2023 · 0 comments · Fixed by #499
Closed

Fix ReferentialIntegrity NaN handling #494

frances-h opened this issue Nov 2, 2023 · 0 comments · Fixed by #499
Assignees
Labels
bug Something isn't working
Milestone

Comments

@frances-h
Copy link
Contributor

  • SDMetrics version: diagnostic_report_updates
  • Python version:
  • Operating System:

Error Description

Currently, the ReferentialIntegrity metric will count NaN foreign key values as not referencing a parent unless the parent keys also include a NaN. The logic should be updated to better handle foreign keys being null:

  1. If the real foreign keys contain NaN values, the metric should ignore NaN values in the synthetic data. For example:
>>> parent_keys = ['a', 'b', 'c']
>>> foreign_keys = ['a', 'a', 'b', 'c', NaN]
>>> ReferentialIntegrity.compute_breakdown(
            real_data=(parent_keys, foreign_keys),
            synthetic_data=(parent_keys, foreign_keys))
 {'score':  1.0}
  1. If the real foreign keys DO NOT contain NaN values, the metric should treat NaN values as a regular foreign key value (current functionality).
>>> parent_keys = ['a', 'b', 'c']
>>> foreign_keys =  ['a', 'a', 'b', 'c', 'a']
>>> synth_foreign_keys = ['a', 'a', 'b', 'c', NaN]
>>> ReferentialIntegrity.compute_breakdown(
            real_data=(parent_keys, foreign_keys),
            synthetic_data=(parent_keys, synth_foreign_keys))
 {'score':  0.8}

Steps to reproduce

import pandas as pd
import numpy as np

from sdmetrics.column_pairs import ReferentialIntegrity

real_primary_keys = ['a', 'b', 'c']
real_foreign_keys = ['a', 'a', 'a', 'b', 'c', np.nan]
synthetic_primary_keys= ['id1', 'id2', 'id3']
synthetic_foreign_keys = ['id1', 'id2', 'id2', 'id3', np.nan, np.nan]

ReferentialIntegrity.compute(
    real_data=(real_primary_keys, real_foreign_keys),
    synthetic_data=(synthetic_primary_keys, synthetic_foreign_keys))
@frances-h frances-h added bug Something isn't working new Label applied to new issues labels Nov 2, 2023
@npatki npatki removed the new Label applied to new issues label Nov 2, 2023
@npatki npatki added this to the 0.13.0 milestone Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants