Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata auto-detection should ensure primary keys are unique (special sdtypes are not exempt from this rule!) #1871

Closed
npatki opened this issue Mar 26, 2024 · 0 comments · Fixed by #1876
Assignees
Labels
bug Something isn't working feature:metadata Related to describing the dataset
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Mar 26, 2024

Environment Details

  • SDV version: 1.11.0

Error Description

In metadata auto-detection script: If a column is a real-world sdtype (such as latitude), then the auto-detection seems willing to make it a primary key even if its values are not unique. This is a problem because primary keys are all expected to be unique. Metadata auto-detection should never make something a primary key if it contains repeating values.

Steps to reproduce

In the example below, column latitude has repeating values. There is no primary key:

import pandas as pd
import numpy as np

from sdv.metadata import SingleTableMetadata

data = pd.DataFrame(data={
    'Age': [int(i) for i in np.random.uniform(low=0, high=100, size=100)],
    'Sex': np.random.choice(['Male', 'Female'], size=100),
    'latitude': [round(i, 2) for i in np.random.uniform(low=-90, high=+90, size=50)] * 2
})

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data)

The metadata detection marks latitude as a primary key column even though it contains repeating values. This should not be possible. As a result, the validation fails.

metadata.validate_data(data)
InvalidDataError: The provided data does not match the metadata:
Key column 'latitude' contains repeating values: [-1.43, -12.66, -18.24, '+ 47 more']

Other Context

  1. Another rule for primary keys is that they must be non-null. (Seee Metadata auto-detection should not assign a primary key if there are NaN values in it #1740). This rule is always working for all sdtypes, as expected

  2. This issue only happens with real-world sdtypes. If I change the name of latitude column to something else, then the detected sdtype is no longer latitude but numerical. In this case, the script correctly exempts this column from being a primary key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature:metadata Related to describing the dataset
Projects
None yet
3 participants