Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validate method to SingleTableMetadata #879

Closed
amontanez24 opened this issue Jul 8, 2022 · 0 comments · Fixed by #930
Closed

Add validate method to SingleTableMetadata #879

amontanez24 opened this issue Jul 8, 2022 · 0 comments · Fixed by #930
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@amontanez24
Copy link
Contributor

amontanez24 commented Jul 8, 2022

Problem Description

As a user, it would be useful to be able to validate whether my metadata is formatted correctly

Expected behavior

  • Add validate method
  • Validation consists of validating three separate parts of the metadata. The full details are in the Additional context section.
    1. Validating the columns
    2. Validating the keys
    3. Validating the constraints
  • If the metadata is not valid: Raises an InvalidMetadataError with a description of all the errors found.
>>> metadata.validate()
InvalidMetadataError: The metadata is not valid

Error: Invalid values ("pii") for datetime column "start_date".
Error: Invalid regex format string "[A-{6}" for text column "user_id"
Error: Unknown key value 'uuid'. Keys should be columns that exist in the table.
Error: A Unique constraint is being applied to column "user_id". This column is already a key for that table.
Error: Invalid increment value (0.5) in a FixedIncrements constraint. Increments must be positive integers.

Additional context

  • Column validation: Each sdtype has different validation rules. They are listed below

    • numerical
      • Required attributes are: representation
      • Throw an error if any attributes besides representation are present
        Error: Invalid values ("pii") for numerical column "age".
    • datetime
      • Required attributes are: datetime_format
      • "datetime_format" must be a valid, parsable format string
        Error: Invalid datetime format string "%O" for datetime column "start_date"
      • There should no other attributes present
        Error: Invalid values ("pii") for datetime column "start_date".
    • categorical
      • Required attributes are either: order or order_by
      • "order" and "order_by" cannot both be present. You can only have 0 or 1 of these attributes
        Error: Categorical column "education" has both an "order" and "order_by" attribute. Only 1 is allowed.
      • If present, "order_by" must be either set to "numerical_value" or "alphabetical"
        Error: Unknown ordering method "testing" provided for categorical column "education". Ordering method must be "numerical_value" or "alphabetical"
      • If present, "order" must be a list with 1 or more elements
        Error: Invalid order value provided for categorical column "education". The "order" must be a list with 1 or more elements.
      • No other attributes can be present
        Error: Invalid values ("pii") for categorical column "education".
    • boolean
      • No required attributes
      • Throw an error if any attributes are present
        Error: Invalid value ("pii") for boolean column "is_subscribed".
    • text
      • Required attributes are: regex_format
      • "regex_format" is present but the string isn't a valid regex string that can be parsed
        Error: Invalid regex format string "[A-{6}" for text column "user_id"
      • Throw an error if any other attributes are present
        Error: Invalid values ("pii") for text column "user_id".
    • Real World (Semantic) Types (ie. phone_number)
      • No required parameters
      • pii is an optional parameter
      • If "pii" exists but it is not True or False, throw an error
        Error: Invalid pii value provided for phone_number column "user_cell". The "pii" value must be set to True or False.
      • Throw an error if any other attributes are present
        Error: Invalid values ("datetime_format") for phone_number column "user_cell".
      • raise a warning if the sdtype isn't fully supported.
        Warning: sdtype 'location' is not fully supported. The SDV will model this as a categorical variable.
        Warning: sdtype 'location' is not fully supported. The SDV will anonymize this column using random characters.
  • Key validation

    • "primary_key" must be a string or list of strings
    • "sequence_key" must a string or list of strings
    • "alternate_keys" must be a list of strings or a nested list of strings
    • "sequence_index" must be a string
    • The strings must correspond to the column names as specified in the other part of the Metadata
      Error: Unknown key value 'uuid'. Keys should be columns that exist in the table.
    • "sequence_index" cannot be the same as "sequence_key"
      Error: sequence_index and sequence_key have the same value ('patient_id'). These columns must be different.
  • Constraint validation

    • Use each constraints _validate_inputs method and surface those errors Add _validate_inputs class method to each constraint #878
    • Check for the following errors still
    • Unique
      • Each of the column names in "column_names" must be a column that is present in the "columns" specification
        Error: A Unique constraint is being applied to invalid column names ("age", "weight"). The columns must exist in the table.
      • "column_names" must include at least 1 column that is NOT a primary key or alternate key. Primary keys and alternate keys will already be guaranteed to be unique, so there's no need to add it in as a constraint.
        Error: A Unique constraint is being applied to column "age". This column is already a key for that table.
    • FixedCombinations
      • Each of the column names in "colum_names" must be a column that is present in the "columns" specification
        Error: A FixedCombinations constraint is being applied to invalid column names ("C", "D"). The columns must exist in the table.
    • Inequality
      • The string in "high_column_name" and "low_column_name" must be a column that is present in the "columns" specification
        Error: An Inequality constraint is being applied to invalid column names ("C", "D"). The columns must exist in the table.
      • Both high and low columns must be either type "numerical" or type "datetime"
        Error: An Inequality constraint is being applied to mismatched sdtypes ("C", "D"). Both columns must be either numerical or datetime.
    • ScalarInequality
      • "column_name" must refer to a column in the table
        Error: A ScalarInequality constraint is being applied to invalid column names ("C"). The columns must exist in the table.
      • The "value" must make sense based on the column type
        • If the column is "numerical", then "value" must be an int or float
        • If the column is "datetime", then "value" must be a datetime string of the right format
        • No other types are compatible
          Error: A ScalarInequality constraint is being applied to mismatched sdtypes. Numerical columns must be compared to integer or float values. Datetimes column must be compared to datetime strings.
    • Range
      • The strings in each of the column names must be a column that is present in the "columns" specification
        Error: A Range constraint is being applied to invalid column names ("C", "D"). The columns must exist in the table.
      • All columns must be either type "numerical" or type "datetime"
        Error: A Range constraint is being applied to mismatched sdtypes ("C", "D", "E"). All columns must be either numerical or datetime.
    • ScalarRange
      • "column_name" must refer to a column in the table
        Error: A ScalarRange constraint is being applied to invalid column names ("C"). The columns must exist in the table.
      • The high and low values must make sense based on the column type
        • If the column is "numerical", then the values must be floats/ints
        • If the column is "datetime", then the values must be a datetime string of the right format
        • No other types are compatible
          Error: A ScalarRange constraint is being applied to mismatched sdtypes. Numerical columns must be compared to integer or float values. Datetimes column must be compared to datetime strings.
    • Positive
      • "column_name" must refer to a column in the table
        Error: A Positive constraint is being applied to invalid column names ("C"). The columns must exist in the table.
      • Column name must type "numerical"
        Error: A Positive constraint is being applied to an invalid column ("C"). This constraint is only defined for numerical columns.
    • Negative
      • "column_name" must refer to a column in the table
        Error: A Negative constraint is being applied to invalid column names ("C"). The columns must exist in the table.
      • Column name must type "numerical"
        Error: A Negative constraint is being applied to an invalid column ("C"). This constraint is only defined for numerical columns.
    • FixedIncrements
      • Column name should refer to a column defined in the metadata
        Error: A FixedIncrements constraint is being applied to invalid column names ("C"). The columns must exist in the table.
    • OneHotEncoding
      • Column names must be valid columns (present in the "columns" part of the metadata)
        Error: A OneHotEncoding constraint is being applied to invalid column names ("C", "D", "E"). The columns must exist in the table.
    • CustomConstraint
      • Column names must be valid columns (present in the "columns part of the metadata)
        Error: A <module>.<name> constraint is being applied to invalid column names ("C", "D"). The columns must exist in the table.
    • Misc
      • If the constraint isn't found, throw an error:
        Error: Invalid constraints ('Other').
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants