Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve metadata detection #1515

Closed
amontanez24 opened this issue Jul 27, 2023 · 0 comments · Fixed by #1529 or #1610
Closed

Improve metadata detection #1515

amontanez24 opened this issue Jul 27, 2023 · 0 comments · Fixed by #1529 or #1610
Assignees
Labels
feature:metadata Related to describing the dataset feature request Request for a new feature
Milestone

Comments

@amontanez24
Copy link
Contributor

amontanez24 commented Jul 27, 2023

Problem Description

After using the metadata detection, I still have to make a lot of manual updates to make the metadata match the data. Some of these changes seem like they could be automated.

Expected behavior

When calling detect_from_dataframe or detect_from_csv, we should now use the following logic in the steps defined below to assign sdtypes:

  1. If dtype==boolean: The sdtype is categorical
  2. If dtype==pd.datetime : The sdtype is always datetime. Don't assign a datetime_format
  3. If dtype==int: The sdtype should be one of: numerical, categorical or id. We can tell them apart based on cardinality (# of unique values).
    1. If there are <=5 rows, we can't properly detect this. Sdtype is numerical.
    2. If the cardinality is less than round(len / 10) (in other words, the number of unique categories is less than 10% the number of rows) and the values are all >=0, sdtype is categorical
    3. Elif there are null values present, sdtype is numerical. (Rationale: an "id" column cannot have null values)
    4. Elif the values are all unique, sdtype is id. Don't assign a regex_format.
    5. Else (Fallback) sdtype is numerical
  4. If dtype==float: The sdtype should be one of: numerical or categorical.
    1. If there are <=5 rows, we can't properly detect this. Sdtype is numerical.
    2. If the cardinality is less than round(len / 10) and all the values are whole numbers (>=0) and there are NaN values, sdtype is categorical.
      • This is for the edge case when the column is supposed to be an int but it has nulls. Pandas does not support int columns with NaNs, so it represents it as a float.
    3. Else (Fallback) sdtype is numerical
  5. If dtype==object:
    1. If the values can be cast as a datetime with a consistent datetime format string: sdtype is datetime with the appropriate datetime_format assigned. Right now, we do not support datetime columns with inconsistent formats or with special values such as "12-31-9999". This one might be slow (especially when matching format), so we would probably require subsampling here.
    2. Elif the entire column can be cast to an int, cast it and the follow the logic of dtype==int
    3. Elif the entire column can be cast to a float, cast it and follow the logic of dtype==float
    4. Elif the values are all unique, sdtype is id and don't assign a regex_format.
    5. Elif the cardinality is less than round(len / 5), sdtype is categorical.
    6. Else (Fallback) The sdtype is probably a pii type (we may learn later that it's actually a foreign key) sdtype: "unknown", pii: True
  6. Any other dtype is unsupported. We should raise an error in this case

Additional context

  • We may also want to check if the dtype == 'categorical'. Right now we're ignoring it because RDT seems to crash with this dtype in some situations
  • We are adding a new sdtype called unknown. Handling this type will be done in Support 'unknown' sdtype #1516.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature:metadata Related to describing the dataset feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants