Improve metadata detection #1515

amontanez24 · 2023-07-27T19:39:20Z

Problem Description

After using the metadata detection, I still have to make a lot of manual updates to make the metadata match the data. Some of these changes seem like they could be automated.

Expected behavior

When calling detect_from_dataframe or detect_from_csv, we should now use the following logic in the steps defined below to assign sdtypes:

If dtype==boolean: The sdtype is categorical
If dtype==pd.datetime : The sdtype is always datetime. Don't assign a datetime_format
If dtype==int: The sdtype should be one of: numerical, categorical or id. We can tell them apart based on cardinality (# of unique values).
1. If there are <=5 rows, we can't properly detect this. Sdtype is numerical.
2. If the cardinality is less than round(len / 10) (in other words, the number of unique categories is less than 10% the number of rows) and the values are all >=0, sdtype is categorical
3. Elif there are null values present, sdtype is numerical. (Rationale: an "id" column cannot have null values)
4. Elif the values are all unique, sdtype is id. Don't assign a regex_format.
5. Else (Fallback) sdtype is numerical
If dtype==float: The sdtype should be one of: numerical or categorical.
1. If there are <=5 rows, we can't properly detect this. Sdtype is numerical.
2. If the cardinality is less than round(len / 10) and all the values are whole numbers (>=0) and there are NaN values, sdtype is categorical.
  - This is for the edge case when the column is supposed to be an int but it has nulls. Pandas does not support int columns with NaNs, so it represents it as a float.
3. Else (Fallback) sdtype is numerical
If dtype==object:
1. If the values can be cast as a datetime with a consistent datetime format string: sdtype is datetime with the appropriate datetime_format assigned. Right now, we do not support datetime columns with inconsistent formats or with special values such as "12-31-9999". This one might be slow (especially when matching format), so we would probably require subsampling here.
2. Elif the entire column can be cast to an int, cast it and the follow the logic of dtype==int
3. Elif the entire column can be cast to a float, cast it and follow the logic of dtype==float
4. Elif the values are all unique, sdtype is id and don't assign a regex_format.
5. Elif the cardinality is less than round(len / 5), sdtype is categorical.
6. Else (Fallback) The sdtype is probably a pii type (we may learn later that it's actually a foreign key) sdtype: "unknown", pii: True
Any other dtype is unsupported. We should raise an error in this case

Additional context

We may also want to check if the dtype == 'categorical'. Right now we're ignoring it because RDT seems to crash with this dtype in some situations
We are adding a new sdtype called unknown. Handling this type will be done in Support 'unknown' sdtype #1516.

The text was updated successfully, but these errors were encountered:

amontanez24 added feature request Request for a new feature feature:metadata Related to describing the dataset labels Jul 27, 2023

This was referenced Jul 27, 2023

Support 'unknown' sdtype #1516

Closed

Detect primary keys in metadata #1521

Closed

R-Palazzo mentioned this issue Aug 3, 2023

Improve metadata detection #1529

Merged

amontanez24 modified the milestones: 1.3.1, 1.4.0 Aug 14, 2023

npatki mentioned this issue Sep 20, 2023

[Metadata Detection] Only make primary/foreign keys sdtype id (leave others as unknown) #1598

Closed

amontanez24 mentioned this issue Sep 27, 2023

Metadata improvements #1610

Merged

amontanez24 closed this as completed in #1610 Sep 27, 2023

amontanez24 assigned R-Palazzo Oct 11, 2023

amontanez24 added this to the 1.5.0 milestone Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve metadata detection #1515

Improve metadata detection #1515

amontanez24 commented Jul 27, 2023 •

edited

Loading

Improve metadata detection #1515

Improve metadata detection #1515

Comments

amontanez24 commented Jul 27, 2023 • edited Loading

Problem Description

Expected behavior

Additional context

amontanez24 commented Jul 27, 2023 •

edited

Loading