You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After using the metadata detection, I still have to make a lot of manual updates to make the metadata match the data. Some of these changes seem like they could be automated.
If dtype==pd.datetime : The sdtype is always datetime. Don't assign a datetime_format
If dtype==int: The sdtype should be one of: numerical, categorical or id. We can tell them apart based on cardinality (# of unique values).
If there are <=5 rows, we can't properly detect this. Sdtype is numerical.
If the cardinality is less than round(len / 10) (in other words, the number of unique categories is less than 10% the number of rows) and the values are all >=0, sdtype is categorical
Elif there are null values present, sdtype is numerical. (Rationale: an "id" column cannot have null values)
Elif the values are all unique, sdtype is id. Don't assign a regex_format.
Else (Fallback) sdtype is numerical
If dtype==float: The sdtype should be one of: numerical or categorical.
If there are <=5 rows, we can't properly detect this. Sdtype is numerical.
If the cardinality is less than round(len / 10) and all the values are whole numbers (>=0) and there are NaN values, sdtype is categorical.
This is for the edge case when the column is supposed to be an int but it has nulls. Pandas does not support int columns with NaNs, so it represents it as a float.
Else (Fallback) sdtype is numerical
If dtype==object:
If the values can be cast as a datetime with a consistent datetime format string: sdtype is datetime with the appropriate datetime_format assigned. Right now, we do not support datetime columns with inconsistent formats or with special values such as "12-31-9999". This one might be slow (especially when matching format), so we would probably require subsampling here.
Elif the entire column can be cast to an int, cast it and the follow the logic of dtype==int
Elif the entire column can be cast to a float, cast it and follow the logic of dtype==float
Elif the values are all unique, sdtype is id and don't assign a regex_format.
Elif the cardinality is less than round(len / 5), sdtype is categorical.
Else (Fallback) The sdtype is probably a pii type (we may learn later that it's actually a foreign key) sdtype: "unknown", pii: True
Any other dtype is unsupported. We should raise an error in this case
Additional context
We may also want to check if the dtype == 'categorical'. Right now we're ignoring it because RDT seems to crash with this dtype in some situations
Problem Description
After using the metadata detection, I still have to make a lot of manual updates to make the metadata match the data. Some of these changes seem like they could be automated.
Expected behavior
When calling detect_from_dataframe or detect_from_csv, we should now use the following logic in the steps defined below to assign sdtypes:
dtype==boolean
: The sdtype is categoricaldtype==pd.datetime
: The sdtype is always datetime. Don't assign adatetime_format
dtype==int
: The sdtype should be one of: numerical, categorical or id. We can tell them apart based on cardinality (# of unique values).regex_format
.dtype==float
: The sdtype should be one of: numerical or categorical.dtype==object
:regex_format
.Additional context
dtype == 'categorical'
. Right now we're ignoring it because RDT seems to crash with this dtype in some situationsunknown
. Handling this type will be done in Support 'unknown' sdtype #1516.The text was updated successfully, but these errors were encountered: