-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TableVectorize cleanup! #579
TableVectorize cleanup! #579
Conversation
The difference in behavior is that now to use remainder="drop" on numerical
columns, users must do:
tv = TableVectorizer(remainder="drop", numerical_transformer="remainder")
which is a bit more explicit and less error-prone (most users don't want their
numerical values to be dropped).
Seems relevant from a cursory look.
|
Nice, thanks for spotting that, this is absolutely a typo!
I'm not so sure, I think the logic is that if there are "false missing", Edit: after reviewing the PR, that makes sense if we remove the function call from However, I can see a way the logic could be improved (unrelated to the previous paragraph): instead of checking for missing values on X as a whole, we could treat each column independently.
Absolutely! And for 8bits, there was most likely a technical reason, but I guess this is not a problem today. And for 2 and 4, 5 and 6, you have my 👍 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks
It would be good to have a changelog entry, so that people transiting from dirty-cat see the difference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but please add a small changelog entry
np.uint32, | ||
np.uint16, | ||
] | ||
include="number" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice simplification <3
if X[col].nunique() < self.cardinality_threshold: | ||
low_card_cat_columns.append(col) | ||
else: | ||
high_card_cat_columns.append(col) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much cleaner code. Thanks!
d3790bc
to
b9cb1bc
Compare
I'm committing to this branch to add the PR in the changelog (great for advanced users, and helps involving a community) |
Merged!! |
This PR addresses what seems to be an inaccuracy in the way we handle numerical values by default in
TableVectorizer
, along with various readability issues.Main issue
We currently have:
but this defeat the purpose of cloning the transformers to avoid altering them.
What we want instead is:
(notice the
_
)The difference in behavior is that now to use
remainder="drop"
on numerical columns, users must do:which is a bit more explicit and less error-prone (most users don't want their numerical values to be dropped).
What do you think?
Other smallish readability issues
replace_false_missing()
needs to be run before_auto_cast()
, so that we don't have to call it twicelow_card_cat_columns
andhigh_card_cat_columns
can be created with a single for-loop instead of twonumeric_columns = X.select_dtypes("number")
instead of listing all int, uint and float types (uint8 and int8 are missing, is it on purpose?)categorical_columns
is a subset ofX.columns
check_is_fitted()
on"transformers_"
rather than"columns_"
, since the latter is created at the beginning of the function whereas the former is created during thefit_transform
ofColumnTransformer
. In case of error,columns_
would be defined and hence the estimator would be mistakingly considered fitted.remainder
, which currently asserts that the default value isdrop
, whereas it's actuallypassthrough
.