-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIR] Refactor _get_unique_value_indices
#24144
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a small type annotation nit!
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
dataset: Dataset, *columns: str | ||
dataset: Dataset, | ||
*columns: str, | ||
drop_na_values: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add this if it's not being used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually being used in the Categorizer PR! Left it in here by mistake, but if we are going to merge that as well, it doesn't really make a huge difference to keep it in.
Adds a Categorizer preprocessor to automatically set the Categorical dtype on a dataset. This is useful for eg. LightGBM, which has build-in support for features with that dtype. Depends on #24144.
Why are these changes needed?
Refactors
_get_unique_value_indices
(used in Encoder preprocessors) for much improved performance with multiple columns. Also uses the same, more robust intermediary dataset format in_get_most_frequent_values
(Imputers).The existing unit tests pass, and no functionality has been changed.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.