[AIR] Refactor `_get_unique_value_indices` #24144

Yard1 · 2022-04-23T22:52:10Z

Why are these changes needed?

Refactors _get_unique_value_indices (used in Encoder preprocessors) for much improved performance with multiple columns. Also uses the same, more robust intermediary dataset format in _get_most_frequent_values (Imputers).

The existing unit tests pass, and no functionality has been changed.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

clarkzinzow

LGTM, just a small type annotation nit!

python/ray/ml/preprocessors/imputer.py

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

python/ray/ml/preprocessors/imputer.py

matthewdeng · 2022-04-27T17:09:14Z

python/ray/ml/preprocessors/encoder.py

-    dataset: Dataset, *columns: str
+    dataset: Dataset,
+    *columns: str,
+    drop_na_values: bool = False,


Do we need to add this if it's not being used?

It's actually being used in the Categorizer PR! Left it in here by mistake, but if we are going to merge that as well, it doesn't really make a huge difference to keep it in.

Adds a Categorizer preprocessor to automatically set the Categorical dtype on a dataset. This is useful for eg. LightGBM, which has build-in support for features with that dtype. Depends on #24144.

Refactor _get_unique_value_indices

5697d07

Yard1 added this to the Ray AIR milestone Apr 23, 2022

Yard1 requested review from matthewdeng and clarkzinzow April 23, 2022 22:52

Yard1 assigned matthewdeng and clarkzinzow Apr 23, 2022

Merge branch 'ray-project:master' into air_categoricals

82367e9

Yard1 mentioned this pull request Apr 25, 2022

[AIR] Add Categorizer preprocessor #24180

Merged

6 tasks

clarkzinzow approved these changes Apr 27, 2022

View reviewed changes

python/ray/ml/preprocessors/imputer.py Outdated Show resolved Hide resolved

Update python/ray/ml/preprocessors/imputer.py

6b51d72

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

matthewdeng reviewed Apr 27, 2022

View reviewed changes

python/ray/ml/preprocessors/imputer.py Show resolved Hide resolved

matthewdeng reviewed Apr 27, 2022

View reviewed changes

clarkzinzow merged commit e62d3fa into ray-project:master Apr 28, 2022

Yard1 deleted the air_categoricals branch November 7, 2022 21:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR] Refactor `_get_unique_value_indices` #24144

[AIR] Refactor `_get_unique_value_indices` #24144

Yard1 commented Apr 23, 2022

clarkzinzow left a comment

matthewdeng Apr 27, 2022

Yard1 Apr 27, 2022

[AIR] Refactor _get_unique_value_indices #24144

[AIR] Refactor _get_unique_value_indices #24144

Conversation

Yard1 commented Apr 23, 2022

Why are these changes needed?

Related issue number

Checks

clarkzinzow left a comment

Choose a reason for hiding this comment

matthewdeng Apr 27, 2022

Choose a reason for hiding this comment

Yard1 Apr 27, 2022

Choose a reason for hiding this comment

[AIR] Refactor `_get_unique_value_indices` #24144

[AIR] Refactor `_get_unique_value_indices` #24144