New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

ENH Allows multiclass target in `TargetEncoder` #26674

Merged

ogrisel merged 65 commits into scikit-learn:main from lucyleeow:target_encoder_multi

Sep 7, 2023

Member

lucyleeow commented Jun 23, 2023 •

edited

Reference Issues/PRs

closes #26613

What does this implement/fix? Explain your changes.

Allows multiclass target type in TargetEncoder, following section 3.3 of Micci-Barreca et al.
Uses LabelBinarizer to perform on vs rest on y and for each feature and calculates one vs rest target mean for each class, thus expanding number of features to n_features * n_classes.

Any other comments?

First attempt, needs more thought on some aspects.

I am conflicted on the best order of the output features. Currently the order of features is:

feat0_class0, feat1_class0, feat2_class0, feat0_class1 ... (same classes are grouped)

I think grouping features may make more sense:

feat0_class0, feat0_class1, feat0_class2, feat1_class0 ... (same features are grouped)

which should not be too computationally expensive, should just require an additional re-ordering of encodings_ (list of ndarray), which can be done via list comprehension using list of reordering indices.

Any suggestions welcome.
EDIT: have now amended such that same features are grouped together.

TODO:

Add custom get_feature_names_out for new features names that include classes
Add doc for new class attributes
Update target encoder user guide
Update and add tests

cc @thomasjpfan


          add multiclass fun, first draft

c8d35eb

github-actions bot added the module:preprocessing label

github-actions bot commented Jun 23, 2023 •

edited

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 9fb9987. Link to the linter CI: here}

lucyleeow added 8 commits

June 23, 2023 16:46


          black

037c6bf


          remove testing cruft

14c6cf3


          Merge branch 'main' into target_encoder_multi

e6b4364


          formatting

4fba5dd


          formatting

2c01358


          reorder feature columns, group same features

6f7c98a


          fix transform_X_ordinal

f773a96


          formatting

f9ced51

thomasjpfan reviewed

View reviewed changes

Member

thomasjpfan left a comment

I'm happy with the ordering:

feat0_class0, feat0_class1, feat0_class2, feat1_class0

At this point, it'll good to prioritize writing tests to make sure multiclass gives reasonable results.

sklearn/preprocessing/_target_encoder.py Show resolved Hide resolved

sklearn/preprocessing/_target_encoder.py Outdated Show resolved Hide resolved

sklearn/preprocessing/tests/test_target_encoder.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_target_encoder.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_target_encoder.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_target_encoder.py Outdated

Comment on lines 304 to 307

+                          n_classes = self._label_binarizer_.classes_.shape[0]
+                          X_ordinal, X_valid = [
+                              np.repeat(X, n_classes, axis=1) for X in (X_ordinal, X_valid)
+                          ]

Member

thomasjpfan Jul 4, 2023

Same here regarding not needing to repeat X_ordinal and X_valid.

sklearn/preprocessing/_target_encoder.py Outdated Show resolved Hide resolved

lucyleeow added 18 commits

July 5, 2023 13:09


          Merge branch 'main' into target_encoder_multi

0b9dbc2


          fix target mean in cv

41d86e8


          Merge branch 'main' into target_encoder_multi

896dfbf


          review

2f702e0


          Merge branch 'main' into target_encoder_multi

ecd4342


          typo

0365d3e


          fix multi_idx, remove reodering of encodings

5bc120b


          formatting

d2d37a0


          fix indicies, add transform multiclass fun

6bae2c1


          format

3358fca


          simplify test

6d3be26


          add comment

acec424


          add _ to end of n_classes

2b33d47


          test nitpicks

96fedab


          order X_out in place, use single transform X ordinal fun

ca46455


          add get_feature_names_out

68db78a


          add changelog

c2f8517


          format

c7e9a0f

lucyleeow added 2 commits

August 3, 2023 17:48


          update name off learn_encodings

dce2c56


          format

f99bfae

thomasjpfan reviewed

View reviewed changes

sklearn/preprocessing/tests/test_target_encoder.py Show resolved Hide resolved

sklearn/preprocessing/_target_encoder.py Outdated Show resolved Hide resolved

sklearn/preprocessing/tests/test_target_encoder.py Show resolved Hide resolved

sklearn/preprocessing/_target_encoder.py Show resolved Hide resolved

sklearn/preprocessing/_target_encoder.py Show resolved Hide resolved

sklearn/preprocessing/_target_encoder.py Show resolved Hide resolved

lucyleeow added 2 commits

August 8, 2023 14:30


          revert test nitpicks

2dca9dd


          review

a01b656

lucyleeow mentioned this pull request

CLN Update var name in TargetEncoder to make consistent #27033

Merged

lucyleeow added 2 commits

August 9, 2023 18:39


          Merge branch 'main' into target_encoder_multi

36b31ab


          Merge branch 'main' into target_encoder_multi

10b51d5

thomasjpfan approved these changes

View reviewed changes

Member

thomasjpfan left a comment

Thank you for the updates!

thomasjpfan added the Waiting for Second Reviewer label

ogrisel reviewed

View reviewed changes

Member

ogrisel left a comment

Thanks for the PR. This LGTM besides the following suggestions:

sklearn/preprocessing/_target_encoder.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_target_encoder.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_target_encoder.py Show resolved Hide resolved

sklearn/preprocessing/_target_encoder.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_target_encoder.py Outdated

+                      In the multiclass case, `X_ordinal` and `X_unknown_mask` have column
+                      (axis=1) size `n_features`, while `encodings` has length of size
+                      `n_features * n_classes`. `feat_idx` deals with this by repeating
+                      feature indicies by `n_classes` E.g., for 3 features, 2 classes:

Member

ogrisel Sep 6, 2023

It seems that this suggestion was not applied.

sklearn/preprocessing/tests/test_target_encoder.py Outdated Show resolved Hide resolved

sklearn/preprocessing/tests/test_target_encoder.py Show resolved Hide resolved


          review

cdbf2ba

Member Author

lucyleeow commented Sep 7, 2023

@ogrisel thank you for the review, changes made.

lucyleeow and others added 9 commits

September 7, 2023 14:30


          Merge branch 'main' into target_encoder_multi

2c2c23a


          black

704b7b3


          Merge branch 'main' into target_encoder_multi

f80dafe


          Fix merge mistake + simplify target parametrization in test

4cb3821


          Assert rather than comment on expected assertion in test.

53f8966


          Remove redundant merge artifact

c18a4c0


          Rename y_int to y_numeric in tests because it can be a flow for regre…

607cc1d

…ssion problems


          More consistently named private method

33a381e


          Use pytest to check for expected warnings when dealing with unique cl…

9fb9987

…ass values

ogrisel approved these changes

View reviewed changes

Member

ogrisel left a comment

This is looking good. Thanks very much for the PR @lucyleeow! I pushed a few more small improvements / fixes and I will merge if CI is green.

ogrisel enabled auto-merge (squash)

September 7, 2023 09:13

ogrisel removed the Waiting for Second Reviewer label

Member Author

lucyleeow commented Sep 7, 2023

Thanks @ogrisel and @thomasjpfan !

ogrisel merged commit 872c19e into scikit-learn:main

26 checks passed

lucyleeow deleted the target_encoder_multi branch

September 7, 2023 10:26

lucyleeow mentioned this pull request

ENH Add pos_label parameter to TargetEncoder #27342

Open

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request


          ENH Allows multiclass target in TargetEncoder (scikit-learn#26674)

75d0713

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment