Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Allows multiclass target in TargetEncoder #26674

Merged
merged 65 commits into from Sep 7, 2023

Conversation

lucyleeow
Copy link
Member

@lucyleeow lucyleeow commented Jun 23, 2023

Reference Issues/PRs

closes #26613

What does this implement/fix? Explain your changes.

Allows multiclass target type in TargetEncoder, following section 3.3 of Micci-Barreca et al.
Uses LabelBinarizer to perform on vs rest on y and for each feature and calculates one vs rest target mean for each class, thus expanding number of features to n_features * n_classes.

Any other comments?

First attempt, needs more thought on some aspects.

I am conflicted on the best order of the output features. Currently the order of features is:

feat0_class0, feat1_class0, feat2_class0, feat0_class1 ... (same classes are grouped)

I think grouping features may make more sense:

feat0_class0, feat0_class1, feat0_class2, feat1_class0 ... (same features are grouped)

which should not be too computationally expensive, should just require an additional re-ordering of encodings_ (list of ndarray), which can be done via list comprehension using list of reordering indices.

Any suggestions welcome.
EDIT: have now amended such that same features are grouped together.

TODO:

  • Add custom get_feature_names_out for new features names that include classes
  • Add doc for new class attributes
  • Update target encoder user guide
  • Update and add tests

cc @thomasjpfan

@github-actions
Copy link

github-actions bot commented Jun 23, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 9fb9987. Link to the linter CI: here

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the ordering:

feat0_class0, feat0_class1, feat0_class2, feat1_class0

At this point, it'll good to prioritize writing tests to make sure multiclass gives reasonable results.

sklearn/preprocessing/_target_encoder.py Show resolved Hide resolved
sklearn/preprocessing/_target_encoder.py Outdated Show resolved Hide resolved
sklearn/preprocessing/tests/test_target_encoder.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_target_encoder.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_target_encoder.py Outdated Show resolved Hide resolved
Comment on lines 304 to 307
n_classes = self._label_binarizer_.classes_.shape[0]
X_ordinal, X_valid = [
np.repeat(X, n_classes, axis=1) for X in (X_ordinal, X_valid)
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here regarding not needing to repeat X_ordinal and X_valid.

sklearn/preprocessing/_target_encoder.py Outdated Show resolved Hide resolved
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates!

@thomasjpfan thomasjpfan added the Waiting for Second Reviewer First reviewer is done, need a second one! label Aug 9, 2023
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. This LGTM besides the following suggestions:

sklearn/preprocessing/_target_encoder.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_target_encoder.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_target_encoder.py Show resolved Hide resolved
sklearn/preprocessing/_target_encoder.py Outdated Show resolved Hide resolved
In the multiclass case, `X_ordinal` and `X_unknown_mask` have column
(axis=1) size `n_features`, while `encodings` has length of size
`n_features * n_classes`. `feat_idx` deals with this by repeating
feature indicies by `n_classes` E.g., for 3 features, 2 classes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this suggestion was not applied.

sklearn/preprocessing/tests/test_target_encoder.py Outdated Show resolved Hide resolved
@lucyleeow
Copy link
Member Author

@ogrisel thank you for the review, changes made.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. Thanks very much for the PR @lucyleeow! I pushed a few more small improvements / fixes and I will merge if CI is green.

@ogrisel ogrisel enabled auto-merge (squash) September 7, 2023 09:13
@ogrisel ogrisel removed the Waiting for Second Reviewer First reviewer is done, need a second one! label Sep 7, 2023
@lucyleeow
Copy link
Member Author

Thanks @ogrisel and @thomasjpfan !

@ogrisel ogrisel merged commit 872c19e into scikit-learn:main Sep 7, 2023
26 checks passed
@lucyleeow lucyleeow deleted the target_encoder_multi branch September 7, 2023 10:26
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Multiclass to Target Encoder
4 participants