Skip to content

Fixing inconsistency in naming features for the StringEncoder#1405

Merged
rcap107 merged 22 commits into
skrub-data:mainfrom
rcap107:string-encoder-feature-names
Jun 6, 2025
Merged

Fixing inconsistency in naming features for the StringEncoder#1405
rcap107 merged 22 commits into
skrub-data:mainfrom
rcap107:string-encoder-feature-names

Conversation

@rcap107

@rcap107 rcap107 commented May 22, 2025

Copy link
Copy Markdown
Member

Features added by the StringEncoder started from 0 up to n_components-1, while other encoders have features in the range 1 to n_components.

This PR addresses the inconsistency.

@rcap107 rcap107 marked this pull request as ready for review May 22, 2025 12:56

@Vincent-Maladiere Vincent-Maladiere left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @rcap107!

Comment thread skrub/_string_encoder.py Outdated
Comment thread skrub/_string_encoder.py Outdated
Comment thread skrub/_string_encoder.py Outdated
Comment thread CHANGES.rst Outdated
rcap107 and others added 2 commits May 23, 2025 17:24
Co-authored-by: Vincent M <maladiere.vincent@yahoo.fr>
Co-authored-by: Vincent M <maladiere.vincent@yahoo.fr>
Comment thread skrub/_string_encoder.py Outdated
@rcap107

rcap107 commented May 26, 2025

Copy link
Copy Markdown
Member Author

I just realized that the periodic features in the DatetimeEncoder have the same problem (starting from 0)

Should I rename those as well while I'm doing this?

@GaelVaroquaux

GaelVaroquaux commented May 26, 2025 via email

Copy link
Copy Markdown
Member

@Vincent-Maladiere

Copy link
Copy Markdown
Member

I do wonder however if we should start at 0 or 1. Python indexing starts at 0.

In plain English, writing the "dimension 0" feels weird IMO.

@rcap107

rcap107 commented May 27, 2025

Copy link
Copy Markdown
Member Author

Talking with Franck offline, I think we should rename the features in the StringEncoder so they're in 1 to N, because if we're changing the defaults from Gap to StringEncoder then having the features in 0 to N-1 would break even more stuff

@rcap107

rcap107 commented May 28, 2025

Copy link
Copy Markdown
Member Author

ready for review

@Vincent-Maladiere Vincent-Maladiere left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks for adding names and factorizing the utils function!

  • Could you add a test for the get_feature_names_out() method for the TextEncoder?
  • Could you add a default name for the gap encoder as well?

Comment thread skrub/_string_encoder.py Outdated
@Vincent-Maladiere

Copy link
Copy Markdown
Member

skrub meeting conclusion: let's stay consistent with sklearn PCA and start indexing names at 0

@rcap107

rcap107 commented Jun 4, 2025

Copy link
Copy Markdown
Member Author

ready for review

@Vincent-Maladiere Vincent-Maladiere left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rcap107, thanks for moving this PR forward. I have a meta-suggestion:

  • add a get_feature_names_out method to SingleColumnTransformer, and remove it for the TextEncoder, StringEncoder, MinHashEncoder, SplineEncoder, CircularEncoder (transformers like the Gap can overwrite it by defining their own get_feature_names_out)
  • remove the outputs_ attributes and call get_feature_names_out() instead. The utils get_encoder_feature_names should not be used inside individual transformers.

WDYT?

Comment thread skrub/_text_encoder.py Outdated
@rcap107

rcap107 commented Jun 5, 2025

Copy link
Copy Markdown
Member Author

I did the refactoring, now it should be ready

@Vincent-Maladiere Vincent-Maladiere left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment!

Comment thread skrub/_on_each_column.py
Comment thread skrub/_datetime_encoder.py
Comment thread skrub/_datetime_encoder.py
Comment thread skrub/_datetime_encoder.py Outdated
self.all_outputs_ = [
f"{name}_spline_{idx}" for idx in range(self.n_components_)
]
name = sbd.name(X) if sbd.name(X) else ""

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to nitpick, but why not have a default name like you suggested below?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both _CircularEncoder and _SplineEncoder are supposed to be used only internally by the DatetimeEncoder, so sbd.name(X) would always be hour/month/weekday etc

I could rename it as "periodic" but in practice it should never be None, which is why I didn't bother adding a check on it until you mentioned it

@Vincent-Maladiere Vincent-Maladiere Jun 6, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I forgot this is not publicly exposed, and we create a name in DatetimeEncoder. You can revert it to:

self.input_name_ = sbd.name(X) + "_spline"

sorry about that :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this point, wouldn't it be fine to keep as is, in case we end up exposing those classes? 🤔 maybe add a TODO to remember to set a default value if we do

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can leave a TODO and revert it, I'd rather not have code that can't be reached

@Vincent-Maladiere Vincent-Maladiere left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (after reverting the sbd.name as said above)

@rcap107 rcap107 force-pushed the string-encoder-feature-names branch from 16c7234 to 9028e3a Compare June 6, 2025 12:14
@rcap107 rcap107 merged commit f070bde into skrub-data:main Jun 6, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants