Fixing inconsistency in naming features for the StringEncoder#1405
Conversation
Vincent-Maladiere
left a comment
There was a problem hiding this comment.
Thank you for this PR @rcap107!
Co-authored-by: Vincent M <maladiere.vincent@yahoo.fr>
Co-authored-by: Vincent M <maladiere.vincent@yahoo.fr>
|
I just realized that the periodic features in the DatetimeEncoder have the same problem (starting from 0) Should I rename those as well while I'm doing this? |
|
I just realized that the periodic features in the DatetimeEncoder have the same problem (starting from 0)
Should I rename those as well while I'm doing this?
Consistency is a good thing.
I do wonder however if we should start at 0 or 1. Python indexing starts at 0.
|
In plain English, writing the "dimension 0" feels weird IMO. |
|
Talking with Franck offline, I think we should rename the features in the StringEncoder so they're in 1 to N, because if we're changing the defaults from Gap to StringEncoder then having the features in 0 to N-1 would break even more stuff |
|
ready for review |
Vincent-Maladiere
left a comment
There was a problem hiding this comment.
Very nice, thanks for adding names and factorizing the utils function!
- Could you add a test for the
get_feature_names_out()method for theTextEncoder? - Could you add a default name for the gap encoder as well?
|
skrub meeting conclusion: let's stay consistent with sklearn PCA and start indexing names at 0 |
|
ready for review |
Vincent-Maladiere
left a comment
There was a problem hiding this comment.
Hey @rcap107, thanks for moving this PR forward. I have a meta-suggestion:
- add a
get_feature_names_outmethod toSingleColumnTransformer, and remove it for the TextEncoder, StringEncoder, MinHashEncoder, SplineEncoder, CircularEncoder (transformers like the Gap can overwrite it by defining their ownget_feature_names_out) - remove the
outputs_attributes and callget_feature_names_out()instead. The utilsget_encoder_feature_namesshould not be used inside individual transformers.
WDYT?
|
I did the refactoring, now it should be ready |
| self.all_outputs_ = [ | ||
| f"{name}_spline_{idx}" for idx in range(self.n_components_) | ||
| ] | ||
| name = sbd.name(X) if sbd.name(X) else "" |
There was a problem hiding this comment.
Sorry to nitpick, but why not have a default name like you suggested below?
There was a problem hiding this comment.
both _CircularEncoder and _SplineEncoder are supposed to be used only internally by the DatetimeEncoder, so sbd.name(X) would always be hour/month/weekday etc
I could rename it as "periodic" but in practice it should never be None, which is why I didn't bother adding a check on it until you mentioned it
There was a problem hiding this comment.
You're right, I forgot this is not publicly exposed, and we create a name in DatetimeEncoder. You can revert it to:
self.input_name_ = sbd.name(X) + "_spline"sorry about that :)
There was a problem hiding this comment.
at this point, wouldn't it be fine to keep as is, in case we end up exposing those classes? 🤔 maybe add a TODO to remember to set a default value if we do
There was a problem hiding this comment.
You can leave a TODO and revert it, I'd rather not have code that can't be reached
Vincent-Maladiere
left a comment
There was a problem hiding this comment.
LGTM! (after reverting the sbd.name as said above)
16c7234 to
9028e3a
Compare
Features added by the StringEncoder started from 0 up to n_components-1, while other encoders have features in the range 1 to n_components.
This PR addresses the inconsistency.