Skip to content

FIX accept input_features parameter in TableVectorizer.get_feature_names_out#1258

Merged
jeromedockes merged 2 commits intoskrub-data:mainfrom
glemaitre:is/1256
Mar 19, 2025
Merged

FIX accept input_features parameter in TableVectorizer.get_feature_names_out#1258
jeromedockes merged 2 commits intoskrub-data:mainfrom
glemaitre:is/1256

Conversation

@glemaitre
Copy link
Copy Markdown
Member

closes #1256

This PR makes sure to expose an input_features parameter in the get_feature_names_out to be compatible with scikit-learn.

Right now, the parameter is ignored. One question would be: do we want to allow overwriting the column names. In this case, we would need to change more code because we would need to overwrite the name of the columns of the output dataframe in transform to have a consistency between get_feature_names_out() and the feature_names_in_ of the subsequent steps in a scikit-learn pipeline.

Copy link
Copy Markdown
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but could we have a tiny changelog entry. it's useful for users

@glemaitre
Copy link
Copy Markdown
Member Author

LGTM, but could we have a tiny changelog entry. it's useful for users

Indeed :)

@jeromedockes
Copy link
Copy Markdown
Member

Thanks a lot @glemaitre !!

Note we have a bunch of estimators defining get_feature_names_out and none of them accept input_features so it might be worth creating another issue with a TODO list

do we want to allow overwriting the column names. In this case, we would need to change more code because we would need to overwrite the name of the columns of the output dataframe in transform to have a consistency between get_feature_names_out() and the feature_names_in_ of the subsequent steps in a scikit-learn pipeline.

I would say no, in any case a preprocessing check in the tablevectorizer raises an exception if the columns of the input are different in transform than in fit. but you know more about get_feature_names_out and how it is used in scikit-learn

@jeromedockes jeromedockes merged commit 43c6fa3 into skrub-data:main Mar 19, 2025
25 checks passed
@GaelVaroquaux
Copy link
Copy Markdown
Member

Note we have a bunch of estimators defining get_feature_names_out and none of them accept input_features so it might be worth creating another issue with a TODO list

+1

Can we use scikit-learn's common tests to detect this?

@glemaitre
Copy link
Copy Markdown
Member Author

Can we use scikit-learn's common tests to detect this?

We should check if the common test is covering this particular feature but I would hope so. I think that we are still bypassing the common test. I should give it another look in skrub.

rcap107 pushed a commit to rcap107/skrub that referenced this pull request Apr 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TableVectorizer.get_feature_names_out() not compatible with scikit-learn Pipeline

3 participants