-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API Idea: add a "drop" argument #670
Comments
I agree that it is something that would be useful, and to reply to your points : I like 2. more than 1. because I think it makes sense to have different dropping strategies for different transformers. That being said, if they share a common, widely used drop case it would make sense to have this argument in TableVectorizer too. This way you would set up the dropping strategy for all of them at once
wdyt ? |
@TheooJ I think Gaël comment is about removing columns based on user-defined lists. You'd need a transformer to perform feature selection. Maybe we could combine 1. and 4. : having a Let's assume a slightly different identity from scikit-learn with verbs rather than nouns ;) |
I think I prefer option 3, "add a Drop transformer". The vectorizer has quite a few parameters already and I believe a slightly longer pipeline with simpler steps is easier to understand than a pipeline where some steps do a lot of things. Also, I'm not sure but there could be situations where a user wants control over where the drop happens, eg to drop a lot of columns as soon as possible to save memory, or to use a column for a join and drop it afterwards for prediction |
I mean 3 or 4 -- ie having a separate transformer for select. indeed it can do more than just subset columns |
Dropping a column is a very common need that we might need to facilitate. Here are a few ideas to do this:
We could add a "drop=None" argument to the TableVectorizer. It would be "None" by default and could take either a column name or a list of columns
A more radical point of view would be that all our major transformers should have the drop argument. This way it is easy to do feature
We could have a "Drop" transformer
We could even a more general transformer that changes columns (mostly renames and drop, I suspect). The rename might be very useful to deal with input tables with multi-index as columns: these are definitely going to lead to convoluted code to support. It could be called "ColSelect". This would be using a verb rather than a noun, as opposed to "ColSelector" but I am more and more thinking that verbs are better: the code looks like a phrase. The drawback is that it contrasts with scikit-learn
The text was updated successfully, but these errors were encountered: