Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Idea: add a "drop" argument #670

Closed
GaelVaroquaux opened this issue Jul 20, 2023 · 4 comments · Fixed by #804
Closed

API Idea: add a "drop" argument #670

GaelVaroquaux opened this issue Jul 20, 2023 · 4 comments · Fixed by #804
Labels
discussion Something somewhat open-ended to discuss enhancement New feature or request

Comments

@GaelVaroquaux
Copy link
Member

Dropping a column is a very common need that we might need to facilitate. Here are a few ideas to do this:

  1. We could add a "drop=None" argument to the TableVectorizer. It would be "None" by default and could take either a column name or a list of columns

  2. A more radical point of view would be that all our major transformers should have the drop argument. This way it is easy to do feature

  3. We could have a "Drop" transformer

  4. We could even a more general transformer that changes columns (mostly renames and drop, I suspect). The rename might be very useful to deal with input tables with multi-index as columns: these are definitely going to lead to convoluted code to support. It could be called "ColSelect". This would be using a verb rather than a noun, as opposed to "ColSelector" but I am more and more thinking that verbs are better: the code looks like a phrase. The drawback is that it contrasts with scikit-learn

@GaelVaroquaux GaelVaroquaux added enhancement New feature or request discussion Something somewhat open-ended to discuss labels Jul 20, 2023
@TheooJ
Copy link
Contributor

TheooJ commented Jul 21, 2023

I agree that it is something that would be useful, and to reply to your points :

I like 2. more than 1. because I think it makes sense to have different dropping strategies for different transformers. That being said, if they share a common, widely used drop case it would make sense to have this argument in TableVectorizer too. This way you would set up the dropping strategy for all of them at once

  1. It could possibly account for other strategies than dropna, for instance practitioners often drop 1) very sparse columns, or features that are present only for a small number of ids, 2) correlations (among features, between feature and target), 3) outliers

  2. I like the idea of verbs over nouns, but I would choose nouns to avoid a contrast with scikit-learn

wdyt ?

@Vincent-Maladiere
Copy link
Member

@TheooJ I think Gaël comment is about removing columns based on user-defined lists. You'd need a transformer to perform feature selection.

Maybe we could combine 1. and 4. : having a drop parameter on the TableVectorizer and allowing renaming or simple column manipulation operations in kwargs. In addition, we could also introduce the ColSelector for usage out of TableVectorizer.

Let's assume a slightly different identity from scikit-learn with verbs rather than nouns ;)

@jeromedockes
Copy link
Member

I think I prefer option 3, "add a Drop transformer". The vectorizer has quite a few parameters already and I believe a slightly longer pipeline with simpler steps is easier to understand than a pipeline where some steps do a lot of things. Also, I'm not sure but there could be situations where a user wants control over where the drop happens, eg to drop a lot of columns as soon as possible to save memory, or to use a column for a join and drop it afterwards for prediction

@jeromedockes
Copy link
Member

I mean 3 or 4 -- ie having a separate transformer for select. indeed it can do more than just subset columns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Something somewhat open-ended to discuss enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants