API Idea: add a "drop" argument #670

GaelVaroquaux · 2023-07-20T17:14:38Z

Dropping a column is a very common need that we might need to facilitate. Here are a few ideas to do this:

We could add a "drop=None" argument to the TableVectorizer. It would be "None" by default and could take either a column name or a list of columns
A more radical point of view would be that all our major transformers should have the drop argument. This way it is easy to do feature
We could have a "Drop" transformer
We could even a more general transformer that changes columns (mostly renames and drop, I suspect). The rename might be very useful to deal with input tables with multi-index as columns: these are definitely going to lead to convoluted code to support. It could be called "ColSelect". This would be using a verb rather than a noun, as opposed to "ColSelector" but I am more and more thinking that verbs are better: the code looks like a phrase. The drawback is that it contrasts with scikit-learn

TheooJ · 2023-07-21T15:28:39Z

I agree that it is something that would be useful, and to reply to your points :

I like 2. more than 1. because I think it makes sense to have different dropping strategies for different transformers. That being said, if they share a common, widely used drop case it would make sense to have this argument in TableVectorizer too. This way you would set up the dropping strategy for all of them at once

It could possibly account for other strategies than dropna, for instance practitioners often drop 1) very sparse columns, or features that are present only for a small number of ids, 2) correlations (among features, between feature and target), 3) outliers
I like the idea of verbs over nouns, but I would choose nouns to avoid a contrast with scikit-learn

wdyt ?

Vincent-Maladiere · 2023-08-01T09:52:55Z

@TheooJ I think Gaël comment is about removing columns based on user-defined lists. You'd need a transformer to perform feature selection.

Maybe we could combine 1. and 4. : having a drop parameter on the TableVectorizer and allowing renaming or simple column manipulation operations in kwargs. In addition, we could also introduce the ColSelector for usage out of TableVectorizer.

Let's assume a slightly different identity from scikit-learn with verbs rather than nouns ;)

jeromedockes · 2023-09-08T08:09:13Z

I think I prefer option 3, "add a Drop transformer". The vectorizer has quite a few parameters already and I believe a slightly longer pipeline with simpler steps is easier to understand than a pipeline where some steps do a lot of things. Also, I'm not sure but there could be situations where a user wants control over where the drop happens, eg to drop a lot of columns as soon as possible to save memory, or to use a column for a join and drop it afterwards for prediction

jeromedockes · 2023-09-08T08:12:34Z

I mean 3 or 4 -- ie having a separate transformer for select. indeed it can do more than just subset columns

GaelVaroquaux added enhancement New feature or request discussion Something somewhat open-ended to discuss labels Jul 20, 2023

jeromedockes mentioned this issue Oct 23, 2023

[MRG] Add SelectCols(cols) and DropCols(cols) transformers #804

Merged

GaelVaroquaux closed this as completed in #804 Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API Idea: add a "drop" argument #670

API Idea: add a "drop" argument #670

GaelVaroquaux commented Jul 20, 2023

TheooJ commented Jul 21, 2023 •

edited

Loading

Vincent-Maladiere commented Aug 1, 2023

jeromedockes commented Sep 8, 2023

jeromedockes commented Sep 8, 2023

API Idea: add a "drop" argument #670

API Idea: add a "drop" argument #670

Comments

GaelVaroquaux commented Jul 20, 2023

TheooJ commented Jul 21, 2023 • edited Loading

Vincent-Maladiere commented Aug 1, 2023

jeromedockes commented Sep 8, 2023

jeromedockes commented Sep 8, 2023

TheooJ commented Jul 21, 2023 •

edited

Loading