Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add SelectCols(cols) and DropCols(cols) transformers #804

Merged
merged 15 commits into from
Oct 27, 2023

Conversation

jeromedockes
Copy link
Contributor

closes #670

a decision on the API has actually not been made in the linked issue, this is to help make the discussion more concrete

@jeromedockes jeromedockes marked this pull request as draft October 23, 2023 13:13
Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look at this PR, and I'm quite convinced that this is the right way to go. Great job @jeromedockes !

Can you add the two objects in the API documentation?

And maybe add a tiny section on https://skrub-data.org/stable/assembling.html, between "joining" and "going further". We should stress here that the benefit is to be able to build pipelines.

examples/04_fuzzy_joining.py Show resolved Hide resolved
examples/04_fuzzy_joining.py Outdated Show resolved Hide resolved
examples/04_fuzzy_joining.py Outdated Show resolved Hide resolved
examples/04_fuzzy_joining.py Outdated Show resolved Hide resolved
skrub/_select_cols.py Show resolved Hide resolved
@jeromedockes
Copy link
Contributor Author

My original view was to rather have only a single object that merged both SelectCols and DropCols.

That's definitely an option, too. A few small things added up and made me lean towards the version with 2 classes:

  • They only have 1 parameter each, so it is natural to pass it as positional argument. Instantiations are a bit less verbose and users don't need to remember the parameter name.

  • The classes are easy to understand almost without reading the documentation because they do 1 simple thing and hold 1 list of column names.

  • "I can drop columns with DropCols" feels easier to discover than "I can drop columns with the 'drop' parameter of SelectCols" (or the other way around "select with SelectCols" vs "select with 'keep' parameter of DropCols")

  • If both drop and select were in one class, we would need a default argument for the "select" parameter to mean "select all" (otherwise using the "drop" parameter would be hard). I didn't see an option I found 100% satisfying for that default value: "None" means the opposite of what we want (ie "all"), a string like "*" would be ok but still it's a bit weird as column names are strings as well, sentinel values like a singleton or enum with a nice repr are too "magic" and we don't know how to import them if we want to pass the default explicitly, ellipsis seems arbitrary, we probably don't want to support slices and they're rarely used outside "[]", etc.

  • We would need to decide and document what happens when both parameters are used, and when a column name ends up in both "drop" and "select". It will almost never happen in practice because it doesn't make much sense, but it still needs to be handled so it adds uninteresting cruft both in the code and the docstring. In general having 2 parameters such that it only makes sense to use 1 at a time sometimes results in annoying docstrings and parameter validation code IMHO.

  • It's likely that we will add other simple manipulations of columns, like renaming. With the current option we could add another simple transformer "RenameCols". If we stuff all operations in one class we would add one more parameter and need to think of its interactions with the other parameters (does the selection happen before or after the renaming?, ...) and guess what users will find surprising or not. So it might become cumbersome to use, describe & maintain.

Of course that's only a first impression and we'll have to see how it plays out in practice and in the examples. They're simple things so we can prototype several approaches and see what we prefer.

The issue also suggested adding a "drop" parameter to most transformers. I thought it would be more work (especially if we want to add "rename", "select" parameters), add parameters to already complex classes with other interactions to think about (do we drop before or after vectorizing?), duplicate some of the work (or at least parameter documentation) across the existing transformers.

OTOH, it would result in shorter pipelines and fewer classes for users to discover, so it is still certainly worth discussing.

@jeromedockes
Copy link
Contributor Author

also pandas and polars both have separate select / getitem and drop

@jeromedockes
Copy link
Contributor Author

Can you add the two objects in the API documentation?

And maybe add a tiny section on https://skrub-data.org/stable/assembling.html, between "joining" and "going further". We should stress here that the benefit is to be able to build pipelines.

I added the api entries and the paragraph in the user guide. I put the api section after "joining" to keep the order similar to the user guide

@jeromedockes jeromedockes marked this pull request as ready for review October 24, 2023 14:56
@jeromedockes
Copy link
Contributor Author

do we want to allow passing a string instead of a list of (1) string if we want to select or drop a single column?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 24, 2023 via email

@jeromedockes jeromedockes changed the title [WIP] add SelectCols(cols) and DropCols(cols) transformers Add SelectCols(cols) and DropCols(cols) transformers Oct 25, 2023
@jeromedockes
Copy link
Contributor Author

ok it is ready to review

@jeromedockes jeromedockes changed the title Add SelectCols(cols) and DropCols(cols) transformers [MRG] Add SelectCols(cols) and DropCols(cols) transformers Oct 25, 2023
@jeromedockes
Copy link
Contributor Author

Also, I named it SelectCols but maybe to be consistent with other transformers it should be ColsSelector and ColsDropper 😐
we still don't have a decision on nouns vs verbs in #751

@GaelVaroquaux
Copy link
Member

@jeromedockes
Copy link
Contributor Author

jeromedockes commented Oct 25, 2023 via email

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments.

I am not sure if we have polars support. Is it easy to add? In all cases, we should be explicit

doc/assembling.rst Outdated Show resolved Hide resolved
examples/04_fuzzy_joining.py Outdated Show resolved Hide resolved
examples/04_fuzzy_joining.py Outdated Show resolved Hide resolved
examples/04_fuzzy_joining.py Outdated Show resolved Hide resolved
skrub/_select_cols.py Show resolved Hide resolved
skrub/_select_cols.py Show resolved Hide resolved
skrub/_select_cols.py Show resolved Hide resolved
@jeromedockes
Copy link
Contributor Author

I am not sure if we have polars support

yes you can see in the tests

@GaelVaroquaux
Copy link
Member

I am not sure if we have polars support

yes you can see in the tests

Awesome!!

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tiny comment, but to me, this is good to go.

examples/04_fuzzy_joining.py Outdated Show resolved Hide resolved
Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice and useful features, thanks @jeromedockes!

@GaelVaroquaux
Copy link
Member

OK, merging then. Thanks @jeromedockes !

@GaelVaroquaux GaelVaroquaux merged commit aa69f4d into skrub-data:main Oct 27, 2023
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API Idea: add a "drop" argument
3 participants