Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ColumnTransformer should be able to use a function to select columns #11190

Closed
jnothman opened this issue Jun 2, 2018 · 21 comments
Closed

ColumnTransformer should be able to use a function to select columns #11190

jnothman opened this issue Jun 2, 2018 · 21 comments
Labels
Easy Well-defined and straightforward way to resolve Enhancement help wanted

Comments

@jnothman
Copy link
Member

jnothman commented Jun 2, 2018

The new ColumnTransformer allows the user to specify column names or indices. I think it should be possible to specify a set of columns as a function. Indeed, towards #10603, we should probably have an inbuilt function that distinguishes between categorical and numeric pd.Series dtypes.

In order to support the remainder functionality, I believe the function should have signature: (X,) -> column indices where column indices is any of the supported column specification formats.

Sound good @amueller, @jorisvandenbossche?

@jorisvandenbossche
Copy link
Member

Yes, that's also something @ogrisel and I were discussing a week ago (was on my to do list to open an issue), that it would be nice to support functions to eg dynamically determine the columns based on the input dtypes.

In that way, we could for example do

from sklearn.compose import make_column_transformer, numerical_features, categorical_features

preprocess = make_column_transformer(
    (numerical_features, StandardScaler()),
    (categorical_features, OneHotEncoder())
)

In order to support the remainder functionality, I believe the function should have signature: (X,) -> column indices where column indices is any of the supported column specification formats.

Yes, anything that is otherwise supported, is fine (eg also a mask, so "column indices" is a bit confusing name), since the remainder functionality can already handle all those inputs.

In any case, regardless of providing such default functions in sklearn itself, supporting functions is a good idea IMO.

@jorisvandenbossche
Copy link
Member

@jnothman I think the link in your first issue should be to #10603 instead of #10063 ? (can't edit it)

@jnothman
Copy link
Member Author

jnothman commented Jun 3, 2018

This is a feature I always expected would happen, but wanted to get the basic interface right. (Eventually, I might even enjoy something more literate like

ColumnTransformer()
    .append(MyTrans(), dtypes=np.numeric)
    .append(MyTrans2(), matches=r'^feature_')

)

@jnothman jnothman added help wanted Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve and removed good first issue Easy with clear instructions to resolve labels Jun 4, 2018
@partmor
Copy link
Contributor

partmor commented Jun 5, 2018

Hi @jnothman, @jorisvandenbossche! I'd like to help in this one if still needed.

I could start helping with the coding of Joris' interface proposal if you have decided it's a good first approach.

@jnothman
Copy link
Member Author

jnothman commented Jun 5, 2018

Thanks @partmor, I look forward to your pr. Please start with supporting functoons, testing it, and demonstrating with a realistic example. Then we can consider simplifying the API common use cases. @jorisvandenbossche's proposal does not, for instance, account for disparity in the definition of "categorical feature"

@partmor
Copy link
Contributor

partmor commented Jun 10, 2018

Hi all,

Just to check I have the concepts clear: so we would want to define a set of methods, to be passed as callables to make_column_transformer, where we are encapsulating column selection logic. Since make_column_transformer already supports column boolean masks, we would be aiming to a (X, ) -> column boolean mask interface.

Candidates to selectors could be numerical_features and categorical_features as @jorisvandenbossche pointed out, and feature_matches (where we could select columns whose names satisfy a regex, right @jnothman?) for instance.

I think it would be nice to allow to combine these selectors intuitively using operators directly on them, e.g.:

from sklearn.compose import (make_column_transformer, numerical_features, 
                             categorical_features, feature_matches)

preprocess = make_column_transformer(
    (numerical_features, StandardScaler()),
    (categorical_features & feature_matches(r'^feature_'), OneHotEncoder())
) 

Tbh I would have to investigate if in Python callables can be decorated so that using operators on them also return a callable. Alternatively we could use instances of a class with overriden __or__ and __and__ operators for instance.

What do you think?

Regarding pandas support, are we willing to use the categorical dtype, or stay at the dtype == object level?

@jnothman
Copy link
Member Author

jnothman commented Jun 10, 2018 via email

@amueller
Copy link
Member

I think this would be pretty great, but we need to get it right...

@amueller
Copy link
Member

proposal does not, for instance, account for disparity in the definition of "categorical feature"

can you elaborate @jnothman ?

@amueller
Copy link
Member

amueller commented Jun 10, 2018

I feel like numerical_features should probably be a factory and return a callable, like numerical_features(include_int=False) or categorical_features(dtypes=[..])?
Hm actually maybe we just want select_types(dtypes=[...]) and select_names(patterns=[..]) as two factories that would probably be useful?

That would make the standard use case a bit more verbose, though :-/

preprocess = make_column_transformer(
    (select_types([float, int]), StandardScaler()),
    (select_types([string, object]), OneHotEncoder())
)

And that wouldn't be very robust wrt different numerical dtypes :-/ (like an unexpected float 16 or something)

@amueller
Copy link
Member

@mfeurer @janvanrijn @joaquinvanschoren relevant to your interests.

@jnothman
Copy link
Member Author

jnothman commented Jun 10, 2018 via email

@jnothman
Copy link
Member Author

jnothman commented Jun 10, 2018 via email

@jnothman
Copy link
Member Author

@partmor, should we be expecting a PR from you?

@partmor
Copy link
Contributor

partmor commented Jun 16, 2018

Hi @jnothman apologies for the silence this week. I am having one of those weeks... :(
Yes, it's constantly behind my head. If it's not inconvenient, I'll be able to PR my work before the end of this incoming week, to be conservative. Thank you and apologies.

@jnothman
Copy link
Member Author

jnothman commented Jun 16, 2018 via email

@amueller
Copy link
Member

I'd love to see this in the release as well, but I probably won't have time to code it before you to @partmor ;)

@partmor
Copy link
Contributor

partmor commented Jun 16, 2018

@jnothman @amueller coding this Saturday has gone great so far. So I expect to PR very soon before the weekend ends, probably tonight. The soonest I can start working on your feedback the best.

Thank you for the cheer up!

@amueller
Copy link
Member

amueller commented Oct 4, 2018

Should this be closed following #11592 or do we want to keep this open and track actually adding some functions?

@jnothman
Copy link
Member Author

jnothman commented Oct 5, 2018

I think this should be closed. But we can open an issue for selection function factories... I'm not about to though

@jorisvandenbossche
Copy link
Member

There is still the open PR that adds some functions: #11301

But in any case, the original point of the issue is solved, so I am fine with closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Well-defined and straightforward way to resolve Enhancement help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants