ColumnTransformer should be able to use a function to select columns #11190

jnothman · 2018-06-02T23:15:08Z

The new ColumnTransformer allows the user to specify column names or indices. I think it should be possible to specify a set of columns as a function. Indeed, towards #10603, we should probably have an inbuilt function that distinguishes between categorical and numeric pd.Series dtypes.

In order to support the remainder functionality, I believe the function should have signature: (X,) -> column indices where column indices is any of the supported column specification formats.

Sound good @amueller, @jorisvandenbossche?

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2018-06-03T12:09:33Z

Yes, that's also something @ogrisel and I were discussing a week ago (was on my to do list to open an issue), that it would be nice to support functions to eg dynamically determine the columns based on the input dtypes.

In that way, we could for example do

from sklearn.compose import make_column_transformer, numerical_features, categorical_features

preprocess = make_column_transformer(
    (numerical_features, StandardScaler()),
    (categorical_features, OneHotEncoder())
)

In order to support the remainder functionality, I believe the function should have signature: (X,) -> column indices where column indices is any of the supported column specification formats.

Yes, anything that is otherwise supported, is fine (eg also a mask, so "column indices" is a bit confusing name), since the remainder functionality can already handle all those inputs.

In any case, regardless of providing such default functions in sklearn itself, supporting functions is a good idea IMO.

jorisvandenbossche · 2018-06-03T12:10:22Z

@jnothman I think the link in your first issue should be to #10603 instead of #10063 ? (can't edit it)

jnothman · 2018-06-03T12:51:20Z

This is a feature I always expected would happen, but wanted to get the basic interface right. (Eventually, I might even enjoy something more literate like

ColumnTransformer()
    .append(MyTrans(), dtypes=np.numeric)
    .append(MyTrans2(), matches=r'^feature_')

)

partmor · 2018-06-05T09:43:26Z

Hi @jnothman, @jorisvandenbossche! I'd like to help in this one if still needed.

I could start helping with the coding of Joris' interface proposal if you have decided it's a good first approach.

jnothman · 2018-06-05T23:44:53Z

Thanks @partmor, I look forward to your pr. Please start with supporting functoons, testing it, and demonstrating with a realistic example. Then we can consider simplifying the API common use cases. @jorisvandenbossche's proposal does not, for instance, account for disparity in the definition of "categorical feature"

partmor · 2018-06-10T11:49:01Z

Hi all,

Just to check I have the concepts clear: so we would want to define a set of methods, to be passed as callables to make_column_transformer, where we are encapsulating column selection logic. Since make_column_transformer already supports column boolean masks, we would be aiming to a (X, ) -> column boolean mask interface.

Candidates to selectors could be numerical_features and categorical_features as @jorisvandenbossche pointed out, and feature_matches (where we could select columns whose names satisfy a regex, right @jnothman?) for instance.

I think it would be nice to allow to combine these selectors intuitively using operators directly on them, e.g.:

from sklearn.compose import (make_column_transformer, numerical_features, 
                             categorical_features, feature_matches)

preprocess = make_column_transformer(
    (numerical_features, StandardScaler()),
    (categorical_features & feature_matches(r'^feature_'), OneHotEncoder())
)

Tbh I would have to investigate if in Python callables can be decorated so that using operators on them also return a callable. Alternatively we could use instances of a class with overriden __or__ and __and__ operators for instance.

What do you think?

Regarding pandas support, are we willing to use the categorical dtype, or stay at the dtype == object level?

jnothman · 2018-06-10T13:22:35Z

I think you should be very careful to avoid any operator overloading. Scikit-learn tends to be quite conservative in this area, and tries to avoid "magic" in its APIs. Don't worry about defining categorical_features, etc. Worry about ensuring that there's an interface for users to write custom selectors that don't require them to know the columns by name or index beforehand, which the current interface does. Later we can worry about making a library of such selectors and how that would look.

amueller · 2018-06-10T18:35:45Z

I think this would be pretty great, but we need to get it right...

amueller · 2018-06-10T19:06:19Z

proposal does not, for instance, account for disparity in the definition of "categorical feature"

can you elaborate @jnothman ?

amueller · 2018-06-10T19:08:50Z

I feel like numerical_features should probably be a factory and return a callable, like numerical_features(include_int=False) or categorical_features(dtypes=[..])?
Hm actually maybe we just want select_types(dtypes=[...]) and select_names(patterns=[..]) as two factories that would probably be useful?

That would make the standard use case a bit more verbose, though :-/

preprocess = make_column_transformer(
    (select_types([float, int]), StandardScaler()),
    (select_types([string, object]), OneHotEncoder())
)

And that wouldn't be very robust wrt different numerical dtypes :-/ (like an unexpected float 16 or something)

amueller · 2018-06-10T19:14:11Z

@mfeurer @janvanrijn @joaquinvanschoren relevant to your interests.

jnothman · 2018-06-10T22:35:58Z

why not then combine type_selector and name_selector, and work on designing a nice unified interface? Still, I think the key here is to support a callable at all, and then work on the API

jnothman · 2018-06-10T22:37:46Z

by "disparity in definition of categorical feature" ("disparity" was probably a bad word choice) I meant that sometimes we mean pd.Categorical, sometimes objects, sometimes numpy string-likes, sometimes integers...

jnothman · 2018-06-16T12:32:07Z

@partmor, should we be expecting a PR from you?

partmor · 2018-06-16T13:42:55Z

Hi @jnothman apologies for the silence this week. I am having one of those weeks... :(
Yes, it's constantly behind my head. If it's not inconvenient, I'll be able to PR my work before the end of this incoming week, to be conservative. Thank you and apologies.

jnothman · 2018-06-16T13:57:05Z

No worries :) I'm just keen to get this into the next release if we can, but that's a few weeks away, it seems.

amueller · 2018-06-16T13:58:21Z

I'd love to see this in the release as well, but I probably won't have time to code it before you to @partmor ;)

partmor · 2018-06-16T16:10:01Z

@jnothman @amueller coding this Saturday has gone great so far. So I expect to PR very soon before the weekend ends, probably tonight. The soonest I can start working on your feedback the best.

Thank you for the cheer up!

amueller · 2018-10-04T20:49:21Z

Should this be closed following #11592 or do we want to keep this open and track actually adding some functions?

jnothman · 2018-10-05T06:33:37Z

I think this should be closed. But we can open an issue for selection function factories... I'm not about to though

jorisvandenbossche · 2018-10-05T09:54:25Z

There is still the open PR that adds some functions: #11301

But in any case, the original point of the issue is solved, so I am fine with closing it.

jnothman added the Enhancement label Jun 2, 2018

jnothman added help wanted Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve and removed good first issue Easy with clear instructions to resolve labels Jun 4, 2018

jorisvandenbossche mentioned this issue Jun 13, 2018

[MRG] DOC add mixed categorical / continuous example with ColumnTransformer #11197

Merged

janvanrijn mentioned this issue Jun 14, 2018

Setup of Scikit-learn Experiments openml/Study-14#4

Open

partmor mentioned this issue Jun 16, 2018

[WIP] Column selector functions for ColumnTransformer #11301

Closed

amueller added this to PR phase in Andy's pets Jun 29, 2018

amueller closed this as completed Oct 5, 2018

amueller mentioned this issue Oct 5, 2018

Provide factory functions for selecting columns in ColumnTransformer #12303

Closed

amueller removed this from PR phase in Andy's pets Dec 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ColumnTransformer should be able to use a function to select columns #11190

ColumnTransformer should be able to use a function to select columns #11190

jnothman commented Jun 2, 2018 •

edited

jorisvandenbossche commented Jun 3, 2018

jorisvandenbossche commented Jun 3, 2018

jnothman commented Jun 3, 2018 •

edited

partmor commented Jun 5, 2018

jnothman commented Jun 5, 2018

partmor commented Jun 10, 2018 •

edited

jnothman commented Jun 10, 2018 via email

amueller commented Jun 10, 2018

amueller commented Jun 10, 2018

amueller commented Jun 10, 2018 •

edited

amueller commented Jun 10, 2018

jnothman commented Jun 10, 2018 via email

jnothman commented Jun 10, 2018 via email

jnothman commented Jun 16, 2018

partmor commented Jun 16, 2018 •

edited

jnothman commented Jun 16, 2018 via email

amueller commented Jun 16, 2018

partmor commented Jun 16, 2018 •

edited

amueller commented Oct 4, 2018

jnothman commented Oct 5, 2018

jorisvandenbossche commented Oct 5, 2018

ColumnTransformer should be able to use a function to select columns #11190

ColumnTransformer should be able to use a function to select columns #11190

Comments

jnothman commented Jun 2, 2018 • edited

jorisvandenbossche commented Jun 3, 2018

jorisvandenbossche commented Jun 3, 2018

jnothman commented Jun 3, 2018 • edited

partmor commented Jun 5, 2018

jnothman commented Jun 5, 2018

partmor commented Jun 10, 2018 • edited

jnothman commented Jun 10, 2018 via email

amueller commented Jun 10, 2018

amueller commented Jun 10, 2018

amueller commented Jun 10, 2018 • edited

amueller commented Jun 10, 2018

jnothman commented Jun 10, 2018 via email

jnothman commented Jun 10, 2018 via email

jnothman commented Jun 16, 2018

partmor commented Jun 16, 2018 • edited

jnothman commented Jun 16, 2018 via email

amueller commented Jun 16, 2018

partmor commented Jun 16, 2018 • edited

amueller commented Oct 4, 2018

jnothman commented Oct 5, 2018

jorisvandenbossche commented Oct 5, 2018

jnothman commented Jun 2, 2018 •

edited

jnothman commented Jun 3, 2018 •

edited

partmor commented Jun 10, 2018 •

edited

amueller commented Jun 10, 2018 •

edited

partmor commented Jun 16, 2018 •

edited

partmor commented Jun 16, 2018 •

edited