New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ColumnTransformer should be able to use a function to select columns #11190
Comments
Yes, that's also something @ogrisel and I were discussing a week ago (was on my to do list to open an issue), that it would be nice to support functions to eg dynamically determine the columns based on the input dtypes. In that way, we could for example do
Yes, anything that is otherwise supported, is fine (eg also a mask, so "column indices" is a bit confusing name), since the In any case, regardless of providing such default functions in sklearn itself, supporting functions is a good idea IMO. |
This is a feature I always expected would happen, but wanted to get the basic interface right. (Eventually, I might even enjoy something more literate like ColumnTransformer()
.append(MyTrans(), dtypes=np.numeric)
.append(MyTrans2(), matches=r'^feature_') ) |
Hi @jnothman, @jorisvandenbossche! I'd like to help in this one if still needed. I could start helping with the coding of Joris' interface proposal if you have decided it's a good first approach. |
Thanks @partmor, I look forward to your pr. Please start with supporting functoons, testing it, and demonstrating with a realistic example. Then we can consider simplifying the API common use cases. @jorisvandenbossche's proposal does not, for instance, account for disparity in the definition of "categorical feature" |
Hi all, Just to check I have the concepts clear: so we would want to define a set of methods, to be passed as callables to Candidates to selectors could be I think it would be nice to allow to combine these selectors intuitively using operators directly on them, e.g.: from sklearn.compose import (make_column_transformer, numerical_features,
categorical_features, feature_matches)
preprocess = make_column_transformer(
(numerical_features, StandardScaler()),
(categorical_features & feature_matches(r'^feature_'), OneHotEncoder())
) Tbh I would have to investigate if in Python callables can be decorated so that using operators on them also return a callable. Alternatively we could use instances of a class with overriden What do you think? Regarding pandas support, are we willing to use the categorical dtype, or stay at the |
I think you should be very careful to avoid any operator overloading.
Scikit-learn tends to be quite conservative in this area, and tries to
avoid "magic" in its APIs. Don't worry about defining categorical_features,
etc. Worry about ensuring that there's an interface for users to write
custom selectors that don't require them to know the columns by name or
index beforehand, which the current interface does. Later we can worry
about making a library of such selectors and how that would look.
|
I think this would be pretty great, but we need to get it right... |
can you elaborate @jnothman ? |
I feel like That would make the standard use case a bit more verbose, though :-/ preprocess = make_column_transformer(
(select_types([float, int]), StandardScaler()),
(select_types([string, object]), OneHotEncoder())
) And that wouldn't be very robust wrt different numerical dtypes :-/ (like an unexpected float 16 or something) |
@mfeurer @janvanrijn @joaquinvanschoren relevant to your interests. |
why not then combine type_selector and name_selector, and work on designing
a nice unified interface? Still, I think the key here is to support a
callable at all, and then work on the API
|
by "disparity in definition of categorical feature" ("disparity" was
probably a bad word choice) I meant that sometimes we mean pd.Categorical,
sometimes objects, sometimes numpy string-likes, sometimes integers...
|
@partmor, should we be expecting a PR from you? |
Hi @jnothman apologies for the silence this week. I am having one of those weeks... :( |
No worries :)
I'm just keen to get this into the next release if we can, but that's a few
weeks away, it seems.
|
I'd love to see this in the release as well, but I probably won't have time to code it before you to @partmor ;) |
Should this be closed following #11592 or do we want to keep this open and track actually adding some functions? |
I think this should be closed. But we can open an issue for selection function factories... I'm not about to though |
There is still the open PR that adds some functions: #11301 But in any case, the original point of the issue is solved, so I am fine with closing it. |
The new ColumnTransformer allows the user to specify column names or indices. I think it should be possible to specify a set of columns as a function. Indeed, towards #10603, we should probably have an inbuilt function that distinguishes between categorical and numeric
pd.Series
dtypes.In order to support the
remainder
functionality, I believe the function should have signature:(X,) -> column indices
where column indices is any of the supported column specification formats.Sound good @amueller, @jorisvandenbossche?
The text was updated successfully, but these errors were encountered: