Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API] Consistent API for attaching properties to samples #4497

Closed
GaelVaroquaux opened this issue Apr 2, 2015 · 96 comments
Closed

[API] Consistent API for attaching properties to samples #4497

GaelVaroquaux opened this issue Apr 2, 2015 · 96 comments

Comments

@GaelVaroquaux
Copy link
Member

This is an issue that I am opening for discussion.

Problem:

Sample weights (in various estimators), group labels (for cross-validation objects), group id (in learning to rank) are optional information that need to be passed to estimators and the CV framework, and that need to kept to the proper shape throughout the data processing pipeline.

Right now, the code to deal with this is inhomogeneous in the codebase, the APIs are not fully consistent (ie passing sample_weights to objects that do not support them will just crash).

This discussion attempt to address the problems above, and open the door to more flexibility to future evolution

Core idea

We could have an argument that is a dataframe-like object, ie a collection (dictionary) of 1D array-like object. This argument would be sliced and diced by any code that modifies the number of samples (CV objects, train_test_split), and passed along the data.

Proposal A

All objects could take as a signature fit(X, y, sample_props=None), with y optional for unsupervised learners.

sample_props (name to be debated) would be a dataframe like object (ie either a dict of arrays, or a dataframe). It would have a few predefined fields, such as "weight" for sample weight, "group" for sample groups used in cross validation. It would open the door to attaching domain-specific information to samples, and thus make scikit-learn easier to adapt to specific applications.

Proposal B

y could be optionally a dataframe-like object, which would have as a compulsory field "target", serving the purpose of the current y, and other fields such as "weight", "group"... In which case, arguments "sample_weights" and alike would disappear into it.

People at the Paris sprint (including me) seem to lean towards proposal A.

Implementation aspects

The different validation tools will have to be adapted to accept this type of argument. We should not depend on pandas. Thus we will accept dict of arrays (and build a helper function to slice them in the sample direction). Also, this helper should probably accept data frame (but given that data frames can be indexed like dictionaries, this will not be a problem.

Finally, the CV objects should be adapted to split the corresponding structure. Probably in a follow up to #4294

@GaelVaroquaux
Copy link
Member Author

To track the evolution of ideas here previous mentions of related idea:
#2904 (comment)

@GaelVaroquaux
Copy link
Member Author

It would fix #2879, and be a clean alternative to #1574 and #3524

@mjbommar
Copy link
Contributor

mjbommar commented Apr 2, 2015

Sorry to be obtuse, but where does the reticence to depend or better integrate pandas come from? It's hard to find applied examples of sklearn in the community that don't include pandas these days, and the marginal dependencies over numpy and scipy are only dateutil and pytz. It seems as if we'd have to reinvent much of the masking and group-by wheel anyway to support data-dependent CV use cases.

@GaelVaroquaux
Copy link
Member Author

GaelVaroquaux commented Apr 2, 2015 via email

@raghavrv
Copy link
Member

raghavrv commented Apr 2, 2015

Proposal A along with dict of arrays seems like a good solution to me... :)

@mblondel
Copy link
Member

mblondel commented Apr 6, 2015

@GaelVaroquaux IIRC, you said you were considering a dataset object for out-of-core learning. If that's indeed the case, this should probably part of our reflexion.

@amueller
Copy link
Member

amueller commented Apr 6, 2015

+1 for A.

I'm still reflecting whether we need to change the API of all estimators, though. I'd like to avoid that, but I'm not sure it is possible.

I have nothing better than sample_props

@amueller
Copy link
Member

amueller commented Apr 6, 2015

It means that everybody that uses sample_weight needs to change their code. Which is better than everybody that ever used y needs to change their code (which they'd have to for B).

@jnothman
Copy link
Member

jnothman commented Apr 6, 2015

What's the advantage of A over kwargs?

@amueller
Copy link
Member

amueller commented Apr 7, 2015

can you elaborate?

@jnothman
Copy link
Member

jnothman commented Apr 7, 2015

A is a dict of names with array values. These variables could be passed directly as **kwargs, similarly resulting in a dict, without changing the current sample weight handling.

@amueller
Copy link
Member

amueller commented Apr 7, 2015

So you would add **kwargs to all fit methods and ignore those that are not used?

@jnothman
Copy link
Member

jnothman commented Apr 7, 2015

Perhaps not, but I want to know in what ways this is really a worse
solution than sample_props.

On 8 April 2015 at 00:59, Andreas Mueller notifications@github.com wrote:

So you would add **kwargs to all fit methods and ignore those that are
not used?


Reply to this email directly or view it on GitHub
#4497 (comment)
.

@GaelVaroquaux
Copy link
Member Author

Perhaps not, but I want to know in what ways this is really a worse
solution than sample_props.

Two aspects.

  • First, it would implie that models tend to swallow arguments without
    raising errors. For instance if I don't know that Python is case
    sensitive, I write

    "fit(X, Y=y)",

    I won't be getting an error message that I didn't pass a valid argument

  • Second, exposing sample_props as one argument will be making it more
    obvious that it is a homogenous type. It will also make people's life
    easier if they are already using pandas. (I must say that I am a bit
    scared of coupling too much with pandas, upgrades to pandas tend to
    break our code, as in Pandas Validation failure #4540).

I also find that "**kwargs" is harder to understand for someone who is
not a Python expert.

@amueller
Copy link
Member

amueller commented Apr 7, 2015

I think mostly in being a little stricter with the interface. Also, there could be arguments to fit that are not of length n_samples (thought we try to avoid them).

@amueller
Copy link
Member

amueller commented Apr 7, 2015

@GaelVaroquaux I think the issue you mentioned is caused by upgrading sklearn, not upgrading pandas ;)

@GaelVaroquaux
Copy link
Member Author

GaelVaroquaux commented Apr 7, 2015 via email

@jnothman
Copy link
Member

jnothman commented Apr 7, 2015

I just thought it worth raising as devil's advocate, so thanks for the initial responses.

First, it would implie that models tend to swallow arguments without raising errors.

Sure, though naming errors are as much a real issue with sample_props. Indeed a confused user may have sample_props={'sample_weight': [...]} or sample_props={'weights': ...} instead of sample_props={'weight': ...}.

Another issue in which all proposed solutions fail (but the incumbent approach of "pass sample_weight explicitly works fine): if an estimator does not have sample_weight support but then it is implemented, its behaviour will change implicitly though the data does not. Is there any way we can avoid this backwards compatibility issue? I don't think the friendly answer is UserWarning("Weight support was recently added. Your results may differ from before.")

@GaelVaroquaux
Copy link
Member Author

GaelVaroquaux commented Apr 7, 2015 via email

@amueller
Copy link
Member

amueller commented Apr 7, 2015

Somewhat related question: will transformers also output a modified sample_props? They must, right?

@jnothman
Copy link
Member

jnothman commented Apr 7, 2015

Or perhaps we should at least have a way of introspecting which sample
props an estimator or method knows about so that the user can make
assertions in upgrades...? Too frameworkish?

On 8 April 2015 at 04:06, Andreas Mueller notifications@github.com wrote:

Somewhat related question: will transformers also output a modified
sample_props? They must, right?


Reply to this email directly or view it on GitHub
#4497 (comment)
.

@amueller
Copy link
Member

amueller commented Apr 9, 2015

Fixes #2630, also see the wiki

@AlexanderFabisch
Copy link
Member

To summarize the current state of the discussion, I think something like this would be a nice solution:

Estimator:

def fit(X, y, sample_props={}):
    weights, groups = check_props(sample_props, required=["weights"], optional=["groups"])
    ...

User:

sklearn.seterr("raise")  # 'ignore', 'warn', 'raise', 'call', 'print', 'log'
est = Bla().fit(X, y, sample_props={"weight": ...})

-> ValueError("Sample properties 'weights' are missing, unknown sample properties 'weight'")

The only thing that is missing is a good way to document the required and optional sample properties of an estimator. I have no idea how we can do this. An advantage of having sample_weights as an explicit argument in fit is that you can directly see that an estimator uses it and it is obvious whether the description in the docstring is missing or not.

@amueller
Copy link
Member

I think just mentioning it in the fit docstring and / or the estimator docstring should be fine, shouldn't it?

@amueller
Copy link
Member

I don't think sklearn.seterr("raise") is good btw. It should be sklearn.seterr(sample_props="raise").
I could see sklearn.seterr(deprecation="raise") or sklearn.seterr(type_conversion='ignore') or convergence issues etc.

@AlexanderFabisch
Copy link
Member

That sounds reasonable. I think this is a more general feature that has an impact on many parts of the library. We should make a separate pull request for it before we deal with the sample properties, shouldn't we? Are there any disadvantages of having such a global state?

@jnothman
Copy link
Member

jnothman commented Sep 22, 2019 via email

@adriangb
Copy link
Contributor

I have not read all of the referenced PRs, issues and comments (it's a lot) but I went over this thread briefly.

One comment I have: would it be possible to introduce this new parameter while keeping the existing sample_weight, groups, etc. parameters? Then as things get updated to support the new parameter, there can be a check that raises an error or warning if the user specifies both sample_weight as it's own parameter and also sample_weight within sample_properties. Then once sample_properties is fully implemented in the ecosystem support for sample_weight could be dropped, with warning of course.

I am also going to briefly detail my use case and results below to support this feature.

I am working on classifying activity data (accelerometer). My data looks something like this:

subject walk feature_1 feature_2 y_true y_pred
0 0 0.2784 0.146 1 ?
0 0 0.1428 0.1286 1 ?
0 1 0.127 0.8127 2 ?
1 0 0.8721 0.328 3 ?
1 1 0.146 0.376 2 ?
1 1 0.4879 0.274 2 ?

In this case, X would be [feature_1, feature_2] and [subject, walk] are the 'sample properties'.
I use subject as the groups in LOGO cross validation because my end goal is to predict for a subject with no existing data.
I need to classify each walk into one of the categories in y_true, but I don't really need to classify each datapoint (in reality, I have thousands of datapoints for each walk). So ideally I would group y_pred by walk and select datapoints for each walk that either have above a certain threshold of prediction probability or pick the top 3 datapoints, etc. The physical reasoning for this is that if the subject did something weird for 1-2 steps I am happy to discard that because I have a lot of other data to work off of.
I was able to achieve this by essentially copy-pasting sklearn.model_selection.__validation.py and hacking it up to passthrough a parameter metadata (a dict) to the scoring function, which I made to have the signature score(estimator, X, y, metadata=None).

Doing this increased my cross-validation scores considerably, I guess I had a lot of "bad" data points that I am now discarding.

@jnothman
Copy link
Member

I've not yet looked at all of your contribution yet, @adriangb, but you might want to check out the latest proposal at #16079

@adriangb
Copy link
Contributor

I haven't contributed to scikit-learn @jnothman, but would love to start! Thanks for referring me to the current discussion, I will comment there.

@jnothman
Copy link
Member

jnothman commented Apr 28, 2020 via email

@adriangb
Copy link
Contributor

Well, I don't expect to be able to do too much, but it cant' hurt to try! I was also interested in working on IterativeImputer (#16638 (comment)). That's probably not any easier.

@jnothman
Copy link
Member

jnothman commented Apr 29, 2020 via email

@jeremiedbb
Copy link
Member

As decided during the dev meeting, we are safely moving it to the 1.2 milestone to give us more time for internal testing.

@jeremiedbb
Copy link
Member

moving to 2.0 (#25851)

@adrinjalali
Copy link
Member

I guess I'm going to dare and close this one, since we have the API now, and the implementation for more and more meta-estimators are on the way, and Pipeline is already merged 🥳

Closing this issue makes me so proud 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment