Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API] Consistent API for attaching properties to samples #4497

Open
GaelVaroquaux opened this issue Apr 2, 2015 · 87 comments · May be fixed by #9566

Comments

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Apr 2, 2015

This is an issue that I am opening for discussion.

Problem:

Sample weights (in various estimators), group labels (for cross-validation objects), group id (in learning to rank) are optional information that need to be passed to estimators and the CV framework, and that need to kept to the proper shape throughout the data processing pipeline.

Right now, the code to deal with this is inhomogeneous in the codebase, the APIs are not fully consistent (ie passing sample_weights to objects that do not support them will just crash).

This discussion attempt to address the problems above, and open the door to more flexibility to future evolution

Core idea

We could have an argument that is a dataframe-like object, ie a collection (dictionary) of 1D array-like object. This argument would be sliced and diced by any code that modifies the number of samples (CV objects, train_test_split), and passed along the data.

Proposal A

All objects could take as a signature fit(X, y, sample_props=None), with y optional for unsupervised learners.

sample_props (name to be debated) would be a dataframe like object (ie either a dict of arrays, or a dataframe). It would have a few predefined fields, such as "weight" for sample weight, "group" for sample groups used in cross validation. It would open the door to attaching domain-specific information to samples, and thus make scikit-learn easier to adapt to specific applications.

Proposal B

y could be optionally a dataframe-like object, which would have as a compulsory field "target", serving the purpose of the current y, and other fields such as "weight", "group"... In which case, arguments "sample_weights" and alike would disappear into it.

People at the Paris sprint (including me) seem to lean towards proposal A.

Implementation aspects

The different validation tools will have to be adapted to accept this type of argument. We should not depend on pandas. Thus we will accept dict of arrays (and build a helper function to slice them in the sample direction). Also, this helper should probably accept data frame (but given that data frames can be indexed like dictionaries, this will not be a problem.

Finally, the CV objects should be adapted to split the corresponding structure. Probably in a follow up to #4294

@GaelVaroquaux

This comment has been minimized.

Copy link
Member Author

@GaelVaroquaux GaelVaroquaux commented Apr 2, 2015

To track the evolution of ideas here previous mentions of related idea:
#2904 (comment)

@GaelVaroquaux

This comment has been minimized.

Copy link
Member Author

@GaelVaroquaux GaelVaroquaux commented Apr 2, 2015

It would fix #2879, and be a clean alternative to #1574 and #3524

@mjbommar

This comment has been minimized.

Copy link
Contributor

@mjbommar mjbommar commented Apr 2, 2015

Sorry to be obtuse, but where does the reticence to depend or better integrate pandas come from? It's hard to find applied examples of sklearn in the community that don't include pandas these days, and the marginal dependencies over numpy and scipy are only dateutil and pytz. It seems as if we'd have to reinvent much of the masking and group-by wheel anyway to support data-dependent CV use cases.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member Author

@GaelVaroquaux GaelVaroquaux commented Apr 2, 2015

@raghavrv

This comment has been minimized.

Copy link
Member

@raghavrv raghavrv commented Apr 2, 2015

Proposal A along with dict of arrays seems like a good solution to me... :)

@mblondel

This comment has been minimized.

Copy link
Member

@mblondel mblondel commented Apr 6, 2015

@GaelVaroquaux IIRC, you said you were considering a dataset object for out-of-core learning. If that's indeed the case, this should probably part of our reflexion.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Apr 6, 2015

+1 for A.

I'm still reflecting whether we need to change the API of all estimators, though. I'd like to avoid that, but I'm not sure it is possible.

I have nothing better than sample_props

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Apr 6, 2015

It means that everybody that uses sample_weight needs to change their code. Which is better than everybody that ever used y needs to change their code (which they'd have to for B).

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Apr 6, 2015

What's the advantage of A over kwargs?

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Apr 7, 2015

can you elaborate?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Apr 7, 2015

A is a dict of names with array values. These variables could be passed directly as **kwargs, similarly resulting in a dict, without changing the current sample weight handling.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Apr 7, 2015

So you would add **kwargs to all fit methods and ignore those that are not used?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Apr 7, 2015

Perhaps not, but I want to know in what ways this is really a worse
solution than sample_props.

On 8 April 2015 at 00:59, Andreas Mueller notifications@github.com wrote:

So you would add **kwargs to all fit methods and ignore those that are
not used?


Reply to this email directly or view it on GitHub
#4497 (comment)
.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member Author

@GaelVaroquaux GaelVaroquaux commented Apr 7, 2015

Perhaps not, but I want to know in what ways this is really a worse
solution than sample_props.

Two aspects.

  • First, it would implie that models tend to swallow arguments without
    raising errors. For instance if I don't know that Python is case
    sensitive, I write

    "fit(X, Y=y)",

    I won't be getting an error message that I didn't pass a valid argument

  • Second, exposing sample_props as one argument will be making it more
    obvious that it is a homogenous type. It will also make people's life
    easier if they are already using pandas. (I must say that I am a bit
    scared of coupling too much with pandas, upgrades to pandas tend to
    break our code, as in #4540).

I also find that "**kwargs" is harder to understand for someone who is
not a Python expert.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Apr 7, 2015

I think mostly in being a little stricter with the interface. Also, there could be arguments to fit that are not of length n_samples (thought we try to avoid them).

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Apr 7, 2015

@GaelVaroquaux I think the issue you mentioned is caused by upgrading sklearn, not upgrading pandas ;)

@GaelVaroquaux

This comment has been minimized.

Copy link
Member Author

@GaelVaroquaux GaelVaroquaux commented Apr 7, 2015

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Apr 7, 2015

I just thought it worth raising as devil's advocate, so thanks for the initial responses.

First, it would implie that models tend to swallow arguments without raising errors.

Sure, though naming errors are as much a real issue with sample_props. Indeed a confused user may have sample_props={'sample_weight': [...]} or sample_props={'weights': ...} instead of sample_props={'weight': ...}.

Another issue in which all proposed solutions fail (but the incumbent approach of "pass sample_weight explicitly works fine): if an estimator does not have sample_weight support but then it is implemented, its behaviour will change implicitly though the data does not. Is there any way we can avoid this backwards compatibility issue? I don't think the friendly answer is UserWarning("Weight support was recently added. Your results may differ from before.")

@GaelVaroquaux

This comment has been minimized.

Copy link
Member Author

@GaelVaroquaux GaelVaroquaux commented Apr 7, 2015

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Apr 7, 2015

Somewhat related question: will transformers also output a modified sample_props? They must, right?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Apr 7, 2015

Or perhaps we should at least have a way of introspecting which sample
props an estimator or method knows about so that the user can make
assertions in upgrades...? Too frameworkish?

On 8 April 2015 at 04:06, Andreas Mueller notifications@github.com wrote:

Somewhat related question: will transformers also output a modified
sample_props? They must, right?


Reply to this email directly or view it on GitHub
#4497 (comment)
.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Apr 9, 2015

Fixes #2630, also see the wiki

@AlexanderFabisch

This comment has been minimized.

Copy link
Member

@AlexanderFabisch AlexanderFabisch commented Apr 10, 2015

To summarize the current state of the discussion, I think something like this would be a nice solution:

Estimator:

def fit(X, y, sample_props={}):
    weights, groups = check_props(sample_props, required=["weights"], optional=["groups"])
    ...

User:

sklearn.seterr("raise")  # 'ignore', 'warn', 'raise', 'call', 'print', 'log'
est = Bla().fit(X, y, sample_props={"weight": ...})

-> ValueError("Sample properties 'weights' are missing, unknown sample properties 'weight'")

The only thing that is missing is a good way to document the required and optional sample properties of an estimator. I have no idea how we can do this. An advantage of having sample_weights as an explicit argument in fit is that you can directly see that an estimator uses it and it is obvious whether the description in the docstring is missing or not.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Apr 10, 2015

I think just mentioning it in the fit docstring and / or the estimator docstring should be fine, shouldn't it?

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Apr 10, 2015

I don't think sklearn.seterr("raise") is good btw. It should be sklearn.seterr(sample_props="raise").
I could see sklearn.seterr(deprecation="raise") or sklearn.seterr(type_conversion='ignore') or convergence issues etc.

@AlexanderFabisch

This comment has been minimized.

Copy link
Member

@AlexanderFabisch AlexanderFabisch commented Apr 11, 2015

That sounds reasonable. I think this is a more general feature that has an impact on many parts of the library. We should make a separate pull request for it before we deal with the sample properties, shouldn't we? Are there any disadvantages of having such a global state?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Feb 9, 2017

Thinking about this again given the merge of #8278, I would like to note a couple of things about my routing proposal:

  • As long as the sample props are always passed, it is generally possible to do routing (and renaming of sample props sought by @vene) by wrappers, mixins and the like: simply remove the weight column or rename it to opt out of its use.
  • However it remains necessary for compatibility across versions to have an explicit way to specify routing in meta-estimators and CV estimators, and default routing needs to maintain former behaviour (i.e. not pass sample weights to SearchCV scorers, nor transformers in a Pipeline) as clear from an issue like #4632. The only safe (wrt backwards compatibility) alternative to this is that every time sample_weight (or another attribute) support is added to something pre-existing, it needs to also add a parameter ignore_sample_weight=False at least to facilitate deprecation in the short term.
  • The necessity of some kind of routing management is not contingent on a generic sample properties implementation. It applies even in the simple case of enabling weighted scoring in GridSearchCV or cross_val_score (#4632).
  • In Pipeline we already have selective passage of fit parameters using double-underscore notation, but this requires that the fit parameter be repeated for each target estimator. I think with these semantics extended to other contexts, this would still have full expressive power of routing, but with a routing parameter in Pipeline and other meta/CV estimators, the routing specification becomes more localised and this multiplication of input parameters (not to mention the munging of names) is avoided.

So while it's a separate but related issue, I'd like to hear which is preferable: a routing parameter, or always route but require attributes to be ignored by default over a deprecation period.

@jnothman jnothman added this to Design/analysis stage in API and interoperability Aug 14, 2017
@jnothman jnothman referenced a pull request that will close this issue Aug 16, 2017
1 of 11 tasks complete
@jnothman jnothman moved this from Design/analysis stage to In progress in API and interoperability Aug 17, 2017
@NicolasWoloszko

This comment has been minimized.

Copy link

@NicolasWoloszko NicolasWoloszko commented Nov 27, 2017

Hi. So is there a way to take sample weights into account for cross validation and gridsearch ? Thanks in advance.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Nov 27, 2017

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jan 14, 2018

If we had a "deep" approach, where the user needs to explicitly specify the path they want a sample prop to take when passing in the sample prop, would it be acceptable to have non-identifier keyword argument names? For example:

GridSearchCV(...).fit(X, y, **{'sample_weight': w, '[score]sample_weight': w})

Python accepts this. And if we want such a deep (path-based) solution unlike my #9566, we need to express paths with more than just double-underscores which should be reserved for nested estimators. (I note that such deep property mechanism makes modifying sample properties in resamplers as in #3855 problematic: a resampler would need to return props modified with the fully-qualified names.)

@Adglink

This comment has been minimized.

Copy link

@Adglink Adglink commented Aug 15, 2018

Any updates on being able to use 'sample_weights' when performing RFECV?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Aug 16, 2018

@Adglink

This comment has been minimized.

Copy link

@Adglink Adglink commented Aug 16, 2018

No worries, completely understandable. Thanks for all your hard work on Sklearn library. I love how simple and universal this package is.

One last question, do you have any suggestions for a quick and dirty way to implement sample weights to RFECV? I just want to mess around with how samples weights change the features I get.

@kyleengel

This comment has been minimized.

Copy link

@kyleengel kyleengel commented Sep 22, 2019

Solution here for those interested: https://stackoverflow.com/questions/49581104/sklearn-gridsearchcv-not-using-sample-weight-in-score-function

Basically, you can define your own scorer and do index matching to get the right weights inside that function. GridSearchCV / RandomizedSearch won't do the splitting for you (yet...)

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Sep 22, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.