-
-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API] Consistent API for attaching properties to samples #4497
Comments
To track the evolution of ideas here previous mentions of related idea: |
Sorry to be obtuse, but where does the reticence to depend or better integrate |
It's hard to find applied examples of sklearn in the community that
don't include pandas these days,
I never use it: my data are images, and images don't fit well in pandas.
It seems as if we'd have to reinvent much of the masking and group-by
wheel anyway to support data-dependent CV use cases.
Only masking, which is trivial, not group by.
|
Proposal A along with dict of arrays seems like a good solution to me... :) |
@GaelVaroquaux IIRC, you said you were considering a dataset object for out-of-core learning. If that's indeed the case, this should probably part of our reflexion. |
+1 for A. I'm still reflecting whether we need to change the API of all estimators, though. I'd like to avoid that, but I'm not sure it is possible. I have nothing better than |
It means that everybody that uses |
What's the advantage of A over kwargs? |
can you elaborate? |
A is a dict of names with array values. These variables could be passed directly as **kwargs, similarly resulting in a dict, without changing the current sample weight handling. |
So you would add |
Perhaps not, but I want to know in what ways this is really a worse On 8 April 2015 at 00:59, Andreas Mueller notifications@github.com wrote:
|
Two aspects.
I also find that "**kwargs" is harder to understand for someone who is |
I think mostly in being a little stricter with the interface. Also, there could be arguments to fit that are not of length n_samples (thought we try to avoid them). |
@GaelVaroquaux I think the issue you mentioned is caused by upgrading sklearn, not upgrading pandas ;) |
@GaelVaroquaux I think the issue you mentioned is caused by upgrading sklearn,
not upgrading pandas ;)
Well pandas.Series.dtype.kind was certainly present in Pandas 0.14.1. I
didn't check for 0.15.
|
I just thought it worth raising as devil's advocate, so thanks for the initial responses.
Sure, though naming errors are as much a real issue with Another issue in which all proposed solutions fail (but the incumbent approach of "pass |
Sure, though naming errors are as much a real issue with sample_props.
Indeed a confused user may have sample_props={'sample_weight': [...]}
or sample_props= {'weights': ...} instead of sample_props={'weight':
...}.
Yes, I agree. I think that the proposal is slightly better than
"**kwargs" in this respect but not much better.
Another issue in which all proposed solutions fail (but the incumbent
approach of "pass sample_weight explicitly works fine): if an estimator
does not have sample_weight support but then it is implemented, its
behaviour will change implicitly though the data does not. Is there any
way we can avoid this backwards compatibility issue?
That's a very good point. We could suggest a global flag "raise", "warn",
"ignore" to deal with unknown sample_props, controlled in the same style
as np.seterr, which is an incredibly useful debugging feature in numpy.
|
Somewhat related question: will transformers also output a modified |
Or perhaps we should at least have a way of introspecting which sample On 8 April 2015 at 04:06, Andreas Mueller notifications@github.com wrote:
|
To summarize the current state of the discussion, I think something like this would be a nice solution: Estimator:
User:
-> ValueError("Sample properties 'weights' are missing, unknown sample properties 'weight'") The only thing that is missing is a good way to document the required and optional sample properties of an estimator. I have no idea how we can do this. An advantage of having |
I think just mentioning it in the fit docstring and / or the estimator docstring should be fine, shouldn't it? |
I don't think sklearn.seterr("raise") is good btw. It should be |
That sounds reasonable. I think this is a more general feature that has an impact on many parts of the library. We should make a separate pull request for it before we deal with the sample properties, shouldn't we? Are there any disadvantages of having such a global state? |
Yes, that's quite a neat pandas-based hack. Another solution would put the
weights as a column in X then have a transformer that drops them for
training....
|
I have not read all of the referenced PRs, issues and comments (it's a lot) but I went over this thread briefly. One comment I have: would it be possible to introduce this new parameter while keeping the existing I am also going to briefly detail my use case and results below to support this feature. I am working on classifying activity data (accelerometer). My data looks something like this:
In this case, Doing this increased my cross-validation scores considerably, I guess I had a lot of "bad" data points that I am now discarding. |
I haven't contributed to scikit-learn @jnothman, but would love to start! Thanks for referring me to the current discussion, I will comment there. |
This is a very intricate place to start!! It has challenged those of us who
know scikit-learn API deeply for years.
|
Well, I don't expect to be able to do too much, but it cant' hurt to try! I was also interested in working on IterativeImputer (#16638 (comment)). That's probably not any easier. |
IterativeImputer doesn't have as many API quandaries and intricacies
involved. More algorithmic questions.
|
As decided during the dev meeting, we are safely moving it to the 1.2 milestone to give us more time for internal testing. |
moving to 2.0 (#25851) |
I guess I'm going to dare and close this one, since we have the API now, and the implementation for more and more meta-estimators are on the way, and Closing this issue makes me so proud 😊 |
This is an issue that I am opening for discussion.
Problem:
Sample weights (in various estimators), group labels (for cross-validation objects), group id (in learning to rank) are optional information that need to be passed to estimators and the CV framework, and that need to kept to the proper shape throughout the data processing pipeline.
Right now, the code to deal with this is inhomogeneous in the codebase, the APIs are not fully consistent (ie passing sample_weights to objects that do not support them will just crash).
This discussion attempt to address the problems above, and open the door to more flexibility to future evolution
Core idea
We could have an argument that is a dataframe-like object, ie a collection (dictionary) of 1D array-like object. This argument would be sliced and diced by any code that modifies the number of samples (CV objects, train_test_split), and passed along the data.
Proposal A
All objects could take as a signature fit(X, y, sample_props=None), with y optional for unsupervised learners.
sample_props (name to be debated) would be a dataframe like object (ie either a dict of arrays, or a dataframe). It would have a few predefined fields, such as "weight" for sample weight, "group" for sample groups used in cross validation. It would open the door to attaching domain-specific information to samples, and thus make scikit-learn easier to adapt to specific applications.
Proposal B
y could be optionally a dataframe-like object, which would have as a compulsory field "target", serving the purpose of the current y, and other fields such as "weight", "group"... In which case, arguments "sample_weights" and alike would disappear into it.
People at the Paris sprint (including me) seem to lean towards proposal A.
Implementation aspects
The different validation tools will have to be adapted to accept this type of argument. We should not depend on pandas. Thus we will accept dict of arrays (and build a helper function to slice them in the sample direction). Also, this helper should probably accept data frame (but given that data frames can be indexed like dictionaries, this will not be a problem.
Finally, the CV objects should be adapted to split the corresponding structure. Probably in a follow up to #4294
The text was updated successfully, but these errors were encountered: