Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sample_weight support to StandardScaler (and friends?) #15601

Closed
amueller opened this issue Nov 12, 2019 · 10 comments
Closed

Add sample_weight support to StandardScaler (and friends?) #15601

amueller opened this issue Nov 12, 2019 · 10 comments

Comments

@amueller
Copy link
Member

amueller commented Nov 12, 2019

Via #15583:
it would be nice to have sample_weight support in StandardScaler. I don't see how to do that for sparse data right now, though (without touching the Cython code).

@jnothman
Copy link
Member

Touching the cython code doesn't look that hard, but it would require deciding if we want to duplicate the functions to handle the weighted case, or whether we can reuse the exisiting code paths (risking loss in both efficiency and precision).

@NicolasHug
Copy link
Member

In #14696 we duplicate. I think it's worth it as long as the function is simple enough?

@lorentzenchr
Copy link
Member

This would also help a "sparse" solution of #3702.

@amueller
Copy link
Member Author

Actually, I'm using a numpy/scipy based helper in my PR and I think this should be fine as long as we don't need inplace?

@panpiort8
Copy link
Contributor

I can take that issue

@panpiort8
Copy link
Contributor

panpiort8 commented Nov 30, 2019

Computing weighted mean and variance (for non-sparse matrix) in partial_fit requires applying _incremental_mean_and_var twice or developing new (to my knowledge) method similar to _incremental_mean_and_var, but with sample_weight (like here). I believe latter option is far better and should be dealt with another issue. What do you think, guys?

@jnothman
Copy link
Member

jnothman commented Dec 2, 2019 via email

@amueller
Copy link
Member Author

amueller commented Dec 3, 2019

I think we should just use a non-cython variant as I did here:
https://github.com/scikit-learn/scikit-learn/pull/15583/files#diff-ce89de6ba5554a75729d37eca95fc24dR835

@panpiort8
Copy link
Contributor

panpiort8 commented Dec 5, 2019

I think we should just use a non-cython variant as I did here:
https://github.com/scikit-learn/scikit-learn/pull/15583/files#diff-ce89de6ba5554a75729d37eca95fc24dR835

But we want StandardScaler to compute stats in incremental (batch) manner, which is, I believe, quite different.

@lorentzenchr
Copy link
Member

#18510 and #18682 added sample_weight to StandardScaler for dense and sparse input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants