New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG Sample weight for StandardScaler #18510
MRG Sample weight for StandardScaler #18510
Conversation
…into sample_weight_standard_scaler
…into sample_weight_standard_scaler
…into sample_weight_standard_scaler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not have duplicated all tests passing sample weights of ones. To me it's an overkill as the code is covered and tested with a dedicated test. Basically these test which are updated if sample_weight was ignore internally they would pass.
my 2c. Besides LGTM
y = rng.randn(n_samples) | ||
sample_weights_OK = rng.randn(n_samples) ** 2 + 1 | ||
sample_weights_OK_1 = 1. | ||
sample_weights_OK_2 = 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why you test for 1 and 2 as you don't check the consequence on the estimated mean or var. Maybe just do 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just remove the sample_weights_OK*
smoke tests. Valid sample weights are already tested in the other tests. Let's focus this test on invalid sample weights (both wrong shape or wrong length).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed most of those sample_weight
tests except for the one testing dtype
because it caused so much turmoil.
I also had to add np.dot()
instead of _safe_accumulator_op()
in _incremental_weighted_mean_and_var
of extmath.py
because it was again causing the dtype
not found error.
…mple_weight_standard_scaler
…into sample_weight_standard_scaler
…e not being accepted
…ikia/scikit-learn into sample_weight_standard_scaler
scaler = StandardScaler() | ||
X_scaled = scaler.fit(X).transform(X, copy=True) | ||
X_scaled = scaler.fit(X).transform( | ||
X, copy=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this formatting change? I believe the original code was already PEP8 compliant, no?
scaler = StandardScaler() | ||
X_scaled = scaler.fit(X).transform(X, copy=True) | ||
X_scaled = scaler.fit(X).transform( | ||
X, copy=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark here.
@@ -534,12 +617,12 @@ def test_standard_scaler_partial_fit(): | |||
|
|||
for chunk_size in [1, 2, 50, n, n + 42]: | |||
# Test mean at the end of the process | |||
scaler_batch = StandardScaler(with_std=False).fit(X) | |||
scaler_batch = StandardScaler(with_std=False).fit( | |||
X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same remark again.
transformer.inverse_transform( | ||
[[np.max(transformer.references_)]])) | ||
transformer.inverse_transform( | ||
[[np.max(transformer.references_)]])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here again those formatting changes are debatable and unrelated to the topic of the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ogrisel and @agramfort
This last changes I did because flake8 was failing locally.
However, I did revert them all now according to your suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather avoid introducing unrelated formatted change to limit the risk of introducing potential conflicts in other PRs when merging this PR (plus it would make the history unnecessary complex when using tools such as git blame).
At some point in the future we will standardize on always using an automated formatter such as "black" but in the mean time, let's not change the formatting on unrelated lines.
Other than that, LGTM :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maikia let's try to revert these formatting changes and then let's merge.
Thank you very much @maikia! |
For the remaining case of sparse data + missing values + sample_weight, I am not sure which route is best. Extending the existing code that was written for the dense case or start over from independent code as you did earlier in this PR and modify it to add support for missing values with a sparse |
@ogrisel |
towards #15601
This PR adds a possibility of defining
sample_weight
for the StandardScaler (only for the dense X).Right now #3020 and #17743 are on hold because if we would like to substitute the use of normalize in linear models we need to assure that the suggested way of substitution will work in all cases (ie suggested way is using a Pipeline with a StandardScaler).
However, as pointed out in #18159 this would not work in the case if the linear model is using
sample_weight
. By adding possibility to addsample_weight
to StandardScaler we are making the above substitution valid again.