Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG Sample weight for StandardScaler #18510

Merged
merged 54 commits into from Oct 7, 2020

Conversation

maikia
Copy link
Contributor

@maikia maikia commented Oct 1, 2020

towards #15601
This PR adds a possibility of defining sample_weight for the StandardScaler (only for the dense X).

Right now #3020 and #17743 are on hold because if we would like to substitute the use of normalize in linear models we need to assure that the suggested way of substitution will work in all cases (ie suggested way is using a Pipeline with a StandardScaler).

However, as pointed out in #18159 this would not work in the case if the linear model is using sample_weight. By adding possibility to add sample_weight to StandardScaler we are making the above substitution valid again.

@jnothman
Copy link
Member

jnothman commented Oct 1, 2020

This should close #15601, right? And it needs to build on #18447

@maikia
Copy link
Contributor Author

maikia commented Oct 2, 2020

Argh. you are right @jnothman . Thanks for pointing this out. I didn't realize there was complete separate conversation on that topic. I will build on #18447 as soon as it's merged

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not have duplicated all tests passing sample weights of ones. To me it's an overkill as the code is covered and tested with a dedicated test. Basically these test which are updated if sample_weight was ignore internally they would pass.

my 2c. Besides LGTM

y = rng.randn(n_samples)
sample_weights_OK = rng.randn(n_samples) ** 2 + 1
sample_weights_OK_1 = 1.
sample_weights_OK_2 = 2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why you test for 1 and 2 as you don't check the consequence on the estimated mean or var. Maybe just do 1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just remove the sample_weights_OK* smoke tests. Valid sample weights are already tested in the other tests. Let's focus this test on invalid sample weights (both wrong shape or wrong length).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed most of those sample_weight tests except for the one testing dtype because it caused so much turmoil.
I also had to add np.dot() instead of _safe_accumulator_op() in _incremental_weighted_mean_and_var of extmath.py because it was again causing the dtype not found error.

scaler = StandardScaler()
X_scaled = scaler.fit(X).transform(X, copy=True)
X_scaled = scaler.fit(X).transform(
X, copy=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this formatting change? I believe the original code was already PEP8 compliant, no?

scaler = StandardScaler()
X_scaled = scaler.fit(X).transform(X, copy=True)
X_scaled = scaler.fit(X).transform(
X, copy=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark here.

@@ -534,12 +617,12 @@ def test_standard_scaler_partial_fit():

for chunk_size in [1, 2, 50, n, n + 42]:
# Test mean at the end of the process
scaler_batch = StandardScaler(with_std=False).fit(X)
scaler_batch = StandardScaler(with_std=False).fit(
X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same remark again.

transformer.inverse_transform(
[[np.max(transformer.references_)]]))
transformer.inverse_transform(
[[np.max(transformer.references_)]]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here again those formatting changes are debatable and unrelated to the topic of the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ogrisel and @agramfort
This last changes I did because flake8 was failing locally.
However, I did revert them all now according to your suggestions

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather avoid introducing unrelated formatted change to limit the risk of introducing potential conflicts in other PRs when merging this PR (plus it would make the history unnecessary complex when using tools such as git blame).

At some point in the future we will standardize on always using an automated formatter such as "black" but in the mean time, let's not change the formatting on unrelated lines.

Other than that, LGTM :)

@agramfort agramfort changed the title WIP Sample weight for StandardScaler MRG Sample weight for StandardScaler Oct 6, 2020
Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maikia let's try to revert these formatting changes and then let's merge.

@ogrisel ogrisel merged commit 64697c6 into scikit-learn:master Oct 7, 2020
7 checks passed
@ogrisel
Copy link
Member

ogrisel commented Oct 7, 2020

Thank you very much @maikia!

@ogrisel
Copy link
Member

ogrisel commented Oct 7, 2020

For the remaining case of sparse data + missing values + sample_weight, I am not sure which route is best. Extending the existing code that was written for the dense case or start over from independent code as you did earlier in this PR and modify it to add support for missing values with a sparse nan_mask.

@maikia
Copy link
Contributor Author

maikia commented Oct 7, 2020

@ogrisel
since you don't see an advantage over either of them I will reuse what I have wrote previously and try to add nan_mask

amrcode pushed a commit to amrcode/scikit-learn that referenced this pull request Oct 19, 2020
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants