New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental weighted mean and var #16066
Incremental weighted mean and var #16066
Conversation
87b303d
to
9d3115f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @panpiort8! A few comments below.
sklearn/utils/tests/test_extmath.py
Outdated
for mean, var in means_vars: | ||
X = rng.normal(loc=mean, scale=var, size=SIZE) | ||
for i, weight in enumerate(weights): | ||
test(X, weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should refactor this to use pytest.mark.parametrize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to use pytest.mark.parametrize inside a method and I'm not sure if parametrizing the most 'outside' method is good idea. I'm open for any suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@panpiort8 Something like
@pytest.mark.parametrize("mean", [0, 1e7, -1e7])
@pytest.mark.parametrize("var", [1, 1e-8, 1e5])
def test_incremental_weighted_mean_and_variance(mean, var):
...
Then you just use mean
and var
inside your test function and save some code, i.e. the loops.
8b7b577
to
bd2f3c0
Compare
bd2f3c0
to
24332b4
Compare
Not sure if resolving comments notify anyone, so to be sure: requested changes are done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise this LGTM!
high_mean = 1e7 | ||
low_var = 1e-8 | ||
high_var = 1e5 | ||
normal_mean = 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code below would be clearer if you just had 0 and 1 instead of normal_mean and normal_var; the meaning of scale=1
and loc=0
is clear, while the use of "normal" is confusing in this context.
|
||
nan_mask = np.isnan(X) | ||
sample_weight_T = np.reshape(sample_weight, (1, -1)) | ||
new_weight_sum = \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might just remind the reader that this has shape (n_features,)
(to parallel last_weight_sum)
_safe_accumulator_op( | ||
np.average, X_0, weights=sample_weight, axis=0) | ||
new_variance *= total_weight_sum / new_weight_sum | ||
new_element = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think element
here is used where mathematicians might use term
?
|
||
last_mean : array-like of shape: (n_features,) | ||
|
||
last_variance : array-like of shape: (n_features,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last_variance : array-like of shape: (n_features,) | |
last_variance : None or array-like of shape: (n_features,) |
# last = stats until now | ||
# new = the current increment | ||
# updated = the aggregated stats | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we route to _incremental_mean_and_var
where sample_weight
and last_weight_sum
are None?
Sorry for the big delay in reviewing! |
Hi @panpiort8 , thanks for your patience! Would you be able to address (and still interested in) the comments? |
I take over this PR. |
Reference Issues/PRs
Partially adresses: #15601
What does this implement/fix? Explain your changes.
Method partial_fit in StandardScaler can be used multiple times to incrementally update mean and variance, but there was no proper method to do it with sample_weights. This PR introduce _incremental_weighted_mean_and_var, which does the exact thing we want.
Any other comments?