Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental weighted mean and var #16066

Conversation

panpiort8
Copy link
Contributor

Reference Issues/PRs

Partially adresses: #15601

What does this implement/fix? Explain your changes.

Method partial_fit in StandardScaler can be used multiple times to incrementally update mean and variance, but there was no proper method to do it with sample_weights. This PR introduce _incremental_weighted_mean_and_var, which does the exact thing we want.

Any other comments?

  1. Basic equations I used can be found here, but I haven't found any (free) papers with derivation leading to batch version of them. It's basic math, but code may seem confusing without that sort of explanation. If it's really necessary I can provide brief paper.
  2. NaNs handling is somehow inelegant, but I haven't found any 'weighted' method similar to np.nanvar/np.nanmean.

@panpiort8 panpiort8 force-pushed the incremental_weighted_mean_and_var branch from 87b303d to 9d3115f Compare January 10, 2020 09:06
rth
rth previously approved these changes Jan 12, 2020
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @panpiort8! A few comments below.

sklearn/utils/extmath.py Outdated Show resolved Hide resolved
sklearn/utils/extmath.py Outdated Show resolved Hide resolved
sklearn/utils/extmath.py Outdated Show resolved Hide resolved
sklearn/utils/extmath.py Outdated Show resolved Hide resolved
sklearn/utils/extmath.py Outdated Show resolved Hide resolved
sklearn/utils/extmath.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_extmath.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_extmath.py Outdated Show resolved Hide resolved
sklearn/utils/extmath.py Outdated Show resolved Hide resolved
for mean, var in means_vars:
X = rng.normal(loc=mean, scale=var, size=SIZE)
for i, weight in enumerate(weights):
test(X, weight)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should refactor this to use pytest.mark.parametrize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to use pytest.mark.parametrize inside a method and I'm not sure if parametrizing the most 'outside' method is good idea. I'm open for any suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@panpiort8 Something like

@pytest.mark.parametrize("mean", [0, 1e7, -1e7])
@pytest.mark.parametrize("var", [1, 1e-8, 1e5])
def test_incremental_weighted_mean_and_variance(mean, var):
...

Then you just use mean and var inside your test function and save some code, i.e. the loops.

@rth rth dismissed their stale review January 12, 2020 10:31

Changes requested

@panpiort8 panpiort8 force-pushed the incremental_weighted_mean_and_var branch 3 times, most recently from 8b7b577 to bd2f3c0 Compare January 12, 2020 12:44
@panpiort8 panpiort8 force-pushed the incremental_weighted_mean_and_var branch from bd2f3c0 to 24332b4 Compare January 12, 2020 12:47
@panpiort8
Copy link
Contributor Author

Not sure if resolving comments notify anyone, so to be sure: requested changes are done.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise this LGTM!

high_mean = 1e7
low_var = 1e-8
high_var = 1e5
normal_mean = 0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code below would be clearer if you just had 0 and 1 instead of normal_mean and normal_var; the meaning of scale=1 and loc=0 is clear, while the use of "normal" is confusing in this context.


nan_mask = np.isnan(X)
sample_weight_T = np.reshape(sample_weight, (1, -1))
new_weight_sum = \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might just remind the reader that this has shape (n_features,) (to parallel last_weight_sum)

_safe_accumulator_op(
np.average, X_0, weights=sample_weight, axis=0)
new_variance *= total_weight_sum / new_weight_sum
new_element = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think element here is used where mathematicians might use term?


last_mean : array-like of shape: (n_features,)

last_variance : array-like of shape: (n_features,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
last_variance : array-like of shape: (n_features,)
last_variance : None or array-like of shape: (n_features,)

# last = stats until now
# new = the current increment
# updated = the aggregated stats

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we route to _incremental_mean_and_var where sample_weight and last_weight_sum are None?

@jnothman
Copy link
Member

jnothman commented May 6, 2020

Sorry for the big delay in reviewing!

@cmarmo
Copy link
Member

cmarmo commented Jul 16, 2020

Hi @panpiort8 , thanks for your patience! Would you be able to address (and still interested in) the comments?

@albertvillanova
Copy link
Contributor

I take over this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants