Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_weighted_percentile does not lead to the same result than np.median #17370

Closed
glemaitre opened this issue May 28, 2020 · 10 comments
Closed

_weighted_percentile does not lead to the same result than np.median #17370

glemaitre opened this issue May 28, 2020 · 10 comments
Labels

Comments

@glemaitre
Copy link
Member

glemaitre commented May 28, 2020

While reviewing a test in #16937, it appears that our implementation of _weighted_percentile with unit sample_weight will lead to a different result than np.median which is a bit problematic for consistency.

In the gradient-boosting, it brakes the loss equivalence because the initial predictions are different. We could bypass this issue by always computing the median using _weighted_percentile there.

import pytest
import numpy as np
from sklearn.utils.stats import _weighted_percentile

rng = np.random.RandomState(42)
X = rng.randn(10)
X.sort()
sample_weight = np.ones(X.shape)

median_numpy = np.median(X)
median_numpy_percentile = np.percentile(X, 50)
median_sklearn = _weighted_percentile(X, sample_weight, percentile=50.0)

assert median_numpy == pytest.approx(np.mean(X[[4, 5]]))
assert median_sklearn == pytest.approx(X[4])
assert median_numpy == median_numpy_percentile
assert median_sklearn == pytest.approx(median_numpy)
@glemaitre
Copy link
Member Author

ping @lucyleeow Since you already look at the code, you might have some intuition why this is the case. We should actually have the above test as a regression test for our _weighted_percentile

@glemaitre
Copy link
Member Author

glemaitre commented May 28, 2020

Does numpy interpolating while we just making a np.searchsorted?

@glemaitre
Copy link
Member Author

yes numpy take np.mean(x, x+1) while we take x.

@glemaitre
Copy link
Member Author

np.percentile is consistent with np.median

@glemaitre
Copy link
Member Author

pinging @NicolasHug @adrinjalali @ogrisel Making thing consistent will impact the HistGradientBoosting* with the LAD loss.

Would be interesting to make the change and see which tests are breaking.

@lucyleeow
Copy link
Member

lucyleeow commented May 28, 2020

For np.median, if n is odd, the median is the middle value (after sorting) and if n is odd it is the average of the 2 middle values.

We take 'lower' percentile, as we use the default parameter np.searchsorted(side='left'). Thus when n is even, _weighted_percentile is not the same as np.median for unit weights. It is the same when n is odd.

We could amend _weighted_percentile to use interpolated median, described well here: #6217 (comment) (I have an implementation of this already, as I didn't realise _weighted_percentile existed). This implementation means that np.median always gives the same result with unit weights.

In the end, I think the difference between 'lower' weighted percentile and interpolated weighted percentile would generally be small for large n sizes. Though, I guess the difference between the '2 middle values' would also affect the how different they are.

@lucyleeow
Copy link
Member

lucyleeow commented May 28, 2020

For np.percentile, you can select how it is calculated with parameter:

interpolation{‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}

default is 'linear', thus performs the same as np.median

@glemaitre
Copy link
Member Author

I think that our _weighted_precentile should offer the same default and maybe an interpolation parameter if required.

@lucyleeow
Copy link
Member

linear = linear interpolation.

@lorentzenchr
Copy link
Member

Our _weighted_precentile implements method inverted_cdf. It is important to note that quantiles are interval valued quantities. Out of that interval, one is free to choose any value. On top of that, different sample estimations exist. So we are well within statistical uncertainty.

np.median says

Given a vector V of length N, the median of V is the middle value of a sorted copy of V, V_sorted - i e., V_sorted[(N-1)/2], when N is odd, and the average of the two middle values of V_sorted when N is even.

I do not know how to generalize this to account for sample weights. Therefore closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants