_weighted_percentile does not lead to the same result than np.median #17370

glemaitre · 2020-05-28T10:03:16Z

While reviewing a test in #16937, it appears that our implementation of _weighted_percentile with unit sample_weight will lead to a different result than np.median which is a bit problematic for consistency.

In the gradient-boosting, it brakes the loss equivalence because the initial predictions are different. We could bypass this issue by always computing the median using _weighted_percentile there.

import pytest
import numpy as np
from sklearn.utils.stats import _weighted_percentile

rng = np.random.RandomState(42)
X = rng.randn(10)
X.sort()
sample_weight = np.ones(X.shape)

median_numpy = np.median(X)
median_numpy_percentile = np.percentile(X, 50)
median_sklearn = _weighted_percentile(X, sample_weight, percentile=50.0)

assert median_numpy == pytest.approx(np.mean(X[[4, 5]]))
assert median_sklearn == pytest.approx(X[4])
assert median_numpy == median_numpy_percentile
assert median_sklearn == pytest.approx(median_numpy)

The text was updated successfully, but these errors were encountered:

glemaitre · 2020-05-28T10:04:21Z

ping @lucyleeow Since you already look at the code, you might have some intuition why this is the case. We should actually have the above test as a regression test for our _weighted_percentile

glemaitre · 2020-05-28T10:06:37Z

Does numpy interpolating while we just making a np.searchsorted?

glemaitre · 2020-05-28T10:09:50Z

yes numpy take np.mean(x, x+1) while we take x.

glemaitre · 2020-05-28T10:16:05Z

np.percentile is consistent with np.median

glemaitre · 2020-05-28T10:18:39Z

pinging @NicolasHug @adrinjalali @ogrisel Making thing consistent will impact the HistGradientBoosting* with the LAD loss.

Would be interesting to make the change and see which tests are breaking.

lucyleeow · 2020-05-28T10:20:40Z

For np.median, if n is odd, the median is the middle value (after sorting) and if n is odd it is the average of the 2 middle values.

We take 'lower' percentile, as we use the default parameter np.searchsorted(side='left'). Thus when n is even, _weighted_percentile is not the same as np.median for unit weights. It is the same when n is odd.

We could amend _weighted_percentile to use interpolated median, described well here: #6217 (comment) (I have an implementation of this already, as I didn't realise _weighted_percentile existed). This implementation means that np.median always gives the same result with unit weights.

In the end, I think the difference between 'lower' weighted percentile and interpolated weighted percentile would generally be small for large n sizes. Though, I guess the difference between the '2 middle values' would also affect the how different they are.

lucyleeow · 2020-05-28T10:29:36Z

For np.percentile, you can select how it is calculated with parameter:

interpolation{‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}

default is 'linear', thus performs the same as np.median

glemaitre · 2020-05-28T14:49:35Z

I think that our _weighted_precentile should offer the same default and maybe an interpolation parameter if required.

lucyleeow · 2020-05-28T15:00:22Z

linear = linear interpolation.

lorentzenchr · 2024-04-26T17:20:37Z

Our _weighted_precentile implements method inverted_cdf. It is important to note that quantiles are interval valued quantities. Out of that interval, one is free to choose any value. On top of that, different sample estimations exist. So we are well within statistical uncertainty.

np.median says

Given a vector V of length N, the median of V is the middle value of a sorted copy of V, V_sorted - i e., V_sorted[(N-1)/2], when N is odd, and the average of the two middle values of V_sorted when N is even.

I do not know how to generalize this to account for sample weights. Therefore closing.

glemaitre added Bug: triage Bug and removed Bug: triage labels May 28, 2020

glemaitre mentioned this issue May 28, 2020

[WIP] BUG make _weighted_percentile(data, ones, 50) consistent with numpy.median(data) #17377

Closed

ogrisel mentioned this issue Jan 5, 2022

ENH add support for sample_weight in KBinsDiscretizer(strategy="quantile") #22048

Closed

ogrisel mentioned this issue Jan 11, 2024

ENH: add weighted quantile for inverted_cdf numpy/numpy#24254

Merged

lorentzenchr closed this as completed Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_weighted_percentile does not lead to the same result than np.median #17370

_weighted_percentile does not lead to the same result than np.median #17370

glemaitre commented May 28, 2020 •

edited

glemaitre commented May 28, 2020

glemaitre commented May 28, 2020 •

edited

glemaitre commented May 28, 2020

glemaitre commented May 28, 2020

glemaitre commented May 28, 2020

lucyleeow commented May 28, 2020 •

edited

lucyleeow commented May 28, 2020 •

edited

glemaitre commented May 28, 2020

lucyleeow commented May 28, 2020

lorentzenchr commented Apr 26, 2024

_weighted_percentile does not lead to the same result than np.median #17370

_weighted_percentile does not lead to the same result than np.median #17370

Comments

glemaitre commented May 28, 2020 • edited

glemaitre commented May 28, 2020

glemaitre commented May 28, 2020 • edited

glemaitre commented May 28, 2020

glemaitre commented May 28, 2020

glemaitre commented May 28, 2020

lucyleeow commented May 28, 2020 • edited

lucyleeow commented May 28, 2020 • edited

glemaitre commented May 28, 2020

lucyleeow commented May 28, 2020

lorentzenchr commented Apr 26, 2024

glemaitre commented May 28, 2020 •

edited

glemaitre commented May 28, 2020 •

edited

lucyleeow commented May 28, 2020 •

edited

lucyleeow commented May 28, 2020 •

edited