New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong computation of weighted minkowski distance in cdist #5718
Comments
Hi, I am not sure whether this is relevant, but even here, the weighting coefficient is powered. Apologies since I am not an expert, but just wanted to point it out. Hope it helps. |
Thanks for the reply. Even the docstring for the function, you've referenced to, says: In weighted Euclidean distance, weights are not squared, and this is just a special case of the weighted minkowski. Probably I'm wrong, I would be grateful for the clarification of the definition. |
my guess is that this is just a definition issue, because weights can be just transformed by the power, AFAICS. analog: in least squares, p=2, I got confused for some time whether weights are defined for the residual WLS is usually defined as squared weights which would be outside, i.e. |
Thanks for the reply! In this case, the docstring for the scipy.spatial.distance.wminkowski function should be changed, because now the weight is not powered according to the formula in the docstring. |
Looks like the easiest fix is to fix the docs. Am adding an easy-fix label then |
Clarify that the weights are powered in the computation of weighted minkowski distance in cdist. closes scipygh-5718
This resolution is simply incorrect. Weights should not be powered, doing so violates the logic of weights. I'm working on a PR that will fix this and add weights broadly in stats and spatial.distance, but would be happy to have the conversation here while I get that ready. |
@metaperture if we want to change this, it would need a Changing it would require that there's a uniform definition in the literature. The first two papers I checked do have
The third one explicitly discuss linear (without power) and nonlinear ones (with power) though: That last paper also happens to be the most highly cited one. So it's not 100% clear. In order to accommodate both versions and not break backwards compatibility, could we add a parameter to choose between these two versions of weights? |
The paper you reference is not proposing that that is what weights mean, it's proposing that as a variation of the standard algorithm (not even of the Minkowski metric, of the Euclidean metric). Note that even in that paper, it calls that approach the non-standard one. It's also saying to use Weights define a probability measure on the data, nonlinear weights would violate that, and a host of related transforms that I'll be referencing in my PR. You can do that for particular algorithmic purposes (Boosting uses weights for its own purposes that aren't supposed to have generalizable meaning), and that's what it seems like is being done here. I'll be uploading my pull request tonight, working on some python 2.7 issues right now (damn oldstyle varargs...). |
not in most usages for weights that I have seen (I would consider weights as the reciprocal of the standard deviation in this case, so that the (I don't have an opinion here, but I do in stats applications.) |
If you mean that weights are the reciprocal of the variance (std dev squared), then yes, that's a common equivalent transpose of the problem--if you define a probability space where the measure is the inverse variance, then the expected risk minimizer on that space is also the one with the least variance. Weights as measure is the ERM approach, weights like inverse variance is (often, but not always) the purely frequentist equivalent. |
Would love to move the discussion to the PR. |
No, I mean standard deviation (sqrt of variance).
that's exactly what we are doing in kernel regression (IIRC) where If p = 2, then this coincides with using the variance, but not if p != 2 |
Do you have a reference for that? The link you gave seems to be inv var weighted. Edit: Oh you were referring to the original minkowski defintion. Got it. Would still rather have this conversation in the PR if you don't mind :) |
There is an error in the weighted minkowski distance computation: in scipy/scipy/spatial/src/distance_impl.h, lines 363, 364:
Currently:
d = fabs(u[i] - v[i]) * w[i];
s = s + pow(d, p);
Should be:
d = fabs(u[i] - v[i]);
s = s + w[i]*pow(d, p);
The weighting coefficient shouldn't be powered.
The text was updated successfully, but these errors were encountered: