Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong computation of weighted minkowski distance in cdist #5718

Closed
arseniev opened this issue Jan 15, 2016 · 13 comments · Fixed by #5797
Closed

Wrong computation of weighted minkowski distance in cdist #5718

arseniev opened this issue Jan 15, 2016 · 13 comments · Fixed by #5797
Labels
good first issue Good topic for first contributor pull requests, with a relatively straightforward solution scipy.spatial
Milestone

Comments

@arseniev
Copy link

There is an error in the weighted minkowski distance computation: in scipy/scipy/spatial/src/distance_impl.h, lines 363, 364:
Currently:
d = fabs(u[i] - v[i]) * w[i];
s = s + pow(d, p);
Should be:
d = fabs(u[i] - v[i]);
s = s + w[i]*pow(d, p);

The weighting coefficient shouldn't be powered.

@maniteja123
Copy link
Contributor

Hi, I am not sure whether this is relevant, but even here, the weighting coefficient is powered. Apologies since I am not an expert, but just wanted to point it out. Hope it helps.

@arseniev
Copy link
Author

Thanks for the reply. Even the docstring for the function, you've referenced to, says:
.. math::
\left(\sum{(w_i |u_i - v_i|^p)}\right)^{1/p}.
The coefficient is out of the power. It seems, that the implementation you've mentioned is also incorrect...
I'm not an expert either,
but here are some link I've found:
http://stats.stackexchange.com/questions/15289/when-to-use-weighted-euclidean-distance-and-how-to-determine-the-weights-to-use
http://www.softcomputing.es/estylf08/es/006-6-176.pdf

In weighted Euclidean distance, weights are not squared, and this is just a special case of the weighted minkowski. Probably I'm wrong, I would be grateful for the clarification of the definition.

@josef-pkt
Copy link
Member

my guess is that this is just a definition issue, because weights can be just transformed by the power, AFAICS.

analog: in least squares, p=2, I got confused for some time whether weights are defined for the residual u_i - v_i or the squared residuals, i.e. is it w**2 or sqrt(w)

WLS is usually defined as squared weights which would be outside, i.e. w[i]*pow(d, p), but that just looks like a convention to me (like using variance instead of standard deviation).

@arseniev
Copy link
Author

Thanks for the reply! In this case, the docstring for the scipy.spatial.distance.wminkowski function should be changed, because now the weight is not powered according to the formula in the docstring.

@ev-br
Copy link
Member

ev-br commented Feb 2, 2016

Looks like the easiest fix is to fix the docs. Am adding an easy-fix label then

@ev-br ev-br added the good first issue Good topic for first contributor pull requests, with a relatively straightforward solution label Feb 2, 2016
maniteja123 added a commit to maniteja123/scipy that referenced this issue Feb 2, 2016
Clarify that the weights are powered in the computation of weighted minkowski distance in cdist. closes scipygh-5718
@ev-br ev-br added this to the 0.18.0 milestone Feb 2, 2016
@quantology
Copy link

This resolution is simply incorrect. Weights should not be powered, doing so violates the logic of weights. I'm working on a PR that will fix this and add weights broadly in stats and spatial.distance, but would be happy to have the conversation here while I get that ready.

@rgommers
Copy link
Member

rgommers commented Dec 2, 2016

@metaperture if we want to change this, it would need a FutureWarning in the next release (0.19.0) and the change itself in the one after that (1.0).

Changing it would require that there's a uniform definition in the literature. The first two papers I checked do have w without power:

The third one explicitly discuss linear (without power) and nonlinear ones (with power) though:

That last paper also happens to be the most highly cited one. So it's not 100% clear. In order to accommodate both versions and not break backwards compatibility, could we add a parameter to choose between these two versions of weights?

@quantology
Copy link

quantology commented Jan 6, 2017

The paper you reference is not proposing that that is what weights mean, it's proposing that as a variation of the standard algorithm (not even of the Minkowski metric, of the Euclidean metric). Note that even in that paper, it calls that approach the non-standard one. It's also saying to use $w' = w^\beta$, then apply the weights as you would normally--which would be easier under the correct implementation than it would be now (the current implementation already assumes $\beta = p$, which is a special case of that subpoint of that paper).

Weights define a probability measure on the data, nonlinear weights would violate that, and a host of related transforms that I'll be referencing in my PR. You can do that for particular algorithmic purposes (Boosting uses weights for its own purposes that aren't supposed to have generalizable meaning), and that's what it seems like is being done here.

I'll be uploading my pull request tonight, working on some python 2.7 issues right now (damn oldstyle varargs...).

@josef-pkt
Copy link
Member

Weights define a probability measure on the data, nonlinear weights would violate that

not in most usages for weights that I have seen
see for example the link from an earlier comment
http://stats.stackexchange.com/questions/15289/when-to-use-weighted-euclidean-distance-and-how-to-determine-the-weights-to-use

(I would consider weights as the reciprocal of the standard deviation in this case, so that the fabs(u[i] - v[i]) * w[i] is the standardized difference, and the power function defines the "kernel" for the distance.)

(I don't have an opinion here, but I do in stats applications.)

@quantology
Copy link

If you mean that weights are the reciprocal of the variance (std dev squared), then yes, that's a common equivalent transpose of the problem--if you define a probability space where the measure is the inverse variance, then the expected risk minimizer on that space is also the one with the least variance.

Weights as measure is the ERM approach, weights like inverse variance is (often, but not always) the purely frequentist equivalent.

@quantology
Copy link

Would love to move the discussion to the PR.

@josef-pkt
Copy link
Member

josef-pkt commented Jan 6, 2017

No, I mean standard deviation (sqrt of variance).
from the initial description above

Currently:
d = fabs(u[i] - v[i]) * w[i];
s = s + pow(d, p);

that's exactly what we are doing in kernel regression (IIRC) where w would be the inverse bandwidth, and in robust estimation where pow would be some robust function.

If p = 2, then this coincides with using the variance, but not if p != 2

@quantology
Copy link

quantology commented Jan 6, 2017

Do you have a reference for that? The link you gave seems to be inv var weighted.

Edit: Oh you were referring to the original minkowski defintion. Got it. Would still rather have this conversation in the PR if you don't mind :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good topic for first contributor pull requests, with a relatively straightforward solution scipy.spatial
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants