Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Kulsinski" dissimilarity seems wrong (Trac #1484) #2009

Closed
scipy-gitbot opened this issue Apr 25, 2013 · 6 comments · Fixed by #14588
Closed

"Kulsinski" dissimilarity seems wrong (Trac #1484) #2009

scipy-gitbot opened this issue Apr 25, 2013 · 6 comments · Fixed by #14588
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected Migrated from Trac scipy.spatial
Milestone

Comments

@scipy-gitbot
Copy link

Original ticket http://projects.scipy.org/scipy/ticket/1484 on 2011-07-28 by trac user muellner, assigned to trac user peridot.

The "Kulsinski" dissimilarity in the module scipy.spatial.distance seems wrong.

First, the function does not do what the documentation

http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.kulsinski.html#scipy.spatial.distance.kulsinski

says: it always returns integer values. Maybe this is a forgotten float(...) in scipy/spatial/distance.py, line 421.

Second, the definition does not agree what I find elsewhere on the net. Compare

http://genome.jouy.inra.fr/doc/genome/statistiques/R-2.6.0/library/prabclus/html/kulczynski.html

and

http://sysbio.oxfordjournals.org/content/55/1/170.full

Can someone please check whether the SciPy definition makes sense and/or is intenionally different from the definition found in R?

@scipy-gitbot
Copy link
Author

@WarrenWeckesser wrote on 2011-08-07

The problem with kulsinski returning an integer was fixed in 32f9e3d8e0154da1b941.

However, it does appear that the formula is not correct. It looks like the calculation should be something like (using the variables from the source code):

1 - 0.5 * ntt * (1.0/u.sum() + 1.0/v.sum())

or equivalently

1 - 0.5 * ntt * (1.0/(ntt + ntf) + 1.0/(nft + ntt))

so, for example, the kulsinksi dissimilarity of [1, 1, 0, 0] and [0, 1, 1, 0] should be 0.5. The current implementation gives:

In [10]: kulsinski([1,1,0,0], [0,1,1,0])
Out[10]: 0.83333333333333337

On the other hand, there are these:
http://www.mothur.org/wiki/Kulczynski
http://www.mothur.org/wiki/Kulczynskicody

A definitive reference would be nice to have.

@scipy-gitbot
Copy link
Author

trac user muellner wrote on 2011-08-16

The formula in the second link which warren.weckesser gave (it's called Kulczynski-Cody index there) agrees with my two links. This formula has the advantage that it agrees with R, so it would not confuse users.

But I agree with warren.weckesser that a definitive reference is desirable. Unfortunately, it's in Polish and from 1928. Let's follow back:

Henning and Hausdorf (doi: 10.1080/10635150500481523) cite Shi (doi:10.1016/0031-0182(93)90084-V). This author mentions two similarity (sic) measures, both attributed to Kulczynski. The reference is

Kulczynski, S., 1928, Zespoly róslin w Pieninach. Bull. Int. Acad. Pol. Sci. Lettres, Sér. B, Suppl., 2:57-203

I haven't read the Polish paper. But according to Shi's paper, the measures are

"Kulczynksi unnamed 1" = 1 - (dissimilarity measure from warren.weckessers first link)

"Kulczynksi unnamed 2" = 1 - (dissimilarity measure from the other three links)

I vote for the second version, since more people appear to use this one. warren.weckesser already stated it: 1 - 0.5 * ntt * (1.0/u.sum() + 1.0/v.sum())

@rgommers
Copy link
Member

rgommers commented Oct 2, 2016

This comment has a more comprehensive overview of issues with distance metrics: #3163 (comment)

@rgommers rgommers removed the good first issue Good topic for first contributor pull requests, with a relatively straightforward solution label Aug 9, 2019
@czgdp1807
Copy link
Member

It seems like there isn't any decision made regarding #3163 (comment) and kulsinski still follows the same behaviour as described above.

@rgommers
Copy link
Member

+1 to adding a new metric as proposed there (kulczynski or kulczynski1), and the release after that deprecating kulsinski.

@czgdp1807
Copy link
Member

adding a new metric as proposed there

I can work on this, as it would be an addition, so nothing existing would be affected. If the code is small enough for it then may be we can have this in master.

the release after that deprecating kulsinski.

For this we will be having a lot of time to decide, so possibly not much to worry about this in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected Migrated from Trac scipy.spatial
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants