Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

ENH: add distance gufuncs to `scipy.spatial` #3163

Open
wants to merge 10 commits into
from

Conversation

Projects
None yet
6 participants
Member

jaimefrio commented Dec 19, 2013

This provides a new submodule umath_distance to scipy.spatial, that provides gufuncs for distance calculations. The Python wrapper to the C coded gufuncs allows pdist and cdist style calculations with a simpler, common interface.

Only functions of two inputs with no additional parameters are included in this version, i.e. everything provided in scipy.spatial.distance except mahalanobis, minkowski, seuclidean and wminkowski.

jaimefrio added some commits Dec 19, 2013

@jaimefrio jaimefrio ENH: add distance gufuncs to `scipy.spatial`
This provides a new submodule `umath_distance` to `scipy.spatial`, that provides gufuncs for distance calculations. The Python wrapper to the C coded gufuncs allows `pdist` and `cdist` style calculations with a simpler, common interface.

Only functions of two inputs with no additional parameters are included in this version, i.e. everything provided in `scipy.spatial.distance` except `mahalanobis`, `minkowski`, `seuclidean` and `wminkowski`.
68d81fd
@jaimefrio jaimefrio ENH: add distance gufuncs to `scipy.spatial`
This provides a new submodule `umath_distance` to `scipy.spatial`, that provides gufuncs for distance calculations. The Python wrapper to the C coded gufuncs allows `pdist` and `cdist` style calculations with a simpler, common interface.

Only functions of two inputs with no additional parameters are included in this version, i.e. everything provided in `scipy.spatial.distance` except `mahalanobis`, `minkowski`, `seuclidean` and `wminkowski`.
ec9afb7
@jaimefrio jaimefrio Merge branch 'umath_distance' of https://github.com/jaimefrio/scipy i…
…nto umath_distance
f8e2a1a
@jaimefrio jaimefrio ENH: add distance gufuncs to `scipy.spatial`
This provides a new submodule `umath_distance` to `scipy.spatial`, that provides gufuncs for distance calculations. The Python wrapper to the C coded gufuncs allows `pdist` and `cdist` style calculations with a simpler, common interface.

Only functions of two inputs with no additional parameters are included in this version, i.e. everything provided in `scipy.spatial.distance` except `mahalanobis`, `minkowski`, `seuclidean` and `wminkowski`.
ac202a5

Coverage Status

Coverage remained the same when pulling f8e2a1a on jaimefrio:umath_distance into d0c442d on scipy:master.

jaimefrio added some commits Dec 19, 2013

@jaimefrio jaimefrio Removed non-ASCII characters from comments in umath_distance.c.src ca…
…using problems with Python 3.x builds
af85b32
@jaimefrio jaimefrio Merge branch 'umath_distance' of https://github.com/jaimefrio/scipy i…
…nto umath_distance

Conflicts:
	scipy/spatial/src/umath_distance.c.src
b6f8fdb
@jaimefrio jaimefrio Removed non-ASCII characters from umath_distance.c.src, this time wit…
…hout

appending the whole file without the changes at the end...
0de46bc

Coverage Status

Coverage remained the same when pulling 0de46bc on jaimefrio:umath_distance into d0c442d on scipy:master.

Coverage Status

Coverage remained the same when pulling a268142 on jaimefrio:umath_distance into d0c442d on scipy:master.

Owner

pv commented Dec 19, 2013

How about just replacing the 1D vector distance functions in scipy.spatial.distance with these, rather than adding a new submodule? As far as I see, there's nothing blocking this?

@pv pv and 1 other commented on an outdated diff Dec 19, 2013

scipy/spatial/src/umath_distance.c.src
+#define END_OUTER_LOOP }
+
+/*
+ *****************************************************************************
+ ** SOME CONSTANTS **
+ *****************************************************************************
+ */
+
+ static npy_double d_nan;
+ static npy_double d_inf;
+ static npy_double d_ninf;
+
+ static void init_constants(void)
+ {
+ /* It seems that NPY_INFINITY and NPY_NAN macros can't be used as
+ initializers, or so says numpy.linalg.umath_linalg.c.src */
@pv

pv Dec 19, 2013

Owner

Is this needed: these don't seem to be used in initializers as far as I see.

@jaimefrio

jaimefrio Dec 19, 2013

Member

d_nan and d_ninf are not used, you are right about that, but d_inf is used in the calculation of the return value of DOUBLE_bray_curtis_distance, d_ninf will come up in handling all possible values of the parameter of the Minkowski distance, and d_nan may eventually be used . All of these are defined in npy_math.h as a casting of the corresponding float values, so for performance you want to have them casted once and then stored in a local variable, which is what happens in init_constants, which is called at module initialization. The code structure is shamelessly copied from the implementation of umath_linalg.c.src

@pv

pv Dec 19, 2013

Owner

I think performance-wise there is little difference: NPY_* are all inlineable constants. Moreover, static variables need to be loaded, and they may need to be loaded more often than the NPY_* constants, as the compiler cannot always assume they remain constant.

Now that you point out umath_linalg.c.src, also that could use some cleanup, I think.

Owner

rgommers commented Dec 19, 2013

Looks like a nice addition to spatial. At the moment it doesn't work with older numpy though; for 1.5.1 I get 35 errors like:

TypeError: 'out' is an invalid keyword to <func_name>
Member

jaimefrio commented Dec 19, 2013

@pv The whole idea behind this is strongly inspired on the linalg rewrite in 1.8, which kept the gufunc implementations in a separate submodule until it was fully tested and validated, at which point it did replace most of the old functionality. So I just followed that same path. I do see an approach like this eventually replacing most of the actual distance code, so if you give me the go ahead I could merge the Python files into a single one, although I'd rather keep it separate until having included and tested the missing distance functions.

Member

jaimefrio commented Dec 19, 2013

@rgommers Is numpy 1.5.1 the oldest supported version? I don't think I have ever used anything earlier than 1.6, so I am not very sure of what may be going on, but I can compile a version and see if I can figure a way around those errors.

Owner

pv commented Dec 19, 2013

@jaimefrio: OK, for WIP you can take any approach that looks good. IIRC, in Numpy we merged the umath_linalg code only after the stuff was incorporated in the linalg functions.

Owner

rgommers commented Dec 19, 2013

Yes, 1.5.1 is the oldest supported version. We don't have plans to drop it in the near future.

Member

WarrenWeckesser commented Dec 19, 2013

FYI: #2009, #2011

Member

jaimefrio commented Dec 19, 2013

@WarrenWeckesser I hadn't seen the one on Kulsinski, but I did the one on Sokal-Michener, which is already corrected in the code in this PR. The Kulsinski measure is specially annoying, because it uses a defintion that AFAIK is only referenced here, and the guy seems to have produced the formula from thin air, so it should probably be replaced with something entirely different. But yes, it's a big mess... The similarities are more or less well defined, and we could implement 20 or 30 of those, e.g. from here, without too much guessing. But it is often harder to come across explicit definitions of dissimilarities, and doing 1 - similarity isn't always the right thing to do.

Member

jaimefrio commented Dec 19, 2013

@rgommers I'm having trouble getting numpy.distutils 1.5 to properly run MinGW on my system (yes, it's Windows...) , will try to set up an old PC box at home with Ubuntu to try it out, but it will take a few days to get it running. Do you have a log of the failures you got when running it under np 1.5 that I can take a look at?

Owner

rgommers commented Dec 21, 2013

It's 35x the same error. This shows the issue clearly:

In [1]: import scipy.spatial._umath_distance as u

In [2]: x = np.arange(3)

In [3]: y = np.ones(3)

In [4]: u._braycurtis(x, y)
Out[4]: 0.33333333333333331

In [5]: u._braycurtis(x, y, out=y)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-609fd75c99f5> in <module>()
----> 1 u._braycurtis(x, y, out=y)

TypeError: 'out' is an invalid keyword to _braycurtis
Member

jaimefrio commented Dec 21, 2013

@rgommers According to the docs (1.5 vs 1.6), ufuncs started taking an out argument in 1.6... That's a big show stopper for the pdist like calculations: the signature of those gufuncs is (n,d)->(p), and since the output shape is not one of the input ones, it relies on an output array of the right shape being passed in. I see three ways out:

  1. Drop the pdist functionality. That would require keeping the existing implementations, and thus having two sets of functions doing similar calculations. I don't like this.
  2. Set up the iterator from scratch for the pdist case. I don't like this too much either. It is going to turn a clean, standard implementation of a few dozen lines of code, into hundreds, and there are going to be hard to track bugs from messing up Python object reference counts or the like.
  3. What I like best would be to keep the current implementation, hack some extra code along the lines of 1, and choose one code or the other at compile or run time depending on the detected numpy version. I can see that being a problem too, as the performance and features of the module would be different depending on the numpy version, and routine testing (e.g. Travis) will probably not catch changes messing the 1.5.1 implementation...

Any thoughts?

Owner

rgommers commented Dec 21, 2013

Hmm, totally forgot about that - I thought out= had been there for ages.

This code doesn't change a lot, so I think (3) is acceptable. Runtime switching in the Python code and a clear explanation in the docs probably.

My default stack is always based on the lowest supported numpy version, so I'll notice regressions quicly.

Member

WarrenWeckesser commented Dec 21, 2013

I haven't looked too closely at this PR (sorry), but the out argument of ufuncs in numpy 1.5 is an optional positional argument. It should work if you just give an additional argument, without specifying the keyword out.

Owner

rgommers commented Feb 2, 2014

@jaimefrio if you update the code with Warren's suggestion I can test it, so no need for you to fight with distutils and MinGW.

@jaimefrio jaimefrio - removed `out` keyword argument in call to gufuncs for compatibility…
… with

  numpy 1.5.
- added Mahalanobis distance function and kernels
- added module build information to bento.info
abd6d9b
Member

jaimefrio commented Feb 4, 2014

Hi @rgommers , new commit with the out kwarg removed, and a little more work on the C kernels. Please, try and see if numpy 1.5 likes it. Thanks!

Coverage Status

Coverage remained the same when pulling abd6d9b on jaimefrio:umath_distance into d0c442d on scipy:master.

Owner

rgommers commented Feb 4, 2014

@jaimefrio works like a charm with numpy 1.5.1.

@jaimefrio jaimefrio - Added C functions and GUFUNC wrappers for all Minkowski family dist…
…ances.

- Started adding additional distance C functions: Soergel, Wave-HEdges,
  intersection, Kulczynski 1.
5cafb89
Member

jaimefrio commented Feb 5, 2014

This last commit has all the gufunc C code for all the distances already in scipy.spatial.distance, so all that's left is wrapping it up nicely in Python and adding full test coverage.

Other than that, there are a few issues with the current implementation that I would like to change. Some are plain bugs that need to be corrected, but other are more subtle, and I am not sure how to go about them. A list of the main ones off the top of my head follows:

  1. hamming doesn't compute the Hamming distance, but scales it dividing by the vector dimension. They are closely related, but are not quite the same. I would like to have matching and sokalmichener return the old value, but make hamming return the proper unscaled Hamming distance. Can this be happily done, or is it best to first add some form of deprecation warning?
  2. sokalmichener is also wrong in the current implementation. It returns the same as rogerstanimoto, but it should return the same as matching. I think there was an open issue complaining about this already.
  3. wminkowski is also computing a slightly erroneous version of the weighted Minkowski distance. According to the documentation, components are subtracted, their absolute value raised to the p-th power, and then get multiplied by the corresponding weight. This is consistent with the few definitions I have seen in the literature. But the actual implementation multiplies by the weight before raising to the p-th power. Can this be changed happily?
  4. minkowski has a check for p >= 1. That condition is usually enforced for the Minkowski distance to be a metric, i.e. for the triangle inequality to hold. This is a very restrictive condition, because many of the other distances are not proper metrics. Furthermore, the condition is not enforced either in pdist or cdist, which will happily take ps smaller than 1. I would like to drop this as well.
  5. kulsinski returns a value based on a formula that I have found in a paper, but which is not consistent with most of the literature. I would like to either add a kulczynski1 function implementing the proper Kulczynski 1 distance, and eventually remove the current kulsinski, or directly replace the current implementation with the new one. Thoughts on this?
Member

jaimefrio commented Feb 5, 2014

The test failures are all related to the redefinition of hamming I mentioned in point 1 of my previous comment: changed the code but didn't update the tests...

@pv pv added the PR label Feb 19, 2014

Owner

rgommers commented Feb 24, 2014

2, 3 and 4 sound like things that can be considered bugs and therefore changed. less sure about 1 and 5, may have to add deprecations. Would be good to ask on the mailing list for opinions from people using these distances.

@pv pv removed the PR label Aug 13, 2014

This was referenced Sep 1, 2014

Owner

rgommers commented Oct 2, 2016

gh-2009 has more details and references on (5), and gh-2011 on (1, 2).

Owner

rgommers commented Oct 2, 2016

@jaimefrio can you comment on your plans for this PR? It seems to be close to completion....

Found this again after looking at the other issues I just linked to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment