Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Closes #5027 distance function always casts bool to double #5384

Merged
merged 1 commit into from
Jan 26, 2016

Conversation

Garrett-R
Copy link
Contributor

Closes #5027: Always casting to double probably incurred a performance hit for the boolean-based distance metrics such as the "matching distance".

I'm a new SciPy contributor, so let me know if I can do anything better.

Two questions:

  1. Wasn't sure if I should add to the release notes since it seems like it was all done by rgommers in the [last release|https://github.com/scipy/scipy/commit/75c17a04530c2604a241209b1e1ed5c79c59cdff], but on the other hand the new developer docs [say you should add a release note|https://github.com/scipy/scipy/blame/master/HACKING.rst.txt#L243].

  2. I didn't add a test since nothing about the functionality of this function (pdist) actually changed and I couldn't think how to test it. Is that fine?

@@ -107,7 +107,6 @@

from . import _distance_wrap
from ..linalg import norm
import collections
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused import.

@pv
Copy link
Member

pv commented Nov 15, 2015

Sorry for the delay:
The following seem to also need _convert_to_double:

  • _distance_wrap.pdist_city_block_wrap(X, dm) needs _convert_to_double(X)
  • elif mstr in set(['cosine', 'cos']): (the calculations should cast to double to avoid overflow for integer inputs)
  • elif mstr in set(['old_cosine', 'old_cos']):
  • potentially also other places below

I would suggest the following for clarity: instead of

_distance_wrap.pdist_sqeuclidean_wrap(_convert_to_double(X), dm)

write

X = _convert_to_double(X)
_distance_wrap.pdist_sqeuclidean_wrap(X, dm)

and require that each elif branch must start with one of

X = _convert_to_double(X)
X = _convert_to_bool(X)

Then it's possible to check no casts are missing.

@pv
Copy link
Member

pv commented Nov 15, 2015

The cdist routine also seems to have the same issue (not necessary to fix in this PR though).

@pv pv added the needs-work Items that are pending response from the author label Nov 15, 2015
@Garrett-R Garrett-R force-pushed the fix-5027 branch 2 times, most recently from 1a093bb to 58da4e8 Compare November 27, 2015 01:18
@@ -1249,34 +1249,41 @@ def dfun(u, v):
elif isinstance(metric, string_types):
mstr = metric.lower()

#if X.dtype != np.double and \
# (mstr != 'hamming' and mstr != 'jaccard'):
# TypeError('A double array must be passed.')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this 7 year old commented code.

@Garrett-R
Copy link
Contributor Author

@pv, right I agree with all those things you said. I also did the cdist function while I was at it.

This should be ready to (re-)review.

@larsmans
Copy link
Contributor

This needs a rebase as #5665 got rid of the matching distance (because it's a duplicate of the Hamming distance).

Always casting to double probably incurred a performance hit for
the boolean-based distance metrics such as the "matching distance".
@Garrett-R
Copy link
Contributor Author

Thanks, @larsmans, I've rebased it.

@codecov-io
Copy link

@@            master   #5384   diff @@
======================================
  Files          235     235       
  Stmts        43211   43281    +70
  Branches      8163    8163       
  Methods          0       0       
======================================
+ Hit          33537   33609    +72
+ Partial       2602    2598     -4
- Missed        7072    7074     +2

Review entire Coverage Diff as of 3aeb659

Powered by Codecov. Updated on successful CI builds.

@larsmans
Copy link
Contributor

I'm merging this as it's a big improvement. Thanks @Garrett-R!

larsmans added a commit that referenced this pull request Jan 26, 2016
BUG: Closes #5027 distance function always casts bool to double
@larsmans larsmans merged commit 59f339e into scipy:master Jan 26, 2016
@Garrett-R Garrett-R deleted the fix-5027 branch January 26, 2016 21:02
@Garrett-R
Copy link
Contributor Author

Thanks!

@rgommers rgommers added enhancement A new feature or improvement and removed needs-work Items that are pending response from the author labels Jan 26, 2016
@rgommers rgommers added this to the 0.18.0 milestone Jan 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.spatial
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants