[MRG + 1] BUG: remove checks from PyFunc distance metric (fixes #6287) #6288

jakevdp · 2016-02-05T17:08:55Z

This check makes too many assumptions about the user-defined distance, and should probably be removed. (fixes #6287)

agramfort · 2016-02-05T17:24:51Z

LGTM

yenchenlin · 2016-02-05T21:14:00Z

Once this pilot check is deleted, users may receive ambiguous error messages.

For example,
if customize distance function now return a value of type string instead of type float

File "sklearn/neighbors/dist_metrics.pyx", line 1114, in sklearn.neighbors.dist_metrics.PyFuncDistance.dist (sklearn/neighbors/dist_metrics.c:11202) TypeError: a float is required

What if just change this check a little bit such as ~~#6289~~ ?

It will raise

ValueError("Customize distance function must return a float")

in the above case.

jakevdp · 2016-02-05T21:48:26Z

I'd initially put the check where it is because it happens only once. A check such as #6289 will happen for every evaluation, and I'm afraid the impact on performance will be quite large (though admittedly, the user-defined function is not particularly performant as-is).

yenchenlin · 2016-02-06T04:16:39Z

@jakevdp Thanks for your useful opinion!

What do you think if change
https://github.com/jakevdp/scikit-learn/blob/fix6287/sklearn/neighbors/dist_metrics.pyx#L1103
to

d = self.func(x1arr, x2arr, **self.kwargs)
try:
    return d
except TypeError:
    raise TypeError("Customize function must return a float")

Since the usual case (i.e. if user didn't do something silly) is no exception, I think try-except will be very efficient.
What do you think?

yenchenlin · 2016-02-06T07:19:06Z

I've tested it with the following script:

import numpy as np
from sklearn.neighbors import BallTree
import timeit

n_samples = 10 ** 5
n_dim = 100
X = np.asarray(range(n_samples * n_dim)).reshape(n_samples, n_dim)

def correct_distance(x, y):
    return np.sum((x - y) ** 2)

def balltree():
    b = BallTree(X, metric=correct_distance)

time = timeit.Timer(balltree)
print min(time.repeat(number=10))

The no try-except version prints 80.8509781361 s,
while try-except version prints 80.3812861443 s,

which means adding try-except here barely affect the performance.

jakevdp · 2016-02-06T13:55:09Z

@yenchenlin1994 – great idea! I added that to the PR.

GaelVaroquaux · 2016-02-07T09:51:30Z

sklearn/neighbors/dist_metrics.pyx

+            d = self.func(x1arr, x2arr, **self.kwargs)
+            try:
+                return d
+            except TypeError:


Maybe it's my knowledge of Python that is lacking, but I am unsure that the exception will be really captured in the way I think.

Indeed, if I understand things correctly, the exception that we are expecting will be raised outside this function. Hence, I suspect that the try/except will not trigger. It seems to me that, whether it triggers or not is depends on the semantics of in which frame the exception is raised.

I think that it would be great to have a test that shows us that the exception is indeed raised, given that it is not trivial.

Hello @GaelVaroquaux

I think test can be something like the following script,
it can show that this exception is indeed raised when customize distance function returns a non-float.

from sklearn.neighbors import BallTree import timeit import numpy as np def wrong_distance(x, y): return "1" n_samples = 10 ** 3 n_dim = 10 X = np.asarray(range(n_samples * n_dim)).reshape(n_samples, n_dim) b = BallTree(X, metric=wrong_distance)

Thanks @yenchenlin1994, that looks good. Would you like to submit a PR with that test? You'll have to put it in a function, probably in sklearn/neighbors/tests/test_dist_metrics.pyx. You can use numpy.testing.assert_raises_regexp to assert that the expected exception is being raised. One more comment: to make the test faster, you could do much fewer than 1000x10 points: even something like 5x2 would probably do it.

Let me know if you need help putting that test together!

oh, and @GaelVaroquaux – I had the same thought! I had to run a script like the one @yenchenlin1994 suggested to convince myself that it would catch the exception. I suspect the reason it's caught is because Cython produces code which does the type checking in the same block as the return statement.

That's slightly crazy ^^

yenchenlin · 2016-02-08T03:51:19Z

@jakevdp Thanks, I'll send a PR right after this PR get merged.

jakevdp · 2016-02-08T05:05:14Z

We'll probably want the tests before merging this PR. You could either write a PR to my branch, or write a PR to master now and I'll cherry-pick the commit.

yenchenlin · 2016-02-08T15:46:16Z

@jakevdp I've written a PR to your branch.
Please notify me if I do it wrong.
Thanks!

jakevdp · 2016-02-18T18:29:11Z

I think this can be merged.

amueller · 2016-10-07T23:00:31Z

@jakevdp can you please rebase?

amueller · 2016-10-07T23:01:50Z

It would be kinda nice to add a regression test against the original issue, i.e. have a metric that fails on 10d data but works on 3d data and test it with 3d data?

jakevdp · 2016-10-10T14:11:48Z

Rebased. Let me add a couple tests...

jakevdp · 2016-10-10T14:35:02Z

Regression test and @yenchenlin's test added. If all tests pass, I think this can be merged.

jakevdp · 2016-10-10T16:19:27Z

Tests failed on old numpy versions... switched to using sklearn's backport of assert_raises_regex. We'll see if that does the trick

jakevdp · 2016-10-10T16:56:13Z

Flake8 error due to my over-zealous copy-paste

amueller · 2016-10-10T19:48:39Z

LGTM if everything passes. @agramfort still looks good to you?

jakevdp · 2016-10-10T23:15:08Z

Tests all pass. Good to merge?

jnothman · 2016-10-11T02:20:13Z

LGTM

jnothman · 2016-10-11T02:24:35Z

Added what's new in 0948ce9

…t-learn#6287) (scikit-learn#6288) # Conflicts: # sklearn/neighbors/tests/test_dist_metrics.py

# Conflicts: # doc/whats_new.rst

…t-learn#6287) (scikit-learn#6288)

GaelVaroquaux reviewed Feb 7, 2016
View reviewed changes

jakevdp changed the title ~~BUG: remove checks from PyFunc distance metric (fixes #6287)~~ [MRG] BUG: remove checks from PyFunc distance metric (fixes #6287) Feb 18, 2016

amueller mentioned this pull request Oct 7, 2016

Unwanted calls to custom pyfunc metric with DistanceMetric #6500

Closed

amueller added this to the 0.19 milestone Oct 8, 2016

jakevdp and others added 2 commits October 10, 2016 07:09

BUG: remove checks from PyFunc distance metric (fixes scikit-learn#6287)

be3fa17

neighbors: more useful error for bad custom distance metric

5889347

jakevdp force-pushed the fix6287 branch from 4cb8eee to 5889347 Compare October 10, 2016 14:11

TST: add regression test for scikit-learn#6288

9402e17

TST: use sklearn's assert_raises_regex

eb0ae2a

TST: remove duplicate function definition

4ca7ed0

amueller changed the title ~~[MRG] BUG: remove checks from PyFunc distance metric (fixes #6287)~~ [MRG + 1] BUG: remove checks from PyFunc distance metric (fixes #6287) Oct 10, 2016

jnothman merged commit cbd3bca into scikit-learn:master Oct 11, 2016

jnothman added a commit that referenced this pull request Oct 11, 2016

DOC What's new for #6288

0b352a7

amueller added a commit to amueller/scikit-learn that referenced this pull request Oct 14, 2016

[MRG + 1] FIX: remove checks from PyFunc distance metric (fixes sciki…

76b8cfa

…t-learn#6287) (scikit-learn#6288) # Conflicts: # sklearn/neighbors/tests/test_dist_metrics.py

amueller added a commit to amueller/scikit-learn that referenced this pull request Oct 14, 2016

DOC What's new for scikit-learn#6288

d280fd8

# Conflicts: # doc/whats_new.rst

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG + 1] FIX: remove checks from PyFunc distance metric (fixes sciki…

8f1f838

…t-learn#6287) (scikit-learn#6288)

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

DOC What's new for scikit-learn#6288

3f7c400

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG + 1] FIX: remove checks from PyFunc distance metric (fixes sciki…

d9a116e

…t-learn#6287) (scikit-learn#6288)

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

DOC What's new for scikit-learn#6288

17b93db

Uh oh!

[MRG + 1] BUG: remove checks from PyFunc distance metric (fixes #6287) #6288

[MRG + 1] BUG: remove checks from PyFunc distance metric (fixes #6287) #6288

Uh oh!

Conversation

jakevdp commented Feb 5, 2016

Uh oh!

agramfort commented Feb 5, 2016

Uh oh!

yenchenlin commented Feb 5, 2016

Uh oh!

jakevdp commented Feb 5, 2016

Uh oh!

yenchenlin commented Feb 6, 2016

Uh oh!

yenchenlin commented Feb 6, 2016

Uh oh!

jakevdp commented Feb 6, 2016

Uh oh!

GaelVaroquaux Feb 7, 2016

Choose a reason for hiding this comment

Uh oh!

yenchenlin Feb 7, 2016

Choose a reason for hiding this comment

Uh oh!

jakevdp Feb 7, 2016

Choose a reason for hiding this comment

Uh oh!

jakevdp Feb 7, 2016

Choose a reason for hiding this comment

Uh oh!

amueller Feb 9, 2016

Choose a reason for hiding this comment

Uh oh!

yenchenlin commented Feb 8, 2016

Uh oh!

jakevdp commented Feb 8, 2016

Uh oh!

yenchenlin commented Feb 8, 2016

Uh oh!

jakevdp commented Feb 18, 2016

Uh oh!

amueller commented Oct 7, 2016

Uh oh!

amueller commented Oct 7, 2016

Uh oh!

jakevdp commented Oct 10, 2016

Uh oh!

jakevdp commented Oct 10, 2016

Uh oh!

jakevdp commented Oct 10, 2016

Uh oh!

jakevdp commented Oct 10, 2016

Uh oh!

amueller commented Oct 10, 2016

Uh oh!

jakevdp commented Oct 10, 2016

Uh oh!

jnothman commented Oct 11, 2016

Uh oh!

jnothman commented Oct 11, 2016

Uh oh!

Uh oh!