New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] BUG: remove checks from PyFunc distance metric (fixes #6287) #6288

Merged
merged 5 commits into from Oct 11, 2016

Conversation

Projects
None yet
6 participants
@jakevdp
Member

jakevdp commented Feb 5, 2016

This check makes too many assumptions about the user-defined distance, and should probably be removed. (fixes #6287)

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Feb 5, 2016

Member

LGTM

Member

agramfort commented Feb 5, 2016

LGTM

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Feb 5, 2016

Contributor

Hello @jakevdp

Once this pilot check is deleted, users may receive ambiguous error messages.

For example,
if customize distance function now return a value of type string instead of type float

File "sklearn/neighbors/dist_metrics.pyx", line 1114, in sklearn.neighbors.dist_metrics.PyFuncDistance.dist (sklearn/neighbors/dist_metrics.c:11202) TypeError: a float is required

What if just change this check a little bit such as #6289 ?

It will raise

ValueError("Customize distance function must return a float")

in the above case.

Contributor

yenchenlin commented Feb 5, 2016

Hello @jakevdp

Once this pilot check is deleted, users may receive ambiguous error messages.

For example,
if customize distance function now return a value of type string instead of type float

File "sklearn/neighbors/dist_metrics.pyx", line 1114, in sklearn.neighbors.dist_metrics.PyFuncDistance.dist (sklearn/neighbors/dist_metrics.c:11202) TypeError: a float is required

What if just change this check a little bit such as #6289 ?

It will raise

ValueError("Customize distance function must return a float")

in the above case.

@jakevdp

This comment has been minimized.

Show comment
Hide comment
@jakevdp

jakevdp Feb 5, 2016

Member

I'd initially put the check where it is because it happens only once. A check such as #6289 will happen for every evaluation, and I'm afraid the impact on performance will be quite large (though admittedly, the user-defined function is not particularly performant as-is).

Member

jakevdp commented Feb 5, 2016

I'd initially put the check where it is because it happens only once. A check such as #6289 will happen for every evaluation, and I'm afraid the impact on performance will be quite large (though admittedly, the user-defined function is not particularly performant as-is).

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Feb 6, 2016

Contributor

@jakevdp Thanks for your useful opinion!

What do you think if change
https://github.com/jakevdp/scikit-learn/blob/fix6287/sklearn/neighbors/dist_metrics.pyx#L1103
to

d = self.func(x1arr, x2arr, **self.kwargs)
try:
    return d
except TypeError:
    raise TypeError("Customize function must return a float")

Since the usual case (i.e. if user didn't do something silly) is no exception, I think try-except will be very efficient.
What do you think?

Contributor

yenchenlin commented Feb 6, 2016

@jakevdp Thanks for your useful opinion!

What do you think if change
https://github.com/jakevdp/scikit-learn/blob/fix6287/sklearn/neighbors/dist_metrics.pyx#L1103
to

d = self.func(x1arr, x2arr, **self.kwargs)
try:
    return d
except TypeError:
    raise TypeError("Customize function must return a float")

Since the usual case (i.e. if user didn't do something silly) is no exception, I think try-except will be very efficient.
What do you think?

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Feb 6, 2016

Contributor

I've tested it with the following script:

import numpy as np
from sklearn.neighbors import BallTree
import timeit

n_samples = 10 ** 5
n_dim = 100
X = np.asarray(range(n_samples * n_dim)).reshape(n_samples, n_dim)

def correct_distance(x, y):
    return np.sum((x - y) ** 2)

def balltree():
    b = BallTree(X, metric=correct_distance)

time = timeit.Timer(balltree)
print min(time.repeat(number=10))

The no try-except version prints 80.8509781361 s,
while try-except version prints 80.3812861443 s,

which means adding try-except here barely affect the performance.

Contributor

yenchenlin commented Feb 6, 2016

I've tested it with the following script:

import numpy as np
from sklearn.neighbors import BallTree
import timeit

n_samples = 10 ** 5
n_dim = 100
X = np.asarray(range(n_samples * n_dim)).reshape(n_samples, n_dim)

def correct_distance(x, y):
    return np.sum((x - y) ** 2)

def balltree():
    b = BallTree(X, metric=correct_distance)

time = timeit.Timer(balltree)
print min(time.repeat(number=10))

The no try-except version prints 80.8509781361 s,
while try-except version prints 80.3812861443 s,

which means adding try-except here barely affect the performance.

@jakevdp

This comment has been minimized.

Show comment
Hide comment
@jakevdp

jakevdp Feb 6, 2016

Member

@yenchenlin1994 – great idea! I added that to the PR.

Member

jakevdp commented Feb 6, 2016

@yenchenlin1994 – great idea! I added that to the PR.

Show outdated Hide outdated sklearn/neighbors/dist_metrics.pyx
d = self.func(x1arr, x2arr, **self.kwargs)
try:
return d
except TypeError:

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Feb 7, 2016

Member

Maybe it's my knowledge of Python that is lacking, but I am unsure that the exception will be really captured in the way I think.

Indeed, if I understand things correctly, the exception that we are expecting will be raised outside this function. Hence, I suspect that the try/except will not trigger. It seems to me that, whether it triggers or not is depends on the semantics of in which frame the exception is raised.

I think that it would be great to have a test that shows us that the exception is indeed raised, given that it is not trivial.

@GaelVaroquaux

GaelVaroquaux Feb 7, 2016

Member

Maybe it's my knowledge of Python that is lacking, but I am unsure that the exception will be really captured in the way I think.

Indeed, if I understand things correctly, the exception that we are expecting will be raised outside this function. Hence, I suspect that the try/except will not trigger. It seems to me that, whether it triggers or not is depends on the semantics of in which frame the exception is raised.

I think that it would be great to have a test that shows us that the exception is indeed raised, given that it is not trivial.

This comment has been minimized.

@yenchenlin

yenchenlin Feb 7, 2016

Contributor

Hello @GaelVaroquaux

I think test can be something like the following script,
it can show that this exception is indeed raised when customize distance function returns a non-float.

from sklearn.neighbors import BallTree
import timeit
import numpy as np

def wrong_distance(x, y):
    return "1"

n_samples = 10 ** 3
n_dim = 10
X = np.asarray(range(n_samples * n_dim)).reshape(n_samples, n_dim)
b = BallTree(X, metric=wrong_distance)
@yenchenlin

yenchenlin Feb 7, 2016

Contributor

Hello @GaelVaroquaux

I think test can be something like the following script,
it can show that this exception is indeed raised when customize distance function returns a non-float.

from sklearn.neighbors import BallTree
import timeit
import numpy as np

def wrong_distance(x, y):
    return "1"

n_samples = 10 ** 3
n_dim = 10
X = np.asarray(range(n_samples * n_dim)).reshape(n_samples, n_dim)
b = BallTree(X, metric=wrong_distance)

This comment has been minimized.

@jakevdp

jakevdp Feb 7, 2016

Member

Thanks @yenchenlin1994, that looks good. Would you like to submit a PR with that test? You'll have to put it in a function, probably in sklearn/neighbors/tests/test_dist_metrics.pyx. You can use numpy.testing.assert_raises_regexp to assert that the expected exception is being raised. One more comment: to make the test faster, you could do much fewer than 1000x10 points: even something like 5x2 would probably do it.

Let me know if you need help putting that test together!

@jakevdp

jakevdp Feb 7, 2016

Member

Thanks @yenchenlin1994, that looks good. Would you like to submit a PR with that test? You'll have to put it in a function, probably in sklearn/neighbors/tests/test_dist_metrics.pyx. You can use numpy.testing.assert_raises_regexp to assert that the expected exception is being raised. One more comment: to make the test faster, you could do much fewer than 1000x10 points: even something like 5x2 would probably do it.

Let me know if you need help putting that test together!

This comment has been minimized.

@jakevdp

jakevdp Feb 7, 2016

Member

oh, and @GaelVaroquaux – I had the same thought! I had to run a script like the one @yenchenlin1994 suggested to convince myself that it would catch the exception. I suspect the reason it's caught is because Cython produces code which does the type checking in the same block as the return statement.

@jakevdp

jakevdp Feb 7, 2016

Member

oh, and @GaelVaroquaux – I had the same thought! I had to run a script like the one @yenchenlin1994 suggested to convince myself that it would catch the exception. I suspect the reason it's caught is because Cython produces code which does the type checking in the same block as the return statement.

This comment has been minimized.

@amueller

amueller Feb 9, 2016

Member

That's slightly crazy ^^

@amueller

amueller Feb 9, 2016

Member

That's slightly crazy ^^

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Feb 8, 2016

Contributor

@jakevdp Thanks, I'll send a PR right after this PR get merged.

Contributor

yenchenlin commented Feb 8, 2016

@jakevdp Thanks, I'll send a PR right after this PR get merged.

@jakevdp

This comment has been minimized.

Show comment
Hide comment
@jakevdp

jakevdp Feb 8, 2016

Member

We'll probably want the tests before merging this PR. You could either write a PR to my branch, or write a PR to master now and I'll cherry-pick the commit.

Member

jakevdp commented Feb 8, 2016

We'll probably want the tests before merging this PR. You could either write a PR to my branch, or write a PR to master now and I'll cherry-pick the commit.

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Feb 8, 2016

Contributor

@jakevdp I've written a PR to your branch.
Please notify me if I do it wrong.
Thanks!

Contributor

yenchenlin commented Feb 8, 2016

@jakevdp I've written a PR to your branch.
Please notify me if I do it wrong.
Thanks!

@jakevdp jakevdp changed the title from BUG: remove checks from PyFunc distance metric (fixes #6287) to [MRG] BUG: remove checks from PyFunc distance metric (fixes #6287) Feb 18, 2016

@jakevdp

This comment has been minimized.

Show comment
Hide comment
@jakevdp

jakevdp Feb 18, 2016

Member

I think this can be merged.

Member

jakevdp commented Feb 18, 2016

I think this can be merged.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 7, 2016

Member

@jakevdp can you please rebase?

Member

amueller commented Oct 7, 2016

@jakevdp can you please rebase?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 7, 2016

Member

It would be kinda nice to add a regression test against the original issue, i.e. have a metric that fails on 10d data but works on 3d data and test it with 3d data?

Member

amueller commented Oct 7, 2016

It would be kinda nice to add a regression test against the original issue, i.e. have a metric that fails on 10d data but works on 3d data and test it with 3d data?

@amueller amueller added this to the 0.19 milestone Oct 8, 2016

@jakevdp

This comment has been minimized.

Show comment
Hide comment
@jakevdp

jakevdp Oct 10, 2016

Member

Rebased. Let me add a couple tests...

Member

jakevdp commented Oct 10, 2016

Rebased. Let me add a couple tests...

@jakevdp

This comment has been minimized.

Show comment
Hide comment
@jakevdp

jakevdp Oct 10, 2016

Member

Regression test and @yenchenlin's test added. If all tests pass, I think this can be merged.

Member

jakevdp commented Oct 10, 2016

Regression test and @yenchenlin's test added. If all tests pass, I think this can be merged.

@jakevdp

This comment has been minimized.

Show comment
Hide comment
@jakevdp

jakevdp Oct 10, 2016

Member

Tests failed on old numpy versions... switched to using sklearn's backport of assert_raises_regex. We'll see if that does the trick

Member

jakevdp commented Oct 10, 2016

Tests failed on old numpy versions... switched to using sklearn's backport of assert_raises_regex. We'll see if that does the trick

@jakevdp

This comment has been minimized.

Show comment
Hide comment
@jakevdp

jakevdp Oct 10, 2016

Member

Flake8 error due to my over-zealous copy-paste

Member

jakevdp commented Oct 10, 2016

Flake8 error due to my over-zealous copy-paste

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 10, 2016

Member

LGTM if everything passes. @agramfort still looks good to you?

Member

amueller commented Oct 10, 2016

LGTM if everything passes. @agramfort still looks good to you?

@amueller amueller changed the title from [MRG] BUG: remove checks from PyFunc distance metric (fixes #6287) to [MRG + 1] BUG: remove checks from PyFunc distance metric (fixes #6287) Oct 10, 2016

@jakevdp

This comment has been minimized.

Show comment
Hide comment
@jakevdp

jakevdp Oct 10, 2016

Member

Tests all pass. Good to merge?

Member

jakevdp commented Oct 10, 2016

Tests all pass. Good to merge?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 11, 2016

Member

LGTM

Member

jnothman commented Oct 11, 2016

LGTM

@jnothman jnothman merged commit cbd3bca into scikit-learn:master Oct 11, 2016

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 11, 2016

Member

Added what's new in 0948ce9

Member

jnothman commented Oct 11, 2016

Added what's new in 0948ce9

jnothman added a commit that referenced this pull request Oct 11, 2016

amueller added a commit to amueller/scikit-learn that referenced this pull request Oct 14, 2016

[MRG + 1] FIX: remove checks from PyFunc distance metric (fixes #6287) (
#6288)

# Conflicts:
#	sklearn/neighbors/tests/test_dist_metrics.py

amueller added a commit to amueller/scikit-learn that referenced this pull request Oct 14, 2016

DOC What's new for #6288
# Conflicts:
#	doc/whats_new.rst

afiodorov added a commit to unravelin/scikit-learn that referenced this pull request Apr 25, 2017

afiodorov added a commit to unravelin/scikit-learn that referenced this pull request Apr 25, 2017

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment