-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX Sets dtype for pairwise_distances when metric='seuclidean' #15730
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test
Looks like a scipy bug to me. You can pass a float32 array to |
Sure thing -- I've added a test to compare the output of
Yeah, I'm not really sure what they were thinking (see the code snippet below). They force conversion to
Regardless, I think this fix is required in the meantime. If they remove this forced conversion to |
Yes the check for the dtype is weird. the validation of mahalanobis params is the same except for this dtype check. We should report this to scipy.
that's right. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments below. Also please add a what's new entry.
@@ -1273,3 +1273,27 @@ def test_pairwise_distances_data_derived_params(n_jobs, metric, dist_function, | |||
|
|||
assert_allclose(dist, expected_dist_explicit_params) | |||
assert_allclose(dist, expected_dist_default_params) | |||
|
|||
|
|||
@pytest.mark.parametrize("metric", ["seuclidean"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to parametrize for a single value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I had originally tried to include 'mahalanobis' in the tests, but it kept failing because one of the Windows test benches didn't like me trying to use np.double
.
It was throwing an error in one of the np.linalg
libraries, so I removed it as I didn't think it was necessary to test for this purpose.
Anyways -- This is fixed in my recent commit
# check that pairwise distances gives the same result as pdist and cdist | ||
# regardless of input datatype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mention that the seuclidean
metric used to raise an error on non double dtype, and add the PR number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also fixed in recent commit
I opened scipy/scipy#11171 to fix this issue upstream |
Co-Authored-By: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
@jeremiedbb thanks for opening that issue with scipy. I've added a new "What's new" entry; however, I'm not very confident that I've gotten the syntax/conventions right. I've tried to mimic the other entries, but let me know if there are any mistakes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a last nitpick from my side. Otherwise looks good. Thanks @ForrestCKoch
Co-Authored-By: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
"dtype", | ||
[np.half, np.float, np.double, np.longdouble]) | ||
@pytest.mark.parametrize("y_is_x", [True, False], ids=["Y is X", "Y is not X"]) | ||
def test_pairwise_distances_input_datatypes(dtype, y_is_x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably mention seuclidean as it only tests that metric.
Or, rather than adding a test, we should be modifying another test to ensure invariance to dtype across metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably mention seuclidean as it only tests that metric.
To clarify, you mean I should change the name of the test? Otherwise it is mentioned in the following comments.
Or, rather than adding a test, we should be modifying another test to ensure invariance to dtype across metrics
I had originally tried to extend this to include other metrics, but 'malahanobis' will fail on the Windows py35_pip_openblas_32bit benchmark because np.linalg
doesn't like the usage of np.double
presumably due to the 32bit openblas library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it might be best to test on all but xfail on some particularly problematic combinations.
At the moment it looks quite strange that you are only testing seuclidean here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my most recent revisions. I've added each of the non-boolean scipy distance metrics to the tests. xfails have been introduced if each of the following hold:
metric == 'mahalanobis'
dtype == np.longdouble
sys.platform.startswith('win')
@jeremiedbb -- I only applied the tolerance adjustment for the float32 and metric=='cosine' test because this is the only one that was failing. Let me know if you would rather it be applied to all float32 cases and I'll make that adjustment. Also, I found a couple of minor typos which I corrected in becdd36, if this is not okay I'll revert the commit. I'm not sure why the macOS test bench is still failing, although it seems most other PR's are failing the macOS tests, so perhaps it's possible it's an issue with the test bench? |
This is an issue on upstream/master, #17913 is meant to fix it. |
Okay, I've merged in the fix and all tests seem to be passing now. Is there anything else I should do here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a last comment, otherwise lgtm. Thanks @ForrestCKoch !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM!
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
…t-learn#15730) Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com> Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
…t-learn#15730) Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com> Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Reference Issues/PRs
Please see my issue at #15731
What does this implement/fix? Explain your changes.
sklearn.metrics.pairwise_distances
returns error when usingmetric='seuclidean'
and input is not of typenp.double
.Any other comments?
This is my first contribution to scikit-learn. I have read through the contributions page, but please let me know if I have done anything wrong.