Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Sets dtype for pairwise_distances when metric='seuclidean' #15730

Merged
merged 52 commits into from
Jul 17, 2020

Conversation

ForrestCKoch
Copy link
Contributor

@ForrestCKoch ForrestCKoch commented Nov 28, 2019

Reference Issues/PRs

Please see my issue at #15731

What does this implement/fix? Explain your changes.

sklearn.metrics.pairwise_distances returns error when using metric='seuclidean' and input is not of type np.double.

Any other comments?

This is my first contribution to scikit-learn. I have read through the contributions page, but please let me know if I have done anything wrong.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test

@jeremiedbb
Copy link
Member

Looks like a scipy bug to me. You can pass a float32 array to scipy.spatial.distance.pdist for X but not for V when metric="seuclidean". However you can pass a float32 array for VI when metric="mahalanobis".

@ForrestCKoch
Copy link
Contributor Author

ForrestCKoch commented Nov 29, 2019

@jnothman

Please add a test

Sure thing -- I've added a test to compare the output of pairwise_distances with pdist and cdist for each of the numpy floating point datatypes.

@jeremiedbb

Looks like a scipy bug to me. You can pass a float32 array to scipy.spatial.distance.pdist for X but not for V when metric="seuclidean". However you can pass a float32 array for VI when metric="mahalanobis".

Yeah, I'm not really sure what they were thinking (see the code snippet below). They force conversion to np.double at the end of their argument validation anyways, so I'm not sure what they were hoping to acheive with this check.

def _validate_seuclidean_kwargs(X, m, n, **kwargs):
    V = kwargs.pop('V', None)
    if V is None:
        V = np.var(X.astype(np.double), axis=0, ddof=1)
    else:
        V = np.asarray(V, order='c')
        if V.dtype != np.double:
            raise TypeError('Variance vector V must contain doubles.')
        if len(V.shape) != 1:
            raise ValueError('Variance vector V must '
                             'be one-dimensional.')
        if V.shape[0] != n:
            raise ValueError('Variance vector V must be of the same '
                             'dimension as the vectors on which the distances '
                             'are computed.')
    kwargs['V'] = _convert_to_double(V)

Regardless, I think this fix is required in the meantime. If they remove this forced conversion to np.double, the tests I have written will fail due to the difference in precision float types, and my changes can be reverted to avoid the performance penalty of type conversion.

@jeremiedbb
Copy link
Member

Yeah, I'm not really sure what they were thinking (see the code snippet below). They force conversion to np.double at the end of their argument validation anyways, so I'm not sure what they were hoping to acheive with this check.

Yes the check for the dtype is weird. the validation of mahalanobis params is the same except for this dtype check. We should report this to scipy.

Regardless, I think this fix is required in the meantime

that's right.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments below. Also please add a what's new entry.

@@ -1273,3 +1273,27 @@ def test_pairwise_distances_data_derived_params(n_jobs, metric, dist_function,

assert_allclose(dist, expected_dist_explicit_params)
assert_allclose(dist, expected_dist_default_params)


@pytest.mark.parametrize("metric", ["seuclidean"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to parametrize for a single value

Copy link
Contributor Author

@ForrestCKoch ForrestCKoch Dec 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I had originally tried to include 'mahalanobis' in the tests, but it kept failing because one of the Windows test benches didn't like me trying to use np.double.

It was throwing an error in one of the np.linalg libraries, so I removed it as I didn't think it was necessary to test for this purpose.

Anyways -- This is fixed in my recent commit

sklearn/metrics/tests/test_pairwise.py Outdated Show resolved Hide resolved
Comment on lines 1284 to 1285
# check that pairwise distances gives the same result as pdist and cdist
# regardless of input datatype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention that the seuclidean metric used to raise an error on non double dtype, and add the PR number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fixed in recent commit

@jeremiedbb
Copy link
Member

I opened scipy/scipy#11171 to fix this issue upstream

@ForrestCKoch
Copy link
Contributor Author

ForrestCKoch commented Dec 5, 2019

@jeremiedbb thanks for opening that issue with scipy.

I've added a new "What's new" entry; however, I'm not very confident that I've gotten the syntax/conventions right. I've tried to mimic the other entries, but let me know if there are any mistakes.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a last nitpick from my side. Otherwise looks good. Thanks @ForrestCKoch

sklearn/metrics/tests/test_pairwise.py Outdated Show resolved Hide resolved
Co-Authored-By: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
"dtype",
[np.half, np.float, np.double, np.longdouble])
@pytest.mark.parametrize("y_is_x", [True, False], ids=["Y is X", "Y is not X"])
def test_pairwise_distances_input_datatypes(dtype, y_is_x):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably mention seuclidean as it only tests that metric.

Or, rather than adding a test, we should be modifying another test to ensure invariance to dtype across metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably mention seuclidean as it only tests that metric.

To clarify, you mean I should change the name of the test? Otherwise it is mentioned in the following comments.

Or, rather than adding a test, we should be modifying another test to ensure invariance to dtype across metrics

I had originally tried to extend this to include other metrics, but 'malahanobis' will fail on the Windows py35_pip_openblas_32bit benchmark because np.linalg doesn't like the usage of np.double presumably due to the 32bit openblas library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it might be best to test on all but xfail on some particularly problematic combinations.

At the moment it looks quite strange that you are only testing seuclidean here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my most recent revisions. I've added each of the non-boolean scipy distance metrics to the tests. xfails have been introduced if each of the following hold:

  • metric == 'mahalanobis'
  • dtype == np.longdouble
  • sys.platform.startswith('win')

@ForrestCKoch
Copy link
Contributor Author

@jeremiedbb -- I only applied the tolerance adjustment for the float32 and metric=='cosine' test because this is the only one that was failing. Let me know if you would rather it be applied to all float32 cases and I'll make that adjustment.

Also, I found a couple of minor typos which I corrected in becdd36, if this is not okay I'll revert the commit.

I'm not sure why the macOS test bench is still failing, although it seems most other PR's are failing the macOS tests, so perhaps it's possible it's an issue with the test bench?

@cmarmo
Copy link
Contributor

cmarmo commented Jul 14, 2020

I'm not sure why the macOS test bench is still failing, although it seems most other PR's are failing the macOS tests, so perhaps it's possible it's an issue with the test bench?

This is an issue on upstream/master, #17913 is meant to fix it.

@ForrestCKoch
Copy link
Contributor Author

Okay, I've merged in the fix and all tests seem to be passing now. Is there anything else I should do here?

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a last comment, otherwise lgtm. Thanks @ForrestCKoch !

sklearn/metrics/tests/test_pairwise.py Outdated Show resolved Hide resolved
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM!

doc/whats_new/v0.24.rst Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_pairwise.py Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_pairwise.py Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_pairwise.py Outdated Show resolved Hide resolved
ForrestCKoch and others added 4 commits July 17, 2020 11:51
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@thomasjpfan thomasjpfan changed the title [Mrg] set dtype=np.double in np.var usage FIX Sets dtype for pairwise_distances when using metric='seuclidean' Jul 17, 2020
@thomasjpfan thomasjpfan changed the title FIX Sets dtype for pairwise_distances when using metric='seuclidean' FIX Sets dtype for pairwise_distances when metric='seuclidean' Jul 17, 2020
@thomasjpfan thomasjpfan merged commit b4678ca into scikit-learn:master Jul 17, 2020
@ogrisel ogrisel added this to the 0.23.2 milestone Aug 3, 2020
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Aug 3, 2020
…t-learn#15730)

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
…t-learn#15730)

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants