Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] ENH Add parameter to sklearn.manifold.TSNE to transition to squaring non-Euclidean distance metrics by default (#12401) #17662

Merged
merged 55 commits into from Jul 12, 2020

Conversation

joshuacwnewton
Copy link
Contributor

@joshuacwnewton joshuacwnewton commented Jun 23, 2020

Reference Issues/PRs

Fixes #12401
Supersedes and closes #12419

The plan for deprecating legacy squaring behaviour is summarized in this comment.

What does this implement/fix? Explain your changes.

  • Adds a square_distances parameter to TSNE. (default='legacy')
    • True enables squaring for all distances metrics
    • 'legacy' preserves existing behaviour, where distances will only be squared when metric='euclidean'. Raises a FutureWarning when a non-Euclidean metric is provided to alert the deprecation of legacy behavior. Setting True suppresses this warning.
  • Adds a parameter description to the docstring of TSNE
  • Adds tests to check square_distances and check that FutureWarning is raised properly.
  • Modifies existing tests to include square_distances=True to suppress FutureWarnings.

Any other comments?

  • I was uncertain about the usage of pytest.mark.parametrize in my tests. Some existing tests use it, while others iterate through lists of parameters. Style suggestions are appreciated. Answered in feedback.
  • I was uncertain about the descriptions for my tests. Some use docstrings while others use a simple comment. Style suggestions are appreciated. Answered in feedback.
  • I was uncertain about proposing a timeline for changing the default value and/or deprecating 'legacy' behaviour. Discussion in the issue (tSNE with non-Euclidean metric should be able to square the distances #12401) has become outdated, so I chose version numbers based on the deprecation section of the Contributing docs.

I presumed that adding square_distance could allow for negative
precomputed distances in some cases (square_distance=True). However,
`knn.fit()` does its own non-neg check, which forces precomputed
distances to be positive regardless of the new square_distance
parameter.
Tests cover the following areas:
    * Invalid square_distance parameter selection
    * TSNE functioning with metric=Euclidean and square_distance=legacy
    * TSNE functioning with square_distance=True/False
@joshuacwnewton
Copy link
Contributor Author

joshuacwnewton commented Jun 23, 2020

It looks like adding a FutureWarning caused other TSNE tests to fail. I'm going to add @ignore_warnings(category=FutureWarning) to the relevant tests, as per the Contributing docs: "The warning should be caught in all other tests (using e.g., @pytest.mark.filterwarnings), and there should be no warning in the examples."

A related question: should I be running my local tests to fail on Warnings, then? I might have not been using the right pytest flags, as I didn't catch this before opening a PR. (EDIT: Found the relevant pytest docs. I needed to be adding -W error::FutureWarning to my local testing.)

@joshuacwnewton
Copy link
Contributor Author

joshuacwnewton commented Jun 23, 2020

Pinging @flosincapite. I think this is ready to go from [WIP] -> [MRG]. :)

Copy link

@flosincapite flosincapite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! There are some comments, but only because I was extremely picky :)

sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved
sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved
sklearn/manifold/tests/test_t_sne.py Outdated Show resolved Hide resolved
sklearn/manifold/tests/test_t_sne.py Outdated Show resolved Hide resolved
@joshuacwnewton
Copy link
Contributor Author

joshuacwnewton commented Jun 23, 2020

Thanks a ton for taking the time to review, @flosincapite!

@joshuacwnewton joshuacwnewton changed the title [WIP] ENH Allow non-Euclidean distance metrics to be squared in sklearn.manifold.TSNE [MRG] ENH Allow non-Euclidean distance metrics to be squared in sklearn.manifold.TSNE (#12401) Jun 23, 2020
@joshuacwnewton joshuacwnewton marked this pull request as ready for review June 23, 2020 17:55
Euclidean handles squaring differently than other metrics. Euclidean
is also slower for n_jobs > 1. So, the same call to pairwise_distances
can't be used for every metric -- they have to be separated.
Comment on lines 924 to 944
@pytest.mark.parametrize('method', ['exact', 'barnes_hut'])
@ignore_warnings(category=FutureWarning)
def test_tsne_with_different_squaring_methods(method):
"""Isolate failing behavior in
`test_tsne_with_legacy_euclidean_squaring.`"""
random_state = check_random_state(0)
n_components_original = 3
n_components_embedding = 2
X = random_state.randn(50, n_components_original).astype(np.float32)
X_precomputed_1 = pairwise_distances(X, metric='euclidean') ** 2
X_precomputed_2 = pairwise_distances(X, metric='euclidean', squared=True)
X_transformed_tsne_precomputed_1 = TSNE(
metric='precomputed', n_components=n_components_embedding,
random_state=0, method=method).fit_transform(X_precomputed_1)
X_transformed_tsne_precomputed_2 = TSNE(
metric='precomputed', n_components=n_components_embedding,
random_state=0, method=method).fit_transform(X_precomputed_2)
assert_array_equal(X_transformed_tsne_precomputed_1,
X_transformed_tsne_precomputed_2)


This comment was marked as duplicate.

This comment was marked as duplicate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I've uncovered an existing issue with TSNE using this test.

This test isolates failing behaviour in another test, here. That test passes with method='exact', but fails with method='barnes_hut'.

For both tests, I believe the issue stems from TSNE's sensitivity to small changes in the input, which just so happens to be demonstrated nicely by two different squaring methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X_precomputed_1 = pairwise_distances(X, metric='euclidean') ** 2
X_precomputed_2 = pairwise_distances(X, metric='euclidean', squared=True)

Although mathematically the same, they might slightly differ in floating arithmetics. Then we can expect the results of t-sne to be slightly different too. Usually we use assert_allclose to compare arrays of floats. Since you're using float32, you can change the rtol to 1e-6 or even 1e-5.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, it might have been better for me to have also included the output of the failing tests.

E       Mismatched elements: 100 / 100 (100%)
E       Max absolute difference: 182.95767
E       Max relative difference: 35.374535
E        x: array([[-36.36463 , -60.091194],
E              [ 35.217983, -77.36719 ],
E              [-23.366423,  -1.641921],...
E        y: array([[ -83.61592 ,  -40.051804],
E              [ -12.799014,  105.59048 ],
E              [  -2.169685,  -25.965828],...

A small change in the distances (e.g. due to float arithmetic) seems to have caused a large difference in the TSNE output. So, I'm not sure assert_allclose would help here, but I could be mistaken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, that's good to know! I had assumed instability was a general issue, so thank you for clarifying. :)

A quick/dirty way to fix the test_tsne_with_legacy_euclidean_squaring test would be to replace this line with:

if method == 'barnes_hut':
    X_precomputed = pairwise_distances(X, metric='euclidean') ** 2
else:
    X_precomputed = pairwise_distances(X, metric='euclidean', squared=True)

That way the squaring method matches up with what's used internally in TSNE.

But, if I understand correctly, you're asking if there's a way configure TSNE to be stable enough that specifying the squaring method shouldn't be necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going out on a limb here, but the lack of stability could also have something do with with the use of randn to generate sample X values. Without much structure in our sample data, t-SNE might not find meaningful topology in the data, as is shown in this tutorial.

I wonder if it would be better to use a small toy dataset for t-SNE tests that has structure (e.g. a couple of small clusters in 3D). This could better reflect the intended use cases for t-SNE.

EDIT: I shouldn't just propose... I should try it out! I'll see if there's still an issue when using sklearn.make_blobs instead of randn. Turns out sklearn.make_blobs is already used in other t-SNE tests, even.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does seem like make_blobs fixed the stability issues. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot of sense. Good idea :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your help! Your questions pointed me in the right direction. :)

t-SNE relies on structure in input data. When random data is used,
t-SNE becomes very sensitive to slight changes in the input. So,
make_blobs is used to create data with structure instead.
No longer necessary now that source of instability has been identified.
I thought I had written this test, but it wasn't mine! So, this change
shouldn't have happened in the first place.
@joshuacwnewton
Copy link
Contributor Author

Thanks much for the reviews @jeremiedbb. :)

sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved
sklearn/manifold/_t_sne.py Show resolved Hide resolved
* Update warning to announce deprecation
* Warn when 'legacy' is set.
* Because 'legacy' now functions exactly the same as 'warn', remove
'warn' as a setting for 'square_distance'.
* Update tests to reflect changes.
* Update docstring to reflect changes.
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @joshuacwnewton !

sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved
sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @joshuacwnewton, made a first pass, mostly looks good but I'm not quite sure about the deprecation plan

doc/whats_new/v0.24.rst Outdated Show resolved Hide resolved
sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved
sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved
sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved
raise ValueError("All distances should be positive, the "
"metric given is not correct")

if self.metric != "euclidean" and self.square_distance is True:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also work in interpreters that aren't CPython? I thought that True being a singleton was a CPython implementation detail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved
sklearn/manifold/tests/test_t_sne.py Outdated Show resolved Hide resolved
Only new tests were changed. Other existing tests use docstrings, but
were left unchanged.
I accidentally grouped an `sklearn.isotonic` enhancement in
`sklearn.manifold` when fixing a merge conflict.
For previous wording, it was unclear what should be doing the squaring
(TSNE or the user). New wording makes it explicit that TSNE is doing
the squaring.
Some implementation details will need to be hashed out in 0.26, so
add reminder.
`square_distance` -> `square_distances` pushed some lines > 79.
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @joshuacwnewton , some last nitpicks and one question

doc/whats_new/v0.24.rst Outdated Show resolved Hide resolved
doc/whats_new/v0.24.rst Outdated Show resolved Hide resolved
sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved
sklearn/manifold/tests/test_t_sne.py Show resolved Hide resolved
sklearn/manifold/tests/test_t_sne.py Outdated Show resolved Hide resolved
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-approving since the remaining comments are easy to address. Thanks @joshuacwnewton !

@joshuacwnewton joshuacwnewton changed the title [MRG] ENH Allow non-Euclidean distance metrics to be squared in sklearn.manifold.TSNE (#12401) [MRG] ENH Add parameter to sklearn.manifold.TSNE to transition to squaring non-Euclidean distance metrics by default (#12401) Jul 11, 2020
@joshuacwnewton
Copy link
Contributor Author

Thanks much for taking the time to review, @NicolasHug! :D

@jnothman jnothman merged commit 89e49b6 into scikit-learn:master Jul 12, 2020
5 checks passed
@jnothman
Copy link
Member

Thanks you @joshuacwnewton

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tSNE with non-Euclidean metric should be able to square the distances
6 participants