New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] ENH Add parameter to sklearn.manifold.TSNE to transition to squaring non-Euclidean distance metrics by default (#12401) #17662
[MRG] ENH Add parameter to sklearn.manifold.TSNE to transition to squaring non-Euclidean distance metrics by default (#12401) #17662
Conversation
I presumed that adding square_distance could allow for negative precomputed distances in some cases (square_distance=True). However, `knn.fit()` does its own non-neg check, which forces precomputed distances to be positive regardless of the new square_distance parameter.
Tests cover the following areas: * Invalid square_distance parameter selection * TSNE functioning with metric=Euclidean and square_distance=legacy * TSNE functioning with square_distance=True/False
It looks like adding a A related question: should I be running my local tests to fail on Warnings, then? I might have not been using the right |
Pinging @flosincapite. I think this is ready to go from [WIP] -> [MRG]. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! There are some comments, but only because I was extremely picky :)
Thanks a ton for taking the time to review, @flosincapite! |
Euclidean handles squaring differently than other metrics. Euclidean is also slower for n_jobs > 1. So, the same call to pairwise_distances can't be used for every metric -- they have to be separated.
sklearn/manifold/tests/test_t_sne.py
Outdated
@pytest.mark.parametrize('method', ['exact', 'barnes_hut']) | ||
@ignore_warnings(category=FutureWarning) | ||
def test_tsne_with_different_squaring_methods(method): | ||
"""Isolate failing behavior in | ||
`test_tsne_with_legacy_euclidean_squaring.`""" | ||
random_state = check_random_state(0) | ||
n_components_original = 3 | ||
n_components_embedding = 2 | ||
X = random_state.randn(50, n_components_original).astype(np.float32) | ||
X_precomputed_1 = pairwise_distances(X, metric='euclidean') ** 2 | ||
X_precomputed_2 = pairwise_distances(X, metric='euclidean', squared=True) | ||
X_transformed_tsne_precomputed_1 = TSNE( | ||
metric='precomputed', n_components=n_components_embedding, | ||
random_state=0, method=method).fit_transform(X_precomputed_1) | ||
X_transformed_tsne_precomputed_2 = TSNE( | ||
metric='precomputed', n_components=n_components_embedding, | ||
random_state=0, method=method).fit_transform(X_precomputed_2) | ||
assert_array_equal(X_transformed_tsne_precomputed_1, | ||
X_transformed_tsne_precomputed_2) | ||
|
||
|
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I've uncovered an existing issue with TSNE
using this test.
This test isolates failing behaviour in another test, here. That test passes with method='exact'
, but fails with method='barnes_hut'
.
For both tests, I believe the issue stems from TSNE's sensitivity to small changes in the input, which just so happens to be demonstrated nicely by two different squaring methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X_precomputed_1 = pairwise_distances(X, metric='euclidean') ** 2
X_precomputed_2 = pairwise_distances(X, metric='euclidean', squared=True)
Although mathematically the same, they might slightly differ in floating arithmetics. Then we can expect the results of t-sne to be slightly different too. Usually we use assert_allclose
to compare arrays of floats. Since you're using float32, you can change the rtol to 1e-6 or even 1e-5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, it might have been better for me to have also included the output of the failing tests.
E Mismatched elements: 100 / 100 (100%)
E Max absolute difference: 182.95767
E Max relative difference: 35.374535
E x: array([[-36.36463 , -60.091194],
E [ 35.217983, -77.36719 ],
E [-23.366423, -1.641921],...
E y: array([[ -83.61592 , -40.051804],
E [ -12.799014, 105.59048 ],
E [ -2.169685, -25.965828],...
A small change in the distances (e.g. due to float arithmetic) seems to have caused a large difference in the TSNE output. So, I'm not sure assert_allclose
would help here, but I could be mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, that's good to know! I had assumed instability was a general issue, so thank you for clarifying. :)
A quick/dirty way to fix the test_tsne_with_legacy_euclidean_squaring
test would be to replace this line with:
if method == 'barnes_hut':
X_precomputed = pairwise_distances(X, metric='euclidean') ** 2
else:
X_precomputed = pairwise_distances(X, metric='euclidean', squared=True)
That way the squaring method matches up with what's used internally in TSNE
.
But, if I understand correctly, you're asking if there's a way configure TSNE to be stable enough that specifying the squaring method shouldn't be necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going out on a limb here, but the lack of stability could also have something do with with the use of randn
to generate sample X
values. Without much structure in our sample data, t-SNE might not find meaningful topology in the data, as is shown in this tutorial.
I wonder if it would be better to use a small toy dataset for t-SNE tests that has structure (e.g. a couple of small clusters in 3D). This could better reflect the intended use cases for t-SNE.
EDIT: I shouldn't just propose... I should try it out! I'll see if there's still an issue when using sklearn.make_blobs
instead of randn
. Turns out sklearn.make_blobs
is already used in other t-SNE tests, even.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does seem like make_blobs
fixed the stability issues. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes a lot of sense. Good idea :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your help! Your questions pointed me in the right direction. :)
t-SNE relies on structure in input data. When random data is used, t-SNE becomes very sensitive to slight changes in the input. So, make_blobs is used to create data with structure instead.
No longer necessary now that source of instability has been identified.
c7c588e
to
e1da68f
Compare
e1da68f
to
6e5481f
Compare
I thought I had written this test, but it wasn't mine! So, this change shouldn't have happened in the first place.
Thanks much for the reviews @jeremiedbb. :) |
* Update warning to announce deprecation * Warn when 'legacy' is set. * Because 'legacy' now functions exactly the same as 'warn', remove 'warn' as a setting for 'square_distance'. * Update tests to reflect changes. * Update docstring to reflect changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @joshuacwnewton !
Another enhancement was added to the sklearn.manifold section, causing a merge conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @joshuacwnewton, made a first pass, mostly looks good but I'm not quite sure about the deprecation plan
sklearn/manifold/_t_sne.py
Outdated
raise ValueError("All distances should be positive, the " | ||
"metric given is not correct") | ||
|
||
if self.metric != "euclidean" and self.square_distance is True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this also work in interpreters that aren't CPython? I thought that True being a singleton was a CPython implementation detail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be thorough enough, but a quick search yields:
- IronPython: https://ironpython-test.readthedocs.io/en/latest/library/constants.html
- Jython: https://www.jython.org/jython-old-sites/docs/library/stdtypes.html#boolean-values
- Stackless Python: https://stackless.readthedocs.io/en/v3.6.9-slp/library/constants.html
- PyPy: I couldn't find reference to bools in https://doc.pypy.org/en/latest/cpython_differences.html#object-identity-of-primitive-values-is-and-id, but I did attempt to quickly test it here: https://pypyjs.org/ See below:
Only new tests were changed. Other existing tests use docstrings, but were left unchanged.
I accidentally grouped an `sklearn.isotonic` enhancement in `sklearn.manifold` when fixing a merge conflict.
For previous wording, it was unclear what should be doing the squaring (TSNE or the user). New wording makes it explicit that TSNE is doing the squaring.
Some implementation details will need to be hashed out in 0.26, so add reminder.
`square_distance` -> `square_distances` pushed some lines > 79.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @joshuacwnewton , some last nitpicks and one question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pre-approving since the remaining comments are easy to address. Thanks @joshuacwnewton !
Thanks much for taking the time to review, @NicolasHug! :D |
Thanks you @joshuacwnewton |
…aring non-Euclidean distance metrics by default (scikit-learn#12401) (scikit-learn#17662)
Reference Issues/PRs
Fixes #12401
Supersedes and closes #12419
The plan for deprecating legacy squaring behaviour is summarized in this comment.
What does this implement/fix? Explain your changes.
square_distances
parameter to TSNE. (default='legacy'
)True
enables squaring for all distances metrics'legacy'
preserves existing behaviour, where distances will only be squared whenmetric='euclidean'
. Raises aFutureWarning
when a non-Euclidean metric is provided to alert the deprecation of legacy behavior. SettingTrue
suppresses this warning.square_distances
and check thatFutureWarning
is raised properly.square_distances=True
to suppress FutureWarnings.Any other comments?
I was uncertain about the usage ofAnswered in feedback.pytest.mark.parametrize
in my tests. Some existing tests use it, while others iterate through lists of parameters. Style suggestions are appreciated.I was uncertain about the descriptions for my tests. Some use docstrings while others use a simple comment. Style suggestions are appreciated.Answered in feedback.'legacy'
behaviour. Discussion in the issue (tSNE with non-Euclidean metric should be able to square the distances #12401) has become outdated, so I chose version numbers based on the deprecation section of the Contributing docs.