ENH Improve initialization and learning rate in t-SNE #19491

dkobak · 2021-02-18T22:20:48Z

This implements suggestions from #18018 (see there for some discussion):

Clarifies the documentation about the learning_rate (different from all other implementations by a factor of 4).
Scales PCA initialization to have the same std as the random initialization. (Update: only issues future warning for now. Would change in v1.2.)
Issues future warning that PCA initialization will become default in v1.2.
Implements learning_rate='auto' that scales the learning rate with the sample size.
Issues future warning that learning_rate='auto' will become default in v1.2.

I would still have to implement unit tests for future warnings (haven't done it before) and add the changes to whats_new (not quite sure which of the above changes need to be mentioned there). But I'd like to get some feedback from the core developers about whether these suggested changes are all fine. @TomDLT @ogrisel

Update: tests added, changes added.

TomDLT

Thanks for the pull-request!

sklearn/manifold/_t_sne.py

Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

dkobak · 2021-02-23T12:20:27Z

@TomDLT I fixed and added t-SNE tests and they seem to be working fine, but something is failing in doctest:

================================== FAILURES ===================================
____________________ [doctest] sklearn.manifold._t_sne.TSNE ____________________
...
UNEXPECTED EXCEPTION: FutureWarning("The default initialization in TSNE will change from 'random' to 'pca' in 1.2.")

and I don't know where to fix it :-/

dkobak · 2021-04-07T21:20:19Z

@thomasjpfan Thanks a lot for reviewing! I pushed your suggestions and added TODO comments and docstrings as you suggested everywhere else too.

This PR is deprecating three things: two defaults value and the PCA SD change. I suggest removing the PCA SD change for now and keep the deprecation for learning_rate and init. In the user guide (doc/modules/manifold.rst), we need to describe learning_rate='auto' with references. In a future PR we can update the user guide for the PCA SD change and add the warning + tests.

Sorry, I am not sure I understand the rationale here. I already have everything regarding PCA SD implemented here, what's the point of taking it out? Also, I would be uncomfortable setting PCA init as default if it does not get scaled to correct SD. To be honest, I think these changes should go together.

Update: I mean, the SD change is currently implemented but commented out, because it should only go live in version 1.2. Not sure what's the better way to do it? I think the future warning should be happening already in this PR.

             # X_embedded = X_embedded / np.std(X_embedded[:, 0]) * 1e-4

In the user guide (doc/modules/manifold.rst), we need to describe learning_rate='auto' with references.

This I can do. Update: done!

Also, what do you think about this suggested change (not yet implemented in this PR):

Oh, there is something I forgot to mention in the original issue: after implementing the learning_rate = n/12 heuristic in openTSNE and FIt-SNE we realized that 750 iterations is enough for all practical purposes and made n_iter=750 the default over there in both implementations (see a LONG discussion here KlugerLab/FIt-SNE#88).

So we could also adopt the same convention here, cutting down the number of default iterations from 1000 to 750. This of course would need to go through a deprecation cycle, together with the learning_rate='auto'. What do you think?

In case you think it's a good idea, I am wondering if the deprecation cycle needs to be implemented via n_iter='warn'. Given that this is tied to the learning_rate change, can the learning rate FutureWarning mention that the n_iter will change to 750 together with the future learning rate change? Without an additional n_iter future warning?

PS. Not sure why the milestone check is not suddenly failing...

dkobak · 2021-04-15T22:23:11Z

@thomasjpfan Could you clarify what you meant by "removing the PCA SD change for now"? See also my comment above for more considerations. Everything is fixed btw. Cheers!

thomasjpfan

Could you clarify what you meant by "removing the PCA SD change for now"? See also my comment above for more considerations. Everything is fixed btw. Cheers!

What I meant was that the commented out PCA change can be its own pull request. But looking at this again, I am okay with leaving it in.

I can see that the depredations in this PR are all related, but they could be done with three separate PRs. This makes it easier to review and helps with merging faster. In general, a PR with bigger scope increases the chances of something blocking it from merging.

So we could also adopt the same convention here, cutting down the number of default iterations from 1000 to 750. This of course would need to go through a deprecation cycle, together with the learning_rate='auto'. What do you think?

We can work on this in a follow up PR. This PR is already a net improvement as is.

As for the review, I left comments about using pytest.mark.filterwarnings instead of ignore_warnings that applies to all the tests. Otherwise this looks good to go.

thomasjpfan · 2021-04-16T13:16:26Z

sklearn/manifold/_t_sne.py

+        where N is the sample size, following Belkina et al. 2019 and
+        Kobak et al. 2019, Nature Communications (or to 50.0, if
+        N / early_exaggeration / 4 < 50). This will become default in 1.2.


Can we move this references into the References section below and link it here?

scikit-learn/sklearn/manifold/_t_sne.py

Line 638 in 962bd9a

References

Makes sense, fixed.

sklearn/manifold/_t_sne.py

sklearn/manifold/tests/test_t_sne.py

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

dkobak · 2021-04-16T17:20:59Z

@thomasjpfan Thanks a lot! I changed the handling of the future warnings. All checks pass.

We can work on this in a follow up PR. This PR is already a net improvement as is.

My suggestion is simply to replace

        if self.learning_rate == 'warn':
            # See issue #18018
            warnings.warn("The default learning rate in TSNE will change "
                          "from 200.0 to 'auto' in 1.2.", FutureWarning)

with

        if self.learning_rate == 'warn':
            # TODO: Change the n_iter to 750 in 1.2.
            # See issue #18018
            warnings.warn("The default learning rate in TSNE will change "
                          "from 200.0 to 'auto' in 1.2. At the same time, the default 
                          "number of iterations will decrease from 1000 to 750",
                          FutureWarning)

I think this does not deserve its own PR... There would nothing else to do really at this point. Or what do you think?

TomDLT · 2021-04-16T18:27:32Z

I am not fully convinced that changing the default number of iteration from 1000 to 750 is necessary. It would probably benefit from a dedicated discussion in a small separate PR.

dkobak · 2021-04-16T18:41:28Z

I am not fully convinced that changing the default number of iteration from 1000 to 750 is necessary. It would probably benefit from a dedicated discussion in a small separate PR.

Fair enough. This change would not have any other consequences apart from decreasing the runtime by 25%. I'm fine merging this PR without this change if you guys prefer that.

thomasjpfan

Thank you for working on this issue @dkobak !

LGTM

dkobak added 5 commits February 18, 2021 22:40

Rescale PCA initialization to std=1e-4

bc1df0c

Add future warning about PCA init

699d66b

Add learning_rate='auto' and comment on factor 4

06d69d5

Add future warning for the learning rate

2ca7375

Add whitespace

bda63c0

github-actions bot added the module:manifold label Feb 18, 2021

Remove whitespaces

0d13285

TomDLT reviewed Feb 19, 2021

View reviewed changes

dkobak and others added 18 commits February 19, 2021 09:43

Update sklearn/manifold/_t_sne.py

ac33824

Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

Update sklearn/manifold/_t_sne.py

84cc82c

Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

Future warning for PCA init

66405d1

Do not overwrite default attributes

869cd80

Remove trailing whitespaces

6496935

Ignore PCA/init future warnings in tests

22e6870

Fix whitespaces before inline comments

1b2988a

Fix more whitespaces

fac85bd

Add tests for new future warnings and for lr=auto

e527e1a

Fix a bug in one test

5082626

Fix another bug

88f08f9

Split one test into two

9378f10

Rename new tests

1cd2c80

Correctly handle None to invoke default

69ab5da

Fix the auto learning rate text

53aa8a6

Fix random state for assert_allclose

e8fc787

Fix max bug

b252ea1

Fix max bug

0c18279

dkobak added 3 commits February 23, 2021 20:48

Ignore future warnings in test

8120814

Ignore future warnings in more tests

652b4ef

Import ignore_warnings

1730bd1

Merge remote-tracking branch 'upstream/main' into tsne-defaults

486b578

dkobak added 2 commits April 7, 2021 23:31

Describe learning_rate=auto in t-SNE doc

795caa4

Fix ignoring future warning

37a25ff

thomasjpfan reviewed Apr 16, 2021

View reviewed changes

dkobak and others added 11 commits April 16, 2021 17:08

Update sklearn/manifold/_t_sne.py

fa36a12

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Update sklearn/manifold/_t_sne.py

861a439

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Update sklearn/manifold/tests/test_t_sne.py

d80ad13

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Update sklearn/manifold/tests/test_t_sne.py

4c6fa58

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

linting fix

8ea825b

format references

d0c5696

linting fix

682ebde

replace ignore_warnings with filterwarning

3bf131a

handle future warnings in tests in other files

a35bc2d

import pytest

78dc3f0

roll back because of failing test

6cfa7f9

dkobak requested a review from thomasjpfan April 18, 2021 20:45

thomasjpfan added 2 commits April 26, 2021 13:50

Merge remote-tracking branch 'upstream/main' into pr/19491

245d72e

CLN Remove comment for check_n_features_in_after_fitting

f1ed1d5

thomasjpfan changed the title ~~Improve initialization and learning rate in t-SNE~~ ENH Improve initialization and learning rate in t-SNE Apr 26, 2021

thomasjpfan approved these changes Apr 26, 2021

View reviewed changes

thomasjpfan merged commit 8156c10 into scikit-learn:main Apr 26, 2021

TomDLT mentioned this pull request Apr 26, 2021

t-SNE default parameters and clarifying documentation #18018

Closed

thomasjpfan mentioned this pull request Jul 28, 2021

TSNE with init="pca" warns by default, is this okay? #20629

Closed

jsilke mentioned this pull request Apr 12, 2022

MNT handle warnings in plot_manifold_sphere.py #23124

Merged

betatim mentioned this pull request Nov 17, 2022

Early stopping criteria in TSNE set too high #24776

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Improve initialization and learning rate in t-SNE #19491

ENH Improve initialization and learning rate in t-SNE #19491

dkobak commented Feb 18, 2021 •

edited

Loading

TomDLT left a comment

dkobak commented Feb 23, 2021

dkobak commented Apr 7, 2021 •

edited

Loading

dkobak commented Apr 15, 2021

thomasjpfan left a comment

thomasjpfan Apr 16, 2021

dkobak Apr 16, 2021

dkobak commented Apr 16, 2021

TomDLT commented Apr 16, 2021

dkobak commented Apr 16, 2021 •

edited

Loading

thomasjpfan left a comment

ENH Improve initialization and learning rate in t-SNE #19491

ENH Improve initialization and learning rate in t-SNE #19491

Conversation

dkobak commented Feb 18, 2021 • edited Loading

TomDLT left a comment

Choose a reason for hiding this comment

dkobak commented Feb 23, 2021

dkobak commented Apr 7, 2021 • edited Loading

dkobak commented Apr 15, 2021

thomasjpfan left a comment

Choose a reason for hiding this comment

thomasjpfan Apr 16, 2021

Choose a reason for hiding this comment

dkobak Apr 16, 2021

Choose a reason for hiding this comment

dkobak commented Apr 16, 2021

TomDLT commented Apr 16, 2021

dkobak commented Apr 16, 2021 • edited Loading

thomasjpfan left a comment

Choose a reason for hiding this comment

dkobak commented Feb 18, 2021 •

edited

Loading

dkobak commented Apr 7, 2021 •

edited

Loading

dkobak commented Apr 16, 2021 •

edited

Loading