[MRG] Fix gradient descent in TSNE #8768

Merged
merged 6 commits into from Jun 1, 2017

Conversation

Projects
None yet
5 participants
@deto
Contributor

deto commented Apr 20, 2017

Reference Issue

Fixes #8766

What does this implement/fix? Explain your changes.

The way the delta-bar-delta is implemented in the Python TSNE is incorrect. The sign is flipped and as a result, the algorithm has difficulty converging. For reference, see the MATLAB, R, and C++ implementations (which I've linked in the issue).

This fix corrects the sign issue and modifies the constants used to be the same as implemented in other languages.

Any other comments?

@deto deto changed the title from Fix gradient descent in TSNE to [MRG] Fix gradient descent in TSNE Apr 20, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 20, 2017

Member
Member

jnothman commented Apr 20, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 20, 2017

Member

Does fixing this resolve issues related to the default learning rate (#6204, #8752)?

Member

jnothman commented Apr 20, 2017

Does fixing this resolve issues related to the default learning rate (#6204, #8752)?

@iosonofabio

This comment has been minimized.

Show comment
Hide comment
@iosonofabio

iosonofabio Apr 20, 2017

That's a separate PR, our learning rate is a lot higher than those other implementations anyway. That PR fixes convergence into local minima, this one speed of convergence. But of course one could rerun the example plots in the other issues and PR and compare.

That's a separate PR, our learning rate is a lot higher than those other implementations anyway. That PR fixes convergence into local minima, this one speed of convergence. But of course one could rerun the example plots in the other issues and PR and compare.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 20, 2017

Member
Member

jnothman commented Apr 20, 2017

@iosonofabio

This comment has been minimized.

Show comment
Hide comment
@iosonofabio

iosonofabio Apr 20, 2017

@deto

This comment has been minimized.

Show comment
Hide comment
@deto

deto Apr 20, 2017

Contributor
Contributor

deto commented Apr 20, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 20, 2017

Member
Member

jnothman commented Apr 20, 2017

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Apr 21, 2017

Member

Not an expert on t-SNE at all unfortunately, @AlexanderFabisch do you have an opinion on this fix?

Member

lesteve commented Apr 21, 2017

Not an expert on t-SNE at all unfortunately, @AlexanderFabisch do you have an opinion on this fix?

@AlexanderFabisch

This comment has been minimized.

Show comment
Hide comment
@AlexanderFabisch

AlexanderFabisch Apr 21, 2017

Member

Yes, it seems like I made a mistake there. Good catch and thanks! :)

The number 0.05 and 0.95 come from Geoff Hinton's Coursera course where the same trick has been used to train neural networks with SGD. We can change it to 0.2 and 0.8 to be closer to the original implementation. I believe it won't make a big difference anyway.

We still have to check if all the examples with t-SNE still work as expected and the unit tests have to be fixed.

Member

AlexanderFabisch commented Apr 21, 2017

Yes, it seems like I made a mistake there. Good catch and thanks! :)

The number 0.05 and 0.95 come from Geoff Hinton's Coursera course where the same trick has been used to train neural networks with SGD. We can change it to 0.2 and 0.8 to be closer to the original implementation. I believe it won't make a big difference anyway.

We still have to check if all the examples with t-SNE still work as expected and the unit tests have to be fixed.

@AlexanderFabisch

This comment has been minimized.

Show comment
Hide comment
@AlexanderFabisch

AlexanderFabisch Apr 21, 2017

Member

OK, I just checked examples/manifold/plot_compare_methods.py, examples/manifold/plot_manifold_sphere.py, and examples/manifold/plot_lle_digits.py (with learning_rate=200.0). The result seems to be almost the same.

Member

AlexanderFabisch commented Apr 21, 2017

OK, I just checked examples/manifold/plot_compare_methods.py, examples/manifold/plot_manifold_sphere.py, and examples/manifold/plot_lle_digits.py (with learning_rate=200.0). The result seems to be almost the same.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 27, 2017

Member

Is there a way to test this?

Member

jnothman commented Apr 27, 2017

Is there a way to test this?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 27, 2017

Member

Could you please add a bug fix entry to what's new?

Member

jnothman commented Apr 27, 2017

Could you please add a bug fix entry to what's new?

sklearn/manifold/t_sne.py
- gains[inc] += 0.05
- gains[dec] *= 0.95
+ gains[inc] += 0.2
+ gains[dec] *= 0.8
np.clip(gains, min_gain, np.inf)

This comment has been minimized.

@AlexanderFabisch

AlexanderFabisch Apr 27, 2017

Member

I just noticed that this line is completely useless. It should be

np.clip(gains, min_gain, np.inf, out=gains)

Do we add it to this PR?

@AlexanderFabisch

AlexanderFabisch Apr 27, 2017

Member

I just noticed that this line is completely useless. It should be

np.clip(gains, min_gain, np.inf, out=gains)

Do we add it to this PR?

This comment has been minimized.

@deto

deto Apr 27, 2017

Contributor

Seems silly to open up a whole new issue for it. I can add to this one.

@deto

deto Apr 27, 2017

Contributor

Seems silly to open up a whole new issue for it. I can add to this one.

@deto

This comment has been minimized.

Show comment
Hide comment
@deto

deto Apr 27, 2017

Contributor

Added a bugfix entry, but now tests are failing? Did I do something wrong?

Contributor

deto commented Apr 27, 2017

Added a bugfix entry, but now tests are failing? Did I do something wrong?

@deto

This comment has been minimized.

Show comment
Hide comment
@deto

deto Apr 27, 2017

Contributor

I'm not sure the best way to test this. I just ran on the MNIST Digits data set, before and after, and can see that it converges to a better solution.

'Exact' implementation:
-- Before fix: 250 iterations, error= 0.776
-- After fix: 675 iterations, error= 0.573

'Barnes-hut' implementation:
-- Before fix: 225 iterations, error=0.84
-- After fix: 650 iterations, error=0.50

Note that it appears to take less iterations before the fix because the inability to converge results in the gradient descent quitting early due to lack of progress.

The resulting plots look fine either way, but on the personal problem I've been applying this to (single-cell gene expression data), I'm getting better looking results with the fix.

Contributor

deto commented Apr 27, 2017

I'm not sure the best way to test this. I just ran on the MNIST Digits data set, before and after, and can see that it converges to a better solution.

'Exact' implementation:
-- Before fix: 250 iterations, error= 0.776
-- After fix: 675 iterations, error= 0.573

'Barnes-hut' implementation:
-- Before fix: 225 iterations, error=0.84
-- After fix: 650 iterations, error=0.50

Note that it appears to take less iterations before the fix because the inability to converge results in the gradient descent quitting early due to lack of progress.

The resulting plots look fine either way, but on the personal problem I've been applying this to (single-cell gene expression data), I'm getting better looking results with the fix.

@AlexanderFabisch

This comment has been minimized.

Show comment
Hide comment
@AlexanderFabisch

AlexanderFabisch Apr 28, 2017

Member

Added a bugfix entry, but now tests are failing? Did I do something wrong?

I guess there is a merge conflict. You should rebase on master:

git fetch original_sklearn_origin
git rebase original_sklearn_origin/master
# solve conflict
Member

AlexanderFabisch commented Apr 28, 2017

Added a bugfix entry, but now tests are failing? Did I do something wrong?

I guess there is a merge conflict. You should rebase on master:

git fetch original_sklearn_origin
git rebase original_sklearn_origin/master
# solve conflict
@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Apr 28, 2017

Member

I guess there is a merge conflict. You should rebase on master:

For this kind of simple conflicts it may be easier to do it via the github web interface. Click on the "Resolve conflicts" button.

Member

lesteve commented Apr 28, 2017

I guess there is a merge conflict. You should rebase on master:

For this kind of simple conflicts it may be easier to do it via the github web interface. Click on the "Resolve conflicts" button.

@deto

This comment has been minimized.

Show comment
Hide comment
@deto

deto Apr 28, 2017

Contributor

Fixed the conflict with some rebasing/squashing/force pushing.

Contributor

deto commented Apr 28, 2017

Fixed the conflict with some rebasing/squashing/force pushing.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Apr 28, 2017

Member

You need to change the docstring for Travis to be green. Other than this LGTM.

Out of interest do you have an example where we can visually see the improvement with this fix? You were mentioning convergence issues in #8766. Does this PR improve the convergence in your case?

Member

lesteve commented Apr 28, 2017

You need to change the docstring for Travis to be green. Other than this LGTM.

Out of interest do you have an example where we can visually see the improvement with this fix? You were mentioning convergence issues in #8766. Does this PR improve the convergence in your case?

@AlexanderFabisch

This comment has been minimized.

Show comment
Hide comment
@AlexanderFabisch

AlexanderFabisch Apr 28, 2017

Member

There are 2 failing unit tests. I think you can update the expected numbers.

Member

AlexanderFabisch commented Apr 28, 2017

There are 2 failing unit tests. I think you can update the expected numbers.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Apr 28, 2017

Member

It is fine to update the expected numbers for the doctest.

The test_n_iter_without_progress is probably a bit more tricky to fix. Basically you need to find some data and some TSNE parameters where did not make any progress during the last 2 episodes. Finished. is printed to stdout. Maybe you have a suggestion @AlexanderFabisch?

Member

lesteve commented Apr 28, 2017

It is fine to update the expected numbers for the doctest.

The test_n_iter_without_progress is probably a bit more tricky to fix. Basically you need to find some data and some TSNE parameters where did not make any progress during the last 2 episodes. Finished. is printed to stdout. Maybe you have a suggestion @AlexanderFabisch?

@AlexanderFabisch

This comment has been minimized.

Show comment
Hide comment
@AlexanderFabisch

AlexanderFabisch Apr 28, 2017

Member

I don't have an idea. Now I see why it was not a good unit test. :) Well the main idea is to test whether n_iter_without_progress has an influence. So maybe we can play around with the value a little bit and we can set min_grad_norm to a lower value. I think that should help.

edit: Forget the last part.

Member

AlexanderFabisch commented Apr 28, 2017

I don't have an idea. Now I see why it was not a good unit test. :) Well the main idea is to test whether n_iter_without_progress has an influence. So maybe we can play around with the value a little bit and we can set min_grad_norm to a lower value. I think that should help.

edit: Forget the last part.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve May 11, 2017

Member

I pushed an update of the doctests of TSNE in order to have a better picture of the tests still failing.

Member

lesteve commented May 11, 2017

I pushed an update of the doctests of TSNE in order to have a better picture of the tests still failing.

@deto

This comment has been minimized.

Show comment
Hide comment
@deto

deto May 20, 2017

Contributor

Thanks @lesteve for moving this along

Now it looks like it's only failing a test on "n_iter_without_progress"....but only on a specific install configuration. Perhaps because the algorithm happened to make progress on every iteration during that run? Hard to tell as only 1/25 iterations are printed to standard out.

I can't think of a good way to generate a problem that would definitely trigger that test - I'd recommend that test is either removed or modified. Was that parameter added just due to convergence issues in the first place?

Contributor

deto commented May 20, 2017

Thanks @lesteve for moving this along

Now it looks like it's only failing a test on "n_iter_without_progress"....but only on a specific install configuration. Perhaps because the algorithm happened to make progress on every iteration during that run? Hard to tell as only 1/25 iterations are printed to standard out.

I can't think of a good way to generate a problem that would definitely trigger that test - I'd recommend that test is either removed or modified. Was that parameter added just due to convergence issues in the first place?

@AlexanderFabisch

This comment has been minimized.

Show comment
Hide comment
@AlexanderFabisch

AlexanderFabisch May 20, 2017

Member

I think we all agree that it is a bad test. I cannot think of any other way to test this. Maybe we should just remove it.

Member

AlexanderFabisch commented May 20, 2017

I think we all agree that it is a bad test. I cannot think of any other way to test this. Maybe we should just remove it.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve May 22, 2017

Member

OK I pushed a tiny change in the random_state which seems to fix the issue on the only situation I can reproduce the problem (OSX with OpenBLAS). This is just a work-around but at least, if that works, it will allow to merge this PR.

Member

lesteve commented May 22, 2017

OK I pushed a tiny change in the random_state which seems to fix the issue on the only situation I can reproduce the problem (OSX with OpenBLAS). This is just a work-around but at least, if that works, it will allow to merge this PR.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve May 23, 2017

Member

And now AppVeyor is failing on Python 3.5 32bit 😢

Member

lesteve commented May 23, 2017

And now AppVeyor is failing on Python 3.5 32bit 😢

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve May 23, 2017

Member

OK this is a bit of a kludge but I went for a negative n_iter_without_progress. Hopefully that should do it.

Member

lesteve commented May 23, 2017

OK this is a bit of a kludge but I went for a negative n_iter_without_progress. Hopefully that should do it.

lesteve added some commits May 23, 2017

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Jun 1, 2017

Member

OK I skipped the test on 32bit and now everything pass.

Member

lesteve commented Jun 1, 2017

OK I skipped the test on 32bit and now everything pass.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Jun 1, 2017

Member

Merging this one, thanks a lot @deto for catching the bug and fixing it!

Member

lesteve commented Jun 1, 2017

Merging this one, thanks a lot @deto for catching the bug and fixing it!

@lesteve lesteve merged commit 57415c4 into scikit-learn:master Jun 1, 2017

5 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 95.52%)
Details
codecov/project Absolute coverage decreased by -0.64% but relative coverage increased by +4.47% compared to ee82c3f
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG] Fix gradient descent in TSNE (#8768)
Skip flaky test_n_iter_without_progress on 32bit

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG] Fix gradient descent in TSNE (#8768)
Skip flaky test_n_iter_without_progress on 32bit

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG] Fix gradient descent in TSNE (#8768)
Skip flaky test_n_iter_without_progress on 32bit

NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG] Fix gradient descent in TSNE (#8768)
Skip flaky test_n_iter_without_progress on 32bit

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG] Fix gradient descent in TSNE (#8768)
Skip flaky test_n_iter_without_progress on 32bit

AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

[MRG] Fix gradient descent in TSNE (#8768)
Skip flaky test_n_iter_without_progress on 32bit

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG] Fix gradient descent in TSNE (#8768)
Skip flaky test_n_iter_without_progress on 32bit

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG] Fix gradient descent in TSNE (#8768)
Skip flaky test_n_iter_without_progress on 32bit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment